0% found this document useful (0 votes)
4 views34 pages

Language Models

The document provides an overview of language models, including n-grams, bigrams, and trigrams, explaining how they assign probabilities to sequences of words. It discusses the limitations of these models, particularly in handling unseen n-grams, and introduces concepts like Markov Chains and Hidden Markov Models (HMMs) for text prediction and sequence classification. Additionally, it covers smoothing techniques to address zero probabilities in language modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Language Models

The document provides an overview of language models, including n-grams, bigrams, and trigrams, explaining how they assign probabilities to sequences of words. It discusses the limitations of these models, particularly in handling unseen n-grams, and introduces concepts like Markov Chains and Hidden Markov Models (HMMs) for text prediction and sequence classification. Additionally, it covers smoothing techniques to address zero probabilities in language modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Language Models

Language Models
• Models that assign probabilities to sequences of words are called language
models or LMs.
• The simplest model that assigns probabilities to sentences and sequences of
words are the n-gram.
• An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram) is a two-
word sequence of words like “please turn”, “turn your”, or ”your homework”, and
a 3-gram (a trigram) is a three-word sequence of words like “please turn your”,
or “turn your homework”.
• P(w|h), the probability of a word w given some history h. Suppose the history h is
“its water is so transparent that” and we want to know the probability that the
next word is the:
Language Models
• One way to estimate this probability is from relative frequency counts: take a very large
corpus, count the number of times we see its water is so transparent that, and count the
number of times this is followed by the. This would be answering the question “Out of the
times we saw the history h, how many times was it followed by the word w”, as follows

• With a large enough corpus, such as the web, we can compute these counts and estimate the probability.
• While this method of estimating probabilities directly from counts works fine in many cases, it turns out that
even the web isn’t big enough to give us good estimates in most
cases.
• This is because language is creative; new sentences are created all the time, and we won’t always be able to
count entire sentences. Even simple extensions of the example sentence may have counts of zero on the web
(such as “Walden Pond’s water is so transparent that the”; well, used to have counts of zero).
• Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so
transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its
water is so transparent?” We would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
Sentences:
1.I like AI
2.I like ML
3.AI likes ML
• Sentence 1: I like AI
• Sentence 2: I like ML
• Sentence 3: AI likes ML
Unigram Model
• In unigram, we calculate the probability of a word independently.
Step 1: Count words
Total words = 9
Word Frequency Unigram probabilities:
I 2
like 2
• P(I) = 2/9
AI 2 • P(like) = 2/9
ML 2
• P(AI) = 2/9
likes 1
• P(ML) = 2/9
• P(likes) = 1/9
Bigram Model
• In bigram, we calculate the probability of a word given
the previous word:
• P(w2 | w1) = Count(w1 w2) / Count(w1)Bigram pairs:
• From the sentences:
Unigram counts (needed for
• I like (2 times) denominator):
Bigram Count •Count(I) = 2
• like AI I like 2 •Count(like) = 2
• like ML like AI 1 •Count(AI) = 2
•Count(likes) = 1
• AI likes like ML 1 Bigram probabilities:
AI likes 1 •P(like | I) = 2 / 2 = 1.0
• likes ML •P(AI | like) = 1 / 2 = 0.5
likes ML 1 •P(ML | like) = 1 / 2 = 0.5
•P(likes | AI) = 1 / 2 = 0.5
•P(ML | likes) = 1 / 1 = 1.0
Trigram Model
• In trigram, we calculate the probability of a word given the two
previous words:
P(w3 | w1 w2) = Count(w1 w2 w3) / Count(w1 w2)
• Trigram sequences:From the sentences:I like AII like MLAI likes ML
Trigram Count Bigram counts (needed for denominator):
•Count(I like) = 2
I like AI 1 •Count(AI likes) = 1
I like ML 1 Trigram probabilities:
AI likes ML 1 •P(AI | I like) = 1 / 2 = 0.5
•P(ML | I like) = 1 / 2 = 0.5
•P(ML | AI likes) = 1 / 1 = 1.0
Predicting next word using bi-gram
• Step 2: Predict the next word
• Suppose we are given:
• Prompt: "I“
• We want to predict the next word using bigram probabilities, i.e.,
P(w2 | I)
• From bigram counts:
• P(like | I) = Count(I like) / Count(I) = 2 / 2 = 1.0
• → So, if the current word is "I", the most likely next word is "like".
Predicting next word using bi-gram
• Now try:
• Prompt: "like“
• Possible next words after "like":
• like AI → 1 time
• like ML → 1 time
• So:P(AI | like) = 1 / 2 = 0.5
• P(ML | like) = 1 / 2 = 0.5
• → The model is equally likely to choose AI or ML.
Predicting next word using bi-gram
• Prompt: "AI“
• From bigrams:
• AI likes → 1
• → So, P(likes | AI) = 1 / 2 = 0.5
• (Assuming AI appeared twice)
• → Likely next word: likes
• If the prompt is:"I like",
• → You check what follows this bigram:
• Trigram: I like AI, I like ML
• So,P(AI | I like) = 0.5P(ML | I like) = 0.5
Summary
• Prompt: I → Predicted next word: like
Prompt: like → Predicted next word: AI or ML
Prompt: AI → Predicted next word: likes
Prompt: likes → Predicted next word: ML
Markov Chain

• A Markov chain or Markov process is a stochastic


model describing a sequence of possible events in
which the probability of each event depends only
on the state attained in the previous event.
Informally, this may be thought of as, "What
happens next depends only on the state of affairs
now."
Markov Chain

• States: Like HMMs, Markov Chains have states. Each state represents a situation
or a condition.

• Transitions: What sets Markov Chains apart is the Markov property, which says
that the probability of transitioning to any particular state depends solely on
the current state and time elapsed. It's memoryless, like a goldfish that only
remembers the last state it was in.

• Probabilities: Each transition between states has a probability associated with


it. These probabilities are often arranged in a transition matrix.
Markov Chain
• In NLP, Markov Chains are like linguistic time travelers. Imagine you're walking through a sentence, and
at each word, you decide where to go next based only on your current location. In NLP, this concept is
applied to generate text.

• Text Generation:

• Order of 1 (Unigram Model): The simplest form involves predicting the next word based only on the
current word. Each word is like a state, and the next word is chosen with probabilities based on the
frequency of transitions in the training data.

• Order of 2 (Bigram Model): Here, the next word depends on the current and the previous word. It's a
bit more context-aware.

• Higher Orders: You can go to higher orders, where the next word depends on the last two, three, etc.,
words. However, this increases the complexity and requires more training data.
Markov Chain
Limitations

• Memoryless: The Markov property can be limiting. It


assumes that the future state depends only on the
current state, not on how you got there.

• Fixed Transitions: Like HMMs, transitions are fixed during


training, which might not capture the complexity of
language.
Markov Chain

Applications

• Text Prediction: Predicting the next word in a


sequence based on the current word.

• Random Text Generation: Generating text that


resembles the training data. It's like a word-based
improvisation.
Hidden Markov Models
• The Hidden Markov model is a probabilistic model which
is used to explain or derive the probabilistic
characteristic of any random process. It basically says
that an observed event will not be corresponding to its
step-by-step status but related to a set of probability
distributions.
Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical model that works with
sequences where we have:
1.Hidden states that we can't directly observe
2.Observable outputs that depend on these hidden states
The "Markov" part means that the current state depends only on the
previous state, not the entire history.
Key Components of HMMs
1.Hidden States: The internal states of the system that
we cannot directly observe
2.Observations: The visible outputs that we can measure
3.Transition Probabilities: The probability of moving
from one hidden state to another
4.Emission Probabilities: The probability of generating a
specific observation from a given hidden state
5.Initial State Probabilities: The probability of starting in
each possible hidden state
Real-time Example: Weather Prediction
Hidden States: The actual weather condition (Sunny, Rainy, Cloudy) - we can't
directly observe the atmospheric conditions.
Observations: Activities of a student (Study, Walk, Shop) that we can observe.
Scenario Setup:
• Initial State Probabilities:
• 40% chance of starting with Sunny
• 30% chance of starting with Rainy
• 30% chance of starting with Cloudy
• Transition Probabilities:
• If today is Sunny:
• 50% chance tomorrow will be Sunny
• 20% chance tomorrow will be Rainy
• 30% chance tomorrow will be Cloudy
• If today is Rainy:
• 30% chance tomorrow will be Sunny
• 40% chance tomorrow will be Rainy
• 30% chance tomorrow will be Cloudy
• If today is Cloudy:
• 40% chance tomorrow will be Sunny
• 30% chance tomorrow will be Rainy
• 30% chance tomorrow will be Cloudy
Emission Probabilities:
• If the weather is Sunny:
• 30% chance the student will Study
• 50% chance the student will Walk
• 20% chance the student will Shop
• If the weather is Rainy:
• 60% chance the student will Study
• 10% chance the student will Walk
• 30% chance the student will Shop
• If the weather is Cloudy:
• 40% chance the student will Study
• 30% chance the student will Walk
• 30% chance the student will Shop
HMM
• The states and observation are:

• states = ('Rainy', 'Sunny')


• And the start probability is:
• observations = ('walk', 'shop', 'clean')
• start probability = {'Rainy': 0.6, 'Sunny':
0.4}
HMM

• Now the distribution of the probability has the


weightage more on the rainy day stateside so we
can say there will be more chances for a day to
being rainy again and the probabilities for next day
weather states are as following
Rainy Sunny

• transition probability -> Rainy


Sunny
0.7
0.4
0.3
0.6
HMM

• From the transition probability we can say the


changes in the probability for a day is transition
probabilities and according to the transition
probability the emitted results for the probability of
work to perform is Rainy Sunny
Jog 0.1 0.6

• emission probability -> Work


Clean
0.4
0.5
0.3
0.1
HMM

• Using the emission probability, we can predict the


states of the weather or using the transition
probabilities we can predict the work which to
perform the next day.
Application of Hidden Markov Model
An application, where HMM is used, aims to recover the data sequence where

the next sequence of the data can not be observed immediately but the next

data depends on the old sequences. Taking the above intuition into account the

HMM can be used in the following applications:

Document separation in scanning solutions


Computational finance
Machine translation
Speed analysis
Handwriting recognition
Speech recognition Time series analysis
Speech synthesis Activity recognition

Part-of-speech tagging Sequence classification

Transportation forecasting
Hidden Markov Models in NLP
• POS tagging is a very useful part of text preprocessing in NLP as
we know that NLP is a task where we make a machine able to
communicate with a human or with a different machine. So it
becomes compulsory for a machine to understand the part of
speech.

• Classifying words in their part of speech and providing their labels


according to their part of speech is called part of speech tagging
or POS tagging OR POST. Hence the set of labels/tags is called a
tagset.
Likelihood Computation

• Likelihood is all about figuring out how probable a


sequence of observations is given the model's
parameters. It's like asking, "How likely is it that this
sequence of words was generated by HMM?"
Likelihood Computation
The Likelihood Computation Process

• Emission Probability: First, you calculate the probability of observing each word in the sequence given

the hidden state. This is the emission probability. It answers the question, "If the ninja is in this state,

what's the probability of it performing this specific move (emitting this word)?"

• Transition Probability: Next, you consider the transitions between states. What's the likelihood of

moving from one part-of-speech to another? This is the transition probability.

• Initial State Probability: Don't forget the starting point. What's the probability of starting in a particular

state?

• Combining Probabilities: Now, you multiply these probabilities together for each step in your sequence.

This gives you the likelihood of the entire sequence.


Smoothing techniques
• Smoothing techniques are essential in language modeling to handle
the problem of zero probabilities for unseen n-grams (word
sequences) in the training data.
• If a language model assigns zero probability to an unseen n-gram, it
could severely affect the performance of applications like speech
recognition, machine translation, etc.
Add-One Smoothing (Laplace Smoothing)
Add 1 to the count of every possible word (or n-gram) to avoid zero
probabilities.
• Formula for Unigram:
Add-One Smoothing (Laplace Smoothing)
• Example:
• Corpus: "the cat sat on the mat“
• Vocabulary V=5 (the, cat, sat, on, mat)
• Total words N=6
• Let’s calculate probability of word "dog" (which is not in corpus):
Add-k Smoothing (Generalized Laplace)
• Instead of adding 1, add a smaller value k∈(0,1) like 0.5 or 0.1, for
better results.

• Used when add-one over-smooths the probabilities.


Good-Turing Smoothing
• Adjust counts based on the frequency of frequencies. If many n-
grams occur only once, it's likely there are many unseen ones too.

You might also like