Language Models
Language Models
Language Models
• Models that assign probabilities to sequences of words are called language
models or LMs.
• The simplest model that assigns probabilities to sentences and sequences of
words are the n-gram.
• An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram) is a two-
word sequence of words like “please turn”, “turn your”, or ”your homework”, and
a 3-gram (a trigram) is a three-word sequence of words like “please turn your”,
or “turn your homework”.
• P(w|h), the probability of a word w given some history h. Suppose the history h is
“its water is so transparent that” and we want to know the probability that the
next word is the:
Language Models
• One way to estimate this probability is from relative frequency counts: take a very large
corpus, count the number of times we see its water is so transparent that, and count the
number of times this is followed by the. This would be answering the question “Out of the
times we saw the history h, how many times was it followed by the word w”, as follows
• With a large enough corpus, such as the web, we can compute these counts and estimate the probability.
• While this method of estimating probabilities directly from counts works fine in many cases, it turns out that
even the web isn’t big enough to give us good estimates in most
cases.
• This is because language is creative; new sentences are created all the time, and we won’t always be able to
count entire sentences. Even simple extensions of the example sentence may have counts of zero on the web
(such as “Walden Pond’s water is so transparent that the”; well, used to have counts of zero).
• Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so
transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its
water is so transparent?” We would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
Sentences:
1.I like AI
2.I like ML
3.AI likes ML
• Sentence 1: I like AI
• Sentence 2: I like ML
• Sentence 3: AI likes ML
Unigram Model
• In unigram, we calculate the probability of a word independently.
Step 1: Count words
Total words = 9
Word Frequency Unigram probabilities:
I 2
like 2
• P(I) = 2/9
AI 2 • P(like) = 2/9
ML 2
• P(AI) = 2/9
likes 1
• P(ML) = 2/9
• P(likes) = 1/9
Bigram Model
• In bigram, we calculate the probability of a word given
the previous word:
• P(w2 | w1) = Count(w1 w2) / Count(w1)Bigram pairs:
• From the sentences:
Unigram counts (needed for
• I like (2 times) denominator):
Bigram Count •Count(I) = 2
• like AI I like 2 •Count(like) = 2
• like ML like AI 1 •Count(AI) = 2
•Count(likes) = 1
• AI likes like ML 1 Bigram probabilities:
AI likes 1 •P(like | I) = 2 / 2 = 1.0
• likes ML •P(AI | like) = 1 / 2 = 0.5
likes ML 1 •P(ML | like) = 1 / 2 = 0.5
•P(likes | AI) = 1 / 2 = 0.5
•P(ML | likes) = 1 / 1 = 1.0
Trigram Model
• In trigram, we calculate the probability of a word given the two
previous words:
P(w3 | w1 w2) = Count(w1 w2 w3) / Count(w1 w2)
• Trigram sequences:From the sentences:I like AII like MLAI likes ML
Trigram Count Bigram counts (needed for denominator):
•Count(I like) = 2
I like AI 1 •Count(AI likes) = 1
I like ML 1 Trigram probabilities:
AI likes ML 1 •P(AI | I like) = 1 / 2 = 0.5
•P(ML | I like) = 1 / 2 = 0.5
•P(ML | AI likes) = 1 / 1 = 1.0
Predicting next word using bi-gram
• Step 2: Predict the next word
• Suppose we are given:
• Prompt: "I“
• We want to predict the next word using bigram probabilities, i.e.,
P(w2 | I)
• From bigram counts:
• P(like | I) = Count(I like) / Count(I) = 2 / 2 = 1.0
• → So, if the current word is "I", the most likely next word is "like".
Predicting next word using bi-gram
• Now try:
• Prompt: "like“
• Possible next words after "like":
• like AI → 1 time
• like ML → 1 time
• So:P(AI | like) = 1 / 2 = 0.5
• P(ML | like) = 1 / 2 = 0.5
• → The model is equally likely to choose AI or ML.
Predicting next word using bi-gram
• Prompt: "AI“
• From bigrams:
• AI likes → 1
• → So, P(likes | AI) = 1 / 2 = 0.5
• (Assuming AI appeared twice)
• → Likely next word: likes
• If the prompt is:"I like",
• → You check what follows this bigram:
• Trigram: I like AI, I like ML
• So,P(AI | I like) = 0.5P(ML | I like) = 0.5
Summary
• Prompt: I → Predicted next word: like
Prompt: like → Predicted next word: AI or ML
Prompt: AI → Predicted next word: likes
Prompt: likes → Predicted next word: ML
Markov Chain
• States: Like HMMs, Markov Chains have states. Each state represents a situation
or a condition.
• Transitions: What sets Markov Chains apart is the Markov property, which says
that the probability of transitioning to any particular state depends solely on
the current state and time elapsed. It's memoryless, like a goldfish that only
remembers the last state it was in.
• Text Generation:
• Order of 1 (Unigram Model): The simplest form involves predicting the next word based only on the
current word. Each word is like a state, and the next word is chosen with probabilities based on the
frequency of transitions in the training data.
• Order of 2 (Bigram Model): Here, the next word depends on the current and the previous word. It's a
bit more context-aware.
• Higher Orders: You can go to higher orders, where the next word depends on the last two, three, etc.,
words. However, this increases the complexity and requires more training data.
Markov Chain
Limitations
Applications
the next sequence of the data can not be observed immediately but the next
data depends on the old sequences. Taking the above intuition into account the
Transportation forecasting
Hidden Markov Models in NLP
• POS tagging is a very useful part of text preprocessing in NLP as
we know that NLP is a task where we make a machine able to
communicate with a human or with a different machine. So it
becomes compulsory for a machine to understand the part of
speech.
• Emission Probability: First, you calculate the probability of observing each word in the sequence given
the hidden state. This is the emission probability. It answers the question, "If the ninja is in this state,
what's the probability of it performing this specific move (emitting this word)?"
• Transition Probability: Next, you consider the transitions between states. What's the likelihood of
• Initial State Probability: Don't forget the starting point. What's the probability of starting in a particular
state?
• Combining Probabilities: Now, you multiply these probabilities together for each step in your sequence.