0% found this document useful (0 votes)

4 views34 pages

Language Models

The document provides an overview of language models, including n-grams, bigrams, and trigrams, explaining how they assign probabilities to sequences of words. It discusses the limitations of these models, particularly in handling unseen n-grams, and introduces concepts like Markov Chains and Hidden Markov Models (HMMs) for text prediction and sequence classification. Additionally, it covers smoothing techniques to address zero probabilities in language modeling.

Uploaded by

Yajneswarpadhiary Padhiary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views34 pages

Language Models

Uploaded by

Yajneswarpadhiary Padhiary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Language Models

Language Models
• Models that assign probabilities to sequences of words are called language
models or LMs.
• The simplest model that assigns probabilities to sentences and sequences of
words are the n-gram.
• An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram) is a two-
word sequence of words like “please turn”, “turn your”, or ”your homework”, and
a 3-gram (a trigram) is a three-word sequence of words like “please turn your”,
or “turn your homework”.
• P(w|h), the probability of a word w given some history h. Suppose the history h is
“its water is so transparent that” and we want to know the probability that the
next word is the:
Language Models
• One way to estimate this probability is from relative frequency counts: take a very large
corpus, count the number of times we see its water is so transparent that, and count the
number of times this is followed by the. This would be answering the question “Out of the
times we saw the history h, how many times was it followed by the word w”, as follows

• With a large enough corpus, such as the web, we can compute these counts and estimate the probability.
• While this method of estimating probabilities directly from counts works fine in many cases, it turns out that
even the web isn’t big enough to give us good estimates in most
cases.
• This is because language is creative; new sentences are created all the time, and we won’t always be able to
count entire sentences. Even simple extensions of the example sentence may have counts of zero on the web
(such as “Walden Pond’s water is so transparent that the”; well, used to have counts of zero).
• Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so
transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its
water is so transparent?” We would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
Sentences:
1.I like AI
2.I like ML
3.AI likes ML
• Sentence 1: I like AI
• Sentence 2: I like ML
• Sentence 3: AI likes ML
Unigram Model
• In unigram, we calculate the probability of a word independently.
Step 1: Count words
Total words = 9
Word Frequency Unigram probabilities:
I 2
like 2
• P(I) = 2/9
AI 2 • P(like) = 2/9
ML 2
• P(AI) = 2/9
likes 1
• P(ML) = 2/9
• P(likes) = 1/9
Bigram Model
• In bigram, we calculate the probability of a word given
the previous word:
• P(w2 | w1) = Count(w1 w2) / Count(w1)Bigram pairs:
• From the sentences:
Unigram counts (needed for
• I like (2 times) denominator):
Bigram Count •Count(I) = 2
• like AI I like 2 •Count(like) = 2
• like ML like AI 1 •Count(AI) = 2
•Count(likes) = 1
• AI likes like ML 1 Bigram probabilities:
AI likes 1 •P(like | I) = 2 / 2 = 1.0
• likes ML •P(AI | like) = 1 / 2 = 0.5
likes ML 1 •P(ML | like) = 1 / 2 = 0.5
•P(likes | AI) = 1 / 2 = 0.5
•P(ML | likes) = 1 / 1 = 1.0
Trigram Model
• In trigram, we calculate the probability of a word given the two
previous words:
P(w3 | w1 w2) = Count(w1 w2 w3) / Count(w1 w2)
• Trigram sequences:From the sentences:I like AII like MLAI likes ML
Trigram Count Bigram counts (needed for denominator):
•Count(I like) = 2
I like AI 1 •Count(AI likes) = 1
I like ML 1 Trigram probabilities:
AI likes ML 1 •P(AI | I like) = 1 / 2 = 0.5
•P(ML | I like) = 1 / 2 = 0.5
•P(ML | AI likes) = 1 / 1 = 1.0
Predicting next word using bi-gram
• Step 2: Predict the next word
• Suppose we are given:
• Prompt: "I“
• We want to predict the next word using bigram probabilities, i.e.,
P(w2 | I)
• From bigram counts:
• P(like | I) = Count(I like) / Count(I) = 2 / 2 = 1.0
• → So, if the current word is "I", the most likely next word is "like".
Predicting next word using bi-gram
• Now try:
• Prompt: "like“
• Possible next words after "like":
• like AI → 1 time
• like ML → 1 time
• So:P(AI | like) = 1 / 2 = 0.5
• P(ML | like) = 1 / 2 = 0.5
• → The model is equally likely to choose AI or ML.
Predicting next word using bi-gram
• Prompt: "AI“
• From bigrams:
• AI likes → 1
• → So, P(likes | AI) = 1 / 2 = 0.5
• (Assuming AI appeared twice)
• → Likely next word: likes
• If the prompt is:"I like",
• → You check what follows this bigram:
• Trigram: I like AI, I like ML
• So,P(AI | I like) = 0.5P(ML | I like) = 0.5
Summary
• Prompt: I → Predicted next word: like
Prompt: like → Predicted next word: AI or ML
Prompt: AI → Predicted next word: likes
Prompt: likes → Predicted next word: ML
Markov Chain

• A Markov chain or Markov process is a stochastic

model describing a sequence of possible events in
which the probability of each event depends only
on the state attained in the previous event.
Informally, this may be thought of as, "What
happens next depends only on the state of affairs
now."
Markov Chain

• States: Like HMMs, Markov Chains have states. Each state represents a situation
or a condition.

• Transitions: What sets Markov Chains apart is the Markov property, which says
that the probability of transitioning to any particular state depends solely on
the current state and time elapsed. It's memoryless, like a goldfish that only
remembers the last state it was in.

• Probabilities: Each transition between states has a probability associated with

it. These probabilities are often arranged in a transition matrix.
Markov Chain
• In NLP, Markov Chains are like linguistic time travelers. Imagine you're walking through a sentence, and
at each word, you decide where to go next based only on your current location. In NLP, this concept is
applied to generate text.

• Text Generation:

• Order of 1 (Unigram Model): The simplest form involves predicting the next word based only on the
current word. Each word is like a state, and the next word is chosen with probabilities based on the
frequency of transitions in the training data.

• Order of 2 (Bigram Model): Here, the next word depends on the current and the previous word. It's a
bit more context-aware.

• Higher Orders: You can go to higher orders, where the next word depends on the last two, three, etc.,
words. However, this increases the complexity and requires more training data.
Markov Chain
Limitations

• Memoryless: The Markov property can be limiting. It

assumes that the future state depends only on the
current state, not on how you got there.

• Fixed Transitions: Like HMMs, transitions are fixed during

training, which might not capture the complexity of
language.
Markov Chain

Applications

• Text Prediction: Predicting the next word in a

sequence based on the current word.

• Random Text Generation: Generating text that

resembles the training data. It's like a word-based
improvisation.
Hidden Markov Models
• The Hidden Markov model is a probabilistic model which
is used to explain or derive the probabilistic
characteristic of any random process. It basically says
that an observed event will not be corresponding to its
step-by-step status but related to a set of probability
distributions.
Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical model that works with
sequences where we have:
1.Hidden states that we can't directly observe
2.Observable outputs that depend on these hidden states
The "Markov" part means that the current state depends only on the
previous state, not the entire history.
Key Components of HMMs
1.Hidden States: The internal states of the system that
we cannot directly observe
2.Observations: The visible outputs that we can measure
3.Transition Probabilities: The probability of moving
from one hidden state to another
4.Emission Probabilities: The probability of generating a
specific observation from a given hidden state
5.Initial State Probabilities: The probability of starting in
each possible hidden state
Real-time Example: Weather Prediction
Hidden States: The actual weather condition (Sunny, Rainy, Cloudy) - we can't
directly observe the atmospheric conditions.
Observations: Activities of a student (Study, Walk, Shop) that we can observe.
Scenario Setup:
• Initial State Probabilities:
• 40% chance of starting with Sunny
• 30% chance of starting with Rainy
• 30% chance of starting with Cloudy
• Transition Probabilities:
• If today is Sunny:
• 50% chance tomorrow will be Sunny
• 20% chance tomorrow will be Rainy
• 30% chance tomorrow will be Cloudy
• If today is Rainy:
• 30% chance tomorrow will be Sunny
• 40% chance tomorrow will be Rainy
• 30% chance tomorrow will be Cloudy
• If today is Cloudy:
• 40% chance tomorrow will be Sunny
• 30% chance tomorrow will be Rainy
• 30% chance tomorrow will be Cloudy
Emission Probabilities:
• If the weather is Sunny:
• 30% chance the student will Study
• 50% chance the student will Walk
• 20% chance the student will Shop
• If the weather is Rainy:
• 60% chance the student will Study
• 10% chance the student will Walk
• 30% chance the student will Shop
• If the weather is Cloudy:
• 40% chance the student will Study
• 30% chance the student will Walk
• 30% chance the student will Shop
HMM
• The states and observation are:

• states = ('Rainy', 'Sunny')

• And the start probability is:
• observations = ('walk', 'shop', 'clean')
• start probability = {'Rainy': 0.6, 'Sunny':
0.4}
HMM

• Now the distribution of the probability has the

weightage more on the rainy day stateside so we
can say there will be more chances for a day to
being rainy again and the probabilities for next day
weather states are as following
Rainy Sunny

• transition probability -> Rainy

Sunny
0.7
0.4
0.3
0.6
HMM

• From the transition probability we can say the

changes in the probability for a day is transition
probabilities and according to the transition
probability the emitted results for the probability of
work to perform is Rainy Sunny
Jog 0.1 0.6

• emission probability -> Work

Clean
0.4
0.5
0.3
0.1
HMM

• Using the emission probability, we can predict the

states of the weather or using the transition
probabilities we can predict the work which to
perform the next day.
Application of Hidden Markov Model
An application, where HMM is used, aims to recover the data sequence where

the next sequence of the data can not be observed immediately but the next

data depends on the old sequences. Taking the above intuition into account the

HMM can be used in the following applications:

Document separation in scanning solutions

Computational finance
Machine translation
Speed analysis
Handwriting recognition
Speech recognition Time series analysis
Speech synthesis Activity recognition

Part-of-speech tagging Sequence classification

Transportation forecasting
Hidden Markov Models in NLP
• POS tagging is a very useful part of text preprocessing in NLP as
we know that NLP is a task where we make a machine able to
communicate with a human or with a different machine. So it
becomes compulsory for a machine to understand the part of
speech.

• Classifying words in their part of speech and providing their labels

according to their part of speech is called part of speech tagging
or POS tagging OR POST. Hence the set of labels/tags is called a
tagset.
Likelihood Computation

• Likelihood is all about figuring out how probable a

sequence of observations is given the model's
parameters. It's like asking, "How likely is it that this
sequence of words was generated by HMM?"
Likelihood Computation
The Likelihood Computation Process

• Emission Probability: First, you calculate the probability of observing each word in the sequence given

the hidden state. This is the emission probability. It answers the question, "If the ninja is in this state,

what's the probability of it performing this specific move (emitting this word)?"

• Transition Probability: Next, you consider the transitions between states. What's the likelihood of

moving from one part-of-speech to another? This is the transition probability.

• Initial State Probability: Don't forget the starting point. What's the probability of starting in a particular

state?

• Combining Probabilities: Now, you multiply these probabilities together for each step in your sequence.

This gives you the likelihood of the entire sequence.

Smoothing techniques
• Smoothing techniques are essential in language modeling to handle
the problem of zero probabilities for unseen n-grams (word
sequences) in the training data.
• If a language model assigns zero probability to an unseen n-gram, it
could severely affect the performance of applications like speech
recognition, machine translation, etc.
Add-One Smoothing (Laplace Smoothing)
Add 1 to the count of every possible word (or n-gram) to avoid zero
probabilities.
• Formula for Unigram:
Add-One Smoothing (Laplace Smoothing)
• Example:
• Corpus: "the cat sat on the mat“
• Vocabulary V=5 (the, cat, sat, on, mat)
• Total words N=6
• Let’s calculate probability of word "dog" (which is not in corpus):
Add-k Smoothing (Generalized Laplace)
• Instead of adding 1, add a smaller value k∈(0,1) like 0.5 or 0.1, for
better results.

• Used when add-one over-smooths the probabilities.

Good-Turing Smoothing
• Adjust counts based on the frequency of frequencies. If many n-
grams occur only once, it's likely there are many unseen ones too.

The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
MarieForleo ThreeToxicLies
No ratings yet
MarieForleo ThreeToxicLies
51 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
NLP_Week_03
No ratings yet
NLP_Week_03
33 pages
NLP_Lec_11
No ratings yet
NLP_Lec_11
6 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
module 2
No ratings yet
module 2
26 pages
NLP_Module 2(1)
No ratings yet
NLP_Module 2(1)
77 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Markov and Pos Report
No ratings yet
Markov and Pos Report
30 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
NLP Cat 2
No ratings yet
NLP Cat 2
78 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLp
No ratings yet
NLp
12 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
3_2
No ratings yet
3_2
26 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
N Grams
No ratings yet
N Grams
51 pages
language modelling_
No ratings yet
language modelling_
17 pages
2024-Fall-CSE366-12-HMM
No ratings yet
2024-Fall-CSE366-12-HMM
46 pages
01 Introduction To N-Grams 8-41
No ratings yet
01 Introduction To N-Grams 8-41
4 pages
n Grams -Nptel Notes
No ratings yet
n Grams -Nptel Notes
75 pages
NLP-week4-ngrams
No ratings yet
NLP-week4-ngrams
60 pages
lm24aug
No ratings yet
lm24aug
84 pages
L3 LanguageModels
No ratings yet
L3 LanguageModels
118 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
Lec_3 (1)
No ratings yet
Lec_3 (1)
51 pages
A Guide To Hidden Markov Model and Its Applications in NLP
No ratings yet
A Guide To Hidden Markov Model and Its Applications in NLP
11 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
module-1 ch-2
No ratings yet
module-1 ch-2
31 pages
NLP
No ratings yet
NLP
46 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Lecture_4_N_grams
No ratings yet
Lecture_4_N_grams
29 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
2. Language Modeling
No ratings yet
2. Language Modeling
50 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
NLP UNIT-4
No ratings yet
NLP UNIT-4
62 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
NLP m2
No ratings yet
NLP m2
74 pages
5)Lecture-Feb11&13&17&18
No ratings yet
5)Lecture-Feb11&13&17&18
21 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
NLP 5th unit
No ratings yet
NLP 5th unit
19 pages
ai
No ratings yet
ai
13 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
No ratings yet
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
54 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
LAB 6
No ratings yet
LAB 6
47 pages
LAB 4
No ratings yet
LAB 4
4 pages
LAB 3
No ratings yet
LAB 3
7 pages
LAB 1
No ratings yet
LAB 1
5 pages
constitution test
No ratings yet
constitution test
2 pages
ELECTIVE 1 - Gender, Women and Society
No ratings yet
ELECTIVE 1 - Gender, Women and Society
40 pages
District of Laurel San Gregorio Integrated School San Gregorio, Laurel, Batangas
No ratings yet
District of Laurel San Gregorio Integrated School San Gregorio, Laurel, Batangas
1 page
Ethics Module 1 - Introduction
100% (1)
Ethics Module 1 - Introduction
15 pages
TM Grout 180
No ratings yet
TM Grout 180
2 pages
Cloze Test - Study Notes PDF
No ratings yet
Cloze Test - Study Notes PDF
11 pages
Linguistics Lecture
No ratings yet
Linguistics Lecture
39 pages
Checklist For Protection Relays General Electrical Tests Rev00
No ratings yet
Checklist For Protection Relays General Electrical Tests Rev00
2 pages
Types of Assertion 1
No ratings yet
Types of Assertion 1
41 pages
Science 6 Reviewer 2
No ratings yet
Science 6 Reviewer 2
6 pages
Vendor Evaluation
No ratings yet
Vendor Evaluation
11 pages
Earthquake Load Analysis
No ratings yet
Earthquake Load Analysis
14 pages
Differential Pr. Gauges Bellow Type 1
No ratings yet
Differential Pr. Gauges Bellow Type 1
2 pages
Resettlement of Entire Towns and Villages As A Result of Climate Change
No ratings yet
Resettlement of Entire Towns and Villages As A Result of Climate Change
20 pages
Support Model (For Business) v1.1
No ratings yet
Support Model (For Business) v1.1
7 pages
05 Lecture Bulk RNA-seq Array
No ratings yet
05 Lecture Bulk RNA-seq Array
40 pages
1 Direction Prepositions
No ratings yet
1 Direction Prepositions
3 pages
The History of The Basel Problem
No ratings yet
The History of The Basel Problem
13 pages
SysLibStr E
No ratings yet
SysLibStr E
2 pages
MECO OILGAS Reverse Osmosis
No ratings yet
MECO OILGAS Reverse Osmosis
1 page
Ip Melc1 Q4
No ratings yet
Ip Melc1 Q4
2 pages
Eventwise List -Discovery 2024
No ratings yet
Eventwise List -Discovery 2024
39 pages
AI Emergy AnalysisCBRN Final
No ratings yet
AI Emergy AnalysisCBRN Final
3 pages
geography_bihar_eng_36
No ratings yet
geography_bihar_eng_36
5 pages
Image Blending Using Unitery CNN Algorithm
No ratings yet
Image Blending Using Unitery CNN Algorithm
69 pages
Methodology:: General
No ratings yet
Methodology:: General
3 pages
9AKK107992A1008 - PDCOM Data Sheet
No ratings yet
9AKK107992A1008 - PDCOM Data Sheet
4 pages
Project Report
No ratings yet
Project Report
11 pages
Navaron Satkhira Bhomra Samples
No ratings yet
Navaron Satkhira Bhomra Samples
9 pages
Breakwater
No ratings yet
Breakwater
14 pages