NLP_Unit2 (2)
NLP_Unit2 (2)
Unit-II
● In practice it’s more common to use trigram models, which condition on the
previous two words rather than the previous word, or 4-gram or even 5-gram
models, when there is sufficient training data.
Example-2: Bi-gram probabilities
● What is the most probable next word predicted by the model for the following
word sequence:
Example-2: Bi-gram probabilities (Contd…)
Example-2: Bi-gram probabilities (Contd…)
Example-2: Bi-gram probabilities (Contd…)
Publicly available corpora
● Gutenberg Project providing with text format of some books.
● Google also released a publicly available corpus, trillion word corpus
with over 13 million unique words.
N-gram Language Model
● Advantages:
○ Easy to understand and implement
○ Conversion from one type of gram to another is easy.
● Disadvantages:
○ Underflow due to multiplication of probabilities
■ Solution: Use Log (which will add probabilities)
○ Zero probability problem
■ Solution: Use Laplace smoothing
Using Log to solve underflow problem
Zero Probability Problem
Smoothing
● To keep a language model from assigning zero probability to these
unseen events, we’ll have to shave off a bit of probability mass
from some more frequent events and give it to the events we’ve
never seen.
● This modification is called smoothing (or discounting).
● There are many ways to do smoothing, and some of them are:
○ Add-1 smoothing (Laplace Smoothing)
○ Add-k smoothing,
○ Backoff
○ Kneser-Ney smoothing.
Laplace Smoothing
● Add one to all n-gram counts before they are normalized into
probabilities.
● Not the highest-performing technique for language modeling, but
useful method for text classification
Applying Laplace Smoothing
Add-K Smoothing
● Rather than adding one to each count, add a fractional count (0.5,
0.05, 0.01 etc.)
● The value of K can be optimized on a validation set
Backoff and Interpolation
● Using HHM and available corpus, we can find most likely sequence of tags
using probability values.
Issues with Markov Model Tagging
● Principles of Max. Entropy Model: Given a collection of facts (features), choose a model which is
consistent (Uniform) with all the facts.
Features in Max. Entropy Model
Examples of features:
PoS Tagging in Max. Entropy Model
Entropy in Max Entropy Model
● It measures the uncertainty (surprise) of a distribution.
● For an event x with probability of occurrence px, Entropy = log (1/px)
● Entropy H for a random variable X with probability distribution P is given as:
● So, in Max. Entropy Model, we choose a model with max. Entropy, subjected
to feature-based constraints.
● We will start from a uniform distribution (because it has max. Entropy) and the
add constraints, which will decrease the entropy and make it closer to the
given data.
Example
Example Contd…
● F1 = F1(x,y) = [word = ‘a’ & tag = ‘D’]
● F2 = F2(x,y) = [word = ‘man’ & tag = ‘N’]
● F3 = F3(x,y) = [word = ‘sleeps’ & tag = ‘V’]
● F4 = F4(x,y) = [word ∈ V` & tag = ‘D’], Where V` =V - {a,man,sleeps};
V is Vocabulary
● F5 = F5(x,y) = [word ∈ V` & tag = ‘N’]
● F6 = F6(x,y) = [word ∈ V` & tag = ‘V’]
● Now, P(D|cat) = e(∑λiFi)/Z
○ ∑λiFi = λ1*0 + λ2*0 + λ3*0 + λ4*1 + λ5*0 + λ6*0 = λ4
○ To calculate Z, we need to calculate P(N|cat) and P(V|cat)
○ P(N|cat) = eλ5/Z and P(V|cat) = eλ6/Z; Also,
P(D|cat)+P(N|cat)+P(V|cat)=1 => Z = eλ4 + eλ5 + eλ6
○ Hence, P(D|cat) = eλ4/(eλ4 + eλ5 + eλ6)
● Similarly, we can calculate P(N|laugh) and P(D|man)
Example Contd…
● P(D|a) = eλ1/Z, to calculate Z here, we have:
○ PV|a)=e0/Z=1/Z P(N|a)=e0/Z=1/Z
○ Hence, P(D|a) = eλ1/(eλ1+2) = 0.9
● Similarly, we can calculate equation for other given constraints and
get values of λs, which will represent the overall Max. entropy model
for the given problem.