0% found this document useful (0 votes)
11 views

NLP_Unit2 (2)

Uploaded by

Kush Aggarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

NLP_Unit2 (2)

Uploaded by

Kush Aggarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Language Modeling

Unit-II

Syed Rameem Zahra


(Assistant Professor)
Department of CSE, NSUT
What is a Language Model in NLP?
● A language model learns to predict the probability of a sequence of
words.
● It’s a statistical tool that analyzes the pattern of human language for
the prediction of words, by estimating the relative likelihood of
different phrases.
● The models are prepared for the prediction of words by learning the
features and characteristics of a language.
● Language models are used in speech recognition, machine
translation, part-of-speech tagging, parsing, Optical Character
Recognition, handwriting recognition, information retrieval,
summarization, spell correction, and many other daily tasks.
Challenges with Language Modeling

● Formal languages (like a programming language) are precisely


defined, but Natural language isn’t designed, it evolves according
to the convenience and learning of an individual.
● There are several terms in natural language that can be used in a
number of ways, which introduces ambiguity but can still be
understood by humans.
Some Common Examples of Language Models
● Speech Recognization
○ Voice assistants such as Siri and Alexa are examples of how language models help
machines in processing speech audio.
● Machine Translation
○ Google Translator and Microsoft Translate are examples of how NLP models can help in
translating one language to another.
● Sentiment Analysis
○ This helps in analyzing the sentiments behind a phrase. This use case of NLP models is
used in products that allow businesses to understand a customer’s intent behind
opinions or attitudes expressed in the text.
● Text Suggestions
○ Google services such as Gmail or Google Docs use language models to help users get
text suggestions while they compose an email or create long text documents,
respectively.
● Parsing Tools
○ Parsing involves analyzing sentences or words that comply with syntax or grammar
rules. Spell checking tools are perfect examples of language modelling and parsing.
Types of Language Models

● Statistical Language Models:


○ These models use traditional statistical techniques like N-grams,
Hidden Markov Models (HMM) and certain linguistic rules to learn the
probability distribution of words.
● Neural Language Models:
○ These are new players in the NLP town and have surpassed the
statistical language models in their effectiveness. They use different
kinds of Neural Networks to model language.
Goal of Probabilistic Language Modelling
● To calculate the probability of a sentence of sequence of words:
○ P(W) = P(w1,w2,..., wn)
● This is the Joint Probability
● It can be calculated using Conditional probability:
○ P(w5 | w1,w2,w3,w4)
● E.g. for two words: X, Y, we have:
○ P(X|Y) = P(X,Y)/P(Y)
○ then, P(X,Y) = P(X|Y) P(Y)
● Similarly, for three words: P(X,Y,Z) = P(X) P(Y|X) P(Z|X,Y)
● This method is used when we have short/limited sentence.
● For longer sentences, we can use Markov assumption:
○ P(wn | w1,w2,w3,...,wn-1) ~ P(wn|wn-1) or P(wn|wn-2,wn-1)
N-Gram Model
● This is one of the simplest approaches to language modelling.
● Here, a probability distribution for a sequence of ‘n’ is created, where
‘n’ can be any number and defines the size of the gram (or sequence
of words being assigned a probability).
● If n=4, a gram may look like: “can you help me”.
● Basically, ‘n’ is the amount of context that the model is trained to
consider.
● There are different types of N-Gram models such as unigrams,
bigrams, trigrams, etc.
● The intuition of the n-gram model is that instead of computing the
probability of a word given its entire history, we can approximate
the history by just the last few words.
(Uni-) 1-gram model

● The simplest case of Markov assumption is case when the size of


prefix is one.
● This will provide us with grammar that only consider one word. As
a result it produces a set of unrelated words.
● It actually would generate sentences with random word order.
Bigram Model

● Approximates the probability of a word given all the previous


words by using only the conditional probability of the
preceding word.
● we consider a 2-word (tandem) bigrams correlations
● In other words, instead of computing the probability
○ P(the|Walden Pond’s water is so transparent that)
● we approximate it with the probability
○ P(the|that)
How to estimate these bigram or n-gram probabilities?
● An intuitive way to estimate probabilities is called maximum
likelihood estimation or MLE.
● We get Maximum likelihood estimate for the parameters of an
n-gram model by getting counts from a corpus, and normalizing
the counts so that they lie between 0 and 1.
● For example, to compute a particular bigram probability of a
word wn given a previous word wn−1, we’ll compute the count of
the bigram C(wn−1wn) and normalize by the sum of all the
bigrams that share the same first word wn−1
Note: A corpus is a collection of authentic text or audio organized into datasets.
Example-1
● We have a mini-corpus of three sentences:
● <s> I am Sam </s>
● <s> Sam I am </s>
● <s> I do not like green eggs and ham </s>
● Here are the calculations for some of the bigram probabilities from this corpus:

● In practice it’s more common to use trigram models, which condition on the
previous two words rather than the previous word, or 4-gram or even 5-gram
models, when there is sufficient training data.
Example-2: Bi-gram probabilities
● What is the most probable next word predicted by the model for the following
word sequence:
Example-2: Bi-gram probabilities (Contd…)
Example-2: Bi-gram probabilities (Contd…)
Example-2: Bi-gram probabilities (Contd…)
Publicly available corpora
● Gutenberg Project providing with text format of some books.
● Google also released a publicly available corpus, trillion word corpus
with over 13 million unique words.
N-gram Language Model

● Advantages:
○ Easy to understand and implement
○ Conversion from one type of gram to another is easy.
● Disadvantages:
○ Underflow due to multiplication of probabilities
■ Solution: Use Log (which will add probabilities)
○ Zero probability problem
■ Solution: Use Laplace smoothing
Using Log to solve underflow problem
Zero Probability Problem
Smoothing
● To keep a language model from assigning zero probability to these
unseen events, we’ll have to shave off a bit of probability mass
from some more frequent events and give it to the events we’ve
never seen.
● This modification is called smoothing (or discounting).
● There are many ways to do smoothing, and some of them are:
○ Add-1 smoothing (Laplace Smoothing)
○ Add-k smoothing,
○ Backoff
○ Kneser-Ney smoothing.
Laplace Smoothing

● Add one to all n-gram counts before they are normalized into
probabilities.
● Not the highest-performing technique for language modeling, but
useful method for text classification
Applying Laplace Smoothing
Add-K Smoothing

● Rather than adding one to each count, add a fractional count (0.5,
0.05, 0.01 etc.)
● The value of K can be optimized on a validation set
Backoff and Interpolation

● Add-K smoothing is useful for some tasks, but still tends to


be suboptimal for language modeling.
● Other techniques are:
○ Backoff: we use the trigram if the evidence is sufficient, otherwise
we use the bigram, otherwise the unigram.
■ In other words, we only “back off” to a lower-order n-gram if we
have zero evidence for a higher-order n-gram.
○ Interpolation: we always mix the probability estimates from all the
n-gram estimators, weighing and combining the trigram, bigram, and
unigram counts.
Simple Linear Interpolation

● we combine different order n-grams by linearly interpolating all the


models.
Example
Text Classification
● Text Classification (Text Categorization) is the task of assigning a label or
categorization category to an entire text or document.
● Some of common text categorization tasks are:
○ Sentiment analysis
■ extraction of sentiment, positive or negative orientation that writer expresses toward an
object.
○ Spam detection
■ binary classification task of assigning an email to one of the two classes spam or
not-spam.
○ Authorship identification
■ determining a text’s author.
○ Age/gender identification
■ determining a text’s author characteristics like gender and age.
○ Language Identification
■ finding the language of a text.
Text Classification: Definition

● Text classification can be defined as follows:


● Input:
○ a document d
○ a fixed set of classes C = {c1 , c2 ,…, cn }
● Output:
○ a predicted class c belongs-to C
Classification Methods: Hand-Coded Rules

● The goal of classification is to take a single observation, extract some


useful features, and thereby classify the observation into one of a set
of discrete classes.
● One method for classifying text is to use hand-written rules.
● Rules based on combinations of words or other features
○ spam: black-list-address OR (“dollars” AND “have been selected”)
● Accuracy can be high if rules carefully are refined by experts.
● But building and maintaining these hand-written rules can be
expensive.
○ Rules can be fragile.
○ It may require domain knowledge.
Classification Methods: Supervised Machine Learning

● Most cases of text classification in language processing are done


via supervised machine learning methods.
● The goal of a supervised machine learning algorithm is to learn
how to map from a new observation to a correct output.
● Input:
○ a document d
○ a fixed set of classes C = {c1 , c2 ,…, cn }
○ a training set of m hand-labeled documents (d1 ,c1 ),....,(dm,cm)
● Output:
○ a learned classifier model: d → c
Classification Methods: Supervised Machine Learning
● Our goal is to learn a classifier that is capable of mapping from a new
document d to its correct class c belongs-to C.
● A probabilistic classifier additionally will tell us the probability of the observation
being in the class.
● Generative classifiers like Naive Bayes build a model of how a class could
generate some input data.
○ Given an observation, they return the class most likely to have generated the observation.
● Discriminative classifiers like logistic regression instead learn what features
from the input are most useful to discriminate between the different possible
classes.
● Some classifiers:
○ Naïve Bayes
○ Logistic regression
○ Support-vector machines
○ k-Nearest Neighbors, …
Text Classification: Evaluation
● In order to evaluate how good is our classifier, we can use different
evaluation metrics.
● In evaluation, we compare the test set results of our classifier with gold
labels (the human labels for the test set documents).
● As a result of this comparison, first we build a contingency table (or
confusion matrix) before calculate our evaluation metrics:
Text Classification: Evaluation
● Accuracy is percentage of all the observations our system labeled correctly.
● Precision measures percentage of items that system detected that are in fact
positive.
● Recall measures percentage of items actually present in the input that were
correctly identified by the system.
● Precision and Recall, unlike Accuracy, emphasize true positives.
○ Looking only one of them can be misleading.
○ tp=1 fp=0 fn=99, then, Precision = 100% (while Recall=1%)
○ tp=1 fp=99 fn=0, then, Recall= 100% (while Precision=1%)
● F-measure is a single metric that incorporates aspects of both precision and
recall.
Text Classification: Evaluation
Text Classification with more than two classes
Microaveraging and Macroaveraging
● In order to derive a single metric that tells us how well the system is
doing, we can combine these values in two ways.
○ In macroaveraging, compute performance for each class, and then average over
classes.
○ In microaveraging, collect decisions for all classes into a single contingency table, and
then compute precision and recall from that table.
Cross-validation
● we randomly choose a training and test set division of our data, train our
classifier, and then compute the error rate on the test set.
● Then we repeat with a different randomly selected training set and test
set.
● We do sampling process 10 times and average these 10 runs to get an
average error rate.
● This is called 10-fold cross-validation.
Part-of-Speech Tagging
● Given a text of english, identify the parts of speech of each word
Part-of-Speech Tagging

● Open class words (content words):


○ Nouns, verbs, adjectives, adverbs.
○ They refer to objects, actions and features in the world.
○ They are open class because new words are added all the
time.
● Closed class words:
○ Pronouns, determiners, prepositions, connectives,...
○ They are limited.
○ Mostly functional: to tie the concepts of a sentence together.l
Part-of-Speech Tagging
POS Tagging: Choosing a target

● For POS tagging, we need to choose a standard set.


● E.g., We could choose a very coarse targets like N, V, Adj, Adv etc.
● A commonly used set is: “UPenn TreeBank” tagset, which contains
45 tags.
UPenn TreeBank POS tagset
POS Tagging is hard???
● A word in a sentence may have multiple POS tags depending on the
context.
○ E.g., For the word “Back”, we have:
○ The back door: back/JJ -> Adjective
○ On my back: back/NN -> Noun
○ Win the voters back: back/RB -> Adverb
● It has been seen that mostly we have 2 to 3 tags for many words (Ambiguity
problem)
● We can use any valid corpus to find the highest probability of a word for
tagging it to a particular POS tag, which designing a model.
○ E.g.: Some words may only be nouns like arrow
○ Some words are ambiguous like flies
○ Probability may help us if one tag is more likely than another.
○ Also the local context can be used.
POS Tagging Approaches

● Rule based approach:


○ Assign each word in the input sentence a list of potential POS tags.
○ Then, scale down the list to a single tag using hand-written rules.
● Statistical tagging:
○ Get a training corpus of tagged text, learn the transformation rules from most
frequently tags (e.g. TBL (Transformation Based Learning) Tagger).
○ Find the most likely sequence of tags for a sequence of words using
probability.
Generative and Conditional Models
● Generative (Joint) Model:
○ Generate the observed data from hidden stuff, i.e., put a probability over the observation given
the class: P(d,c) in terms of P(d|c)
○ E.g., Naive bayes classification, Hidden Markov Models etc.
● Conditional (Discriminative) Models:
○ Take the data as given and put a probability over hidden structure given the data: P(c|d)
○ E.g., Logistic regression, max. Entropy models, SVMs, Perceptron etc.
Generative Model: Hidden Markov Model
● A hidden Markov model (HMM) is a statistical Markov
model in which the system being modeled is assumed to
be a Markov process — call it X with unobservable
("hidden") states.
● As part of the definition, HMM requires that there be an
observable process Y whose outcomes are "influenced" by
the outcomes of X in a known way.
● Since X cannot be observed directly, the goal is to learn
about X by observing Y.
● HMM has an additional requirement that the outcome of Y
at time t = to must be "influenced" exclusively by the
outcome of X at time t = to and that the outcomes of X and
Y at t < to must not affect the outcome of Y at t = to.
● Similar to N-Gram models
X — states
● Model the text as a sequence y — possible observations
● For ngrams, we modeled the probability of each word a — state transition probabilities
conditioned on the previous n-1 words. b — output probabilities
● Here, we model each tag conditioned on previous tags
● Still uses Markov assumption: only look back a few tags
Generative Model: Hidden Markov Model
Generative Model: Hidden Markov Model
Estimating Parameters: MLE in Hidden Markov Model
Example

Note: We can add smoothing techniques also (as discussed earlier)


Three Tasks for HMM

● Likelihood: Given a tagged sequence, determine its likelihood


● Decoding: Given an untagged sequence, determine the best
tag sequence for it
● Learning: Given an untagged sequence, and a set of tags,
learn the HMM parameters
Likelihood of a tagged sentence
Example: Disambiguation of “race”

● Using HHM and available corpus, we can find most likely sequence of tags
using probability values.
Issues with Markov Model Tagging

● Missing probabilities for unknown words.


○ Solution: Use morphological cues (Capitalization or suffixes etc.) to
assign a more calculated guess.
● Limited context may not be sufficient for correct tagging.
○ Solution: use higher order HMM (like 2nd order or 3rd order etc.)
and combine various N-gram models.
Maximum Entropy Model: Discriminative Model
● Uses a combination of heterogeneous set of features to create a probabilistic
model, which is able to select a correct PoS tag for a current word, e.g.:
○ Whether the next word is to.
○ Whether one of the last 5 words is a preposition. etc.

● Principles of Max. Entropy Model: Given a collection of facts (features), choose a model which is
consistent (Uniform) with all the facts.
Features in Max. Entropy Model

Examples of features:
PoS Tagging in Max. Entropy Model
Entropy in Max Entropy Model
● It measures the uncertainty (surprise) of a distribution.
● For an event x with probability of occurrence px, Entropy = log (1/px)
● Entropy H for a random variable X with probability distribution P is given as:

● So, in Max. Entropy Model, we choose a model with max. Entropy, subjected
to feature-based constraints.
● We will start from a uniform distribution (because it has max. Entropy) and the
add constraints, which will decrease the entropy and make it closer to the
given data.
Example
Example Contd…
● F1 = F1(x,y) = [word = ‘a’ & tag = ‘D’]
● F2 = F2(x,y) = [word = ‘man’ & tag = ‘N’]
● F3 = F3(x,y) = [word = ‘sleeps’ & tag = ‘V’]
● F4 = F4(x,y) = [word ∈ V` & tag = ‘D’], Where V` =V - {a,man,sleeps};
V is Vocabulary
● F5 = F5(x,y) = [word ∈ V` & tag = ‘N’]
● F6 = F6(x,y) = [word ∈ V` & tag = ‘V’]
● Now, P(D|cat) = e(∑λiFi)/Z
○ ∑λiFi = λ1*0 + λ2*0 + λ3*0 + λ4*1 + λ5*0 + λ6*0 = λ4
○ To calculate Z, we need to calculate P(N|cat) and P(V|cat)
○ P(N|cat) = eλ5/Z and P(V|cat) = eλ6/Z; Also,
P(D|cat)+P(N|cat)+P(V|cat)=1 => Z = eλ4 + eλ5 + eλ6
○ Hence, P(D|cat) = eλ4/(eλ4 + eλ5 + eλ6)
● Similarly, we can calculate P(N|laugh) and P(D|man)
Example Contd…
● P(D|a) = eλ1/Z, to calculate Z here, we have:
○ PV|a)=e0/Z=1/Z P(N|a)=e0/Z=1/Z
○ Hence, P(D|a) = eλ1/(eλ1+2) = 0.9
● Similarly, we can calculate equation for other given constraints and
get values of λs, which will represent the overall Max. entropy model
for the given problem.

You might also like