NLP Digital Notes
NLP Digital Notes
Introduction to NLP:
The field of natural language processing began in the 1940s, after World War II. At this time, people recognized
the importance of translation from one language to another and hoped to create a machine that could do this sort
of translation automatically. However, the task was obviously not as easy as people first imagined. By 1958,
some researchers were identifying significant issues in the development of NLP. One of these researchers was
Noam Chomsky, who found it troubling that models of language recognized sentences that were nonsense but
grammatically correct as equally irrelevant as sentences that were nonsense and not grammatically correct.
Chomsky found it problematic that the sentence “Colorless green ideas sleep furiously” was classified as
improbable to the same extent that “Furiously sleep ideas green colorless”; any speaker of English can recognize
the former as grammatically correct and the latter as incorrect, and Chomsky felt the same should be expected of
machine models.
Around the same time in history, from 1957-1970, researchers split into two divisions concerning NLP:
symbolic and stochastic. Symbolic, or rule-based, researchers focused on formal languages and generating
syntax; this group consisted of many linguists and computer scientists who considered this branch the beginning
of artificial intelligence research. Stochastic researchers were more interested in statistical and probabilistic
methods of NLP, working on problems of optical character recognition and pattern recognition between texts.
After 1970, researchers split even further, embracing new areas of NLP as more technology and knowledge
became available. One new area was logic-based paradigms, languages that focused on encoding rules and
language in mathematical logics. This area of NLP research later contributed to the development of the
programming language Prolog. Natural language understanding was another area of NLP that was particularly
influenced by SHRDLU, Professor Terry Winograd’s doctoral thesis. This program placed a computer in a
world of blocks, enabling it to manipulate and answer questions about the blocks according to natural language
instructions from the user. The amazing part of this system was its capability to learn and understand with
amazing accuracy, something only currently possible in extremely limited domains (e.g., the block world). The
following text was generated in a demonstration of SHDRLU:
The computer is clearly able to resolve relationships between objects and understand certain ambiguities. A
fourth area of NLP that came into existence after 1970 is discourse modeling. This area examines interchanges
between people and computers, working out such ideas as the need to change “you” in a speaker’s question to
“me” in the computer’s answer.
From 1983 to 1993, researchers became more united in focusing on empiricism and probabilistic models.
Researchers were able to test certain arguments by Chomsky and others from the 1950s and 60s, discovering
that many arguments that were convincing in text were not empirically accurate. Thus, by 1993, probabilistic
and statistical methods of handling natural language processing were the most common types of models. In the
last decade, NLP has also become more focused on information extraction and generation due to the vast
amounts of information scattered across the Internet. Additionally, personal computers are now everywhere, and
thus consumer level applications of NLP are much more common and an impetus for further research.
Before we talk about processing words, we need to decide what counts as a word. corpus Let’s start
by looking at one particular corpus (plural corpora), a computer-readable corpora collection of text or
speech. For example the Brown corpus is a million-word collection of samples from 500 written
English texts from different genres (newspaper, fiction, non-fiction, academic, etc.), assembled at
Brown University in 1963–64 (Kuˇcera and Francis, 1967). How many words are in the following
Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15 if we count punctuation.
Whether we treat period (“.”), comma (“,”), and so on as words depends on the task. Punctuation is
critical for finding boundaries of things (commas, periods, colons) and for identifying some aspects of
meaning (question marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if they were separate
words.
The Switchboard corpus of American English telephone conversations between strangers was
collected in the early 1990s; it contains 2430 conversations averaging 6 minutes each, totaling 240
hours of speech and about 3 million words (Godfrey et al., 1992). Such corpora of spoken language
don’t have punctuation but do introduce other complications with regard to defining words. Let’s look
at one utterance from Switchboard; an utterance is the spoken correlate of a sentence:
This utterance has two kinds of disfluencies. The broken-off word main- is called a fragment. Words
like uh and um are called fillers or filled pauses. Should we consider these to be words? Again, it
depends on the application. If we are building a speech transcription system, we might want to
eventually strip out the disfluencies.
How about inflected forms like cats versus cat? These two words have the same lemma cat but are
different wordforms. A lemma is a set of lexical forms having the same stem, the same major part-of-
speech, and the same word sense. The wordform is the full inflected or derived form of the word. For
morphologically complex languages like Arabic, we often need to deal with lemmatization. For many
tasks in English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to distinguish two ways of
talking about words. Types are the number of distinct words in a corpus; if the set of words in the
vocabulary is V, the number of types is the word token vocabulary size |V|. Tokens are the total
number N of running words. If we ignore punctuation, the following Brown sentence has 16 tokens
and 14 types:
They picnicked by the pool, then lay back on the grass and looked at the stars
Corpora
It’s also quite common for speakers or writers to use multiple languages in a code switching single
communicative act, a phenomenon called code switching. Code switching is enormously common
across the world.
Text Normalization
Before almost any natural language processing of a text, the text has to be normalized. At least three
tasks are commonly applied as part of any normalization process:
Word Tokenization
While the Unix command sequence just removed all the numbers and punctuation, for most NLP
applications we’ll need to keep these in our tokenization. We often want to break off punctuation as a
separate token; commas are a useful piece of information for parsers, periods help indicate sentence
boundaries. But we’ll often want to keep the punctuation that occurs word internally, in examples like
m.p.h., Ph.D., AT&T, and cap’n. Special characters and numbers will need to be kept in prices
($45.55) and dates (01/02/06); we don’t want to segment that price into separate tokens of “45” and
“55”. And there are URLs (https://www.stanford.edu), Twitter hashtags (#nlproc), or email addresses
([email protected]).
Number expressions introduce other complications as well; while commas normally appear at word
boundaries, commas are used inside numbers in English, every three digits: 555,500.50. Languages,
and hence tokenization requirements, differ on this; many continental European languages like
Spanish, French, and German, by contrast, use a comma to mark the decimal point, and spaces (or
sometimes periods) where English puts commas, for example, 555 500,50.
A tokenizer can also be used to expand clitic contractions that are marked by apostrophes, for
example, converting what're to the two tokens what are, and we're to we are. A clitic is a part of a
word that can’t stand on its own, and can only occur when it is attached to another word. Some such
contractions occur in other alphabetic languages, including articles and pronouns in French (j'ai,
l'homme).
Depending on the application, tokenization algorithms may also tokenize multiword expressions like
New York or rock 'n' roll as a single token, which requires a multiword expression dictionary of some
sort. Tokenization is thus intimately tied up with named entity recognition, the task of detecting
names, dates, and organizations.
One commonly used tokenization standard is known as the Penn Treebank tokenization standard, used
for the parsed corpora (treebanks) released by the Lintokenization guistic Data Consortium (LDC),
the source of many useful datasets. This standard separates out clitics (doesn’t becomes does plus n’t),
In practice, since tokenization needs to be run before any other language processing, it needs to be
very fast. The standard method for tokenization is therefore to use deterministic algorithms based on
regular expressions compiled into very efficient finite state automata which we can implement using
nltk.regexptokenize function of the Python-based Natural Language Toolkit (NLTK).
Carefully designed deterministic algorithms can deal with the ambiguities that arise, such as the fact
that the apostrophe needs to be tokenized differently when used as a genitive marker (as in the book’s
cover), a quotative as in ‘The other class’, she said, or in clitics like they’re.
There is a third option to tokenizing text. Instead of defining tokens as words (whether delimited by
spaces or more complex algorithms), or as characters (as in Chinese), we can use our data to
automatically tell us what the tokens should be. This is especially useful in dealing with unknown
words, an important problem in language processing. As we will see in the next chapter, NLP
algorithms often learn some facts about language from one corpus (a training corpus) and then use
these facts to make decisions about a separate test corpus and its language. Thus if our training corpus
contains, say the words low, new, newer, but not lower, then if the word lower appears in our test
corpus, our system will not know what to do with it.
Subword:
To deal with this unknown word problem, modern tokenizers often automatically induce sets of
tokens that include tokens smaller than words, called subwords. Subwords can be arbitrary
substrings, or they can be meaning-bearing units like the morphemes -est or -er. (A morpheme is the
Most tokenization schemes have two parts: a token learner, and a token segmenter. The token learner
takes a raw training corpus (sometimes roughly preseparated into words, for example by whitespace)
and induces a vocabulary, a set of tokens. The token segmenter takes a raw test sentence and segments
it into the tokens in the vocabulary. Three algorithms are widely used: byte-pair encoding (Sennrich et
al., 2016), unigram language modeling (Kudo, 2018), and WordPiece (Schuster and Nakajima, 2012);
there is also a SentencePiece library that includes implementations of the first two of the three (Kudo
and Richardson, 2018a).
Byte-pair Encoding:
In this section we introduce the simplest of the three, the byte-pair encoding or BPE algorithm
(Sennrich et al., 2016); see Fig. 2.13. The BPE token learner begins with a vocabulary that is just the
set of all individual characters. It then examines the training corpus, chooses the two symbols that are
most frequently adjacent (say ‘A’, ‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and
replaces every adjacent ’A’ ’B’ in the corpus with the new ‘AB’. It continues to count and merge,
creating new longer and longer character strings, until k merges have been done creating k novel
tokens; k is thus a parameter of the algorithm. The resulting vocabulary consists of the original set of
characters plus k new symbols.
The algorithm is usually run inside words (not merging across word boundaries), so the input corpus
is first white-space-separated to give a set of strings, each corresponding to the characters of a word,
plus a special end-of-word symbol , and its counts. Let’s see its operation on the following tiny input
corpus of 18 word tokens with counts for each word (the word low appears 5 times, the word newer 6
times, and so on), which would have a starting vocabulary of 11 letters:
The BPE algorithm first counts all pairs of adjacent symbols: the most frequent is the pair e r because
it occurs in newer (frequency of 6) and wider (frequency of 3) for a total of 9 occurrences.1 We then
merge these symbols, treating er as one symbol, and count again:
a single normal form for words with multiple forms like USA and US or uh-huh and uhhuh. This
standardization may be valuable, despite the spelling information that is lost in the normalization
process. For information retrieval or information and Stemming extraction about the US, we might
want to see information from documents whether they mention the US or the USA.
Case folding is another kind of normalization. Mapping everything to lower case means that
Woodchuck and woodchuck are represented identically, which is very helpful for generalization in
many tasks, such as information retrieval or speech recognition. For sentiment analysis and other text
classification tasks, information extraction, and machine translation, by contrast, case can be quite
helpful and case folding is generally not done. This is because maintaining the difference between, for
example, US the country and us the pronoun can outweigh the advantage in generalization that case
folding would have provided for other words.
For many natural language processing situations we also want two morphologically different forms of
a word to behave similarly. For example in web search, someone may type the string woodchucks but
a useful system might want to also return pages that mention woodchuck with no s. This is especially
common in morphologically complex languages like Polish, where for example the word Warsaw has
different endings when it is the subject (Warszawa), or after a preposition like “in Warsaw” (w
Warszawie), or “to Warsaw” (do Warszawy), and so on.
Sentence Segmentation
The scene is written with a combination of unbridled passion and surehanded control: In the
exchanges of the three characters and the rise and fall of emotions, Mr. Weller has captured the
heartbreaking inexorability of separation.
Related to this is the fact that sometimes sentences do not nicely follow in sequence, but seem to nest
in awkward ways. While normally nested things are not seen as sentences by themselves, but clauses,
this classification can be strained for cases such as the quoting of direct speech, where we get
subsentences:
**.ou remind me,** she remarked, .**f your mother.”
A second problem with such indirect speech is that it is standard typesetting practice (particularly in
North America) to place quotation marks after sentence final punctuation. Therefore, the end of the
sentence is not after the period in the example above, but after the close quotation mark that follows
the period.
In practice most systems have used heuristic algorithms of this sort. With enough effort development,
they can work very well, at least within the textual in their domain for which they were built. But any
such solution suffers from the same problems of heuristic processes in other parts of the tokenization
process. They require a lot of hand-coding and domain knowledge on the part of the person
constructing the tokenizer, and tend to be brittle and domain-specific.
Edit distance gives us a way to quantify both of these intuitions about string similarity. More formally, the
minimum edit distance between two strings is defined as the minimum number of editing perations (operations
like insertion, deletion, substitution) needed to transform one string into another.
We can also assign a particular cost or weight to each of these operations. The Levenshtein distance between
two sequences is the simplest weighting factor in which each of the three operations has a cost of 1
(Levenshtein, 1966)—we assume that the substitution of a letter for itself, for example, t for t, has zero cost. The
Levenshtein distance between intention and execution is 5. Levenshtein also proposed an alternative version of
his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but giving each substitution a
cost of 2 since any substitution can be represented by one insertion and one deletion). Using this version, the
Levenshtein distance between intention and execution is 8.
How do we find the minimum edit distance? We can think of this as a search task, in which we are searching for
the shortest path—a sequence of edits—from one string to another.
The space of all possible edits is enormous, so we can’t search naively. However, lots of distinct edit paths will
end up in the same state (string), so rather than recomputing all those paths, we could just remember the shortest
path to a state each time we saw it. We can do this by using dynamic programming. Dynamic program ming is
the name for a class of algorithms, first introduced by Bellman (1957), that
apply a table-driven method to solve problems by combining solutions to subproblems. Some of the most
commonly used algorithms in natural language processing make use of dynamic programming, such as the
Viterbi algorithm and the CKY algorithm for parsing.
The intuition of a dynamic programming problem is that a large problem can be solved by properly combining
the solutions to various subproblems. Consider the shortest path of transformed words that represents the
minimum edit distance between the strings intention and execution shown in Fig. 2.16.
The minimum edit distance algorithm was named by Wagner and Fischer (1974) but independently discovered
by many people
Let’s first define the minimum edit distance between two strings. Given two strings, the source string X of
length n, and target string Y of length m, we’ll define D[i; j] as the edit distance between X[1::i] and Y[1:: j],
i.e., the first i characters of X and the first j characters of Y. The edit distance between X and Y is thus D[n;m].
We’ll use dynamic programming to compute D[n;m] bottom up, combining solutions to subproblems. In the
base case, with a source substring of length i but an empty target string, going from i characters to 0 requires i
deletes. With a target substring of length j but an empty source going from 0 characters to j characters requires j
inserts. Having computed D[i; j] for small i; j we then compute larger D[i; j] based on previously computed
smaller values. The value of D[i; j] is computed by taking the minimum of the three possible paths through the
matrix which arrive there:
If we assume the version of Levenshtein distance in which the insertions and deletions each have a cost of 1
(ins-cost(.) = del-cost(.) = 1), and substitutions have a cost of 2 (except substitution of identical letters have zero
cost), the computation for D[i; j] becomes:
The algorithm is summarized in Fig. 2.17; Fig. 2.18 shows the results of applying the algorithm to the distance
between intention and execution with the version of Levenshtein in Eq. 2.9.
To extend the edit distance algorithm to produce an alignment, we can start by visualizing an alignment as a
path through the edit distance matrix. Figure 2.19 shows this path with boldfaced cells. Each boldfaced cell
represents an alignment
Figure 2.19 also shows the intuition of how to compute this alignment path. The computation proceeds in two
steps. In the first step, we augment the minimum edit distance algorithm to store backpointers in each cell. The
backpointer from a cell points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.19. Some cells have multiple backpointers because the
minimum extension could have come from multiple previous cells. In the second step, we perform a backtrace.
In a backtrace, we start
from the last cell (at the final row and column), and follow the pointers back through the dynamic programming
matrix. Each complete path between the final cell and the initial cell is a minimum distance alignment. Exercise
2.7 asks you to modify the minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment.
While we worked our example with simple Levenshtein distance, the algorithm in Fig. 2.17 allows arbitrary
weights on the operations. For spelling correction, for example, substitutions are more likely to happen between
letters that are next to
each other on the keyboard. The Viterbi algorithm is a probabilistic extension of minimum edit distance. Instead
of computing the “minimum edit distance” between two strings, Viterbi computes the “maximum probability
alignment” of one string with another.
1. Insertion
2. Deletion
3. Substitution
4. Transposition of two adjacent letters
In the previous example we can see that 80% of the errors are within 1 edit distance and alsmost all edit
distances are within 2. Also we should allow insertion of space and hyphane
Channel model:
N-Grams:
Let’s begin with the task of computing P(wjh), the probability of a word w given some history h. Suppose the
history h is “its water is so transparent that” and we want to know the probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a very large corpus, count the
number of times we see its water is so transparent that, and count the number of times this is followed by the.
This would be answering the question “Out of the times we saw the history h, how many times was it followed
by the word w”, as follows:
With a large enough corpus, such as the web, we can compute these counts and estimate the probability from
Eq. 3.2. You should pause now, go to the web, and compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in many cases, it turns out that
even the web isn’t big enough to give us good estimates in most cases. This is because language is creative; new
sentences are created all the time, and we won’t always be able to count entire sentences. Even simple
extensions of the example sentence may have counts of zero on the web (such as “Walden Pond’s water is so
transparent that the”; well, used to have counts of zero).
Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so
transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its
water is so transparent?” We would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
For this reason, we’ll need to introduce more clever ways of estimating the probability of a word w given a
history h, or the probability of an entire word sequence W. Let’s start with a little formalizing of notation. To
represent the probability of a particular random variable Xi taking on the value “the”, or P(Xi =“the”), we will
use the simplification P(the). We’ll represent a sequence of n words either as w1 : : :wn or w1:n (so the expression
w1:n1 means the string w1;w2; :::;wn1). For the joint probability of each word in a sequence having a particular
value P(X1 = w1;X2 = w2;X3 = w3; :::;Xn = wn) we’ll use P(w1;w2; :::;wn).
Now, how can we compute probabilities of entire sequences like P(w1;w2; :::;wn)? One thing we can do is
decompose this probability using the chain rule of probability:
The intuition of the n-gram model is that instead of computing the probability of a word given its entire history,
we can approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given all the previous words
P(wnjw1:n1) by using only the conditional probability of the preceding word P(w njwn1). In other words,
instead of computing the probability
When we use a bigram model to predict the conditional probability of the next word, we are thus making the
following approximation:
The assumption that the probability of a word depends only on the previous word is called a Markov
assumption. Markov models are the class of probabilistic models that assume we can predict the probability of
some future unit without looking too far into the past. We can generalize the bigram (which looks one word into
the past) to the trigram (which looks two words into the past) and thus to the n-gram (which looks n1 words
into the past).
Let’s see a general equation for this n-gram approximation to the conditional probability of the next word in a
sequence. We’ll use N here to mean the n-gram size, so N = 2 means bigrams and N = 3 means trigrams. Then
we approximate the probability of a word given its entire context as follows:
Given the bigram assumption for the probability of an individual word, we can compute the probability of a
complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
For example, to compute a particular bigram probability of a word wn given a previous word wn1, we’ll
compute the count of the bigram C(wn1 wn) and normalize by the sum of all the bigrams that share the same
first word wn1:
We can simplify this equation, since the sum of all bigram counts that start with a given word wn1 must be
equal to the unigram count for that word wn1 (the reader should take a moment to be convinced of this):
Let’s work through an example using a mini-corpus of three sentences. We’ll first need to augment each
sentence with a special symbol <s> at the beginning of the sentence, to give us the bigram context of the first
word. We’ll also need a special end-symbol. </s>.
Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the observed frequency of a particular
sequence by the observed frequency of a prefix. This ratio is called a relative frequency. We said above that this
use of relative frequencies as a way to estimate probabilities is an example of maximum likelihood estimation or
MLE. In MLE, the resulting parameter set maximizes the likelihood
of the training set T given the model M (i.e., P(TjM)). For example, suppose the word Chinese occurs 400 times
in a corpus of a million words like the Brown corpus. What is the probability that a random word selected from
some other text of, say, a million words will be the word Chinese? The MLE of its probability is 400 1000000 or
:0004. Now :0004 is not the best possible estimate of the probability of Chinese occurring in all situations; it
might turn out that in some other corpus or context Chinese is a very unlikely word. But it is the probability that
makes it most likely that Chinese will occur 400 times in a million-word corpus. We present ways to modify the
MLE estimates slightly to get better probability estimates in Section 3.5.
Let’s move on to some examples from a slightly larger corpus than our 14-word example above. We’ll use data
from the now-defunct Berkeley Restaurant Project, a dialogue system from the last century that answered
questions about a database of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some
textnormalized sample user queries (a sample of 9332 sentences is on the website):
We leave it as Exercise 3.2 to compute the probability of i want chinese food. What kinds of linguistic
phenomena are captured in these bigram statistics? Some of the bigram probabilities above encode some facts
Unfortunately, running big NLP systems end-to-end is often very expensive. Instead, it would be nice to have a
metric that can be used to quickly evaluate potential improvements in a language model. An intrinsic
evaluation metric is one that measures the quality of a model independent of any application.
Perplexity
In practice we don’t use raw probability as our metric for evaluating language models, but a variant called
perplexity. The perplexity (sometimes called PPL for short) of a language model on a test set is the inverse
probability of the test set, normalized by the number of words. For a test setW = w1w2 : : :wN,:
The perplexity of a test setW depends on which language model we use. Here’s the perplexity ofW with a
unigram language model (just the geometric mean of the unigram probabilities):
Note that because of the inverse in Eq. 3.15, the higher the conditional probability of the word sequence, the
lower the perplexity. Thus, minimizing perplexity is equivalent to maximizing the test set probability according
to the language model. What we generally use for word sequence in Eq. 3.15 or Eq. 3.17 is the entire sequence
of words in some test set. Since this sequence will cross many sentence boundaries, we need to include the
begin- and end-sentence markers <s> and </s>
in the probability computation. We also need to include the end-of-sentence marker </s> (but not the beginning-
of-sentence marker <s>) in the total count of word tokens N.
There is another way to think about perplexity: as the weighted average branching factor of a language. The
branching factor of a language is the number of possible next words that can follow any word. Consider the task
of recognizing the digits in English (zero, one, two,..., nine), given that (both in some training set and in some
test set) each of the 10 digits occurs with equal probability P= 1
But suppose that the number zero is really frequent and occurs far more often than other numbers. Let’s say that
0 occur 91 times in the training set, and each of the other digits occurred 1 time each. Now we see the following
test set: 0 0 0 0 0 3 0 0 0 0. We should expect the perplexity of this test set to be lower since most of the time the
next number will be zero, which is very predictable, i.e. has a high probability. Thus, although the branching
factor is still 10, the perplexity or weighted branching
factor is smaller. We leave this exact calculation as exercise 3.12.
We mentioned above that perplexity is a function of both the text and the language model: given a textW,
different language models will have different perplexities. Because of this, perplexity can be used to compare
different n-gram models. Let’s look at an example, in which we trained unigram, bigram, and trigram grammars
on 38 million words (including start-of-sentence tokens) from the Wall Street Journal, using a 19,979 word
vocabulary. We then computed the perplexity of each of these models on a test set of 1.5 million words, using
Eq. 3.16 for unigrams, Eq. 3.17 for bigrams, and the corresponding equation for trigrams. The table below
shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars.
As we see above, the more information the n-gram gives us about the word sequence, the higher the probability
the n-gram will assign to the string. A trigram model is less surprised than a unigram model because it has a
better idea of what words might come next, and so it assigns them a higher probability. And the higher the
probability, the lower the perplexity (since as Eq. 3.15 showed, perplexity is related inversely to the likelihood
of the test sequence according to the model). So a
lower perplexity can tell us that a language model is a better predictor of the words in the test set.
Smoothing:
What do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an
unseen context (for example they appear after a word they never appeared after in training)? To keep a language
model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass
from some more frequent events and give it to the events we’ve never seen.This modification is called
smoothing or discounting. In this section and the fol
lowing ones we’ll introduce a variety of ways to do smoothing: Laplace (add-one) smoothing, add-k smoothing,
stupid backoff, and Kneser-Ney smoothing
Laplace Smoothing
The simplest way to do smoothing is to add one to all the n-gram counts, before we normalize them into
probabilities. All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on.
This algorithm is called Laplace smoothing. Laplace smoothing does not perform well enough to be used in
modern n-gram models, but it usefully introduces many of the concepts that we
see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other
tasks like text classification
Let’s start with the application of Laplace smoothing to unigram probabilities. Recall that the unsmoothed
maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total
number of word tokens N:
Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing
algorithm affects the numerator, by defining an adjusted count c_. This adjusted count is easier to compare
directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by N. To
define this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply
by a normalization factor N N+V :
A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the
probability mass that will be assigned to the zero counts. Thus, instead of referring to the discounted counts c _,
we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to
the original counts:
Now that we have the intuition for the unigram case, let’s smooth our Berkeley Restaurant Project bigrams.
Figure 3.6 shows the add-one smoothed counts for the bigrams in Fig. 3.1.
Add-k smoothing requires that we have a method for choosing k; this can be done, for example, by optimizing
on a devset. Although add-k is useful for some tasks (including text classification), it turns out that it still
doesn’t work well for language modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).
In other words, sometimes using less context is a good thing, helping to generalize more for contexts that the
model hasn’t learned much about. There are two ways to use this n-gram “hierarchy”. In backoff, we use the
trigram if the evidence is sufficient, otherwise we use the bigram, otherwise the unigram. In other words, we
only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram. By contrast, in
interpolation, we always mix the probability estimates from all the n-gram estimators, weighting and combining
the trigram, bigram, and unigram counts.
In simple linear interpolation, we combine different order n-grams by linearly interpolating them. Thus, we
estimate the trigram probability P(wnjwn2 wn1) by mixing together the unigram, bigram, and trigram
probabilities, each weighted by a
In a slightly more sophisticated version of linear interpolation, each l weight is computed by conditioning on the
context. This way, if we have particularly accurate counts for a particular bigram, we assume that the counts of
the trigrams based on this bigram will be more trustworthy, so we can make the ls for those trigrams higher and
How are these l values set? Both the simple interpolation and conditional interpolation ls are learned from a
held-out corpus. A held-out corpus is an additional training corpus, so-called because we hold it out from the
training data, that we use to set hyperparameters like these l values. We do so by choosing the l values that
maximize the likelihood of the held-out corpus. That is, we fix the n-gram probabilities and then search for the l
values that—when plugged into Eq. 3.27—give us the highest probability of the held-out set. There are various
ways to find this optimal set of ls. One way is to use the EM algorithm, an iterative learning algorithm that
converges on locally optimal ls (Jelinek and Mercer, 1980).
In a backoff n-gram model, if the n-gram we need has zero counts, we approximate it by backing off to the (n-
1)-gram. We continue backing off until we reach a history that has some counts.
In order for a backoff model to give a correct probability distribution, we have to discount the higher-order n-
grams to save some probability mass for the lower order n-grams. Just as with add-one smoothing, if the higher-
order n-grams aren’t discounted and we just used the undiscounted MLE probability, then as soon as we
replaced an n-gram which has zero probability with a lower-order n-gram, we would be adding probability
mass, and the total probability assigned to all possible strings
by the language model would be greater than 1! In addition to this explicit discount factor, we’ll need a function
a to distribute this probability mass to the lower order n-grams.
This kind of backoff with discounting is also called Katz backoff. In Katz backoff we rely on a discounted
probability P_ if we’ve seen this n-gram before (i.e., if we have non-zero counts). Otherwise, we recursively
back off to the Katz probability for the shorter-history (n-1)-gram. The probability for a backoff n-gram PBO is
thus computed as follows:
Important Questions:
1. Calculate the probability of the sentence i want chinese food. Give two probabilities, one using Fig. 3.2
and the ‘useful probabilities’ just below it on page 36, and another using the add-1 smoothed table in
Fig. 3.7. Assume the additional add-1 smoothed probabilities P(i|<s>)=0:19 and P(</s>|food)= 0:40.
2. We are given the following corpus, modified from the one in the chapter:
Using a bigram language model with add-one smoothing, what is P(Sam j am)? Include <s> and </s>
in your counts just like any other token.
3. Analyze Laplace smoothing with the help of suitable example.
4. How language models are evaluated explain elaborately?
5. Apply Byte-pair encoding for the following corpus:
“With the Bioethics Unit of the Indian Council of Medical Research (ICMR) placing a consensus
policy statement on Controlled Human Infection Studies (CHIS) for comments, India has taken the first
step in clearing the deck for such studies to be undertaken here.”
Project:
Build a system that will check spelling of words whether it is correct or wrong from a corpus.
Advanced Smoothing
Kneser-Ney Smoothing
A popular advanced n-gram smoothing method is the interpolated Kneser-Ney algorithm. (Kneser and Ney
1995, Chen and Goodman 1998).
Absolute Discounting
Kneser-Ney has its roots in a method called absolute discounting. Recall that discounting of the
counts for frequent n-grams is necessary to save some probability mass for the smoothing algorithm to
distribute to the unseen n-grams.
To see this, we can use a clever idea from Church and Gale (1991). Consider an n-gram that has count
4. We need to discount this count by some amount. But how much should we discount it? Church and
Gale’s clever idea was to look at a held-out corpus and just see what the count is for all those bigrams
that had count 4 in the training set. They computed a bigram grammar from 22 million words of AP
newswire and then checked the counts of each of these bigrams in another 22 million words. On
average, a bigram that occurred 4 times in the first 22 million words occurred 3.23 times in the next
22 million words. Fig. 3.9 from Church and Gale (1991) shows these counts for bigrams with c from 0
to 9.
Notice in Fig. 3.9 that except for the held-out counts for 0 and 1, all the other bigram counts in the
held-out set could be estimated pretty well by just subtracting 0.75 from the count in the training set!
The first term is the discounted bigram, with 0 ≤ 𝑑 ≤ 1. and the second term is the unigram with an
interpolation weight 𝜆. By inspection of Fig. 3.9, it looks like just setting all the d values to .75 would
work very well, or perhaps keeping a separate second discount value of 0.5 for the bigrams with
counts of 1. There are principled methods for setting d; for example, Ney et al. (1994) set d as a
function of 𝑛1 and 𝑛2 , the number of unigrams that have a count of 1 and a count of 2, respectively:
Kneser-Ney Discounting
Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discounting with a more
sophisticated way to handle the lower-order unigram distribution. Consider the job of predicting the
next word in this sentence, assuming we are interpolating a bigram and a unigram model.
The word glasses seems much more likely to follow here than, say, the word Kong, so we’d like our
unigram model to prefer glasses. But in fact it’s Kong that is more common, since Hong Kong is a
very frequent word. A standard unigram model will assign Kong a higher probability than glasses. We
would like to capture the intuition that although Kong is frequent, it is mainly only frequent in the
phrase Hong Kong, that is, after the word Hong. The word glasses has a much wider distribution.
In other words, instead of P(w), which answers the question “How likely is w?”, we’d like to create a
unigram model that we might call PCONTINUATION, which answers the question “How likely is w
to appear as a novel continuation?”. How can we estimate this probability of seeing the word w as a
novel continuation, in a new unseen context? The Kneser-Ney intuition is to base our estimate of
PCONTINUATION on the number of different contexts word w has appeared in, that is, the number
of bigram types it completes. Every bigram type was a novel continuation the first time it was seen.
We hypothesize that words that have appeared in more contexts in the past are more likely to appear
in some new context as well. The number of times a word w appears as a novel continuation can be
expressed as:
An equivalent formulation based on a different metaphor is to use the number of word types seen to
precede w (Eq. 3.34 repeated):
A frequent word (Kong) occurring in only one context (Hong) will have a low continuation
probability.
The final equation for Interpolated Kneser-Ney smoothing for bigrams is then:
The 𝜆 is a normalizing constant that is used to distribute the probability mass we’ve discounted:
𝑑
The first term,∑ , is the normalized discount (the discount d, 0 ≤ 𝑑 ≤ 1, was introduced in the
𝑣 𝐶(𝑤𝑖−1 𝑣)
absolute discounting section above). The second term,|{𝑤: 𝐶 (𝑤𝑖−1 𝑣) > 0}|, is the number of word
types that can follow 𝑤𝑖−1 , or, equivalently, the number of word types that we discounted; in other
words, the number of times we applied the normalized discount.
where the definition of the count 𝑐𝐾𝑁 depends on whether we are counting the highest-order n-gram
being interpolated (for example trigram if we are interpolating trigram, bigram, and unigram) or one
of the lower-order n-grams (bigram or unigram if we are interpolating trigram, bigram, and unigram):
Computational Morphology:
Morphological analysis is a field of linguistics that studies the structure of words. It identifies how a
word is produced through the use of morphemes. A morpheme is a basic unit of the English
language. The morpheme is the smallest element of a word that has grammatical function and
meaning. Free morpheme and bound morpheme are the two types of morphemes. A single free
morpheme can become a complete word.
For instance, a bus, a bicycle, and so forth. A bound morpheme, on the other hand, cannot stand alone
and must be joined to a free morpheme to produce a word. ing, un, and other bound morphemes are
examples.
Types of Morphology:
1. Inflectional Morphology: modification of a word to express different grammatical categories.
Inflectional morphology is the study of processes, including affixation and vowel change, that
distinguish word forms in certain grammatical categories. Inflectional morphology consists of
at least five categories, provided in the following excerpt from Language Typology and
Syntactic Description: Grammatical Categories and the Lexicon. As the text will explain,
derivational morphology cannot be so easily categorized because derivation isn’t as
predictable as inflection.Examples- cats, men etc.
2. Derivational Morphology: Is defined as morphology that creates new lexemes, either by
changing the syntactic category (part of speech) of a base or by adding substantial,
nongrammatical meaning or both. On the one hand, derivation may be distinguished from
inflectional morphology, which typically does not change category but rather modifies
lexemes to fit into various syntactic contexts; inflection typically expresses distinctions like
number, case, tense, aspect, person, among others. On the other hand, derivation may be
distinguished from compounding, which also creates new lexemes, but by combining two or
more bases rather than by affixation, reduplication, subtraction, or internal modification of
various sorts. Although the distinctions are generally useful, in practice applying them is not
always easy.
APPROACHES TO MORPHOLOGY:
1. Morpheme Based Morphology : In these words are analyzed as arrangements of
morphemes.Word-based morphology is (usually) a word-and-paradigm approach. The
theory takes paradigms as a central notion. Instead of stating rules to combine
morphemes into word forms or to generate word forms from stems, word-based
Lemmatization is the task of determining that two words have the same root, despite their surface
differences. The words am, are, and is have the shared lemma be; the words dinner and dinners both
have the lemma dinner. Lemmatizing each of these forms to the same lemma will let us find all
mentions of words in Polish like Warsaw. The lemmatized form of a sentence like He is reading
detective stories would thus be He be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization involve complete
morphological parsing of the word. Morphology is the study of morpheme the way words are built
up from smaller meaning-bearing units called morphemes.Two broad classes of morphemes can be
distinguished: stems—the central morpheme of the word, supplying the main meaning—and
affixes—adding “additional” meanings of various kinds. So, for example, the word fox consists of
one morpheme (the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the two morphemes cat
and s, or parses a Spanish word like amaren (‘if in the future they would love’) into the morpheme
amar ‘to love’, and the morphological features 3PL and future subjunctive.
This was not the map we found in Billy Bones's chest, but an accurate copy,
complete in all things-names and heights and soundings-with the single
exception of the red crosses and the written notes.
Thi wa not the map we found in Billi Bone s chest but an accur copi complet
in all thing name and height and sound with the singl except of the red cross
and the written note
The algorithm is based on series of rewrite rules run in series: the output of each pass is fed as input to
the next pass. Here are some sample rules (more details canbe found at
https://tartarus.org/martin/PorterStemmer/):
The approach to spelling rules involves the use of finite state transducers (FSTs). Rather than
jumping straight into this, I’ll briefly consider the simpler finite state automata and how they can be
used in a simple recogniser. Suppose we want to recognise dates (just day and month pairs) written in
the format day/month. The day and the month may be expressed as one or two digits (e.g. 11/2, 1/12
etc). This format corresponds to the following simple FSA, where each character corresponds to one
transition:
Accept states are shown with a double circle. This is a non-deterministic FSA: for instance, an input
starting with the digit 3 will move the FSA to both state 2 and state 3. This corresponds to a local
ambiguity: i.e., one that will be resolved by subsequent context. By convention, there must be no ‘left
over’ characters when the system is in the final state.
To make this a bit more interesting, suppose we want to recognise a comma-separated list of such
dates. The FSA, shown below, now has a cycle and can accept a sequence of indefinite length (note
that this is iteration and not full recursion, however).
To illustrate two-level morphology, consider the following FST, which recognises the affix -s
allowing for environ-ments corresponding to the e-insertion spelling rule shown in §1.4 and repeated
below4
4Actually, I’ve simplified this slightly so the FST works correctly but the correspondence to the
spelling rule is not exact: J&M give a more complex transducer which is an accurate reflection of the
spelling rule. They also use an explicit terminating character while I prefer to rely on the ‘use all the
input’ convention, which results in simpler rules.
For instance, with this FST, the surface form cakes would start from 1 and go through the
transitions/states (c:c) 1,(a:a) 1, (k:k) 1, (e:e) 1, (ε:ˆ) 2, (s:s) 3 (accept, underlying cakeˆs) and also
(c:c) 1, (a:a) 1, (k:k) 1, (e:e) 1, (s:s) 4 (accept, underlying cakes). ‘d o g s’ maps to ‘d o g ˆ s’, ‘f o x e
s’ maps to ‘f o x ˆ s’ and to ‘f o x e ˆ s’, and ‘b u z z e s’ maps to ‘b u z z ˆ s’ and ‘b u z z e ˆ s’.5 When
the transducer is run in analysis mode, this means the system can detect an affix boundary (and hence
look up the stem and the affix in the appropriate lexicons). In generation mode, it can construct the
correct string. This FST is non-deterministic.
Similar FSTs can be written for the other spelling rules for English (although to do consonant
doubling correctly, in-formation about stress and syllable boundaries is required and there are also
One issue with this use of FSTs is that they do not allow for any internal structure of the word form.
For instance, we can produce a set of FSTs which will result in unionised being mapped into
unˆionˆiseˆed, but as we’ve seen, the affixes actually have to be applied in the right order and the
bracketing isn’t modelled by the FSTs.
Parts of speech (also known as POS) and named entities are useful clues to sentence structure and
meaning. Knowing whether a word is a noun or a verb tells us about likely neighboring words (nouns
in English are preceded by determiners and adjectives, verbs by nouns) and syntactic structure (verbs
have dependency links to nouns), making part-of-speech tagging a key aspect of parsing. Knowing if
a named entity like Washington is a name of a person, a place, or a university is important to many
natural language processing tasks like question answering, stance detection, or information extraction.
Parts of speech fall into two broad categories: closed class and open class. Closed classes are those
with relatively fixed membership, such as prepositions. new prepositions are rarely coined. By
contrast, nouns and verbs are open classes—new nouns and verbs like iPhone or to fax are continually
being created or borrowed. Closed class words are generally function words like of, it, and, or you,
which tend. Closed class words are generally function words like of, it, and, or you, which tend
Nouns are words for people, places, or things, but include others as well. Com-common nouns include
concrete terms like cat and mango, abstractions like algorithm. Nouns are words for people, places, or
things, but include others as well. Com-common noun mon nouns include concrete terms like cat and
Verbs refer to actions and processes, including main verbs like draw, provide, and go. English verbs
have inflections (non-third-person-singular (eat), third-person-singular (eats), progressive (eating),
past participle (eaten)). While many scholars believe that all human languages have the categories of
noun and verb, others have argued that some languages, such as Riau Indonesian and Tongan, don’t
even make this distinction (Broschart 1997; Evans 2000; Gil 2000) .
Adjectives often describe properties or qualities of nouns, like color (white, black), age (old, young),
and value (good, bad), but there are languages without adjectives. In Korean, for example, the words
corresponding to English adjectives act as a subclass of verbs, so what is in English an adjective
“beautiful” acts in Korean like a verb meaning “to be beautiful”.
Adverbs are a hodge-podge. All the italicized words in this example are adverbs:
Adverbs generally modify something (often verbs, hence the name “adverb”, but addverbs (home,
here, downhill) specify the direction or location of some action; degree adverbs (extremely, very,
somewhat) specify the extent of some action, process, or manner property; manner adverbs (slowly,
slinkily, delicately) describe the manner of some temporal action or process; and temporal adverbs
describe the time that some action or event took place (yesterday, Monday).
Interjections (oh, hey, alas, uh, um) are a smaller open class that also includes greetings (hello,
goodbye) and question responses (yes, no, uh-huh).
English adpositions occur before nouns, hence are called prepositions. They can indicate spatial or
temporal relations, whether literal (on it, before then, by the house) or metaphorical (on time, with
gusto, beside herself), and relations like marking the agent in Hamlet was written by Shakespeare.
A particle resembles a preposition or an adverb and is used in combination with a verb. Particles often
have extended meanings that aren’t quite the same as the prepositions they resemble, as in the particle
over in she turned the paper over. A verb and a particle acting as a single unit is called a phrasal verb.
The meaning of phrasal verbs is often non-compositional—not predictable from the individual
meanings of the verb and the particle. Thus, turn down means ‘reject’, rule out ‘eliminate’, and go on
‘continue’. Determiners like this and that (this chapter, that page) can mark the start of an
article English noun phrase. Articles like a, an, and the, are a type of determiner that mark discourse
properties of the noun and are quite frequent; the is the most common word in written English, with a
and an right behind.
Conjunctions join two phrases, clauses, or sentences. Coordinating conjunctions like and, or, and but
join two elements of equal status. Subordinating conjunctions are used when one of the elements has
some embedded status. For example, the subordinating conjunction that in “I thought that you might
Pronouns act as a shorthand for referring to an entity or event. Personal pronouns refer to persons or
entities (you, she, I, it, me, etc.). Possessive pronouns are forms of personal pronouns that indicate
either actual possession or more often just an abstract relation between the person and some object
(my, your, his, her, its, one’s, wh our, their). Wh-pronouns (what, who, whom, whoever) are used in
certain question
Auxiliary verbs mark semantic features of a main verb such as its tense, whether it is completed
(aspect), whether it is negated (polarity), and whether an action is necessary, possible, suggested, or
desired (mood). English auxiliaries include the copula copula verb be, the two verbs do and have,
forms, as well as modal verbs used to modal mark the mood associated with the event depicted by the
main verb: can indicates ability or possibility, may permission or possibility, must necessity.
An English-specific tagset, the 45-tag Penn Treebank tagset (Marcus et al., 1993), shown in Fig. 8.2,
has been used to label many syntactically annotated corpora like the Penn Treebank corpora, so is
worth knowing about:
Part-of-Speech Tagging
The accuracy of part-of-speech tagging algorithms (the percentage of test set tags that match human
gold labels) is extremely high. One study found accuracies over 97% across 15 languages from the
Universal Dependency (UD) treebank (Wu and Dredze, 2019). Accuracies on various English
treebanks are also 97% (no matter the algorithm; HMMs, CRFs, BERT perform similarly). This 97%
number is also about the human performance on this task, at least for English (Manning, 2011).
We’ll introduce algorithms for the task in the next few sections, but first let’s explore the task. Exactly
how hard is it? Fig. 8.4 shows that most word types (85-86%) are unambiguous (Janet is always NNP,
hesitantly is always RB). But the ambiguous words, though accounting for only 14-15% of the
vocabulary, are very common, and 55-67% of word tokens in running text are ambiguous. Particularly
ambiguous common words include that, back, down, put and set; here are some examples of the 6
different parts of speech for the word back:
Nonetheless, many words are easy to disambiguate, because their different tags aren’t equally likely.
For example, a can be a determiner or the letter a, but the determiner sense is much more likely.
Markov Chains
The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us
something about the probabilities of sequences of random variables, states, each of which can take on
values from some set. These sets can be words, or tags, or symbols representing anything, for example
the weather. A Markov chain makes a very strong assumption that if we want to predict the future in
the sequence, all that matters is the current state. All the states before the current state have no impact
on the future except via the current state. It’s as if to predict tomorrow’s weather you could examine
today’s weather but you weren’t allowed to look at yesterday’s weather.
More formally, consider a sequence of state variables 𝑞1 , 𝑞2 , … 𝑞𝑖 . A Markov Markov model embodies
theMarkov assumption on the probabilities of this sequence: that assumption when predicting the
future, the past doesn’t matter, only the present.
Figure 8.8a shows a Markov chain for assigning a probability to a sequence of weather events, for
which the vocabulary consists of HOT, COLD, and WARM. The states are represented as nodes in
the graph, and the transitions, with their probabilities, as edges. The transitions are probabilities: the
values of arcs leaving a given state must sum to 1. Figure 8.8b shows a Markov chain for assigning a
probability to a sequence of words 𝑤1 … 𝑤𝑡 . This Markov chain should be familiar; in fact, it
represents a bigram language model, with each edge expressing the probability 𝑝(𝑤𝑖 |𝑤𝑗 )! Given the
two models in Fig. 8.8, we can assign a probability to any sequence from our vocabulary.
A hidden Markov model (HMM) allows us to talk about both observed events model (like words that
we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in
our probabilistic model. An HMM is specified by the following components:
A first-order hidden Markov model instantiates two simplifying assumptions. First, as with a first-
order Markov chain, the probability of a particular state depends only on the previous state:
The A matrix contains the tag transition probabilities 𝑃(𝑡𝑖 |𝑡𝑖−1 ) which represent the probability of a
tag occurring given the previous tag. For example, modal verbs like will are very likely to be followed
by a verb in the base form, a VB, like race, so we expect this probability to be high. We compute the
maximum likelihood estimate of this transition probability by counting, out of the times we see the
first tag in a labeled corpus, how often the first tag is followed by the second:
In the WSJ corpus, for example, MD occurs 13124 times of which it is followed by VB 10471, for an
MLE estimate of
Let’s walk through an example, seeing how these probabilities are estimated and used in a sample
tagging task, before we return to the algorithm for decoding.
In HMM tagging, the probabilities are estimated by counting on a tagged training corpus. For this
example we’ll use the tagged WSJ corpus.
The B emission probabilities, 𝑃(𝑤𝑖 |𝑡𝑖 ), represent the probability, given a tag (say MD), that it will be
associated with a given word (say will). The MLE of the emission probability is
Of the 13124 occurrences of MD in the WSJ corpus, it is associated with will 4046 times:
For any model, such as an HMM, that contains hidden variables, the task of determining the hidden
variables sequence corresponding to the sequence of observations is called decoding. More formally,
For part-of-speech tagging, the goal of HMM decoding is to choose the tag sequence 𝑡1 … 𝑡𝑛 that is
most probable given the observation sequence of n words 𝑤1 … 𝑤𝑛 :
The way we’ll do this in the HMM is to use Bayes’ rule to instead compute:
The second assumption, the bigram assumption, is that the probability of a tag is dependent only on
the previous tag, rather than the entire tag sequence;
Plugging the simplifying assumptions from Eq. 8.15 and Eq. 8.16 into Eq. 8.14 results in the
following equation for the most probable tag sequence from a bigram tagger:
The two parts of Eq. 8.17 correspond neatly to the B emission probability and A transition probability
that we just defined above!
Each cell of the lattice, 𝑣𝑡 (𝑗), represents the probability that the HMM is in state j after seeing the first
t observations and passing through the most probable state sequence 𝑞1 … 𝑞𝑡−1 , given the HMM 𝜆.
The value of each cell 𝑣𝑡 (𝑗), is computed by recursively taking the most probable path that could lead
us to this cell. Formally, each cell expresses the probability
We represent the most probable path by taking the maximum over all possible previous state
sequences max . Like other dynamic programming algorithms, Viterbi fills each cell
𝑞1…𝑞𝑡−1,
recursively. Given that we had already computed the probability of being in every state at time t1,
we compute the Viterbi probability by taking the most probable of the extensions of the paths that
lead to the current cell. For a given state qj at time t, the value 𝑣𝑡 (𝑗), is computed as
The three factors that are multiplied in Eq. 8.19 for extending the previous paths to compute the
Viterbi probability at time t are
There are a few phases for this algorithm, including the initial phase, the forward phase, the backward
phase, and the update phase. The forward and the backward phase form the E-step of the EM
algorithm, while the update phase itself is the M-step.
Initial phase
In the initial phase, the content of the parameter matrices A, B, π₀ are initialized, and it could be done
randomly if there is no prior knowledge about them.
1. The alpha function is defined as the joint probability of the observed data up to time k and the
state at time k
2. It is a recursive function because the alpha function appears in the first term of the right hand
side (R.H.S.) of the equation, meaning that the previous alpha is reused in the calculation of
the next. This is also why it is called the forward phase.
3. The second term of the R.H.S. is the state transition probability from A, while the last term is
the emission probability from B.
4. The R.H.S. is summed over all possible states at time k -1.
It should be pointed out that, each alpha contains the information from the observed data up to time k,
and to get the next alpha, we only need to reuse the current alpha, and add information about the
transition to the next state and the next observed variable. This recursive behavior saves computations
of getting the next alpha by freeing us from looking through the past observed data every time.
By the way, we need the following starting alpha to begin the recursion.
Backward phase
In this phase we have to follow the following formula.
1. The beta function is defined as the conditional probability of the observed data from time k+1
given the state at time k
Firstly, as mentioned, they are both recursive functions, which means that we could reuse previous
answer as the input for the next answer. This is what dynamic programming is about — you could
save time by reusing old result!
Secondly, the formula in the forward phase is very useful. Suppose you have a set of well-trained
transition and emission parameters, and given that your problem is to, in real-time, find out the
mysterious hidden truth from observed data. Then you actually could do it like this! When you get
one data point (data point p), then you could put it into the formula which will give you the
probability distribution of the associated hidden state, and from which you could pick the most
probable one as your answer. And the story does not stop here, as you get the next data point (data
point q), and you put it again into the formula, it will give you another probability distribution for you
to pick the best choice, but this is not only based on data point q and the transition and emission
parameters, but also the data point p. Such use of the formula is called filtering.
Thirdly, and continuing the above discussion, that suppose you collected many data points already,
and because you know that the earlier the data point, the less observed data the choice of your answer
based on. Therefore you would like to improve that by somehow ‘injecting’ information from the later
data into the earlier ones. This is where the backward formula comes into play. Such use of the
formula is called smoothing.
Fourthly, this is about the combination of the last two paragraphs. With the help of the alpha and the
beta formula, one could determine the probability distribution of the state variable at any time k given
the whole sequence of observed data. This could also be understood mathematically.
Update phase
For the deviation of the above formulas, if you have watched the YouTube videos that I suggested for
the forward and backward formula, and you can understand them, then probably you will have no
problem to derive these two yourself.
The first formula here is just repeating what we have seen above, and to recap, it is to tell us the
probability distribution of a state at time k given all observed data we have. The second formula,
however, tells us a bit different thing which is the joint probability of two consecutive states given the
data. They make use of the alpha function, the beta function, the transition and the emission that are
already available. These two formulas are further used to finally do the update.
The Baum-Welch algorithm is a case of EM algorithm that, in the E-step, the forward and the
backward formulas tell us the expected hidden states given the observed data and the set of parameter
matrices before-tuned. The M-step update formulas then tune the parameter matrices to best fit the
observed data and the expected hidden states. And these two steps are then iterated over and over
again until the parameters converged, or until the model has reached some certain accuracy
requirement.
Like any machine learning algorithm, this algorithm could be overfitting the data, as by definition the
M-step encourages the model to approach the observed data as good as possible. Also, although we
have not talked too much about the initial phase, it indeed affects the final performance of the model
(as a problem of trapping the model in local optimum), so one might want to try different ways of
initializing the parameters and see what works better.
One obvious clue we might glean from the sample is the list of allowed trans-lations. For example, we
might discover that the expert translator always chooses among the following five French phrases:
{dans, en, a, au cours de, pendant}. With this information in hand, we can impose our first constraint
on our model p:
This equation represents our first statistic of the process; we can now proceed to search for a suitable
model that obeys this equation. Of course, there are an infinite number of models p for which this
identity holds. One model satisfying the above equation is p(dans) = 1; in other words, the model
always predicts dans. Another model obeying this constraint predicts pendant with a probability of 1
/2, and a with a probability of 1/2. But both of these models offend our sensibilities: knowing only
that the expert always chose from among these five French phrases, how can we justify either of these
probability distributions? Each seems to be making rather bold assump-tions, with no empirical
justification. Put another way, these two models assume more than we actually know about the
expert's decision-making process. All we know is that the expert chose exclusively from among these
five French phrases; given this the most intuitively appealing model is the following:
This model, which allocates the total probability evenly among the five possible phrases, is the most
uniform model subject to our knowledge. It is not, however, the most uniform overall; that model
would grant an equal probability to every possible French phrase.
We might hope to glean more clues about the expert's decisions from our sample. Suppose we notice
that the expert chose either dans or en 30% of the time. We could apply this knowledge to update our
model of the translation process by requiring that p satisfy two constraints:
Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the
expert chose either dans or a. We can incorporate this information into our model as a third constraint:
We can once again look for the most uniform p satisfying these constraints, but now the choice is not
as obvious. As we have added complexity, we have encountered two difficulties at once. First, what
exactly is meant by "uniform," and how can we measure the uniformity of a model? Second, having
determined a suitable answer to these questions, how do we go about finding the most uniform model
subject to a set of constraints like those we have described?
Our task is to construct a stochastic model that accurately represents the behavior of the random
process. Such a model is a method of estimating the conditional prob-ability that, given a context x,
the process will output y. We will denote by p(y\x) the probability that the model assigns to y in
context x. With a slight abuse of notation, we will also use p(y\x) to denote the entire conditional
probability distribution provided by the model, with the interpretation that y and x are placeholders
rather than specific instantiations. The proper interpretation should be clear from the context. We will
Training Data
To study the process, we observe the behavior of the random process for some time, collecting a
large number of samples (x1,y1), (x2,Y2), ... , (xN,YN)· In the example we have been considering,
each sample would consist of a phrase x containing the words surrounding in, together with the
translation y of in that the process produced. For now, we can imagine that these training samples
have been generated by a human expert who was presented with a number of random phrases
containing in and asked to choose a good translation for each.
We can summarize the training sample in terms of its empirical probability distri-bution p, defined by
Typically, a particular pair (x,y) will either not occur at all in the sample, or will occur at most a few
times.
To express the fact that in translates as en when April is the following word, we can introduce the
indicator function:
The expected value off with respect to the empirical distribution p(x, y) is exactly the statistic we are
interested in. We denote this expected value by
We can express any statistic of the sample as the expected value of an appropriate binary-valued
indicator function f. We call such function a feature function or feature for short. (As with probability
When we discover a statistic that we feel is useful, we can acknowledge its im-portance by requiring
that our model accord with it. We do this by constraining the expected value that the model assigns to
the corresponding feature function f. The expected value off with respect to the model p(ylx) is
Combining (1), (2) and (3) yields the more explicit equation
We call the requirement (3) a constraint equation or simply a constraint. By re-stricting attention to
those models p(ylx) for which (3) holds, we are eliminating from consideration those models that do
not agree with the training sample on how often the output of the process should exhibit the feature f.
To sum up so far, we now have a means of representing statistical phenomena inherent in a sample of
data (namely, p(f)), and also a means of requiring that our model of the process exhibit these
phenomena (namely, p(f) = p(f)).
One final note about features and constraints bears repeating: although the words "feature" and
"constraint" are often used interchangeably in discussions of maximum entropy, we will be vigilant in
distinguishing the two and urge the reader to do likewise. A feature is a binary-valued function of (x,
y); a constraint is an equation between the expected value of the feature function in the model and its
expected value in the training data.
Here P is the space of all (unconditional) probability distributions on three points, sometimes called a
simplex. If we impose no constraints (depicted in (a)), then all probability models are allowable.
Imposing one linear constraint C1 restricts us to those p E P that lie on the region defined by C1, as
Among the models p EC, the maximum entropy philosophy dictates that we select the most uniform
distribution. But now we face a question left open in Section 2: what does "uniform" mean? A
mathematical measure of the uniformity of a conditional distribution p(ylx) is provided by the
conditional entropy1
There is a discriminative sequence model based on log-linear models: CRF the conditional random
field (CRF). We’ll describe here the linear chain CRF, the version of the CRF most commonly used
for language processing, and the one whose conditioning closely matches the HMM.
In a CRF, by contrast, we compute the posterior p(Y|X) directly, training the CRF
Let’s introduce the CRF more formally, again using X and Y as the input and output sequences. A
CRF is a log-linear model that assigns a probability to an entire output (tag) sequence Y, out of all
possible sequences Y, given the entire input (word) sequence X. We can think of a CRF as like a giant
version of what multinomial logistic regression does for a single token. Recall that the feature
function f in regular multinomial logistic regression can be viewed as a function of a tuple: a token x
and a label y (page 89). In a CRF, the function F maps an entire input sequence X and an entire output
sequence Y to a feature vector. Let’s assume we have K features, with a weight 𝑤𝑘 for each feature
𝐹𝑘 :
It’s common to also describe the same equation by pulling out the denominator into a function Z(X):
We’ll call these K functions 𝐹𝑘 (𝑋, 𝑌) global features, since each one is a property of the entire input
sequence X and output sequence Y. We compute them by decomposing into a sum of local features
for each position i in Y:
Each of these local features 𝑓𝑘 in a linear-chain CRF is allowed to make use of the current output
token 𝑦𝑖 , the previous output token 𝑦𝑖−1 , the entire input string X (or any subpart of it), and the
current position i. This constraint to only depend on the current and previous output tokens 𝑦𝑖 and
𝑦𝑖−1 are what characterizes a linear linear chain chain CRF. As we will see, this limitation makes it
Let’s look at some of these features in detail, since the reason to use a discriminative sequence model
is that it’s easier to incorporate a lot of features
Again, in a linear-chain CRF, each local feature 𝑓𝑘 at position i can depend on any information from:
(𝑦𝑖−1 , 𝑦𝑖 ,X, i). So some legal features representing common situations might be the following:
For simplicity, we’ll assume all CRF features take on the value 1 or 0. Above, we explicitly use the
notation 1fxg to mean “1 if x is true, and 0 otherwise”. From now on, we’ll leave off the 1 when we
define features, but you can assume each feature has it there implicitly.
Although the idea of what features to use is done by the system designer by hand, the specific features
are automatically populated by using feature templates as we briefly mentioned in Chapter 5. Here are
some templates that only use information from (𝑦𝑖−1 , 𝑦𝑖 ,X, i)
These templates automatically populate the set of features from every instance in the training and test
set. Thus for our example Janet/NNP will/MD back/VB the/DT bill/NN, when xi is the word back, the
following features would be generated and have the value 1 (we’ve assigned them arbitrary feature
numbers):
It’s also important to have features that help with unknown words. One of the word important is word
shape features, which represent the abstract letter pattern of the word by mapping lower-case letters to
‘x’, upper-case to ‘X’, numbers to ’d’, and retaining punctuation. Thus for example I.M.F. would map
to X.X.X. and DC10-30 would map to XXdd-dd. A second class of shorter word shape features is also
used. In these features consecutive character types are removed, so words in all caps map to X, words
with initial-caps map to Xx, DC10-30 would be mapped to Xd-d but I.M.F would still map to X.X.X.
Prefix and suffix features are also useful. In summary, here are some sample feature templates that
help with unknown words:
The known-word templates are computed for every word seen in the training set; the unknown word
features can also be computed for all words in training, or only on training words whose frequency is
below some threshold. The result of the known-word templates and word-signature features is a very
large set of features. Generally a feature cutoff is used in which features are thrown out if they have
count < 5 in the training set.
How do we find the best tag sequence ˆY for a given input X? We start with Eq. 8.22:
We can ignore the exp function and the denominator Z(X), as we do above, because exp doesn’t
change the argmax, and the denominator Z(X) is constant for a given observation sequence X.
Concretely, this involves filling an NT array with the appropriate values, maintaining backpointers as
we proceed. As with HMM Viterbi, when the table is filled, we simply follow pointers back from the
maximum value in the final column to retrieve the desire.
The requisite changes from HMM Viterbi have to do only with how we fill each cell. Recall from Eq.
8.19 that the recursive step of the Viterbi equation computes the Viterbi value of time t for state j asd
set of labels.
Learning in CRFs relies on the same supervised learning algorithms we presented for logistic
regression. Given a sequence of observations, feature functions, and corresponding outputs, we use
stochastic gradient descent to train the weights to maximize the log-likelihood of the training corpus.
The local nature of linear-chain CRFs means that the forward-backward algorithm introduced for
HMMs in Appendix A can be extended to a CRF version that will efficiently compute the necessary
derivatives. As with logistic regression, L1 or L2 regularization is important.
Important Question:
1. Implement the “most likely tag” baseline. Find a POS-tagged training set, and use it to
compute for each word the tag that maximizes p(t|w). You will need to implement a simple
tokenizer to deal with sentence boundaries. Start by assuming that all unknown words are NN
and compute your error rate on known and unknown words. Now write at least five rules to
do a better job of tagging unknown words, and show the difference in error rates.
2. Build a bigram HMM tagger. You will need a part-of-speech-tagged corpus.First split the
corpus into a training set and test set. From the labeled training set, train the transition and
observation probabilities of the HMM tagger directly on the hand-tagged data. Then
implement the Viterbi algorithm so youcan decode a test sentence. Now run your algorithm
on the test set. Report its error rate and compare its performance to the most frequent tag
baseline. Names of works of art (books, movies, video games, etc.) are quite different
Project :
Build a system to correctly identify different parts of speech to detect gender of a word from a
corpus.
Interview Questions;
1. What Is Part Of Speech (pos) Tagging?
2. What is Hidden Markov Model?
3. How HMM helps in POS tagging?
4. What is Conditional Random Fields?
5. How CRF can be used in entity name tagging?
Research Papers:
1. Song, S., Zhang, N., & Huang, H. (2019). Named entity recognition based on conditional
random fields. Cluster Computing, 22, 5195-5206.
2. Liu, Z., Tang, B., Wang, X., & Chen, Q. (2017). De-identification of clinical notes via
recurrent neural network and conditional random field. Journal of biomedical informatics, 75,
S34-S42.
3. Suleiman, D., Awajan, A., & Al Etaiwi, W. (2017). The use of hidden Markov model in
natural arabic language processing: A survey. Procedia computer science, 113, 240-247.
4. Paul, A., Purkayastha, B. S., & Sarkar, S. (2015, September). Hidden Markov model based
part of speech tagging for Nepali language. In 2015 international symposium on advanced
computing and communication (isacc) (pp. 149-156). IEEE.
5. Sun, S., Liu, H., Lin, H., & Abraham, A. (2012, October). Twitter part-of-speech tagging
using pre-classification Hidden Markov model. In 2012 IEEE International Conference on
Systems, Man, and Cybernetics (SMC) (pp. 1118-1123). IEEE.
Constituency Parsing: Constituency parsing involves dividing a sentence into a set of constituents
(phrases) and representing the hierarchical relationships between these constituents. The result is
typically a parse tree, where the root represents the whole sentence, and the internal nodes represent
constituents, such as noun phrases, verb phrases, prepositional phrases, etc. The leaf nodes correspond
to individual words in the sentence.
Dependency Parsing: Dependency parsing represents the grammatical relationships between words in
a sentence using directed arcs. Each word in the sentence is a node in the tree, and the arcs represent
the syntactic dependencies between words, showing which words are dependent on others. The root of
the tree is usually an artificial node representing the root of the sentence.
Syntax parsing can be accomplished using various algorithms and techniques, including rule-based
approaches, statistical parsing models, and neural network-based methods. Data-driven approaches
that learn from annotated parsing data have gained significant popularity due to their ability to
generalize well to unseen sentences.
In recent years, deep learning techniques, particularly transformer-based models like BERT and GPT,
have achieved state-of-the-art performance in syntax parsing, making it a crucial component in
various NLP tasks and applications.
Syntax CKY:
The context-free grammar used in CKY is typically in Chomsky Normal Form (CNF), where each
production rule is of the form:
Here, A, B, and C are non-terminal symbols representing constituents (phrases), and "word"
represents individual words in the sentence.
The CKY algorithm works by filling up a table, called the CKY table, for all possible subphrases of
the sentence. The rows of the table represent the starting positions of the subphrases, and the columns
represent the ending positions. Each cell of the table stores the constituents that span the
corresponding subphrase.
1. Initialization: Fill the diagonal cells of the CKY table with the terminal rules corresponding to
the individual words in the sentence.
2. Filling the Table: Traverse the CKY table in a diagonal manner, filling the cells with binary
rules that combine constituents to build larger constituents. The constituents in the cells are
determined based on the grammar rules and the constituents in the adjacent cells.
3. Backtracking: Once the CKY table is filled, the parse tree can be constructed by backtracking
through the table and finding the constituents that span the entire sentence.
The CKY algorithm efficiently explores all possible parse trees for the given sentence and grammar,
eliminating the need to enumerate all possible combinations explicitly. This makes it an efficient
parsing method, especially for CNF grammars.
The CKY algorithm is widely used in parsing tasks where constituency-based parse trees are required.
It has been employed in various NLP applications, such as information extraction, question
PCFG stands for Probabilistic Context-Free Grammar, and it is a formalism used in Natural Language
Processing (NLP) for modeling the syntax of a language with probabilities. PCFGs extend context-
free grammars by assigning probabilities to each production rule, indicating the likelihood of
generating a particular phrase or constituent in a sentence.
A context-free grammar is a set of production rules that describe how sentences can be generated in a
language. Each production rule consists of a non-terminal symbol on the left-hand side (LHS) and a
sequence of symbols (non-terminals and/or terminals) on the right-hand side (RHS). Non-terminals
are placeholders for constituents or phrases, while terminals represent actual words in the language.
PCFGs:
In a PCFG, each production rule is associated with a probability, denoting the likelihood of using that
rule during the generation process. The probabilities must satisfy certain conditions, such as being
non-negative and summing to one for each non-terminal symbol.
The probabilities in a PCFG are typically estimated from a large corpus of annotated sentences using
techniques such as maximum likelihood estimation or expectation-maximization.
PCFGs are commonly used in syntax parsing tasks, especially for constituency parsing. The
probabilistic nature of PCFGs allows them to capture the most likely syntactic structures of sentences.
During parsing, the goal is to find the most probable parse tree (constituency-based) or dependency
tree (dependency-based) for a given sentence, given the PCFG.
PCFGs have been used in various NLP applications, such as machine translation, speech recognition,
and information extraction. However, like other grammar-based approaches, PCFGs have some
limitations in handling long-range dependencies and capturing semantic information. As a result,
modern NLP techniques, particularly neural network-based models, have become more popular due to
their ability to handle more complex language structures and learn from large amounts of data without
relying on manually crafted grammars.
PCFGs Inside:
1. Syntactic Parsing: PCFGs are commonly used for parsing sentences and generating parse
trees. Given a sentence, the PCFG aims to find the most probable parse tree by selecting the
most likely production rules at each step of parsing. These parse trees represent the
hierarchical syntactic structure of the sentences, which can be useful for various downstream
tasks like information extraction, question answering, and sentiment analysis.
2. Ambiguity Resolution: Natural language often contains ambiguity, where a sentence can have
multiple valid interpretations or parse trees. PCFGs help in disambiguating sentences by
selecting the most probable parse tree based on the assigned probabilities of the production
rules. This is particularly useful in tasks like machine translation, where choosing the correct
translation can depend on the syntactic structure.
3. Language Modeling: PCFGs can be used for language modeling, where they estimate the
probabilities of sentences or sequences of words. By assigning probabilities to different
production rules, PCFGs can calculate the likelihood of generating a particular sentence
according to the grammar, which aids in generating coherent and fluent sentences.
4. Speech Recognition: PCFGs have been used in certain approaches to speech recognition.
They can help in modeling the syntactic structure of spoken sentences and aid in the
conversion of spoken language into written text.
5. Grammar Induction: PCFGs can be used to induce grammars from a set of sentences. This is
helpful in cases where the grammar of a language is not known beforehand or when dealing
with languages with limited resources.
It's worth noting that while PCFGs have been widely used in NLP, more advanced probabilistic
models, such as probabilistic dependency grammars and probabilistic phrase structure grammars, have
been developed to address some of the limitations of PCFGs and achieve better accuracy and
performance in various NLP tasks. Nevertheless, PCFGs remain an essential foundational concept in
the field of computational linguistics and natural language processing.
Outside Probabilities:
In Natural Language Processing (NLP), the concept of "outside probabilities" is often associated with
parsing algorithms, specifically those used in probabilistic context-free grammars (PCFGs) or more
generally, in context-free grammars (CFGs) augmented with probabilities.
The "inside probabilities" refer to the probabilities associated with partial parses (subtrees) that span a
contiguous subsequence of the input sentence. On the other hand, the "outside probabilities" are used
to compute the probabilities of partial parses that cover the remaining portions of the input sentence,
i.e., the portions that are not covered by the inside probabilities.
To explain it further, let's consider a parse tree of a sentence. The inside probabilities are used to
calculate the probabilities of subtrees starting from the leaves of the tree and moving upwards towards
In a PCFG or probabilistic CFG, the inside probabilities are usually calculated using algorithms like
the Inside-Outside algorithm or the CYK algorithm (Cocke-Younger-Kasami). Once the inside
probabilities have been computed, the outside probabilities can be calculated using a similar bottom-
up approach, but this time starting from the root and moving towards the leaves of the tree.
1. Parsing Accuracy: When parsing sentences, both inside and outside probabilities are crucial
for improving the accuracy of the parse. The combination of inside and outside probabilities
allows for better disambiguation of sentence structures and more accurate parsing results.
2. Probabilistic Parsing: PCFGs can be used to compute probabilities for different parse trees of
a sentence. By incorporating outside probabilities, the overall probabilities of the parse trees
can be computed more accurately.
3. Parameter Estimation: In certain learning algorithms used for PCFGs, such as the
Expectation-Maximization (EM) algorithm, outside probabilities are used to update the model
parameters based on the likelihood of the training data.
4. Parsing with Unannotated Data: Outside probabilities are also employed in semi-supervised
or unsupervised parsing, where the goal is to parse sentences using a combination of
annotated and unannotated data.
It's important to note that while outside probabilities are beneficial for improving parsing accuracy
and probabilistic modeling in NLP, they also increase the computational complexity of parsing
algorithms. Hence, various approximation techniques and optimization methods are used to make the
parsing process more efficient and scalable.
In Natural Language Processing (NLP), "Inside-Outside" (IO) probabilities, also known as "Forward-
Backward" probabilities, are used in the context of probabilistic context-free grammars (PCFGs) and
probabilistic models to estimate probabilities of parse trees and compute expectations of different
structures in the data.
Inside Probabilities:
1. Inside probabilities are used to compute the probability of a substructure (partial parse) of a
sentence that spans a contiguous subsequence of the input. In the context of parsing, this
refers to calculating the probability of a partial parse tree rooted at a particular node, given the
observed words in the sentence up to that point. These probabilities are typically calculated in
a bottom-up manner, starting from the leaves of the parse tree and moving towards the root.
The inside probabilities are denoted as α(i, j, X), where "i" and "j" represent the span of the
subsequence, and "X" is a non-terminal symbol.
The inside probabilities are calculated recursively using the following formula:
CSS
Outside probabilities are used to compute the probability of the remaining context of the sentence,
which is not covered by the partial parse. They represent the probability of the unobserved words in
the sentence outside the current span. The outside probabilities are denoted as β(i, j, X).
The outside probabilities are calculated recursively using the following formula:
CSS
1. Parsing: Inside-Outside probabilities are crucial for calculating the probabilities of parse trees
and finding the most likely parse for a given sentence.
2. Parameter Estimation: Inside-Outside probabilities are used in the Expectation-Maximization
(EM) algorithm for parameter estimation in PCFGs.
3. Probabilistic Modeling: Inside-Outside probabilities help estimate probabilities of different
linguistic structures in the data, which is essential for various probabilistic models in NLP.
The Inside-Outside algorithm efficiently computes these probabilities and is widely used in various
parsing and probabilistic modeling tasks in NLP.
Dependency grammars and parsing are essential concepts in Natural Language Processing (NLP) that
focus on analyzing the syntactic structure of sentences based on dependencies between words. Unlike
phrase structure grammars, which use constituency-based parsing to represent hierarchical phrase
structures, dependency grammars represent the relationships between individual words in a sentence
using directed links called dependencies.
1. Dependency Grammars:
Dependency grammar is a type of formal grammar that focuses on the relationships between words in
a sentence. In this approach, each word in the sentence is considered a node in a syntactic tree, and the
2. Dependency Parsing:
Dependency parsing is the process of automatically analyzing the syntactic structure of a sentence
using a dependency grammar. It involves creating a dependency tree (also known as a parse tree or a
dependency graph) that represents the dependencies between words in the sentence. The goal of
dependency parsing is to determine the grammatical relationships (dependencies) between words and
to construct a tree that represents these relationships in a meaningful way.
3. Dependency Relations:
The links (arcs) in a dependency tree represent different types of dependency relations between
words. Common dependency relations include:
"nsubj": Nominal subject (the word that acts as the subject of the sentence)
"dobj": Direct object (the word that is the direct object of the verb)
"amod": Adjectival modifier (the word that modifies a noun)
"advmod": Adverbial modifier (the word that modifies a verb or adjective)
"conj": Conjunct (connecting words that have the same relationship to another word)
"root": The root of the tree, usually representing the main verb of the sentence
4. Dependency Parsing Algorithms:
There are various algorithms used for dependency parsing, ranging from rule-based approaches to
statistical and machine learning-based methods. Common dependency parsing algorithms include
transition-based methods (e.g., arc-eager, arc-standard) and graph-based methods (e.g., Eisner's
algorithm). These algorithms aim to find the most likely dependency tree for a given sentence based
on the observed dependencies in a training corpus.
Dependency parsing is widely used in NLP for tasks such as information extraction, question
answering, machine translation, and sentiment analysis. It provides valuable insights into the syntactic
relationships between words in a sentence, which can be leveraged for a wide range of natural
language understanding tasks.
Transition-based parsing is a popular approach in Natural Language Processing (NLP) for syntactic
parsing, particularly for dependency parsing. It involves using a sequence of transition actions to
construct a dependency tree for a given sentence. The parser starts with an initial state and iteratively
applies transition actions until it reaches a final state, representing the fully constructed dependency
tree.
Configuration:
A configuration represents the state of the parser at a particular step during parsing. It consists of a
stack, a buffer, and a set of dependency arcs (links between words).
Transition actions are rules that define how the parser can change its configuration. Each action
corresponds to an operation the parser can perform at a given state. Common transition actions
include SHIFT (move a word from the buffer to the stack), LEFT-ARC (create a dependency arc from
the top of the stack to the second-top word on the stack), and RIGHT-ARC (create a dependency arc
from the second-top word on the stack to the top of the stack).
2. Parsing Process:
The parsing process starts with an initial configuration where the buffer contains all the words in the
sentence, and the stack is empty except for the ROOT symbol. The parser then iteratively applies
transition actions until the buffer is empty, and the stack contains only the ROOT symbol.
During each iteration, the parser uses a parsing model (e.g., a machine learning model) to predict the
most appropriate transition action based on the current configuration and linguistic features of the
words. The model is trained on annotated data (dependency treebanks) to learn the patterns and
dependencies in the language.
Some common transition-based parsing algorithms include the arc-eager algorithm and the arc-
standard algorithm. These algorithms differ in their transition actions and how they handle the
construction of dependency arcs.
Transition-based parsing is known for its efficiency and simplicity, making it a popular choice for
dependency parsing in various NLP applications. It has achieved impressive performance with the
help of powerful machine learning models and feature representations.
Formulation:
In Natural Language Processing (NLP), "formulation" refers to the process of converting a natural
language problem or task into a structured representation that can be processed by computational
algorithms. Formulation involves defining the input and output formats, specifying the problem
requirements, and deciding on the appropriate representation for the task at hand.
Problem: Given a piece of text (e.g., a review or a tweet), determine the sentiment expressed in the
text (e.g., positive, negative, neutral).
Formulation: In sentiment analysis, the formulation involves representing the text as a sequence of
words (tokens) and mapping it to one of the sentiment classes (e.g., positive, negative, neutral). This
can be achieved through supervised learning, where a labeled dataset of text examples with
corresponding sentiments is used to train a machine learning model.
Problem: Identify and classify named entities (e.g., person names, locations, organizations) in a given
text.
Formulation: For NER, the formulation involves representing the text as a sequence of tokens and
assigning a label to each token indicating whether it belongs to a named entity and its specific type
(e.g., person, location, organization). This task can be addressed using sequence labeling approaches,
such as Conditional Random Fields (CRFs) or deep learning models like BiLSTMs with CRF.
Formulation: In machine translation, the formulation requires mapping a source sentence in the source
language to a target sentence in the target language. This involves representing the sentences as
sequences of words and finding an appropriate mapping using various translation models, such as
statistical machine translation or neural machine translation.
Problem: Given a question in natural language, find the most relevant answer in a given context or
document.
Formulation: For question answering, the formulation involves representing the question and the
context/document as structured data. This can be achieved by using techniques like word embeddings,
attention mechanisms, and language modeling to align the question with relevant parts of the context
and generate an appropriate answer.
In each of these examples, the process of formulation helps define the task, establish the data
representation, and guide the selection of appropriate algorithms and models to address the NLP
problem effectively.
1. Feature Extraction:
To train a machine learning model for transition-based parsing, relevant features need to be extracted
from the parsing configurations. Features can include information about the words in the stack and
buffer, the current dependency arcs, and other linguistic features like part-of-speech tags and word
embeddings. The goal is to create a feature representation that captures the relevant information
necessary for predicting the next transition action.
Training data is essential for supervised learning of the parsing model. The training data consists of
parsing configurations and the corresponding correct transition actions for a given set of sentences.
These configurations can be obtained by applying a gold-standard oracle or an existing parser to
generate the correct sequence of transitions for each sentence in the training set.
3. Model Selection:
Different machine learning models can be used for transition-based parsing, including linear models
(e.g., logistic regression, linear SVM), decision trees, random forests, and neural network-based
models (e.g., feedforward neural networks, recurrent neural networks, or transformers). The choice of
the model depends on the complexity of the task and the availability of training data.
4. Model Training:
The training process involves feeding the extracted features and correct transition actions from the
training data into the selected machine learning model. The model is then trained to learn the mapping
between the feature representations and the correct transition actions. The objective is to minimize the
prediction errors between the model's output and the gold-standard actions.
5. Model Evaluation:
After training, the model's performance is evaluated on a separate development or validation dataset.
This dataset contains sentences that the model has not seen during training. The evaluation measures
the accuracy of the model in predicting the correct transition actions. Metrics such as labeled
attachment score (LAS) or unlabeled attachment score (UAS) are commonly used to evaluate
dependency parsing accuracy.
6. Hyperparameter Tuning:
Machine learning models often have hyperparameters that need to be tuned to optimize the
performance on the validation set. Hyperparameter tuning involves trying different combinations of
hyperparameter values to find the best configuration for the model.
By going through these steps, transition-based parsing models can be effectively trained and applied
to parse sentences, providing valuable insights into the syntactic structure of natural language text.
MST (Minimum Spanning Tree) based dependency parsing is a popular approach in Natural
Language Processing (NLP) for generating dependency trees from sentences. It involves finding the
minimum spanning tree of a graph, where each word in the sentence is represented as a node, and the
dependency relations between words are represented as weighted edges. The resulting minimum
spanning tree corresponds to the dependency tree, where each word has exactly one head (governor)
and a directed edge indicates the dependency relation.
1. Graph Construction:
The first step in MST-based dependency parsing is to construct a graph representation of the sentence.
Each word in the sentence is treated as a node, and the dependency relations between words are
represented as weighted edges. These weights can be assigned based on various features, such as part-
of-speech tags, word embeddings, or other linguistic properties. The goal is to create a graph that
captures the likelihood of each dependency relation.
The next step is to find the minimum spanning tree of the graph. The minimum spanning tree is a tree
that connects all the nodes (words) in the graph with the minimum total edge weight while avoiding
cycles. In dependency parsing, the minimum spanning tree corresponds to the dependency tree of the
sentence, where each word has a single head (governor), and the edges indicate the dependency
relations.
Once the minimum spanning tree is obtained, the edges in the tree represent the dependency relations
between words in the sentence. The parser then transforms this tree into a labeled dependency tree,
where each edge is labeled with the corresponding dependency relation (e.g., subject, object,
modifier).
MST-based dependency parsing can handle both projective and non-projective dependency structures.
In a projective tree, the edges do not cross each other, and the tree can be drawn on a single line
There are various algorithms for finding the minimum spanning tree of a graph, and they differ in
terms of efficiency and optimality. Common algorithms used in MST-based dependency parsing
include Chu-Liu/Edmonds' algorithm and Eisner's algorithm.
MST-based dependency parsing is widely used in NLP due to its efficiency and accuracy. It has been
successfully applied to many languages and has become a standard method for parsing dependency
structures. Machine learning techniques, such as structured prediction with structured perceptron or
neural networks, are often integrated into MST-based parsing to improve performance further.
MST-based dependency parsing can be combined with machine learning techniques to learn the
weights of the edges in the graph representation of the sentence. By learning these weights from
annotated data (dependency treebanks), the parsing model can capture the most probable dependency
relations for different linguistic contexts. This approach is known as "MST-based dependency parsing
with learning" or "graph-based dependency parsing with learning."
1. Feature Extraction:
Training data is essential for supervised learning of the parsing model. The training data consists of
sentences along with their corresponding gold-standard dependency trees, where each dependency
relation is represented as a labeled edge in the graph. For each sentence, the goal is to find the optimal
set of edge weights that results in the correct dependency tree.
3. Model Selection:
Machine learning models are selected to learn the edge weights for the graph representation. Common
choices include structured prediction models such as the structured perceptron or structured SVM,
which can directly optimize the global structure (the dependency tree) rather than considering
individual edges independently.
4. Model Training:
The training process involves feeding the extracted features and the correct dependency trees from the
training data into the selected machine learning model. The model is then trained to learn the optimal
weights for the edges that best fit the gold-standard dependency trees.
After training, the model's performance is evaluated on a separate development or validation dataset
using metrics such as labeled attachment score (LAS) or unlabeled attachment score (UAS).
Hyperparameter tuning may be performed to optimize the model's performance on the validation set.
Finally, the trained model is tested on a separate test set to evaluate its performance on unseen data.
The test set evaluation provides a realistic assessment of the model's ability to generalize to new
sentences and produce accurate dependency trees.
By combining MST-based dependency parsing with learning, the model can learn to make informed
decisions about the most likely dependency relations in different linguistic contexts, leading to more
accurate and linguistically informed dependency parsing results. The learned model can be applied to
parse sentences and extract meaningful dependency structures, which are useful for various NLP
tasks, such as information extraction, sentiment analysis, and machine translation.
QUESTIONS:
It is the study of reference, meaning, or truth. The term can be used to refer to
subfields of several distinct disciplines,
including philosophy, linguistics and computer science.
Structured Models
Query expansion is one of the most common methods to solve mismatch. We
use the automatic term mismatch diagnosis to guide query expansion. Other
forms of intervention, e.g. term removal or substitution, can also solve certain
cases of mismatch, but they are not the focus of this work. We show that proper
diagnosis can save expansion effort by 33%, while achieving near optimal
performance. We generate structured expansion queries of Boolean conjunctive
normal form (CNF) -- a conjunction of disjunctions where each disjunction
Word Embedding’s
It is an approach for representing words and documents. Word Embedding or
Word Vector is a numeric vector input that represents a word in a lower-
dimensional space. It allows words with similar meaning to have a similar
representation. They can also approximate meaning. A word vector with 50
values can represent 50 unique features.
Features: Anything that relates words to one another. Eg: Age, Sports,
Fitness, Employed etc. Each word vector has values corresponding to these
features.
Goal of Word Embeddings
To reduce dimensionality
To use a word to predict the words around it
Inter word semantics must be captured
How are Word Embeddings used?
They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in
training or inference
To represent or visualize any underlying patterns of usage in the
corpus that was used to train them.
Implementations of Word Embeddings:
Word Embeddings are a method of extracting features out of text so that we
can input those features into a machine learning model to work with text data.
They try to preserve syntactical and semantic information. The methods such
as Bag of Words(BOW), CountVectorizer and TFIDF rely on the word count
in a sentence but do not save any syntactical or semantic information. In these
algorithms, the size of the vector is the number of elements in the vocabulary.
We can get a sparse matrix if most of the elements are zero. Large input
vectors will mean a huge number of weights which will result in high
computation required for training. Word Embeddings give a solution to these
problems.
Let’s take an example to understand how word vector is generated by taking
emoticons which are most frequently used in certain conditions and transform
each emoji into a vector and the conditions will be our features.
1) Word2Vec:
In this model what we do is we try to fit the neighboring words in the window
to the central word.
In this model, we try to make the central word closer to the neighboring words.
It is the complete opposite of the CBOW model. It is shown that this method
produces more meaningful embeddings.
2) GloVe:
This is another method for creating word embeddings. In this method, we take
the corpus and iterate through it and get the co-occurrence of each word with
other words in the corpus. We get a co-occurrence matrix through this. The
words which occur next to each other get a value of 1, if they are one word
apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a
small corpus:
it 0
is 1+1 0
a 1/2+1 1+1/2 0
good 0 0 0 0 1 0
The upper half of the matrix will be a reflection of the lower half. We can
consider a window frame as well to calculate the co-occurrences by shifting
the frame till the end of the corpus. This helps gather information about the
context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two
pairs of vectors and see how close they are to each other in space. If they occur
together more often or have a higher value in the co-occurrence matrix and are
far apart in space then they are brought close to each other. If they are close to
each other but are rarely or not frequently used together then they are moved
further apart in space.
After many iterations of the above process, we’ll get a vector space
representation that approximates the information from the co-occurrence
matrix. The performance of GloVe is better than Word2Vec in terms of both
semantic and syntactic capturing.
Pre-trained Word Embedding Models:
People generally use pre-trained models for word embeddings. Few of them
are:
SpaCy
fastText
Lexical Semantics
The purpose of semantic analysis is to draw exact meaning, or you can say
dictionary meaning from the text. The work of semantic analyzer is to check the
text for meaningfulness.
We already know that lexical analysis also deals with the meaning of the words,
then how is semantic analysis different from lexical analysis? Lexical analysis
is based on smaller token but on the other side semantic analysis focuses on
larger chunks. That is why semantic analysis can be divided into the following
two parts −
Studying meaning of individual word
It is the first part of the semantic analysis in which the study of the meaning of
individual words is performed. This part is called lexical semantics.
Studying the combination of individual words
In the field of natural language processing, there are a variety of tasks such as
automatic text classification, sentiment analysis, text summarization, etc. These
tasks are partially based on the pattern of the sentence and the meaning of the
words in a different context. The two different words may be similar with an
amount of amplitude. For example, the words ‘jog’ and ‘run’, both of them are
partially different and also partially similar to each other. To perform
specific NLP-based tasks, it is required to understand the intuition of words in
different positions and hold the similarity between the words as well. Here
WordNET comes to the picture which helps in solving the linguistic problems
of the NLP models.
Structure of WordNET
In the below picture we can see the structure of any synset where we are having
synonyms of benefit in the array of synsets with the definition and the example
of usage of benefit word. This synset is related to another synset word, where
the words benefit and profit have exactly the same meaning.
The reason for explaining these terms here is because in WordNET the most
frequent relationships between synsets are based on these hyponym and
hypernym relations. These are very beneficial in linking words like(paper, piece
of paper). Saying more specifically with an example from the above picture like
purple and violet, in WordNET the category colour includes purple which in
turn includes violet. The root node of the hierarchy is the last point for every
noun. In violet is a kind of purple and purple is a kind of colour then violet is a
kind colour this is the hyponymy relation between the words which is transitive.
Meronymy: The wordnet hold follows the meronymy relation which defines
the whole relationship between the synset for example a bike has two wheels
handle and petrol tank. These components of a bike are inherited from their
subordinates: if a bike has two wheels then a sports bike has wheels as well. In
linguistics, we basically use this kind of relationship for adverbs which basically
represents the characteristic of the noun. So the parts are inherited into a
downward direction because all the bikes and types of bikes have two wheels,
but not all kinds of automobiles consist of two wheels.
Most of the relations in the wordNET are in the same part of speech. On the
basis of part of speech relations, we can divide WordNET into 4 types of 4
subnets one for each noun, verbs, adjective, and adverb. There are also some
cross-PoS pointers available in the network which include a morphosemantic
link that holds the words with the same meaning and shares a stem. For
example, many pairs like (reader read) in which the noun of the pair has a
semantic layer with respect to the verb have been specified.
Implementation of WordNET
Importing libraries:
import nltk
from nltk.corpus import wordnet
nltk.download('wordnet')
Output:
synonyms = [ ]
antonyms = [ ]
Output:
Here we can see the synonyms of the evil word and in the network, good and
goodness is the opposite of the evil word.
word1 = wordnet.synset('man.n.01')
word2 = wordnet.synset('boy.n.01')
print(word1.wup_similarity(word2)*100)
Output:
Since we know grown-up boys are men, here when we asked the measure of
similarity between the man and boy it gave the result around 66% which is a
nice estimation of the similarity.
3. What are word embeddings, and how are they related to distributional
semantics?
4. Explain the CBOW and Skip-Gram architectures in Word2Vec. When might
one be more suitable than the other?
5. What is lexical semantics, and why is it important in natural language
processing?
1. Extractive Summarization:
Extractive summarization involves selecting and extracting sentences or phrases directly from
the source text to create a summary. It identifies the most relevant and informative segments
of the original text and stitches them together to form a summary. Here's how extractive
summarization works:
Sentence Selection: The sentences with the highest relevance scores are
selected and arranged to create the summary. These selected sentences form
the extractive summary.
Output: The extractive summary consists of sentences directly taken from the
source text, arranged in a logical order.
Extractive summarization is relatively simpler to implement but may not always produce
coherent summaries, as sentences are taken out of context. It's effective when the source text
is well-structured and contains clear topic sentences.
2. Abstractive Summarization:
Text Understanding: The source text is processed to capture its main ideas,
entities, and relationships. This may involve techniques like natural language
understanding (NLU) and named entity recognition (NER).
Output: The abstractive summary is a new text that may contain paraphrased
content from the source text, expressed in a more concise and coherent form.
Abstractive summarization is more challenging but has the potential to produce summaries
that are more human-like and contextually accurate. It is particularly useful for summarizing
long and complex texts.
Search Engine Snippets: Generating brief descriptions (snippets) for search engine
results.
Summary:
A Summary is a text that is produced from one or more texts that contains
a significant portion of the information in the original text and that is no
longer than half of the original text.
Text summarization:
Text Summarization is the process of distilling the most important
information from a source to produce an abridged version for a particular
user or task. And it is the process of generating short, fluent, and most
importantly accurate summary of a respectively longer text document. The
main idea behind automatic text summarization is to be able to find a short
subset of the most essential information from the entire set and present it in
a human-readable format. As online textual data grows, automatic text
summarization methods have the potential to be very helpful because more
useful information can be read in a short time.
Text Classification:
Introduction:
Text classification, also known as text categorization, is a classical
problem in natural language processing (NLP), which aims to assign labels
or tags to textual units such as sentences, queries, paragraphs, and
documents. It has a wide range of applications including question
Here,
A is the proposition;
B is the evidence;
Hence,
_________________________________
This formula assumes that the predictors or features are independent, and
one’s presence does not affect another’s feature. Hence, it is called
‘naïve.’
Problem Statement:
Sentence Label
“A clean but
Sports
forgettable game”
Here, you need to find the sentence ‘A very close game’ is of which
label?
______________
P (sports/ a very close game) = P(a very close game/ sports) x P(sports)
____________________________
We will abandon the divisor same for both the labels and compare
With
But, in the training data, ‘A very close game’ doesn’t seem anywhere so
this probability is zero.
Now comes the core part here, ‘Naïve.’ Every word in a sentence is
independent of the other, we’re not looking at the entire sentences, but at
single words.
These individual words appear many times in the training data that we
can compute.
Computing Probability
The finishing step is to calculate the probabilities and look at which one is
larger.
First, we calculate the a priori probability of the labels: for the sentences in the
given training data. The probability of it being Sports P (Sports) will be ⅗, and
P (Not Sports) will be ⅖.
While calculating P (game/ Sports), we count the times the word “game”
appears in Sports text (here 2) divided by the words in sports (11).
P(game/Sports) = 2/11
The end result will be 0, and the entire calculation will be nullified. But
this is not what we want, so we seek some other way around.
Laplace Smoothing
P(game/Sports) = 2+1
___________
11 + 14
Final Outcome:
= 2.76 x 10 ^-5
= 0.0000276
= 0.572 x 10 ^-5
= 0.00000572
Hence, we have finally got our classifier that gives “A very close game”
the label Sports as its probability is high and we infer that the sentence
belongs to the Sports category.
We represent a text document bag of words as if it were a bag of words, that is, an unordered
set of words with their position ignored, keeping only their frequency in the document.
instead of representing the word order in all the phrases like “I love this movie” and “I would
recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the
word it 6 times, the words love, recommend, and movie once, and so on.
Naive Bayes is a probabilistic classifier, meaning that for a document d, out of all classes c ∈
C the classifier returns the class ˆc which has the maximum posterior ˆ probability given the
document. we use the hat notation ˆ to mean “our estimate of the correct class”.
This idea of Bayesian inference has been known since the work of Bayes (1763), Bayesian
inference and was first applied to text classification by Mosteller and Wallace (1964). The
intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 4.1 into other
probabilities that have some useful properties. Bayes’ rule is presented in Eq. 4.2; it gives us
a way to break down any conditional probability P(x|y) into three other probabilities:
cˆ = argmax c∈C P(c|d) = argmax c∈C P(d|c)P(c) / P(d) ….(3) We can conveniently simplify
Eq. 4.3 by dropping the denominator P(d). This is possible because we will be computing
P(d|c)P(c)/P(d) for each possible class. But P(d) doesn’t change for each class; we are always
asking about the most likely class for the same document d, which must have the same
probability P(d). Thus, we can choose the class that maximizes this simpler formula:
We call Naive Bayes a generative model because we can read Eq. 4.4 as stating a kind of
implicit assumption about how a document is generated: first a class is sampled from P(c),
we compute the most probable class ˆc given some document d by choosing the class which
has the highest product of two probabilities: the prior probability of the class P(c) and the
likelihood of the document P(d|c):
Without loss of generalization, we can represent a document d as a set of features f1, f2,..., fn:
Unfortunately, Eq. 6 is still too hard to compute directly: without some simplifying
assumptions, estimating the probability of every possible combination of features (for
example, every possible set of words and positions) would require huge numbers of
parameters and impossibly large training sets. Naive Bayes classifiers therefore make two
simplifying assumptions.
The first is the bag-of-words assumption discussed intuitively above: we assume position
doesn’t matter, and that the word “love” has the same effect on classification whether it
occurs as the 1st, 20th, or last word in the document. Thus we assume that the features f1,
f2,..., fn only encode word identity and not position.
The second is commonly called the naive Bayes assumption: this is the condi- naive Bayes
assumption tional independence assumption that the probabilities P(fi |c) are independent
given the class c and hence can be ‘naively’ multiplied as follows:
The final equation for the class chosen by a naive Bayes classifier is thus:
To apply the naive Bayes classifier to text, we need to consider word positions, by simply
walking an index through every word position in the document:
Naive Bayes calculations, like calculations for language modeling, are done in log space, to
avoid underflow and increase speed. Thus Eq. 9 is generally instead expressed as
By considering features in log space, Eq. 10 computes the predicted class as a linear function
of input features. Classifiers that use a linear combination of the inputs to make a
classification decision —like naive Bayes and also logistic regression are called linear
classifiers.
When the predictors take a constant value, we assume that these values
are sampled from a Gaussian distribution.
+ ...zany characters and richly applied satire, and some great plot twists
− It was pathetic. The worst part about it was the boxing scenes...
+ ...awesome caramel sauce and sweet toasty almonds. I love this place!
While standard naive Bayes text classification can work well for sentiment
analysis, some small changes are generally employed that improve
performance. First, for sentiment classification and several other text
classification tasks, whether a word occurs or not seems to matter more
than its frequency. Thus, it often improves performance to clip the word
counts in each document at 1 (see the end of the chapter for pointers to
these results). This variant is called binary multinomial naive Bayes or
binary naive Bayes. The variant uses the same algorithm as binary naive
Bayes except that for each document we remove all duplicate words before
concatenating them into the single big document during training and we
Newly formed ‘words’ like NOT like, NOT recommend will thus occur
more often in negative document and act as cues for negative sentiment,
while words like NOT bored, NOT dismiss will acquire positive
associations. We will return in Chapter 20 to the use of parsing to deal
more accurately with the scope relationship between these negation words
and the predicates they modify, but this simple baseline works quite well
in practice.
if sentiment > 0:
print("Positive sentiment")
elif sentiment < 0:
print("Negative sentiment")
else:
print("Neutral sentiment"))
TextBlob's sentiment analysis is based on a simple rule-based approach.
Interview Questions:
1.what is text summarization and why it is important in NLP?
3.