2
2
All
rights reserved. Draft of January 12, 2025.
CHAPTER
like #nlproc. Some languages, like Japanese, don’t have spaces between words,
so word tokenization becomes more difficult. And as we’ll see, for large language
models we’ll use tokens that range greatly in size, from letters to subwords (parts of
words) to words and even sometimes short phrases.
lemmatization Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.
Regular expressions are case sensitive; lower case /s/ is distinct from upper
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.
The regular expression /[1234567890]/ specifies any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ (2.1)
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.
The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.
How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
4 C HAPTER 2 • R EGULAR E XPRESSIONS , T OKENIZATION , E DIT D ISTANCE
The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
2.1 • R EGULAR E XPRESSIONS 5
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)
Regex Match
ˆ start of line
$ end of line
\b word boundary
\B non-word boundary
Figure 2.7 Anchors in regular expressions.
There are also two other anchors: \b matches a word boundary, and \B matches
a non word-boundary. Thus, /\bthe\b/ matches the word the but not the word
other. A “word” for the purposes of a regular expression is defined based on the
definition of words in programming languages as a sequence of digits, underscores,
or letters. Thus /\b99\b/ will match the string 99 in There are 99 bottles of beer on
the wall (because 99 follows a space) but not 99 in There are 299 bottles of beer on
the wall (since 99 follows a number). But it will match 99 in $99 (since 99 follows
a dollar sign ($), which is not a digit, underscore, or letter).
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/ (2.3)
But we will still overgeneralize, incorrectly return texts with the embedded in other
words (e.g., other or there). So we need to specify that we want instances with a
word boundary on both sides:
/\b[tT]he\b/ (2.4)
The simple process we just went through was based on fixing two kinds of errors:
false positives false positives, strings that we incorrectly matched like other or there, and false
false negatives negatives, strings that we incorrectly missed, like The. Addressing these two kinds
of errors comes up again and again in language processing. Reducing the overall
error rate for an application thus involves two antagonistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
We’ll come back to precision and recall with more precise definitions in Chapter 4.
2.1 • R EGULAR E XPRESSIONS 7
Regex Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? zero or one occurrence of the previous char or expression
{n} exactly n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 2.9 Regular expression operators for counting.
Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Fig. 2.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
able to look for expressions like 6 GHz or 500 GB or $999.99. Let’s work out some
regular expressions for this task.
First, let’s complete our regular expression for prices. Here’s a regular expres-
sion for a dollar sign followed by a string of digits:
/$[0-9]+/ (2.5)
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)
Now we just need to deal with fractions of dollars. We’ll add a decimal point
and two digits afterwards:
/$[0-9]+\.[0-9][0-9]/ (2.6)
This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary:
/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/ (2.7)
One last catch! This pattern allows prices like $199999.99 which would be far too
expensive! We need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/ (2.8)
Further fixes (like avoiding matching a dollar sign with no price after it) are left as
an exercise for the reader.
How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the use of / */ to mean “zero
or more spaces” since there might always be extra spaces lying around:
Modifying this regular expression so that it only matches more than 500 GB is left
as an exercise for the reader.
s/colour/color/ (2.10)
s/([0-9]+)/<\1>/ (2.11)
2.1 • R EGULAR E XPRESSIONS 9
The parenthesis and number operators can also specify that a certain string or ex-
pression must occur twice in the text. For example, suppose we are looking for the
pattern “the Xer they were, the Xer they will be”, where we want to constrain the two
X’s to be the same string. We do this by surrounding the first X with the parenthesis
operator, and replacing the second X with the number operator \1, as follows:
/the (.*)er they were, the \1er they will be/ (2.12)
Here the \1 will be replaced by whatever string matched the first item in parentheses.
So this will match the bigger they were, the bigger they will be but not the bigger
they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/ (2.13)
will match the faster they ran, the faster we ran but not the faster they ran, the faster
we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used
to group terms for specifying the order in which operators should apply, and they
are used to capture something in a register. Occasionally we might want to use
parentheses for grouping, but don’t want to capture the resulting pattern in a register.
non-capturing
group In that case we use a non-capturing group, which is specified by putting the special
commands ?: after the open parenthesis, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/ (2.14)
will match some cats like some cats but not some cats like some some.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:
User1 : Men are all alike.
ELIZA1 : IN WHAT WAY
User2 : They’re always bugging us about something or other.
ELIZA2 : CAN YOU THINK OF A SPECIFIC EXAMPLE
User3 : Well, my boyfriend made me come here.
ELIZA3 : YOUR BOYFRIEND MADE YOU COME HERE
User4 : He says I’m depressed much of the time.
ELIZA4 : I AM SORRY TO HEAR YOU ARE DEPRESSED
Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 15.
/ˆ(?!Volcano)[A-Za-z]+/ (2.15)
2.2 Words
Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at one particular corpus (plural corpora), a computer-readable
corpora collection of text or speech. For example the Brown corpus is a million-word col-
lection of samples from 500 written English texts from different genres (newspa-
per, fiction, non-fiction, academic, etc.), assembled at Brown University in 1963–64
(Kučera and Francis, 1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter
a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15
if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as
words depends on the task. Punctuation is critical for finding boundaries of things
(commas, periods, colons) and for identifying some aspects of meaning (question
marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if
they were separate words.
The Switchboard corpus of American English telephone conversations between
strangers was collected in the early 1990s; it contains 2430 conversations averaging
6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey
et al., 1992). Such corpora of spoken language introduce other complications with
regard to defining words. Let’s look at one utterance from Switchboard; an utter-
utterance ance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
disfluency This utterance has two kinds of disfluencies. The broken-off word main- is
fragment called a fragment. Words like uh and um are called fillers or filled pauses. Should
filled pause we consider these to be words? Again, it depends on the application. If we are
building a speech transcription system, we might want to eventually strip out the
disfluencies.
2.2 • W ORDS 11
How many words are there in English? When we speak about the number of
words in the language, we are generally referring to word types. Fig. 2.11 shows
the rough numbers of types and instances computed from some English corpora.
The larger the corpora we look at, the more word types we find, and in fact this
relationship between the number of types |V | and number of instances N is called
Herdan’s Law Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978) after its discoverers
Heaps’ Law (in linguistics and information retrieval respectively). It is shown in Eq. 2.16, where
k and β are positive constants, and 0 < β < 1.
|V | = kN β (2.16)
The value of β depends on the corpus size and the genre, but at least for the large
corpora in Fig. 2.11, β ranges from .67 to .75. Roughly then we can say that the
1 In earlier tradition, and occasionally still, you might see word instances referred to as word tokens, but
we now try to reserve the word token instead to mean the output of subword tokenization algorithms.
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T OKENIZATION , E DIT D ISTANCE
vocabulary size for a text goes up significantly faster than the square root of its
length in words.
It’s sometimes useful to make a further distinction. Consider inflected forms like
cats versus cat. We say these two words are different wordforms but have the same
lemma lemma. A lemma is a set of lexical forms having the same stem, and usually the
wordform same major part-of-speech. The wordform is the full inflected or derived form of
the word. The two wordforms cat and cats thus have the same lemma, which we can
represent as cat.
For morphologically complex languages like Arabic, we often need to deal with
lemmatization. For most tasks in English, however, wordforms are sufficient, and
when we talk about words in this book we almost always mean wordforms (although
we will discuss basic algorithms for lemmatization and the related task of stemming
below in Section 2.6). One of the situations even in English where we talk about
lemmas is when we measure the number of words in a dictionary. Dictionary en-
tries or boldface forms are a very rough approximation to (an upper bound on) the
number of lemmas (since some lemmas have multiple boldface forms). The 1989
edition of the Oxford English Dictionary had 615,000 entries.
Finally, we should note that in practice, for many NLP applications (for example
for neural language modeling) we don’t actually use words as our internal unit of
representation at all! We instead tokenize the input strings into tokens, which can
be words but can also be only parts of words. We’ll return to this tokenization
question when we introduce the BPE algorithm in Section 2.5.2.
2.3 Corpora
Words don’t appear out of nowhere. Any particular piece of text that we study
is produced by one or more specific speakers or writers, in a specific dialect of a
specific language, at a specific time, in a specific place, for a specific function.
Perhaps the most important dimension of variation is the language. NLP algo-
rithms are most useful when they apply across many languages. The world has 7097
languages at the time of this writing, according to the online Ethnologue catalog
(Simons and Fennig, 2018). It is important to test algorithms on more than one lan-
guage, and particularly on languages with different properties; by contrast there is
an unfortunate current tendency for NLP algorithms to be developed or tested just
on English (Bender, 2019). Even when algorithms are developed beyond English,
they tend to be developed for the official languages of large industrialized nations
(Chinese, Spanish, Japanese, German etc.), but we don’t want to limit tools to just
these few languages. Furthermore, most languages also have multiple varieties, of-
ten spoken in different regions or by different social groups. Thus, for example,
AAE if we’re processing text that uses features of African American English (AAE) or
African American Vernacular English (AAVE)—the variations of English used by
millions of people in African American communities (King 2020)—we must use
NLP tools that function with features of those varieties. Twitter posts might use fea-
tures often used by speakers of African American English, such as constructions like
MAE iont (I don’t in Mainstream American English (MAE)), or talmbout corresponding
to MAE talking about, both examples that influence word segmentation (Blodgett
et al. 2016, Jones 2015).
It’s also quite common for speakers or writers to use multiple languages in a
code switching single communicative act, a phenomenon called code switching. Code switching
2.4 • S IMPLE U NIX T OOLS FOR W ORD T OKENIZATION 13
is enormously common across the world; here are examples showing Spanish and
(transliterated) Hindi code switching with English (Solorio et al. 2014, Jurgens et al.
2017):
(2.17) Por primera vez veo a @username actually being hateful! it was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
(2.18) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Another dimension of variation is the genre. The text that our algorithms must
process might come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts. It might come from spoken genres like telephone
conversations, business meetings, police body-worn cameras, medical interviews,
or transcripts of television shows or movies. It might come from work situations
like doctors’ notes, legal text, or parliamentary or congressional proceedings.
Text also reflects the demographic characteristics of the writer (or speaker): their
age, gender, race, socioeconomic class can all influence the linguistic properties of
the text we are processing.
And finally, time matters too. Language changes over time, and for some lan-
guages we have good corpora of texts from different historical periods.
Because language is so situated, when developing computational models for lan-
guage processing from a corpus, it’s important to consider who produced the lan-
guage, in what context, for what purpose. How can a user of a dataset know all these
datasheet details? The best way is for the corpus creator to build a datasheet (Gebru et al.,
2020) or data statement (Bender et al., 2021) for each corpus. A datasheet specifies
properties of a dataset like:
Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: When and in what situation was the text written/spoken? For example,
was there a task? Was the language originally spoken conversation, edited
text, social media communication, monologue vs. dialogue?
Language variety: What language (including dialect/region) was the corpus in?
Speaker demographics: What was, e.g., the age or gender of the text’s authors?
Collection process: How big is the data? If it is a subsample how was it sampled?
Was the data collected with consent? How was the data pre-processed, and
what metadata is available?
Annotation process: What are the annotations, what are the demographics of the
annotators, how were they trained, how was the data annotated?
Distribution: Are there copyright or other intellectual property restrictions?
In the next sections we walk through each of these tasks, but we’ll first start with
an easy, if somewhat naive version of word tokenization and normalization (and fre-
quency computation) that can be accomplished for English solely in a single Unix
command-line, inspired by Church (1994). We’ll make use of some Unix com-
mands: tr, used to systematically change particular characters in the input; sort,
which sorts input lines in alphabetical order; and uniq, which collapses and counts
adjacent identical lines.
For example let’s begin with the ‘complete words’ of Shakespeare in one file,
sh.txt. We can use tr to tokenize the words by changing every sequence of non-
alphabetic characters to a newline (’A-Za-z’ means alphabetic and the -c option
complements to non-alphabet, so together they mean to change every non-alphabetic
character into a newline. The -s (‘squeeze’) option is used to replace the result
of multiple consecutive changes into a single output, so a series of non-alphabetic
characters in a row would all be ‘squeezed’ into a single newline):
tr -sc 'A-Za-z' '\n' < sh.txt
The output of this command will be:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Now that there is one word per line, we can sort the lines, and pass them to uniq
-c which will collapse and count them:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c
with the following output:
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
...
Alternatively, we can collapse all the upper case to lower case:
tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | sort | uniq -c
whose output is
14725 a
97 aaron
1 abaissiez
10 abandon
2.5 • W ORD AND S UBWORD T OKENIZATION 15
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...
Unix tools of this sort can be very handy in building quick word count statistics
for any corpus in English. While in some versions of Unix these command-line tools
also correctly handle Unicode characters and so can be used for many languages,
in general for handling most languages outside English we use more sophisticated
tokenization algorithms.
want to break off punctuation as a separate token; commas are a useful piece of infor-
mation for parsers, and periods help indicate sentence boundaries. But we’ll often
want to keep the punctuation that occurs word internally, in examples like m.p.h.,
Ph.D., AT&T, and cap’n. Special characters and numbers will need to be kept in
prices ($45.55) and dates (01/02/06); we don’t want to segment that price into sepa-
rate tokens of “45” and “55”. And there are URLs (https://www.stanford.edu),
Twitter hashtags (#nlproc), or email addresses ([email protected]).
Number expressions introduce complications; in addition to appearing at word
boundaries, commas appear inside numbers in English, every three digits: 555,500.50.
Tokenization differs by language; languages like Spanish, French, and German, for
example, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, converting what’re to the two tokens what are, and we’re to we
are. A clitic is a part of a word that can’t stand on its own, and can only occur
when it is attached to another word. Such contractions occur in other alphabetic
languages, including French pronouns (j’ai and articles l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity recognition, the task of detecting names, dates,
and organizations (Chapter 17).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank kenization standard, used for the parsed corpora (treebanks) released by the Lin-
tokenization
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation (to save space we’re showing visible spaces
‘ ’ between tokens, although newlines is a more common output):
Input: "The San Francisco-based restaurant," they said,
"doesn’t charge $10".
Output: " The San Francisco-based restaurant , " they said ,
" does n’t charge $ 10 " .
words low, new, newer, but not lower, then if the word lower appears in our test
corpus, our system will not know what to do with it.
To deal with this unknown word problem, modern tokenizers automatically in-
subwords duce sets of tokens that include tokens smaller than words, called subwords. Sub-
words can be arbitrary substrings, or they can be meaning-bearing units like the
morphemes -est or -er. (A morpheme is the smallest meaning-bearing unit of a lan-
guage; for example the word unwashable has the morphemes un-, wash, and -able.)
In modern tokenization schemes, most tokens are words, but some tokens are fre-
quently occurring morphemes or other subwords like -er. Every unseen word like
lower can thus be represented by some sequence of known subword units, such as
low and er, or even as a sequence of individual letters if necessary.
Most tokenization schemes have two parts: a token learner, and a token seg-
menter. The token learner takes a raw training corpus (sometimes roughly pre-
separated into words, for example by whitespace) and induces a vocabulary, a set
of tokens. The token segmenter takes a raw test sentence and segments it into the
tokens in the vocabulary. Two algorithms are widely used: byte-pair encoding
(Sennrich et al., 2016), and unigram language modeling (Kudo, 2018), There is
also a SentencePiece library that includes implementations of both of these (Kudo
and Richardson, 2018), and people often use the name SentencePiece to simply
mean unigram language modeling tokenization.
In this section we introduce the simplest of the three, the byte-pair encoding or
BPE BPE algorithm (Sennrich et al., 2016); see Fig. 2.13. The BPE token learner begins
with a vocabulary that is just the set of all individual characters. It then examines the
training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’,
‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent
’A’ ’B’ in the corpus with the new ‘AB’. It continues to count and merge, creating
new longer and longer character strings, until k merges have been done creating
k novel tokens; k is thus a parameter of the algorithm. The resulting vocabulary
consists of the original set of characters plus k new symbols.
The algorithm is usually run inside words (not merging across word boundaries),
so the input corpus is first white-space-separated to give a set of strings, each corre-
sponding to the characters of a word, plus a special end-of-word symbol , and its
counts. Let’s see its operation on the following tiny input corpus of 18 word tokens
with counts for each word (the word low appears 5 times, the word newer 6 times,
and so on), which would have a starting vocabulary of 11 letters:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
The BPE algorithm first counts all pairs of adjacent symbols: the most frequent
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
3) for a total of 9 occurrences.2 We then merge these symbols, treating er as one
symbol, and count again:
2 Note that there can be ties; we could have instead chosen to merge r first, since that also has a
frequency of 9.
2.5 • W ORD AND S UBWORD T OKENIZATION 19
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Next n e (total count of 8) get merged to ne:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
2 l o w e s t
6 ne w er
3 w i d er
2 ne w
If we continue, the next merges are:
merge current vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Figure adapted from Bostrom and Durrett (2020).
Once we’ve learned our vocabulary, the token segmenter is used to tokenize a
test sentence. The token segmenter just runs on the merges we have learned from
the training data on the test data. It runs them greedily, in the order we learned them.
(Thus the frequencies in the test data don’t play a role, just the frequencies in the
training data). So first we segment each test sentence word into characters. Then
we apply the first rule: replace every instance of e r in the test corpus with er, and
20 C HAPTER 2 • R EGULAR E XPRESSIONS , T OKENIZATION , E DIT D ISTANCE
then the second rule: replace every instance of er in the test corpus with er ,
and so on. By the end, if the test corpus contained the character sequence n e w e
r , it would be tokenized as a full word. But the characters of a new (unknown)
word like l o w e r would be merged into the two tokens low er .
Of course in real settings BPE is run with many thousands of merges on a very
large input corpus. The result is that most words will be represented as full symbols,
and only the very rare words (and unknown words) will have to be represented by
their parts.
2.6.1 Lemmatization
For other natural language processing situations we also want two morphologically
different forms of a word to behave similarly. For example in web search, someone
may type the string woodchucks but a useful system might want to also return pages
that mention woodchuck with no s. This is especially common in morphologically
complex languages like Polish, where for example the word Warsaw has different
endings when it is the subject (Warszawa), or after a preposition like “in Warsaw” (w
lemmatization Warszawie), or “to Warsaw” (do Warszawy), and so on. Lemmatization is the task
of determining that two words have the same root, despite their surface differences.
The words am, are, and is have the shared lemma be; the words dinner and dinners
both have the lemma dinner. Lemmatizing each of these forms to the same lemma
will let us find all mentions of words in Polish like Warsaw. The lemmatized form
of a sentence like He is reading detective stories would thus be He be read detective
story.
How is lemmatization done? The most sophisticated methods for lemmatization
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
2.7 • S ENTENCE S EGMENTATION 21
affix pheme of the word, supplying the main meaning—and affixes—adding “additional”
meanings of various kinds. So, for example, the word fox consists of one morpheme
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or parses a Spanish word like amaren (‘if in the future
they would love’) into the morpheme amar ‘to love’, and the morphological features
3PL (third person plural) and future subjunctive.
Simple stemmers can be useful in cases where we need to collapse across dif-
ferent variants of the same lemma. Nonetheless, they are less commonly used in
modern systems since they commit errors of both over-generalizing (lemmatizing
policy to police) and under-generalizing (not lemmatizing European to Europe)
(Krovetz, 1993).
INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s
Figure 2.14 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.
i n t e n t i o n
n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.15 Finding the edit distance viewed as a search problem
The space of all possible edits is enormous, so we can’t search naively. However,
lots of distinct edit paths will end up in the same state (string), so rather than recom-
puting all those paths, we could just remember the shortest path to a state each time
dynamic
programming we saw it. We can do this by using dynamic programming. Dynamic programming
is the name for a class of algorithms, first introduced by Bellman (1957), that apply
a table-driven method to solve problems by combining solutions to subproblems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 17) and the
CKY algorithm for parsing (Chapter 18).
The intuition of a dynamic programming problem is that a large problem can
be solved by properly combining the solutions to various subproblems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.16.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
then we could use it instead, resulting in a shorter overall path, and the optimal
minimum edit
sequence wouldn’t be optimal, thus leading to a contradiction.
distance The minimum edit distance algorithm was named by Wagner and Fischer
algorithm
(1974) but independently discovered by many people (see the Historical Notes sec-
tion of Chapter 17).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
24 C HAPTER 2 • R EGULAR E XPRESSIONS , T OKENIZATION , E DIT D ISTANCE
i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.16 Path from intention to execution.
D[i, j] as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D[n, m].
We’ll use dynamic programming to compute D[n, m] bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D[i, j] for small i, j we then compute larger
D[i, j] based on previously computed smaller values. The value of D[i, j] is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:
D[i − 1, j] + del-cost(source[i])
D[i, j] = min D[i, j − 1] + ins-cost(target[ j]) (2.23)
D[i − 1, j − 1] + sub-cost(source[i], target[ j])
The algorithm is summarized in Fig. 2.17; Fig. 2.18 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.24.
Alignment Knowing the minimum edit distance is useful for algorithms like find-
ing potential spelling error corrections. But the edit distance algorithm is important
in another way; with a small change, it can also provide the minimum cost align-
ment between two strings. Aligning two strings is useful throughout speech and
language processing. In speech recognition, minimum edit distance alignment is
used to compute the word error rate (Chapter 16). Alignment plays a role in ma-
chine translation, in which sentences in a parallel corpus (a corpus with a text in two
languages) need to be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.19
2.8 • M INIMUM E DIT D ISTANCE 25
n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix D[n+1,m+1]
# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])
# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]
Figure 2.17 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).
Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.18 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.17, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.
shows this path with boldfaced cells. Each boldfaced cell represents an alignment
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.19 also shows the intuition of how to compute this alignment path. The
computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.19. Some cells have mul-
26 C HAPTER 2 • R EGULAR E XPRESSIONS , T OKENIZATION , E DIT D ISTANCE
tiple backpointers because the minimum extension could have come from multiple
backtrace previous cells. In the second step, we perform a backtrace. In a backtrace, we start
from the last cell (at the final row and column), and follow the pointers back through
the dynamic programming matrix. Each complete path between the final cell and the
initial cell is a minimum distance alignment. Exercise 2.7 asks you to modify the
minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment.
# e x e c u t i o n
# 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7 ←8 ← 9
i ↑1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n ↑2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t ↑3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e ↑4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n ↑5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t ↑6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i ↑7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o ↑8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n ↑9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.19 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum
cost alignment between the two strings, again using Levenshtein distance with cost of 1 for
insertions or deletions, 2 for substitutions. Diagram design after Gusfield (1997).
While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.17 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
each other on the keyboard. The Viterbi algorithm is a probabilistic extension of
minimum edit distance. Instead of computing the “minimum edit distance” between
two strings, Viterbi computes the “maximum probability alignment” of one string
with another. We’ll discuss this more in Chapter 17.
2.9 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduced the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |), counters (*, +, and {n,m}), anchors (ˆ, $)
and precedence operators ((,)).
• Word tokenization and normalization are generally done by cascades of
simple regular expression substitutions or finite automata.
• The Porter algorithm is a simple and efficient way to do stemming, stripping
off affixes. It does not have high accuracy but may be useful for some tasks.
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 27
• The minimum edit distance between two strings is the minimum number of
operations it takes to edit one into the other. Minimum edit distance can be
computed by dynamic programming, which also results in an alignment of
the two strings.
Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T OKENIZATION , E DIT D ISTANCE
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 9. You might want to choose a different domain than a Rogerian psy-
chologist, although keep in mind that you would need a domain in which your
program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
Exercises 29
Baayen, R. H. 2001. Word frequency distributions. Springer. Krovetz, R. 1993. Viewing morphology as an inference pro-
Bellman, R. 1957. Dynamic Programming. Princeton Uni- cess. SIGIR-93.
versity Press. Kruskal, J. B. 1983. An overview of sequence comparison.
Bellman, R. 1984. Eye of the Hurricane: an autobiography. In D. Sankoff and J. B. Kruskal, eds, Time Warps, String
World Scientific Singapore. Edits, and Macromolecules: The Theory and Practice of
Sequence Comparison, 1–44. Addison-Wesley.
Bender, E. M. 2019. The #BenderRule: On naming the lan-
guages we study and why it matters. Blog post. Kudo, T. 2018. Subword regularization: Improving neural
network translation models with multiple subword candi-
Bender, E. M., B. Friedman, and A. McMillan-Major. 2021.
dates. ACL.
A guide for writing data statements for natural lan-
guage processing. http://techpolicylab.uw.edu/ Kudo, T. and J. Richardson. 2018. SentencePiece: A simple
data-statements/. and language independent subword tokenizer and detok-
enizer for neural text processing. EMNLP.
Bird, S., E. Klein, and E. Loper. 2009. Natural Language
Processing with Python. O’Reilly. Kučera, H. and W. N. Francis. 1967. Computational Analysis
of Present-Day American English. Brown Univ. Press.
Blodgett, S. L., L. Green, and B. O’Connor. 2016. Demo-
graphic dialectal variation in social media: A case study Levenshtein, V. I. 1966. Binary codes capable of correct-
of African-American English. EMNLP. ing deletions, insertions, and reversals. Cybernetics and
Bostrom, K. and G. Durrett. 2020. Byte pair encoding is Control Theory, 10(8):707–710. Original in Doklady
suboptimal for language model pretraining. EMNLP. Akademii Nauk SSSR 163(4): 845–848 (1965).
Chen, X., Z. Shi, X. Qiu, and X. Huang. 2017. Adversar- Li, X., Y. Meng, X. Sun, Q. Han, A. Yuan, and J. Li. 2019.
ial multi-criteria learning for Chinese word segmentation. Is word segmentation necessary for deep learning of Chi-
ACL. nese representations? ACL.
Church, K. W. 1994. Unix for Poets. Slides from 2nd EL- Lovins, J. B. 1968. Development of a stemming algorithm.
SNET Summer School and unpublished paper ms. Mechanical Translation and Computational Linguistics,
11(1–2):9–13.
Clark, H. H. and J. E. Fox Tree. 2002. Using uh and um in
spontaneous speaking. Cognition, 84:73–111. Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. Bethard,
and D. McClosky. 2014. The Stanford CoreNLP natural
Egghe, L. 2007. Untangling Herdan’s law and Heaps’ language processing toolkit. ACL.
law: Mathematical and informetric arguments. JASIST,
58(5):702–709. NIST. 2005. Speech recognition scoring toolkit (sctk) ver-
sion 2.1. http://www.nist.gov/speech/tools/.
Gebru, T., J. Morgenstern, B. Vecchione, J. W. Vaughan,
H. Wallach, H. Daumé III, and K. Crawford. 2020. O’Connor, B., M. Krieger, and D. Ahn. 2010. Tweetmotif:
Datasheets for datasets. ArXiv. Exploratory search and topic summarization for twitter.
ICWSM.
Godfrey, J., E. Holliman, and J. McDaniel. 1992. SWITCH-
BOARD: Telephone speech corpus for research and de- Packard, D. W. 1973. Computer-assisted morphological
velopment. ICASSP. analysis of ancient Greek. COLING.
Gusfield, D. 1997. Algorithms on Strings, Trees, and Se- Palmer, D. 2012. Text preprocessing. In N. Indurkhya and
quences. Cambridge University Press. F. J. Damerau, eds, Handbook of Natural Language Pro-
cessing, 9–30. CRC Press.
Heaps, H. S. 1978. Information retrieval. Computational and
theoretical aspects. Academic Press. Porter, M. F. 1980. An algorithm for suffix stripping. Pro-
gram, 14(3):130–137.
Herdan, G. 1960. Type-token mathematics. Mouton.
Jones, T. 2015. Toward a description of African American Sennrich, R., B. Haddow, and A. Birch. 2016. Neural ma-
Vernacular English dialect regions using “Black Twitter”. chine translation of rare words with subword units. ACL.
American Speech, 90(4):403–440. Simons, G. F. and C. D. Fennig. 2018. Ethnologue: Lan-
Jurgens, D., Y. Tsvetkov, and D. Jurafsky. 2017. Incorpo- guages of the world, 21st edition. SIL International.
rating dialectal variability for socially equitable language Solorio, T., E. Blair, S. Maharjan, S. Bethard, M. Diab,
identification. ACL. M. Ghoneim, A. Hawwari, F. AlGhamdi, J. Hirschberg,
King, S. 2020. From African American Vernacular English A. Chang, and P. Fung. 2014. Overview for the first
to African American Language: Rethinking the study of shared task on language identification in code-switched
race and language in African Americans’ speech. Annual data. Workshop on Computational Approaches to Code
Review of Linguistics, 6:285–300. Switching.
Kiss, T. and J. Strunk. 2006. Unsupervised multilingual Thompson, K. 1968. Regular expression search algorithm.
sentence boundary detection. Computational Linguistics, CACM, 11(6):419–422.
32(4):485–525. Wagner, R. A. and M. J. Fischer. 1974. The string-to-string
Kleene, S. C. 1951. Representation of events in nerve nets correction problem. Journal of the ACM, 21:168–173.
and finite automata. Technical Report RM-704, RAND Weizenbaum, J. 1966. ELIZA – A computer program for the
Corporation. RAND Research Memorandum. study of natural language communication between man
Kleene, S. C. 1956. Representation of events in nerve nets and machine. CACM, 9(1):36–45.
and finite automata. In C. Shannon and J. McCarthy, eds, Weizenbaum, J. 1976. Computer Power and Human Reason:
Automata Studies, 3–41. Princeton University Press. From Judgement to Calculation. W.H. Freeman & Co.