Unix For Poets
Unix For Poets
Text is available like never before. Data collection efforts such as the Association for Computational
Linguistics’ Data Collection Initiative (ACL/DCI), the Consortium for Lexical Research (CLR), the
European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), the Linguistic Data
Consortium (LDC), Electronic Dictionary Research (EDR) and many others have done a wonderful job in
acquiring and distributing dictionaries and corpora.1 In addition, there are vast quantities of so-called
Information Super Highway Roadkill: email, bboards, faxes. We now has access to billions and billions of
words, and even more pixels.
What can we do with it all? Now that data collection efforts have done such a wonderful service to the
community, many researchers have more data than they know what to do with. Electronic bboards are
beginning to fill up with requests for word frequency counts, ngram statistics, and so on. Many researchers
believe that they don’t have sufficient computing resources to do these things for themselves. Over the
years, I’ve spent a fair bit of time designing and coding a set of fancy corpus tools for very large corpora
(eg, billions of words), but for a mere million words or so, it really isn’t worth the effort. You can almost
certainly do it yourself, even on a modest PC. People used to do these kinds of calculations on a PDP-11,
which is much more modest in almost every respect than whatever computing resources you are currently
using.
I wouldn’t bring out the big guns (fancy machines, fancy algorithms, data collection committees, bigtime
favors) unless you have a lot of text (e.g., hundreds of million words or more), or you are trying to count
really long ngrams (e.g., 50-grams). This chapter will describe a set of simple Unix-based tools that should
be more than adequate for counting trigrams on a corpus the size of the Brown Corpus. I’d recommend that
you do it yourself for basically the same reason that home repair stores like DIY and Home Depot are as
popular as they are. You can always hire a pro to fix your home for you, but a lot of people find that it is
better not to, unless they are trying to do something moderately hard. Hamming used to say it is much
better to solve the right problem naively than the wrong problem expertly.
I am very much a believer in teaching by examples. George Miller (personal communication) has observed
that dictionary definitions are often not as helpful as example sentences. Definitions make a lot of sense if
you already basically know what the word means, but they can be hard going if you have never seen the
word before. Following this spirit, this chapter will focus on examples and avoid definitions whenever
possible. In some cases, I will deliberately use new options and even new commands without defining
them first. The reader is encouraged to try the examples themselves, and when all else fails consult the
documentation. (But hopefully, that shouldn’t be necessary too often, since we all know how boring the
documentation can be.)
We will show how to solve the following exercises using only very simple utilities.
1. Count words in a text
__________________
1. For more information on the ACL/DCI and the LDC, see http://www.cis.upenn.edu/˜ldc. The CLR’s web page is:
http://clr.nmsu.edu/clr/CLR.html, and EDR’s web page is: http://www.iijnet.or.jp/edr. Information on the ECI can be found in
http://www.cogsci.ed.ac.uk/elsnet/eci_summary.html, or by sending email to [email protected]. Information on teh
BNC can be found in http://info.ox.ac.uk/bnc, or by sending email to [email protected]. Information on the London-
Lund Corpus and other corpora available though ICAME can be found in the ICAME Journal, edited by Stig Johansson,
Department of English, University of Oslo, Norway.
-2-
The code fragments in this chapter were developed and tested on a Sun computer running Berkeley Unix.
The code ought to work on more or less as is in any Unix system, and even in many PC environments,
running various compatibility packages such as the MKS toolkit.
The problem is to input a text file, say Genesis (a good place to start),2 and output a list of words in the file
along with their frequency counts. The algorithm consists of three steps:
1. Tokenize the text into a sequence of words (tr),
2. Sort the words (sort), and
3. Count duplicates (uniq –c).
periodically, we’ll mention some things to try if one of the examples doesn’t happen to work in your
environment.
The less than symbol ‘‘<’’ indicates that the input should be taken from the file named ‘‘genesis,’’ and the
greater than symbol ‘‘>’’ indicates taht the output should be redirected to a file named ‘‘genesis.hist.’’ By
default, input is read from stdin (standard input) and written to stdout (standard output). Standard input is
usually the keyboard and standard output is usually the active window.
The pipe symbol ‘‘’’ is used to connect the output of one program to the input of the next. In this case, tr
outputs a sequence of tokens (words) which are then piped into sort. Sort outputs a sequence of sorted
tokens which are then piped into uniq, which counts up the runs, producing the desired result.
We can understand this program better by breaking it up into smaller pieces. Lets start by looking at the
beginning of the input file. The Unix program ‘‘sed’’ can be used as follows to show the first five lines of
the genesis file:
In the same way, we can use sed to verify that the first few tokens generated by tr do indeed correspond to
the first few words in the genesis file.
Genesis
In
the
beginning
Similarly, we can verify that the output of the sort step produces a sequence of (not necessarily distinct)
tokens in lexicographic order.
A
A
Abel
Abel
1
2 A
8 Abel
1 Abelmizraim
1 Abidah
This section will show three variants of the counting program above to illustrate just how easy it is to count
a wide variety of (possibly) useful things. The details of the examples aren’t all that important. The point
is to realize that pipelines are so powerful that they can be easily extended to count words (by almost any
definition one can imagine), ngrams, and much more.
The examples in this section will discuss various (weird) definitions of what is a ‘‘word.’’ Some students
get distracted at this point by these weird definitions and lose sight of the point – that pipelines make it
relatively easy to mix and match a small number of simple programs in very powerful ways. But even
these students usually come around when they see how these very same techiques can be used to count
ngrams, which is really nothing other than yet another weird definition of what is a word/token.
The first of the three examples shows that a straightforward one-line modification to the counting program
can be used to merge the counts for upper and lower case. The first line in the new program below
collapses the case distinction by translating lower case to upper case:
Small changes to the tokenizing rule are easy to make, and can have a dramatic impact. The second
example shows that we can count vowel sequences instead of words by changing the tokenizing rule
(second tr line) to emit sequences of vowels rather than sequences of alphabetic characters.
The third and final counting example is just like the second, except that it counts sequences of consonants
rather than sequences of words.
These three examples are intended to show how easy it is to change the definition of what counts as a word.
Sometimes you want to distinguish between upper and lower case, and sometimes you don’t. Sometimes
-5-
you want to collapse morphological variants (does hostage = hostages). Different languages use different
character sets. Sometimes I get email from Sweden, for example, where ‘‘{’’ is a vowel. The tokenizer
depends on what you are trying to do. The same basic counting program can be used to count a variety of
different things, depending on how you implement the definition of thing (=token).
You can find the documentation for the tr command (and many other commands as well) by saying
man tr
If you want to see the document for some other command, simply replace the ’tr’ with the name of that
other command.
The man page for tr explains that tr inputs a character at a time and outputs a translation of the character,
depending on the options. In a simple case like
tr ’[a-z]’ ’[A-Z]’
tr translates lowercase characters to uppercase characters. The first argument, ‘‘[a-z],’’ specifies lowercase
characters and the second argument, ‘‘[A-Z],’’ specifies uppercase characters. The specification of
characters is more or less standard across most Unix commands (though be warned that there are more
surprises than there should be). The notation, ‘‘[x-y],’’ denotes a character between ‘‘x’’ and ‘‘y’’ in ascii
order. Ascii order is a generalization of alphabetic order, including not only the standard English
characters, a-z, in both upper and lower case, but also digits, 0-9, puctuation and white space. It is
important to mention, though, that ascii is an American standard and does not generalize well to European
character sets, let alone Asian characters. Some (but not all) Unix systems support
European/Asian/wordwide standards such as Latin1, EUC and Unicode.
Some characters are difficult to input (because they mean different things to different programs). One of
the worst of these is the newline character, since newline is also used to terminate the end of a
line/command (depending on the context). To get around some of these annoyances, you can specify a
character by referring to its ascii code in octal (base 8). If this sounds cryptic, don’t worry about it; it is.
All you need to know is that ‘‘[\012*]’’ is a newline.
The optional flag, ‘‘–s,’’ squeezes out multiple instances of the translation, so that
tr -c ’[a-z][A-Z]’ ’[\012*]’
won’t output more than one newline in a row. I probably shouldn’t have used this fancy option in the first
example, because it doesn’t change the results very much (and I wanted to establish the principle that it is
ok to use fancy options without explaining them first).
3. sort
The sorting step can also be modified in a variety of different ways. The man page for sort describes a
-6-
Example Explanation
______________________________________
sort –d dictionary order
sort –f fold case
sort –n numeric order
sort –nr reverse numeric order
sort –u remove duplicates
sort +1 start with field 1 (starting from 0)
sort +0.50 start with 50th character
sort +1.5 start with 5th character of field 1
These options can be used in straightforward ways to solve the following exercises:
1. Sort the words in Genesis by frequency
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis |
sort |
uniq -c |
sort –nr
2. Sort them by folding case.
3. Sort them by ‘‘rhyming’’ order.
The last two examples above are left as exercises for the reader, but unlike most ‘‘exercises for the reader,’’
the solutions can be found in the appendix of this chapter.
By ‘‘rhyming’’ order, we mean that the words should be sorted from the right rather than the left, as
illustrated below:
...
1 freely
1 sorely
5 Surely
15 surely
1 falsely
1 fly
...
‘‘freely’’ comes before ‘‘sorely’’ because ‘‘yleerf’’ (‘‘freely’’ spelled backwards) comes before ‘‘yleros’’
(‘‘sorely’’ spelled backwords) in lexicographic order. Rhyming dictionaries are often used to help poets
(and linguists who are interested in morphology).
Hint: There is a Unix command ‘‘rev,’’ which reverses the letters on a line:
echo hello world | rev
dlrow olleh
Thus far, we have seen how Unix commands such as tr, sort, uniq, sed and rev can be combined into
pipelines with the ‘‘,’’ ‘‘<,’’ and ‘‘>’’ operators in simple yet powerful ways. All of the examples were
based on counting words in a text. The flexibility of the pipeline approach was illustrated by showing how
easy it was to modify the counting program to
-7-
4. Bigrams
The same basic counting program can be modified to count bigrams (pairs of words), using the algorithm:
1. tokenize by word
2. print word i and word i + 1 on the same line
3. count
The second step makes use of two new Unix commands, tail and paste. Tail is usually used to output the
last few lines of a file, but it can be used to output all but the first few lines with the ‘‘+2’’ option. The
following code uses tail in this way to generate the files genesis.words and genesis.nextwords which
correspond to word i and word i + 1 .
Paste takes two files and prints them side by side on the same line. Pasting genesis.words and
genesis.nextwords together produces the appropriate input for the counting step.
...
And God
God said
said Let
Let there
...
Combining the pieces, we end up with the following four line program for counting bigrams:
I have presented this material in quite a number of tutorials over the years. The students almost always
come up with a big ‘‘aha’’ reaction at this point. Counting words seems easy enough, but for some reason,
students are almost always pleasantly surprised to discover that counting ngrams (bigrams, trigrams, 50-
grams) is just as easy as counting words. It is just a matter of how you tokenize the text. We can tokenize
the text into words or into ngrams; it makes little difference as far as the counting step is concerned.
5. Shell Scripts
Suppose that you found that you were often computing trigrams of different things, and you found it
inconvenient to keep typing the same five lines over and over. If you put the following into a file called
‘‘trigram,’’
The ‘‘sh’’ command (pronounced ‘‘shell’’) is used to execute a shell script. PCs and DOS have basically
the same concept: shell scripts are called ‘‘bat’’ or batch files, and they are invoked with the ‘‘run’’
command rather than ‘‘sh.’’ (Unix command names tend to be short and rarely contain vowels.)
This example made use of a new command, ‘‘rm,’’ that deletes (removes) files. ‘‘rm’’ is basically the same
as ‘‘del’’ in DOS, but be careful! Don’t expect to be asked for confirmation. (Unix does not believe in
asking lots of interactive questions.)
The shell script also introduced several new symbols. The ‘‘#’’ symbol is used for comments. The ‘‘$$’’
syntax encodes a long number (the process id) into the names of the temporary files. It is a good idea to use
the ‘‘$$’’ syntax in this way so that two users (or two processes) can run the shell script at the same time,
and there won’t be any confusion about which temporary file belongs to which process. (If you don’t know
what a process is, don’t worry about it. A process is a job, an instance of a program that is assigned
resources by the scheduler, the time sharing system.)
After completing the trigram exercise, you will have discovered that ‘‘the land of’’ and ‘‘And he said’’ are
the two most frequent trigrams in Genesis. Suppose that you wanted to count trigrams separately for verses
that contain just the phrase ‘‘the land of.’’
Grep (general regular expression pattern matcher) can be used as follows to extract lines containing ‘‘the
-9-
land of.’’ (WARNING: Because the ‘‘lines’’ were too long to fit the format required for publication, I have
had to insert additional newlines.)
Grep can be combined in straightforward ways with the trigram shell script to count trigrams for lines
matching any regular expression including ‘‘the land of’’ and ‘‘And he said.’’
The syntax for regular expressions can be fairly ellaborate. The simplest case is like the ‘‘the land of’’
example above, where we are looking for lines containing a string. In more ellaborate cases, matches can
be anchored at the beginning or end of lines by using the ‘‘ˆ’’ and ‘‘$’’ symbols.
Explanation
______________________________________
Example
grep gh find lines containing ‘‘gh’’
grep ’ˆcon’ find lines beginning with ‘‘con’’
grep ’ing$’ find lines ending with ‘‘ing’’
With the –v option, instead of printing the lines that match, grep prints lines that do not match. In other
words, matching lines are filtered out or deleted from the output.
- 10 -
_Example Explanation
__________________________________________
grep –v gh delete lines containing ‘‘gh’’
grep –v ’ˆcon’ delete lines beginning with ‘‘con’’
grep –v ’ing$’ delete lines ending with ‘‘ing’’
Explanation
___________________________________________________
Example
grep ’[A–Z]’ lines with an uppercase char
grep ’ˆ[A–Z]’ lines starting with an uppercase char
grep ’[A–Z]$’ lines ending with an uppercase char
grep ’ˆ[A–Z]*$’ lines with all uppercase chars
grep ’[aeiouAEIOU]’ lines with a vowel
grep ’ˆ[aeiouAEIOU]’ lines starting with a vowel
grep ’[aeiouAEIOU]$’ lines ending with a vowel
Warning: the ‘‘ˆ’’ can be confusing. Inside square brackets as in ‘‘[ˆa-zA-Z],’’ it no longer refers to the
beginning of the line, but rather, it complements the set of characters. Thus, ‘‘[ˆa-zA-Z]’’ denotes any
non-alphabetic character.
_Example Explanation
________________________________________________________
grep –i ’[aeiou].*[aeiou]’ lines with two or more vowels
grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ lines with exactly one vowel
A quick summary of the syntax of regular expressions is given in the table below. The syntax is slightly
more general for egrep. (grep and egrep are basically the same, though egrep is often more efficient.)
Explanation
_______________________________________________
Example
a match the letter ‘‘a’’
[a–z] match any lowercase letter
[A–Z] match any uppercase letter
[0–9] match any digit
[0123456789] match any digit
[aeiouAEIUO] match any vowel
- 11 -
Grep Exercises
1. How many uppercase words are there in Genesis? Lowercase? Hint: wc -l or grep -c
2. How many 4-letter words?
3. Are there any words with no vowels?
4. Find ‘‘1-syllable’’ words (words with exactly one vowel)
5. Find ‘‘2-syllable’’ words (words with exactly two vowels)
6. Some words with two orthographic vowels have only one phonological vowel. Delete words ending
with a silent ‘‘e’’ from the 2-syllable list. Delete diphthongs (sequences of two vowels in a row).
7. Find verses in Genesis with the word ‘‘light.’’ How many have two or more instances of ‘‘light’’?
Three or more? Exactly two?
to print the first 5 lines. Actually, this means quit after the fifth line. We could have also quit after the first
instance of a regular expression
Sed is also used to substitute regions matching a regular expression with a second string. For example:
sed ’s/light/dark/g’
will replace all instances of light with dark. The first argument can be a regular expression. For example,
as part of a simple morphology program, we might insert a hyphen into words ending with ‘‘ly.’’
sed ’s/ly$/–ly/g’
The substitution operator can also be used to select the first field on each line by replacing everything after
the first white space with nothing. (Note: the square brackets contain a space followed by a tab.)
- 12 -
sed exercises
2. Count word initial consonant sequences: tokenize by word, delete the vowel and the rest of the word,
and count
8. awk
Awk is a general purpose programming language, though generally intended for shorter programs (1 or 2
lines). The name ‘‘awk’’ is derived from the names of the three authors: Alfred Aho, Peter Weinberger and
Brian Kernighan.
WARNING: There are many obsolete versions of awk still in common use, especially on Sun computers.
You may need to use nawk (new awk) or gawk (Gnu awk) instead of awk, if your system is still using an
obsolete version of awk.
Awk is especially good for manipulating lines and fields in simple ways.
_Example Explanation
___________________________________________
awk ’{print $1}’ print first field
awk ’{print $2}’ print second field
awk ’{print $NF}’ print last field
awk ’{print $(NF-1)}’ print penultimate field
awk ’{print $0}’ print the entire record (line)
print the number of fields
awk ’{print $NF}’
Exercise: sort the words in Genesis by the number of syllables (sequences of vowels)
Awk supports the following predicates (functions that return truth values):
Exercises:
1. find vowel sequences that appear at least 1000 times
2. find bigrams that appear exactly twice
- 14 -
The ‘‘==’’ predicate can also be used on strings as illustrated in the following exercise.
Exercise: Find words in Genesis that can also be spelled both forwards and backwards. The list should
include palindromes like ‘‘deed,’’ as well as words like ‘‘live’’ whose reverse (‘‘evil’’) is also a word,
though not the same one. Hint: make a file of the words in Genesis, and a second file of those words
spelled backwards. Then look for words that are in both files. One way to find words that are in both files
is to sort them together with
sort <file1> <file2>
and count and look for words with a frequency of 2.
Exercise: Compare two files, say exodus and genesis. Find words that are in just the first file, just the
second, and both. Do it with the tools that we have discussed thus far, and then do a man on comm, and
do it again.
Suppose that one wanted to find words ending with ‘‘ed’’ in genesis.hist. Then using the tools we have
thus far, we could say
Alternatively, we can do the regular expression matching in awk using the ˜ operator.
Counting can also be done using the tools introduced previously (or not). Suppose we wanted to count the
number of words that end in /ed$/ by token or by type.3 Using the tools presented thus far, we could say:
__________________
3. The terms tokens and types are intended to clarify a possibly confusion. Do the two instances of ‘‘to’’ in ‘‘to be or not to be’’
count as one word or two? We say that the two instances of ‘‘to’’ are two different tokens, but the same type. So, if we were
counting words in the phrase ‘‘to be or not to be,’’ we would say that there were six tokens and four types.
- 15 -
# by token
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis | grep -c ’ed$’
# by type
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis | sort -u | grep -c ’ed$’
# by token
awk ’$2 ˜ /ed$/ {x = x + $1}
END {print x}’ genesis.hist
# by type
awk ’$2 ˜ /ed$/ {x = x + 1}
END {print x}’ genesis.hist
‘‘x’’ is a variable. For each line that matches the regular expression, we accumulate into x either the
frequency of the word ($1) or just 1. In either case, at the end of the input, we print out the value of x.
These examples illustrate the use of variables and the assignment operator (‘‘=’’).
It is also possible to combine the two awk programs into a single program.
awk ’/ed$/ {token = token + $1; type = type + 1}
END {print token, type}’ genesis.hist
There is some syntactic sugar for adding a constant (+=) and incrementing (++) since these are such
common operations. The following program is equivalent to the previous one.
Exercises:
1. It is said that English avoids sequences of -ing words. Find bigrams where both words end in -ing.
Do these count as counter-examples to the -ing -ing rule?
2. For comparison’s sake, find bigrams where both words end in -ed. Should there also be a prohibition
against -ed -ed? Are there any examples of -ed -ed in Genesis? If so, how many? Which verse(s)?
The use of the variables, ‘‘x,’’ ‘‘type,’’ and ‘‘token’’ in examples above illustrated the use of memory
across lines.
Exercise: Print out verses containing the phrase ‘‘Let there be light.’’ Also print out the previous verse as
well.
The solution uses a variable, prev, which is used to store the previous line. The AWK program reads a line
at a time. If there is a match, we print out the previous line as well as the current one. In any case, we
reassign prev to the current line so that when the awk program reads the next line, prev will continue to
contain the previous line.
- 16 -
Exercise: write a uniq -c program in awk. Hint: the following ‘‘almost’’ works
The following example illustrates the problem: the AWK program drops the last line. There should be
three output lines:
echo a a b b c c | tr ’[ ]’ ’[\012*]’ | uniq -c
2 a
2 b
2 c
but our program generates only two:
echo a a b b c c | tr ’[ ]’ ’[\012*]’ |
awk ’$0 == prev { c++ }
$0 != prev { print c, prev
c=1; prev=$0 }’
2 a
2 b
The solution is given in the appendix.
8.6 uniq1
input:
- 17 -
+s goods
+s deeds
+ed failed
+ed attacked
+ing playing
+ing singing
output:
+s goods deeds
+ed failed attacked
+ing playing singing
...
abacus n
abaft av pp
abalone n
abandon vt n
abandoned aj
...
8.7 Arrays
Pr(x,y)
I(x;y) = log 2 ____________
Pr(x) Pr(y)
N f (x,y)
I(x;y) ∼
∼ log 2 __________
f (x) f (y)
1. Mutual information is unstable for small bigram counts. Modify the previous prog so that it doesn’t
produce any output when the bigram count is less than 5.
1
f (x,y) − __ f (x) f (y)
N
t ∼
∼ _____________________
√
f (x,y)
3. Print the words that appear in both Genesis and wsj.frag, followed by their freqs in the two samples.
Do a man on join and do it again.
4. Repeat the previous exercise, but don’t distinguish uppercase words from lowercase words.
9. KWIC
Input:
All’s well that ends well.
Nature abhors a vacuum.
Every man has a price.
- 19 -
Output:
Every man has a price.
Nature abhors a vacuum.
Nature abhors a vacuum
All’s well that ends well.
Every man has a price.
Every man has a price
Every man has a price.
All’s well that ends well.
Nature abhors a vacuum.
All’s well that ends
well that ends well.
awk ’
{for(i=1; i<length($0); i++)
if(substr($0, i, 1) == " ")
printf("%15s%s\n",
substr($0, i-15, i<=15 ? i-1 : 15),
substr($0, i, 15))}’
• substr
• length
• printf
• for(i=1; i<n; i++) { ... }
• pred ? true : false
Exercise: Make a concordance instead of a KWIC index. That is, show only those lines that match the
input word.
awk ’{i=0;
while(m=match(substr($0, i+1), "well")){
i+=m
printf("%15s%s\n",
substr($0, i-15, i<=15 ? i-1 : 15),
substr($0, i, 15))}’
awk ’{i=0;
while(m=match(substr($0, i+1), re)) {
i+=m
printf("%15s%s\n",
substr($0, i-15, i<=15 ? i-1 : 15),
substr($0, i, 15))}}
’ re=" [ˆaeiouAEIOU]"
This is a bit of a trick question. What do we mean by a word? Do we count two instances of ‘‘the’’ as two
words or just one? That is, are we counting by token or by type? Either solution is fine. The exercises are
intentially left vague on certain points to illustrate that the hardest part of these kinds of caculations is often
deciding what you want to compute.
# by token
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis |
grep -c ’ˆ[A-Z]’
5533
# by type
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis |
sort -u | grep -c ’ˆ[A-Z]’
635
# by token
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis |
grep -c ’ˆ....$’
9036
# by type
tr -sc ’[A-Z][a-z]’ ’[\012*]’ < genesis |
sort -u | grep -c ’ˆ....$’
448
The output above illustrates that there are certain technical difficulties with our definition of ‘‘word’’ and
with our definition of ‘‘vowel.’’ Often, though, it is easier to work with imperfect definitions and interpret
the results with a grain of salt than to spend endless amounts of time refining the program to deal with the
tricky cases that hardly ever come up.
Three or more?
Exactly two?
Exercise: Sort the words in Genesis by the number of syllables (sequences of vowels).
Exercise: Find words in Genesis that can also be spelled both forwards and backwards.
A
I
O
a
deed
did
draw
evil
ewe
live
no
noon
on
s
saw
ward
was
Exercise: Compare Exodus and Genesis. Find words that are in just the first file, just the second, and both.
Solution: sort the words in Genesis with two copies of the words in Exodus and count. Words with counts
of 1 appear just in Genesis, words with counts of 2 appear just in Exodus, and words with counts of 3
appear in both.
- 25 -