Intro To NLP: Natural Language Toolkit
Intro To NLP: Natural Language Toolkit
NLP got its start around 1950 with Alan Turing’s test for artificial intelligence evaluating
whether a computer can use language to fool humans into believing it’s human.
But approximating human speech is only one of a wide range of applications for NLP!
Applications from detecting spam emails or bias in tweets to improving accessibility for
people with disabilities all rely heavily on natural language processing techniques.
NLP can be conducted in several programming languages. However, Python has some
of the most extensive open-source NLP libraries, including the Natural Language
Toolkit or NLTK. Because of this, you’ll be using Python to get your first taste of NLP.
Text Preprocessing
"You never know what you have... until you clean your data."
~ Unknown (or possibly made up)
Cleaning and preparation are crucial for many tasks, and NLP is no exception. Text
preprocessing is usually the first step you’ll take when faced with an NLP task.
Stemming is a blunt axe to chop off word prefixes and suffixes. “booing” and
“booed” become “boo”, but “sing” may become “s” and “sung” would remain
“sung.”
Lemmatization is a scalpel to bring words down to their root forms. For
example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”
Other common tasks include lowercasing, stopwords removal, spelling correction,
etc.
Parsing Text
You now have a preprocessed, clean list of words. Now what? It may be helpful to know
how the words relate to each other and the underlying syntax (grammar). Parsing is a
stage of NLP concerned with segmenting text based on syntax.
You probably do not want to be doing any parsing by hand and NLTK has a few tricks
up its sleeve to help you out:
Named entity recognition (NER) helps identify the proper nouns (e.g., “Natalia” or
“Berlin”) in a text. This can be a clue as to the topic of the text and NLTK captures many
for you.
Dependency grammar trees help you understand the relationship between the words
in a sentence. It can be a tedious task for a human, so the Python library spaCy is at your
service, even if it isn’t always perfect.
In English we leave a lot of ambiguity, so syntax can be tough, even for a computer
program. Take a look at the following sentence:
Do I have the binoculars? Does the cow have binoculars? Does the tree have binoculars?
Regex parsing, using Python’s re library, allows for a bit more nuance. When coupled
with POS tagging, you can identify specific phrase chunks. On its own, it can find you
addresses, emails, and many other common patterns within large chunks of text.
Parsing Text
You now have a preprocessed, clean list of words. Now what? It may be helpful to know
how the words relate to each other and the underlying syntax (grammar). Parsing is a
stage of NLP concerned with segmenting text based on syntax.
You probably do not want to be doing any parsing by hand and NLTK has a few tricks
up its sleeve to help you out:
Dependency grammar trees help you understand the relationship between the words
in a sentence. It can be a tedious task for a human, so the Python library spaCy is at your
service, even if it isn’t always perfect.
In English we leave a lot of ambiguity, so syntax can be tough, even for a computer
program. Take a look at the following sentence:
Do I have the binoculars? Does the cow have binoculars? Does the tree have binoculars?
Regex parsing, using Python’s re library, allows for a bit more nuance. When coupled
with POS tagging, you can identify specific phrase chunks. On its own, it can find you
addresses, emails, and many other common patterns within large chunks of text.
Language Models - Bag-of-Words
Approach
How can we help a machine make sense of a bunch of word tokens? We can help
computers make predictions about language by training a language model on
a corpus (a bunch of example text).
One of the most common language models is the unigram model, a statistical language
model commonly known as bag-of-words. As its name suggests, bag-of-words does
not have much order to its chaos! What it does have is a tally count of each instance for
each word. Consider the following text example:
Take a look at our revised squid example: “The squids jumped out of the suitcases. The
squids were furious.”
1. How can your language model make sense of the sentence “The cat fell asleep in
the mailbox” if it’s never seen the word “mailbox” before? During training, your
model will probably come across test words that it has never encountered before
(this issue also pertains to bag of words). A tactic known as language
smoothing can help adjust probabilities for unknown words, but it isn’t always
ideal.
2. For a model that more accurately predicts human language patterns, you
want n (your sequence length) to be as large as possible. That way, you will have
more natural sounding language, right? Well, as the sequence length grows, the
number of examples of each sequence within your training corpus shrinks. With
too few examples, you won’t have enough data to make many predictions.
Enter neural language models (NLM)! Much recent work within NLP has involved
developing and training neural networks to approximate the approach our human
brains take towards language. This deep learning approach allows computers a much
more adaptive tack to processing human language.
Topic Models
We’ve touched on the idea of finding topics within a body of language. But what if the
text is long and the topics aren’t obvious?
A common technique is to deprioritize the most common words and prioritize less
frequently used terms as topics in a process known as term frequency-inverse
document frequency (tf-idf). Say what?! This may sound counter-intuitive at first. Why
would you want to give more priority to less-used words? Well, when you’re working
with a lot of text, it makes a bit of sense if you don’t want your topics filled with words
like “the” and “is.” The Python libraries gensim and sklearn have modules to handle tf-idf.
Whether you use your plain bag of words (which will give you term frequency) or run it
through tf-idf, the next step in your topic modeling journey is often latent Dirichlet
allocation (LDA). LDA is a statistical model that takes your documents and determines
which words keep popping up together in the same contexts (i.e., documents). We’ll
use sklearn to tackle this for us.
If you have any interest in visualizing your newly minted topics, word2vec is a great
technique to have up your sleeve. word2vec can map out your topic model results
spatially as vectors so that similarly used words are closer together. In the case of a
language sample consisting of “The squids jumped out of the suitcases. The squids were
furious. Why are your suitcases full of jumping squids?”, we might see that “suitcase”,
“jump”, and “squid” were words used within similar contexts. This word-to-vector
mapping is known as a word embedding.
Text Similarity
Most of us have a good autocorrect story. Our phone’s messenger quietly swaps one
letter for another as we type and suddenly the meaning of our message has changed (to
our horror or pleasure). However, addressing text similarity — including spelling
correction — is a major challenge within natural language processing.
Addressing word similarity and misspelling for spellcheck or autocorrect often involves
considering the Levenshtein distance or minimal edit distance between two words. The
distance is calculated through the minimum number of insertions, deletions, and
substitutions that would need to occur for one word to become another. For example,
turning “bees” into “beans” would require one substitution (“a” for “e”) and one insertion
(“n”), so the Levenshtein distance would be two.
It’s also helpful to find out if texts are the same to guard against plagiarism, which we
can identify through lexical similarity (the degree to which texts use the same
vocabulary and phrases). Meanwhile, semantic similarity (the degree to which
documents contain similar meaning or topics) is useful when you want to find (or
recommend) an article or book similar to one you recently finished.
TTING STARTED WITH NATURAL LANGUAGE PROCESSING
Your first step to language prediction is picking a language model. Bag of words alone
is generally not a great model for language prediction; no matter what the preceding
word was, you will just get one of the most commonly used words from your training
corpus.
If you go the n-gram route, you will most likely rely on Markov chains to predict the
statistical likelihood of each following word (or character) based on the training corpus.
Markov chains are memory-less and make statistical predictions based entirely on the
current n-gram on hand.
For example, let’s take a sentence beginning, “I ate so many grilled cheese”. Using a
trigram model (where n is 3), a Markov chain would predict the following word as
“sandwiches” based on the number of times the sequence “grilled cheese sandwiches”
has appeared in the training data out of all the times “grilled cheese” has appeared in
the training data.
A more advanced approach, using a neural language model, is the Long Short Term
Memory (LSTM) model. LSTM uses deep learning with a network of artificial “cells” that
manage memory, making them better suited for text prediction than traditional neural
networks.
Different NLP tasks may be more or less difficult in different languages. Because
so many NLP tools are built by and for English speakers, these tools may lag
behind in processing other languages. The tools may also be programmed with
cultural and linguistic biases specific to English speakers.
What if your Amazon Alexa could only understand wealthy men from coastal
areas of the United States? English itself is not a homogeneous body. English
varies by person, by dialect, and by many sociolinguistic factors. When we build
and train NLP tools, are we only building them for one type of English speaker?
You can have the best intentions and still inadvertently program a bigoted tool.
While NLP can limit bias, it can also propagate bias. As an NLP developer, it’s
important to consider biases, both within your code and within the training
corpus. A machine will learn the same biases you teach it, whether intentionally
or unintentionally.
As you become someone who builds tools with natural language processing, it’s
vital to take into account your users’ privacy. There are many powerful NLP tools
that come head-to-head with privacy concerns. Who is collecting your data? How
much data is being collected and what do those companies plan to do with your
data?
NLP Review
Check out how much you’ve learned about natural language processing!