Writing Code For NLP Research PDF
Writing Code For NLP Research PDF
Research
EMNLP 2018
{joelg,mattg,markn}@allenai.org
Who we are
Matt Gardner (@nlpmattg)
Matt is a research scientist on AllenNLP. He was the original
architect of AllenNLP, and he co-hosts the NLP Highlights podcast.
BREAK
Python
What we expect you know already
the difference between good science and bad science
What you'll learn
today
What you'll learn today
how to write code in a way that facilitates good science and
reproducible experiments
What you'll learn today
how to write code in a way that makes your life easier
The Elephant in the Room: AllenNLP
● This is not a tutorial about AllenNLP
● But (obviously, seeing as we wrote it)
AllenNLP represents our experiences
and opinions about how best to write AllenNLP
research code
● Accordingly, we'll use it in most of our
examples
● And we hope you'll come out of this
tutorial wanting to give it a try
● But our goal is that you find the tutorial
useful even if you never use AllenNLP
Two modes of writing
research code
2: writing
1: prototyping components
Prototyping New
Models
Main goals during prototyping
- Make sure you can bypass the abstractions when you need to
Writing code quickly - Get a good starting place
Writing code quickly - Get a good starting place
- First step: get a baseline running
- Instead: just copy the code, figure out how to share later, if it makes
sense
Writing code quickly - Do use good code style
- CS degree:
Writing code quickly - Do use good code style
- CS degree:
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Meaningful names
Writing code quickly - Do use good code style
Shape comments on
tensors
Writing code quickly - Do use good code style
Comments describing
non-obvious logic
Writing code quickly - Do use good code style
Why so abstract?
Writing code quickly - How much to hard-code?
- Which one should I do?
Possible ablations
Running experiments - Controlled experiments
(+ Build Automation)
Continuous Integration (+ Build Automation)
Continuous Integration
Build Automation
a unit test is an
automated
check that a
small part of
your code
works correctly
What should I test?
If You're Prototyping,
Test the Basics
Prototyping? Test the Basics
Prototyping? Test the Basics
If You're Writing
Reusable Components,
Test Everything
Test Everything
but how?
Use Test Fixtures
create tiny datasets that look
like the real thing
Use Test Fixtures
use them to create tiny
pretrained models
Attention is hard to
test because it relies
on parameters
Use your knowledge to write clever tests
● AllenNLP now
has more than
20 models in it
○ some simple
○ some complex
● Some
abstractions
have
consistently
proven useful
● (Some haven't)
Things That We Use A Lot
● training a model
● mapping words (or
characters, or labels) to
indexes
● summarizing a sequence of
tensors with a single tensor
Things That Require a Fair
Amount of Code
● training a model
● (some ways of) summarizing a
sequence of tensors with a single
tensor
● some neural network modules
Things That Have Many Variations
● turning a word (or a character, or a label) into a tensor
● summarizing a sequence of tensors with a single tensor
● transforming a sequence of tensors into a sequence of tensors
Things that reflect our higher-level thinking
● we'll have some inputs:
○ text, almost certainly
○ tags/labels, often
○ spans, sometimes
● we need some ways of embedding them as
tensors
○ one hot encoding
○ low-dimensional embeddings
● we need some ways of dealing with
sequences of tensors
○ sequence in -> sequence out (e.g. all outputs of an
LSTM)
○ sequence in -> tensor out (e.g. last output of an
LSTM)
Along the way, we need to worry
about some things that make
NLP tricky
Inputs are text, but neural models want tensors
Inputs are sequences of things
and order matters
Inputs can vary in length
Some sentences are short.
Whereas other sentences are so long that by the time you finish reading
them you've already forgotten what they started off talking about and
you have to go back and read them a second time in order to remember
the parts at the beginning.
Reusable Components
in AllenNLP
AllenNLP is built on
PyTorch
AllenNLP is built on PyTorch
*usually on the first try it won't "just work", but usually that's your fault not PyTorch's
TokenEmbedder
● turns ids (the outputs of your TokenIndexers) into tensors
● many options:
○ learned word embeddings
○ pretrained word embeddings
○ contextual embeddings (e.g. ELMo)
○ character embeddings + Seq2VecEncoder
Seq2VecEncoder
● bag of words
● (last output of) LSTM
● CNN + pooling
Seq2SeqEncoder
● Registrable
○ retrieve a class by its name
● FromParams
○ instantiate a class instance
from JSON
Registrable
● so now, given a model "type"
(specified in the JSON config),
we can programmatically
retrieve the class
● remaining problem: how do we
programmatically call the
constructor?
○ DatasetReader gives us
○ Labels are optional in the model and
dataset reader
○ Model returns an arbitrary dict, so can get
and visualize model internals
○ Predictor wraps it all in JSON
○ Archive lets us load a pre-trained model in
a server
○ Even better: pre-built UI components
(using React) to visualize standard pieces
of a model, like attentions, or span labels
We don't have it all figured out!
still figuring out some abstractions that we may
not have correct
embedding
LSTM
wdo
wThe wate wthe wapple encodings
g
Linear
seems reasonable
v1: PyTorch - Define Model
compute char
embeddings
concatenate inputs
add a
character
embedder
use the
character
embedder
v3: AllenNLP - config
we can accomplish
this with just a
couple of minimal
config changes
v3: AllenNLP - config
Here is a finished
Dockerfile.
Do yourself a favour.
Don’t change the names
of things during this
step.
Step 1: Write a Dockerfile
Here is a finished
Dockerfile.
Step 2: Build your Dockerfile into an Image
Step 2: Build your Dockerfile into an Image
Stable environments
for Python can be
tricky
https://www.anaconda.com/
Python environments
Questions?