Text Classification
Text Classification
Lena Voita Blog Publications Talks & Service NLP Course | For You
• Models: Convolutional Since most of the classification datasets assume that only one label is correct (you will see this
right now!), in the lecture we deal with this type of classification, i.e. the single-label
Multi-Label Classification
classification. We mention multi-label classification in a separate section (Multi-Label
Practical Tips Classification).
The most popular datasets are for sentiment classification. They consist of reviews of movies,
places or restaurants, and products. There are also datasets for question type classification
and topic classification.
To better understand typical classification tasks, below you can look at the examples from
different datasets.
How to: pick a dataset and look at the examples to get a feeling of the task. Or you can
come back to this later!
Pick a dataset
SST IMDb Review Yelp Review Amazon Review
TREC Yahoo! Answers AG's News Sogou News DBPedia
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 1/24
30/10/2024, 05:43 Text Classification
Classical Methods
• Naive Bayes
• SVM
Neural Networks
• High-Level Pipeline
• Training
Multi-Label Classification
Practical Tips
General View
Analysis and Interpretability
Here we provide a general view on classification and introduce the notation. This section applies to
Research Thinking both classical and neural approaches.
Related Papers We assume that we have a collection of documents with ground-truth labels. The input of a
classifier is a document x = (x 1 , … , x n ) with tokens (x 1 , … , x n ), the output is a label
Have Fun! y ∈ 1 … k. Usually, a classifier estimates probability distribution over classes, and we want
feature extractor
A feature extractor can be either
manually defined (as in classical
approaches) or learned (e.g., with
neural networks).
classifier
A classifier has to assign class
probabilities given feature
representation of a text. The most
common way to do this is using
logistic regression, but other
variants are also possible (e.g., Naive Bayes classifier or SVM).
In this lecture, we'll mostly be looking at different ways to build feature representation of a text
and to use this representation to get class probabilities.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 2/24
30/10/2024, 05:43 Text Classification
Text Classification
General View
Classical Methods
• Naive Bayes A classification model can be either generative or discriminative.
discriminative models
• High-Level Pipeline
Discriminative models are interested only in the conditional probability p(y|x), i.e. they
• Training learn only the border between classes. To make a prediction given an input x, these
models pick a class with the highest conditional probability: y = arg max p(y = k|x).
k
• Models: (Weighted) BOE
In this lecture, we will meet both generative and discriminative models.
• Models: Recurrent
Models: Convolutional
•
Classical Methods for Text Classification
Multi-Label Classification
In this part, we consider classical approaches for text classification. They were developed long
Practical Tips before neural networks became popular, and for small datasets can still perform comparably
to neural models.
Analysis and Interpretability
Lena: Later in the course, we will learn about transfer learning which can make neural approaches
Research Thinking better even for very small datasets. But let's take this one step at a time: for now, classical
approaches are a good baseline for your models.
Related Papers
prior probability P (y = k) : class probability before looking at data (i.e., before knowing x
);
posterior probability P (y = k|x) : class probability after looking at data (i.e., after knowing
the specific x);
joint probability P (x, y): the joint probability of data (i.e., both examples x and labels y);
maximum a posteriori (MAP) estimate: we pick the class with the highest posterior
probability.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 3/24
30/10/2024, 05:43 Text Classification
Classical Methods
where N (y = k) is the number of examples (documents) with the label k.
• Naive Bayes
P(x|y=k): use the "naive" assumptions, then count
• MaxEnt (Logistic Regression)
Here we assume that document x is represented as a set of features, e.g., a set of its words
• SVM (x 1 , … , x n ) :
Neural Networks
P (x|y = k) = P (x 1 , … , x n |y = k).
• Models: Recurrent does not depend on context (neither word order nor other words at all). For example, we can
say that awesome, brilliant, great are more likely to appear in documents with a positive
• Models: Convolutional
sentiment and awful, boring, bad are more likely in negative documents, but we know nothing
Multi-Label Classification about how these (or other) words influence each other.
t=1
Research Thinking
The probabilities P (x i |y = k) are estimated as the proportion of times the word xi appeared
Related Papers in documents of class k among all tokens in these documents:
Have Fun! N (x i , y = k)
P (x i |y = k) = ,
|V |
∑ N (x t , y = k)
t=1
where N (x i , y = k) is the number of times the token x i appeared in documents with the label
k V , is the vocabulary (more generally, a set of all possible features).
What if N (x i , y = k) = 0 , i.e. in training we haven't seen the token xi in the documents with
class k? This will null out the probability of the whole document, and this is not what we want!
For example, if we haven't seen some rare words (e.g., pterodactyl or abracadabra) in training
positive examples, it does not mean that a positive document can never contain these words.
To avoid this, we'll use a simple trick: we add to counts of all words a small δ:
δ + N (x i , y = k) δ + N (x i , y = k)
P (x i |y = k) = = ,
|V | |V |
∑ (δ + N (x t , y = k)) δ ⋅ |V | + ∑ N (x t , y = k)
t=1 t=1
Note: this is Laplace smoothing (aka Add-1 smoothing if δ = 1). We'll learn more about
smoothings in the next lecture when talking about Language Modeling.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 4/24
30/10/2024, 05:43 Text Classification
Making a Prediction
As we already mentioned, Naive Bayes (and, more broadly, generative models) make a
NLP Course | For You prediction based on the joint probability of data and class:
Text Classification y
∗
= arg max P (x, y = k) = arg max P (y = k) ⋅ P (x|y = k).
k k
• SVM
Neural Networks
• High-Level Pipeline
• Training
• Models: Recurrent
• Models: Convolutional
Multi-Label Classification
Analysis and Interpretability Practical Note: Sum of Log-Probabilities Instead of Product of Probabilities
The main expression Naive Bayes uses for classification is a product lot of probabilities:
Research Thinking
n
t=1
Have Fun!
A product of many probabilities may be very unstable numerically. Therefore, usually instead of
P (x, y) we consider log P (x, y):
t=1
Since we care only about argmax, we can consider log P (x, y) instead of P (x, y).
Important! Note that in practice, we will usually deal with log-probabilities and not probabilities.
Feature Design
In the standard setting, we used words as features. However, you can use other types of
features: URL, user id, etc.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 5/24
30/10/2024, 05:43 Text Classification
Even if your data is a plain text (without fancy things such as URL, user id, etc), you
can still design features in different ways. Learn how to improve Naive Bayes in this
NLP Course | For You exercise in the Research Thinking section.
Text Classification
• Naive Bayes Here we also have to define features manually, but we have more freedom: features do not
have to be categorical (in Naive Bayes, they had to!). We can use the BOW representation or
• MaxEnt (Logistic Regression)
come up with something more interesting.
• SVM
The general classification pipeline here is as follows:
Neural Networks
get h = (f 1 , f 2 , … , f n ) - feature representation of the input text;
(i) (i) (i)
• High-Level Pipeline take w = (w 1 , … , w n ) - vectors with feature weights for each of the classes;
for each class, weigh features, i.e. take the dot product of feature representation h with
• Training feature weights w (k) :
• Models: Recurrent
To get a bias term in the sum above, we define one of the features being 1 (e.g., f0 = 1 ).
• Models: Convolutional Then
Practical Tips
get class probabilities using softmax:
Related Papers
Softmax normalizes the K values we got at the previous step to a probability distribution
Have Fun! over output classes.
∗ i i
w = arg max ∑ log P (y = y |x ).
w
i=1
In other words, we choose parameters such that the data is more likely to appear. Therefore,
this is called the Maximum Likelihood Estimate (MLE) of the parameters.
To find the parameters maximizing the data log-likelihood, we use gradient ascent: gradually
improve weights during multiple iterations over the data. At each step, we maximize the
probability a model assigns to the correct class.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 6/24
30/10/2024, 05:43 Text Classification
Equvalence to minimizing cross-entropy
Note that maximizing data log-likelihood is equivalent to minimizing cross entropy between the
NLP Course | For You target probability distribution p ∗ = (0, … , 0, 1, 0, …) (1 for the target label, 0 for the rest)
and the predicted by the model distribution p = (p 1 , … , p K ), p i = p(i|x):
Text Classification
K
∗ ∗ ∗
Intro and Datasets Loss(p , p ) = −p log(p) = − ∑ p i log(p i ).
i=1
General View
∗
Since only one of p
i
is non-zero (1 for the target label k , 0 for the rest), we will get
Classical Methods
∗
Loss(p , p) = − log(p k ) = − log(p(k|x)).
• Naive Bayes
• SVM
Neural Networks
• High-Level Pipeline
• Training
• Models: Recurrent
• Models: Convolutional
Multi-Label Classification
This equivalence is very important for you to understand: when talking about neural
Practical Tips approaches, people usually say that they minimize the cross-entropy loss. Do not forget that
this is the same as maximizing the data log-likelihood.
Analysis and Interpretability
Naive Bayes vs Logistic Regression
Research Thinking
Related Papers
Have Fun!
Let's finalize this part by discussing the advantages and drawbacks of logistic regression and
Naive Bayes.
simplicity
Both methods are simple; Naive Bayes is the simplest one.
interpretability
Both methods are interpretable: you can look at the features which influenced the
predictions most (in Naive Bayes - usually words, in logistic regression - whatever you
defined).
training speed
Naive Bayes is very fast to train - it requires only one pass through the training data to
evaluate the counts. For logistic regression, this is not the case: you have to go over the
data many times until the gradient ascent converges.
independence assumptions
Naive Bayes is too "naive" - it assumed that features (words) are conditionally independent
given class. Logistic regression does not make this assumption - we can hope it is better.
text representation: manual
Both methods use manually defined feature representation (in Naive Bayes, BOW is the
standard choice, but you still choose this yourself). While manually defined features are
good for interpretability, they may be no so good for performance - you are likely to miss
something which can be useful for the task.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 7/24
30/10/2024, 05:43 Text Classification
• SVM
Neural Networks
Text Classification with Neural Networks
• High-Level Pipeline
Instead of manually defined features, let a neural network to learn useful
• Training
features.
• Models: (Weighted) BOE
The main idea of neural-network-based classification is that feature representation of the input
• Models: Recurrent text can be obtained using a neural network. In this setting, we feed the embeddings of the
input tokens to a neural network, and this neural network gives us a vector representation of
• Models: Convolutional
the input text. After that, this vector is used for classification.
Multi-Label Classification
Practical Tips
Research Thinking
Related Papers
Have Fun!
When dealing with neural networks, we can think about the classification part (i.e., how to get
class probabilities from a vector representation of a text) in a very simple way.
Let us look closer to the neural network classifier. The way we use vector representation of the
input text is exactly the same as we did with logistic regression: we weigh features according
to feature weights for each class. The only difference from logistic regression is where the
features come from: they are either defined manually (as we did before) or obtained by a
neural network.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 8/24
30/10/2024, 05:43 Text Classification
Text Classification
General View
Classical Methods
• Naive Bayes
• SVM
Neural Networks
Intuition: Text Representation Points in the Direction of Class Representation
• High-Level Pipeline
• Training
• Models: Recurrent
• Models: Convolutional
Multi-Label Classification If we look at this final linear layer more closely, we will see that the columns of its matrix are
vectors w i . These vectors can be thought of as vector representations of classes. A good
Practical Tips
neural network will learn to represent input texts in such a way that text vectors will point in the
direction of the corresponding class vectors.
Analysis and Interpretability
Training and the Cross-Entropy Loss
Research Thinking
Neural classifiers are trained to predict probability distributions over classes. Intuitively, at each
Related Papers step we maximize the probability a model assigns to the correct class.
Have Fun! The standard loss function is the cross-entropy loss. Cross-entropy loss for the target
probability distribution p ∗ = (0, … , 0, 1, 0, …) (1 for the target label, 0 for the rest) and the
predicted by the model distribution p = (p 1 , … , p K ), p i = p(i|x):
∗ ∗ ∗
Loss(p , p ) = −p log(p) = − ∑ p log(p i ).
i
i=1
∗
Since only one of p
i
is non-zero (1 for the target label k, 0 for the rest), we will get
∗
Loss(p , p) = − log(p k ) = − log(p(k|x)). Look at the illustration for one training example.
In training, we gradually improve model weights during multiple iterations over the data: we
iterate over training examples (or batches of examples) and make gradient updates. At each
step, we maximize the probability a model assigns to the correct class. At the same time, we
minimize sum of the probabilities of incorrect classes: since sum of all probabilities is constant,
by increasing one probability we decrease sum of all the rest (Lena: Here I usually imagine a
bunch of kittens eating from the same bowl: one kitten always eats at the expense of the others).
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 9/24
30/10/2024, 05:43 Text Classification
Text Classification
General View
Classical Methods
• Naive Bayes
• SVM
Neural Networks
• High-Level Pipeline
Recap: This is equivalent to maximizing the data likelihood
• Training
Do not forget that when talking about MaxEnt classifier (logistic regression), we showed that
• Models: (Weighted) BOE
minimizing cross-entropy is equivalent to maximizing the data likelihood. Therefore, here we are
• Models: Recurrent also trying to get the Maximum Likelihood Estimate (MLE) of model parameters.
• Models: Convolutional
Models for Text Classification
Multi-Label Classification
We need a model that can produce a fixed-sized vector for inputs of different
Practical Tips
lengths.
Analysis and Interpretability
In this part, we will look at different ways to
Research Thinking get a vector representation of an input text
using neural networks. Note that while input
Related Papers texts can have different lengths, the vector
representation of a text has to have a fixed
Have Fun! size: otherwise, a network will not "work".
Lena: A bit later in the course, you will learn about Transformers and the most recent classification
techniques using large pretrained models.
Bag of Embeddings (ideally, along with Naive Bayes) should be a baseline for any model with
a neural network: if you can't do better than that, it's not worth using NNs at all. This can be the
case if you don't have much data.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 10/24
30/10/2024, 05:43 Text Classification
Text Classification
General View
Classical Methods
• Naive Bayes While Bag of Embeddings (BOE) is sometimes called Bag of Words (BOW), note that these two
are very different. BOE is the sum of embeddings and BOW is the sum of one-hot vectors: BOE
• MaxEnt (Logistic Regression)
knows a lot more about language. The pretrained embeddings (e.g., Word2Vec or GloVe)
• SVM understand similarity between words. For example, awesome, brilliant, great will be represented
with unrelated features in BOW but similar word vectors in BOE.
Neural Networks
Note also that to use a weighted sum of embeddings, you need to come up with a way to get
• High-Level Pipeline weights. However, this is exactly what we wanted to avoid by using neural networks: we don't
want to introduce manual features, but rather let a network to learn useful patterns.
• Training
Bag of Embeddings as Features for SVM
• Models: (Weighted) BOE
You can use SVM on top of BOE! The only difference from SVMs in classical approaches (on
• Models: Recurrent
top of bag-of-words and bag-of-ngrams) if the choice of a kernel: here the RBF kernel is better.
• Models: Convolutional Models: Recurrent (RNN/LSTM/etc)
Multi-Label Classification
Recurrent networks are a natural way to process text in a sense that, similar to humans, they
Practical Tips "read" a sequence of tokens one by one and process the information. Hopefully, at each step
the network will "remember" everything it has read before.
Analysis and Interpretability
Basics: Recurrent Neural Networks
Research Thinking
• RNN cell
Related Papers
At each step, a recurrent network receives a new input vector
(e.g., token embedding) and the previous network state
Have Fun!
(which, hopefully, encodes all previous information). Using
this input, the RNN cell computes the new state which it gives
as output. This new state now contains information about
both current input and the information from previous steps.
• Vanilla RNN
h t = tanh(h t−1 W h + x t W t ).
Vanilla RNNs suffer from the vanishing and exploding gradients problem. To alleviate this
problem, more complex recurrent cells (e.g., LSTM, GRU, etc) perform several operations on
the input and use gates. For more details of RNN basics, look at the Colah's blog post.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 11/24
30/10/2024, 05:43 Text Classification
Here we (finally!) look at how we can use recurrent models for text classification. Everything
you will see here will apply to all recurrent cells, and by "RNN" in this part I refer to recurrent
cells in general (e.g. vanilla RNN, LSTM, GRU, etc).
NLP Course | For You
Let us recall what we need:
Text Classification
We need a model that can produce a fixed-sized vector for inputs of different
Intro and Datasets
lengths.
General View
• Simple: read a text, take the
Classical Methods
final state
• Naive Bayes The most simple recurrent model is a
one-layer RNN network. In this
• MaxEnt (Logistic Regression)
network, we have to take the state
• SVM which knows more about input text.
Therefore, we have to use the last
Neural Networks
state - only this state saw all input
tokens.
• High-Level Pipeline
• Training
• Multiple layers: feed the states from
one RNN to the next one
• Models: (Weighted) BOE
To get a better text representation, you can
• Models: Recurrent stack multiple layers. In this case, inputs for the
higher RNN are representations coming from
• Models: Convolutional
the previous layer.
Multi-Label Classification
The main hypothesis is that with several layers,
Practical Tips lower layers will catch local phenomena (e.g.,
phrases), while higher layers will be able to
Analysis and Interpretability learn more high-level things (e.g., topic).
Research Thinking • Bidirectional: use final states from forward and backward RNNs.
Related Papers Previous approaches may have a problem: the last state can easily "forget" earlier tokens. Even
strong models such as LSTMs can still suffer from that!
Have Fun!
To avoid this, we can use two RNNs: forward, which reads input from left to right, and
backward, which reads input from right to left. Then we can use the final states from both
models: one will better remember the final part of a text, another - the beginning. These states
can be concatenated, or summed, or something else - it's your choice!
You can combine the ideas above. For example, in a multi-layered network, some layers can
go in the opposite direction, etc.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 12/24
30/10/2024, 05:43 Text Classification
Convolutional networks were originally developed for computer vision tasks. Therefore, let's
first understand the intuition behind convolutional models for images.
NLP Course | For You Imagine we want to classify an image into several classes, e.g. cat, dog, airplane, etc. In this
case, if you find a cat on an image, you don't care where on the image this cat is: you care only
Text Classification that it is there somewhere.
General View
Classical Methods
• Naive Bayes
Convolutional networks apply the same operation to small parts of an
• MaxEnt (Logistic Regression) image: this is how they extract features. Each operation is looking for a
match with a pattern, and a network learns which patterns are useful.
• SVM
With a lot of layers, the learned patterns become more and more
complicated: from lines in the early layers to very complicated patterns
Neural Networks
(e.g., the whole cat or dog) on the upper ones. You can look at the
• High-Level Pipeline examples in the Analysis and Interpretability section.
The illustration is
adapted from the one
• Training This property is called translation invariance: translation because we taken from this cool
repo.
are talking about shifts in space, invariance because we want it to not
• Models: (Weighted) BOE
matter.
• Models: Recurrent
Convolutions for Text
• Models: Convolutional
Well, for images it's all clear: e.g. we want to be able to move a cat because we don't care
Multi-Label Classification where the cat is. But what about texts? At first glance, this is not so straightforward: we can not
move phrases easily - the meaning will change or we will get something that does not make
Practical Tips much sense.
Analysis and Interpretability However, there are some applications where we can think of the same intuition. Let's imagine
that we want to classify texts, but not cats/dogs as in images, but positive/negative sentiment.
Research Thinking Then there are some words and phrases which could be very informative "clues" (e.g. it's been
great, bored to death, absolutely amazing, the best ever, etc), and others which are not important
Related Papers
at all. We don't care much where in a text we saw bored to death to understand the sentiment,
right?
Have Fun!
Following the intuition above, we want to detect some patterns, but we don't care much where
exactly these patterns are. This behavior is implemented with two layers:
convolution: finds matches with patterns (as the cat head we saw above);
pooling: aggregates these matches over positions (either locally or globally).
A typical convolutional model for text classification is shown on the figure. To get a vector
representation of an input text, a convolutional layer is applied to word embedding, which is
followed by a non-linearity (usually ReLU) and a pooling operation. The way this representation
is used for classification is similar to other networks.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 13/24
30/10/2024, 05:43 Text Classification
Text Classification
General View
Classical Methods
• Naive Bayes
Neural Networks
Basics: Convolution Layer for Text
• High-Level Pipeline Convolutional Neural Networks were initially developed for computer
vision tasks, e.g. classification of images (cats vs dogs, etc). The idea
• Training
of a convolution is to go over an image with a sliding window and to
• Models: (Weighted) BOE apply the same operation, convolution filter, to each window.
• Models: Recurrent The illustration (taken from this cool repo) shows this process for one
filter: the bottom is the input image, the top is the filter output. Since an
Convolution filter for
• Models: Convolutional image has two dimensions (width and height), the convolution is two- images. The illustration
dimensional. is from this cool repo.
Multi-Label Classification
Differently from images, texts have only one
Practical Tips
dimension: here a convolution is one-
dimensional: look at the illustration.
Analysis and Interpretability
Convolution filter for text.
Research Thinking
Convolution is a Linear Operation Applied to Each Window
Related Papers
Have Fun!
k (kernel size) - the length of a convolution window (on the illustration, k = 3);
m (output channels) - number of convolution filters (i.e., number of channels produced by
the convolution).
(k⋅d)×m
Then a convolution is a linear layer W ∈ R . For a k-sized window (x i , … x i+k−1 ) , the
convolution takes the concatenation of these vectors
k⋅d
u i = [x i , … x i+k−1 ] ∈ R
Fi = ui × W .
A convolution goes over an input with a sliding window and applies the same linear
transformation to each window.
(f )
General View The number F i (the extracted "feature") is
a result of applying the filter f to the window
Classical Methods
(x i , … x i+k−1 ) .
• Naive Bayes
• m filters: m feature extractors
• MaxEnt (Logistic Regression)
• SVM
Neural Networks
• High-Level Pipeline
One filter extracts a single feature. Usually,
• Training
we want many features: for this, we have
• Models: (Weighted) BOE to take several filters. Each filter reads an
input text and extracts a different feature -
• Models: Recurrent look at the illustration. The number of
filters is the number of output features you
• Models: Convolutional
want to get. With m filters instead of one,
Multi-Label Classification the size of the convolutional layer we
discussed above will become (k ⋅ d) × m
Practical Tips
.
Analysis and Interpretability This is done in parallel! Note that while I show you how a CNN "reads" a text, in practice these
computations are done in parallel.
Research Thinking
Basics: Pooling Operation
Related Papers
After a convolution extracted m features from each window, a pooling layer summarises the
Have Fun! features in some region. Pooling layers are used to reduce the input dimension, and, therefore,
to reduce the number of parameters used by the network.
The most popular is max-pooling: it takes maximum over each dimension, i.e. takes the
maximum value of each feature.
Intuitively, each feature "fires" when it sees some pattern: a visual pattern in an image (line,
texture, a cat's paw, etc) or a text pattern (e.g., a phrase). After a pooling operation, we have a
vector saying which of these patterns occurred in the input.
Mean-pooling works similarly but computes mean over each feature instead of maximum.
Similarly to convolution, pooling is applied to windows of several elements. Pooling also has the
stride parameter, and the most common approach is to use pooling with non-overlapping
windows. For this, you have to set the stride parameter the same as the pool size. Look at the
illustration.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 15/24
30/10/2024, 05:43 Text Classification
The difference between pooling and global pooling is that pooling is applied over features in
each window independently, while global pooling performs over the whole input. For texts,
global pooling is often used to get a single vector representing the whole text; such global
NLP Course | For You pooling is called max-over-time pooling, where the "time" axis goes from the first input token to
the last.
Text Classification
General View
Classical Methods
• Naive Bayes
• SVM
Neural Networks
Convolutional Neural Networks for Text Classification
• High-Level Pipeline
Now, when we understand how the convolution and pooling work, let's come to modeling
• Training modifications. First, let us recall what we need:
• Models: Convolutional Therefore, we need to construct a convolutional model that represents a text as a single vector.
Multi-Label Classification The basic convolutional model for text classification is shown on the figure. It is almost the
same as we saw before: the only thing that's changed is that we specified the type of pooling
Practical Tips
used. Specifically, after the convolution, we use global-over-time pooling. This is the key
operation: it allows to compress a text into a single vector. The model itself can be different, but
Analysis and Interpretability
at some point it has to use the global pooling to compress input in a single vector.
Research Thinking
Related Papers
Have Fun!
Instead of picking one kernel size for your convolution, you can use several convolutions with
different kernel sizes. The recipe is simple: apply each convolution to the data, add non-
linearity and global pooling after each of them, then concatenate the results (on the illustration,
non-linearity is omitted for simplicity). This is how you get vector representation of the data
which is used for classification.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 16/24
30/10/2024, 05:43 Text Classification
Text Classification
General View
Classical Methods
• Naive Bayes
• SVM This idea was used, among others, in the paper Convolutional Neural Networks for Sentence
Classification and many follow-ups.
Neural Networks
• Stack Several Blocks Convolution+Pooling
• High-Level Pipeline
Instead of one layer, you can stack several blocks convolution+pooling on top of each other.
• Training After several blocks, you can apply another convolution, but with global pooling this time.
Remember: you have to get a single fixed-sized vector - for this, you need global pooling.
• Models: (Weighted) BOE
Such multi-layered convolutions can be useful when your texts are very long; for example, if
• Models: Recurrent
your model is character-level (as opposed to word-level).
• Models: Convolutional
Multi-Label Classification
Practical Tips
Research Thinking
This idea was used, among others, in the paper Character-level Convolutional Networks for
Related Papers Text Classification.
Have Fun!
Multi-Label Classification
Multi-label classification is different from the
single-label problems we discussed before in
that each input can have several correct
labels. For example, a twit can have several
hashtags, a user can have several topics of
interest, etc. Multi-label classification:
many labels, several can be correct
For a multi-label problem, we need to change
two things in the single-label pipeline we
discussed before:
1.
model (how we evaluate class 2.
probabilities);
loss function.
For single-label problems, we used softmax: it converts K values into a probability distribution,
i.e. sum of all probabilities is 1. It means that the classes share the same probability mass: if
the probability of one class is high, other classes can not have large probability (Lena: Once
again, imagine a bunch of kittens eating from the same bowl: one kitten always eats at the
expense of the others).
For multi-label problems, we convert each of the K values into a probability of the
corresponding class independently from the others. Specifically, we apply the sigmoid function
1
σ(x) =
1+e
−x
to each of the K values.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 17/24
30/10/2024, 05:43 Text Classification
Intuitively, we can think of this as having
K independent binary classifiers that
use the same text representation.
NLP Course | For You
Loss Function: Binary
Text Classification
Cross-Entropy for Each
Intro and Datasets
Class
General View
Loss function changes to enable
Classical Methods
multiple labels: for each class, we use the binary cross-entropy loss. Look at the illustration.
• Naive Bayes
• SVM
Neural Networks
• High-Level Pipeline
• Training
Let's think about these options by looking at the data a model can use. Training data for
classification is labeled and task-specific, but labeled data is usually hard to get. Therefore,
this corpus is likely to be not huge (at the very least), or not diverse, or both. On the contrary,
training data for word embeddings is not labeled - plain texts are enough. Therefore, these
datasets can be huge and diverse - a lot to learn from.
Now let us think what a model will know depending on what we do with the embeddings. If the
embeddings are trained from scratch, the model will "know" only the classification data - this
may not be enough to learn relationships between words well. But if we use pretrained
embeddings, they (and, therefore, the whole model) will know a huge corpus - they will learn a
lot about the world. To adapt these embeddings to your task-specific data, you can fine-tune
these embeddings by training them with the whole network - this can bring gains in the
performance (not huge though).
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 18/24
30/10/2024, 05:43 Text Classification
When we use pretrained embeddings, this is an example of transfer learning: through the
embeddings, we "transfer" the knowledge of their training data to our task-specific model. We
will learn more about transfer learning later in the course.
NLP Course | For You
Text Classification Fine-tune pretrained embeddings or not? Before training models, you can first think
why fine-tuning can be useful, and which types of examples can benefit from it.
Intro and Datasets Learn more from this exercise in the Research Thinking section.
General View
Classical Methods
For more details and the experiments with different settings for word embeddings,
• Naive Bayes look at this paper summary.
Practical Tips Data augmentation for images can be done easily: look at the examples below. The standard
augmentations include flipping an image, geometrical transformations (e.g. rotation and
Analysis and Interpretability stretching along some direction), covering parts of an image with different patches.
Research Thinking
Related Papers
Have Fun!
Word dropout is the simplest regularization: for each example, you choose some words
randomly (say, each word is chosen with probability 10%) and replace the chosen words with
either the special token UNK or with a random token from the vocabulary.
The motivation here is simple: we teach a model not to over-rely on individual tokens, but take
into consideration context of the whole text. For example, here we masked great, and a model
has to understand the sentiment based on other words.
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 19/24
30/10/2024, 05:43 Text Classification
A bit more complicated approach is to replace words or phrases with their synonyms. The
tricky part is getting these synonyms: you need external resources, and they are rarely
available for languages other than English (for English, you can use e.g. WordNet). Another
NLP Course | For You problem is that for languages with rich morphology (e.g., Russian) you are likely to violate the
grammatical agreement.
Text Classification
General View
Classical Methods
• Naive Bayes • use separate models - even more complicated
• MaxEnt (Logistic Regression) An even more complicated method is to paraphrase the whole sentences using external
models. A popular paraphrasing method is to translate a sentence to some language and
• SVM back. We will learn how to train a translation model a bit later (in the Seq2seq and Attention
lecture), but for now, you can use industrial systems, e.g. Yandex Translate, Google Translate,
Neural Networks
etc. (Lena: Obviously, personally I'm biased towards Yandex :) ) Note that you can combine
• High-Level Pipeline translation systems and languages to get several paraphrases.
• Training
• Models: Recurrent
• Models: Convolutional
Multi-Label Classification
Note: For images, the last two techniques correspond to
Practical Tips
geometric transformations: we want to change text, but to
preserve the meaning. This is different from word dropout,
Analysis and Interpretability
where some parts are lost completely.
Research Thinking
Related Papers
Have Fun!
Analysis and Interpretability
What do Convolutions Learn? Analyzing Convolutional Filters
Convolutions in Computer Vision: Visual Patterns
Convolutions were originally developed for images, and there's already a pretty good
understanding of what the filters capture and how filters from different layers from a hierarchy.
While lower layers capture simple visual patterns such as lines or circles, final layers can
capture the whole pictures, animals, people, etc.
Examples of patterns captured by convolution filters for images. The examples are from Activation Atlas from
distill.pub.
This part is based on the paper Understanding Convolutional Neural Networks for Text
Classification.
For images, filters capture local visual patterns which are important for classification. For text,
such local patterns are word n-grams. The main findings on how CNNs work for texts are:
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 20/24
30/10/2024, 05:43 Text Classification
homogeneous, i.e. a single filter can, and often does, detect multiple distinctly different
families of ngrams.
max-pooling induces a thresholding behavior
NLP Course | For You Values below a given threshold are ignored when (i.e. irrelevant to) making a prediction.
For example, this paper shows that 40% of the pooled ngrams on average can be
Text Classification
dropped with no loss of performance.
Intro and Datasets
The simplest way to understand what a
General View network captures is to look which
patterns activate its neurons. For
Classical Methods
convolutions, we pick a filter and find
those n-grams which activate this filter
• Naive Bayes
most.
• MaxEnt (Logistic Regression)
Below are examples of the top-1 n-
• SVM gram for several filters. For one of them,
we also show other n-grams which lead
Neural Networks
to high activation of this filter - you can
see that the n-grams have a very similar meaning.
• High-Level Pipeline
• Training
• Models: Recurrent
• Models: Convolutional
Multi-Label Classification
Practical Tips
Research Thinking
How to
Read the short description at the beginning - this is our starting point, something known.
Read a question and think: for a minute, a day, a week, ... - give yourself some time! Even
if you are not thinking about it constantly, something can still come to mind.
Look at the possible answers - previous attempts to answer/solve this problem.
Important: You are not supposed to come up with something exactly like here -
remember, each paper usually takes the authors several months of work. It's a habit of
thinking about these things that counts! All the rest a scientist needs is time: to try-fail-
think until it works.
It's well-known that you will learn something easier if you are not just given the answer right
away, but if you think about it first. Even if you don't want to be a researcher, this is still a good
way to learn things!
Classical Approaches
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 21/24
30/10/2024, 05:43 Text Classification
• MaxEnt (Logistic Regression) ? What other types of features you can come up with?
Possible answers
• SVM
Neural Networks
? Are all words equally needed for classification? If not, how can we modify the
method?
• High-Level Pipeline
Possible answers
• Training
Have Fun! ? Imagine we want to use embeddings for sentiment classification. Can you find
examples of antonyms such that if their embeddings are very close, it would hurt
sentiment classification? If you can, it means that might be better to fine-tune!
Possible answers
Related Papers
How to
High-level: look at key results in short summaries - get an idea of what's going on in the
field.
A bit deeper: for topics which interest you more, read longer summaries with illustrations
and explanations. Take a walk through the authors' reasoning steps and key observations.
In depth: read the papers you liked. Now, when you got the main idea, this is going to be
easier!
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 22/24
30/10/2024, 05:43 Text Classification
What's inside:
NLP Course | For You
Classical Methods
Convolutions for Classification: Classics
• Naive Bayes
EMNLP 2014
• MaxEnt (Logistic Regression) Convolutional Neural Networks for Sentence Classification
Yoon Kim
• SVM
Even a very simple CNN with one layer on top of word
Neural Networks
embeddings shows very good performance (without features
• High-Level Pipeline requiring some external knowledge!). The paper also shows
the importance of using pretrained embeddings (and not
• Training training from scratch) and gains from fine-tuning.
More details
More details
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 23/24
30/10/2024, 05:43 Text Classification
General View
Classical Methods
• Naive Bayes Have Fun!
• MaxEnt (Logistic Regression)
SVM
•
Coming soon!
Neural Networks
We are still working on this!
• High-Level Pipeline
• Training
• Models: Convolutional
Multi-Label Classification
Practical Tips
Research Thinking
Related Papers
Have Fun!
https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 24/24