0% found this document useful (0 votes)
13 views

Text Classification

Uploaded by

Satish 1542
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Text Classification

Uploaded by

Satish 1542
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

30/10/2024, 05:43 Text Classification

Lena Voita Blog Publications Talks & Service NLP Course | For You

NLP Course | For You

Text Classification Text Classification


Intro and Datasets Text classification is an extremely popular
 task. You enjoy working text classifiers in your
General View
mail agent: it classifies letters and filters
Classical Methods
 spam. Other applications include document
classification, review classification, etc.
• Naive Bayes
Multi-class classification:
Text classifiers are often used not as an
• MaxEnt (Logistic Regression) many labels, only one correct
individual task, but as part of bigger
pipelines. For example, a voice assistant
• SVM
classifies your utterance to understand what
you want (e.g., set the alarm, order a taxi or 1.
Neural Networks
 just chat) and passes your message to 2.
• High-Level Pipeline different models depending on the classifier's 3.
decision. Another example is a web search 4.
• Training
engine: it can use classifiers to identify the
• Models: (Weighted) BOE query language, to predict the type of your
query (e.g., informational, navigational, transactional), to understand whether you what to see
• Models: Recurrent pictures or video in addition to documents, etc.

• Models: Convolutional Since most of the classification datasets assume that only one label is correct (you will see this
right now!), in the lecture we deal with this type of classification, i.e. the single-label
Multi-Label Classification
classification. We mention multi-label classification in a separate section (Multi-Label
Practical Tips  Classification).

Analysis and Interpretability Datasets for Classification


Research Thinking Datasets for text classification are very different in terms of size (both dataset size and
examples' size), what is classified, and the number of labels. Look at the statistics below.
Related Papers
Number Size Avg. length
Have Fun! Dataset Type
of labels (train/test) (tokens)
SST sentiment 5 or 2 8.5k / 1.1k 19
IMDb Review sentiment 2 25k / 25k 271
Yelp Review sentiment 5 or 2 650k / 50k 179
Amazon Review sentiment 5 or 2 3m / 650k 79
TREC question 6 5.5k / 0.5k 10
Yahoo! Answers question 10 1.4m / 60k 131
AG’s News topic 4 120k / 7.6k 44
Sogou News topic 6 54k / 6k 737
DBPedia topic 14 560k / 70k 67

Some of the datasets can be downloaded here.

The most popular datasets are for sentiment classification. They consist of reviews of movies,
places or restaurants, and products. There are also datasets for question type classification
and topic classification.

To better understand typical classification tasks, below you can look at the examples from
different datasets.

How to: pick a dataset and look at the examples to get a feeling of the task. Or you can
come back to this later!

Pick a dataset
SST IMDb Review Yelp Review Amazon Review
TREC Yahoo! Answers AG's News Sogou News DBPedia

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 1/24
30/10/2024, 05:43 Text Classification

Dataset Description (click!)


Look how sentiment of a sentence is composed from its parts.
NLP Course | For You

Text Classification Label: 3


Review:
Intro and Datasets Makes even the claustrophobic on-board quarters seem
fun .
General View 

Classical Methods

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks

• High-Level Pipeline

• Training

• Models: (Weighted) BOE 1.


2.
• Models: Recurrent
3.
• Models: Convolutional

Multi-Label Classification

Practical Tips 
General View
Analysis and Interpretability
Here we provide a general view on classification and introduce the notation. This section applies to
Research Thinking both classical and neural approaches.

Related Papers We assume that we have a collection of documents with ground-truth labels. The input of a
classifier is a document x = (x 1 , … , x n ) with tokens (x 1 , … , x n ), the output is a label
Have Fun! y ∈ 1 … k. Usually, a classifier estimates probability distribution over classes, and we want

the probability of the correct class to be the highest.

Get Feature Representation and Classify


Text classifiers have the following structure:

feature extractor
A feature extractor can be either
manually defined (as in classical
approaches) or learned (e.g., with
neural networks).
classifier
A classifier has to assign class
probabilities given feature
representation of a text. The most
common way to do this is using
logistic regression, but other
variants are also possible (e.g., Naive Bayes classifier or SVM).

In this lecture, we'll mostly be looking at different ways to build feature representation of a text
and to use this representation to get class probabilities.

Generative and Discriminative Models

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 2/24
30/10/2024, 05:43 Text Classification

NLP Course | For You

Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes A classification model can be either generative or discriminative.

• MaxEnt (Logistic Regression) generative models


Generative models learn joint probability distribution of data p(x, y) = p(x|y) ⋅ p(y). To
• SVM
make a prediction given an input x, these models pick a class with the highest joint
probability: y = arg max p(x|y = k) ⋅ p(y = k).
Neural Networks
 k

discriminative models
• High-Level Pipeline
Discriminative models are interested only in the conditional probability p(y|x), i.e. they
• Training learn only the border between classes. To make a prediction given an input x, these
models pick a class with the highest conditional probability: y = arg max p(y = k|x).
k
• Models: (Weighted) BOE
In this lecture, we will meet both generative and discriminative models.
• Models: Recurrent

Models: Convolutional

Classical Methods for Text Classification
Multi-Label Classification
In this part, we consider classical approaches for text classification. They were developed long
Practical Tips  before neural networks became popular, and for small datasets can still perform comparably
to neural models.
Analysis and Interpretability
Lena: Later in the course, we will learn about transfer learning which can make neural approaches
Research Thinking better even for very small datasets. But let's take this one step at a time: for now, classical
approaches are a good baseline for your models.
Related Papers

Have Fun! Naive Bayes Classifier


A high-level idea of the Naive Bayes approach is given below: we rewrite the conditional class
probability P (y = k|x) using Bayes's rule and get P (x|y = k) ⋅ P (y = k) .

This is a generative model!

Naive Bayes is a generative model: it models the joint probability of data.

Note also the terminology:

prior probability P (y = k) : class probability before looking at data (i.e., before knowing x

);
posterior probability P (y = k|x) : class probability after looking at data (i.e., after knowing
the specific x);
joint probability P (x, y): the joint probability of data (i.e., both examples x and labels y);
maximum a posteriori (MAP) estimate: we pick the class with the highest posterior
probability.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 3/24
30/10/2024, 05:43 Text Classification

How to define P(x|y=k) and P(y=k)?


P(y=k): count labels
NLP Course | For You
P (y = k) is very easy to get: we can just evaluate the proportion of documents with the label
Text Classification
k (this is the maximum likelihood estimate, MLE). Namely,
Intro and Datasets
N (y = k)
P (y = k) = ,
General View  ∑ N (y = i)
i

Classical Methods
 where N (y = k) is the number of examples (documents) with the label k.
• Naive Bayes
P(x|y=k): use the "naive" assumptions, then count
• MaxEnt (Logistic Regression)
Here we assume that document x is represented as a set of features, e.g., a set of its words
• SVM (x 1 , … , x n ) :

Neural Networks
 P (x|y = k) = P (x 1 , … , x n |y = k).

• High-Level Pipeline The Naive Bayes assumptions are

• Training Bag of Words assumption: word order does not matter,


Conditional Independence assumption: features (words) are independent given the class.
• Models: (Weighted) BOE
Intuitively, we assume that the probability of each word to appear in a document with class k

• Models: Recurrent does not depend on context (neither word order nor other words at all). For example, we can
say that awesome, brilliant, great are more likely to appear in documents with a positive
• Models: Convolutional
sentiment and awful, boring, bad are more likely in negative documents, but we know nothing
Multi-Label Classification about how these (or other) words influence each other.

Practical Tips  With these "naive" assumptions we get:

Analysis and Interpretability P (x|y = k) = P (x 1 , … , x n |y = k) = ∏ P (x t |y = k).

t=1

Research Thinking
The probabilities P (x i |y = k) are estimated as the proportion of times the word xi appeared
Related Papers in documents of class k among all tokens in these documents:

Have Fun! N (x i , y = k)
P (x i |y = k) = ,
|V |

∑ N (x t , y = k)
t=1

where N (x i , y = k) is the number of times the token x i appeared in documents with the label
k V , is the vocabulary (more generally, a set of all possible features).

What if N (x i , y = k) = 0 ? Need to avoid this!

What if N (x i , y = k) = 0 , i.e. in training we haven't seen the token xi in the documents with
class k? This will null out the probability of the whole document, and this is not what we want!
For example, if we haven't seen some rare words (e.g., pterodactyl or abracadabra) in training
positive examples, it does not mean that a positive document can never contain these words.

To avoid this, we'll use a simple trick: we add to counts of all words a small δ:

δ + N (x i , y = k) δ + N (x i , y = k)
P (x i |y = k) = = ,
|V | |V |

∑ (δ + N (x t , y = k)) δ ⋅ |V | + ∑ N (x t , y = k)
t=1 t=1

where δ can be chosen using cross-validation.

Note: this is Laplace smoothing (aka Add-1 smoothing if δ = 1). We'll learn more about
smoothings in the next lecture when talking about Language Modeling.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 4/24
30/10/2024, 05:43 Text Classification

Making a Prediction
As we already mentioned, Naive Bayes (and, more broadly, generative models) make a
NLP Course | For You prediction based on the joint probability of data and class:

Text Classification y

= arg max P (x, y = k) = arg max P (y = k) ⋅ P (x|y = k).
k k

Intro and Datasets


Intuitively, Naive Bayes expects that some words serve as class indicators. For example, for
General View  sentiment classification tokens awesome, brilliant, great will have higher probability given
positive class then negative. Similarly, tokens awful, boring, bad will have higher probability
Classical Methods
 given negative class then positive.
• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks

• High-Level Pipeline

• Training

• Models: (Weighted) BOE

• Models: Recurrent

• Models: Convolutional

Multi-Label Classification

Practical Tips  Final Notes on Naive Bayes

Analysis and Interpretability Practical Note: Sum of Log-Probabilities Instead of Product of Probabilities

The main expression Naive Bayes uses for classification is a product lot of probabilities:
Research Thinking
n

Related Papers P (x, y = k) = P (y = k) ⋅ P (x 1 , … , x n |y) = P (y = k) ⋅ ∏ P (x t |y = k).

t=1

Have Fun!
A product of many probabilities may be very unstable numerically. Therefore, usually instead of
P (x, y) we consider log P (x, y):

log P (x, y = k) = log P (y = k) + ∑ log P (x t |y = k).

t=1

Since we care only about argmax, we can consider log P (x, y) instead of P (x, y).

Important! Note that in practice, we will usually deal with log-probabilities and not probabilities.

View in the General Framework

Remember our general view on the


classification task? We obtain feature
representation of the input text using
some method, then use this feature
representation for classification.

In Naive Bayes, our features are


words, and the feature representation
is the Bag-of-Words (BOW)
representation - a sum of one-hot
representations of words. Indeed, to
evaluate P (x, y) we only need to
count the number of times each token appeared in the text.

Feature Design

In the standard setting, we used words as features. However, you can use other types of
features: URL, user id, etc.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 5/24
30/10/2024, 05:43 Text Classification

Even if your data is a plain text (without fancy things such as URL, user id, etc), you
can still design features in different ways. Learn how to improve Naive Bayes in this
NLP Course | For You exercise in the Research Thinking section.

Text Classification

Intro and Datasets Maximum Entropy Classifier (aka Logistic Regression)


General View  Differently from Naive Bayes, MaxEnt classifier is a discriminative model, i.e., we are interested
in P (y = k|x) and not in the joint distribution p(x, y) . Also, we will learn how to use features:
Classical Methods
 this is in contrast to Naive Bayes, where we defined how to use the features ourselves.

• Naive Bayes Here we also have to define features manually, but we have more freedom: features do not
have to be categorical (in Naive Bayes, they had to!). We can use the BOW representation or
• MaxEnt (Logistic Regression)
come up with something more interesting.
• SVM
The general classification pipeline here is as follows:
Neural Networks
 get h = (f 1 , f 2 , … , f n ) - feature representation of the input text;
(i) (i) (i)
• High-Level Pipeline take w = (w 1 , … , w n ) - vectors with feature weights for each of the classes;
for each class, weigh features, i.e. take the dot product of feature representation h with
• Training feature weights w (k) :

• Models: (Weighted) BOE (k) (k) (k)


w h = w ⋅ f1 + ⋯ + wn ⋅ fn , k = 1, … , K.
1

• Models: Recurrent
To get a bias term in the sum above, we define one of the features being 1 (e.g., f0 = 1 ).
• Models: Convolutional Then

(k) (k) (k)


Multi-Label Classification w
(k)
h = w + w ⋅ f1 + ⋯ + wn ⋅ fn , k = 1, … , K.
0 1

Practical Tips 
get class probabilities using softmax:

Analysis and Interpretability exp(w


(k)
h)
P (class = k|h) = .
K

Research Thinking ∑ exp(w


(i)
h)
i=1

Related Papers
Softmax normalizes the K values we got at the previous step to a probability distribution
Have Fun! over output classes.

Look at the illustration below (classes are shown in different colors).

Training: Maximum Likelihood Estimate


1 N 1 N i
Given training examples x ,…,x with labels y ,…,y , y ∈ {1, … , K} , we pick those
(k)
weights w , k = 1.. K which maximize the probability of the training data:

∗ i i
w = arg max ∑ log P (y = y |x ).
w
i=1

In other words, we choose parameters such that the data is more likely to appear. Therefore,
this is called the Maximum Likelihood Estimate (MLE) of the parameters.

To find the parameters maximizing the data log-likelihood, we use gradient ascent: gradually
improve weights during multiple iterations over the data. At each step, we maximize the
probability a model assigns to the correct class.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 6/24
30/10/2024, 05:43 Text Classification
Equvalence to minimizing cross-entropy

Note that maximizing data log-likelihood is equivalent to minimizing cross entropy between the
NLP Course | For You target probability distribution p ∗ = (0, … , 0, 1, 0, …) (1 for the target label, 0 for the rest)
and the predicted by the model distribution p = (p 1 , … , p K ), p i = p(i|x):
Text Classification
K

∗ ∗ ∗
Intro and Datasets Loss(p , p ) = −p log(p) = − ∑ p i log(p i ).

i=1

General View 

Since only one of p
i
is non-zero (1 for the target label k , 0 for the rest), we will get
Classical Methods
 ∗
Loss(p , p) = − log(p k ) = − log(p(k|x)).

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks

• High-Level Pipeline

• Training

• Models: (Weighted) BOE

• Models: Recurrent

• Models: Convolutional

Multi-Label Classification
This equivalence is very important for you to understand: when talking about neural
Practical Tips  approaches, people usually say that they minimize the cross-entropy loss. Do not forget that
this is the same as maximizing the data log-likelihood.
Analysis and Interpretability
Naive Bayes vs Logistic Regression
Research Thinking

Related Papers

Have Fun!

Let's finalize this part by discussing the advantages and drawbacks of logistic regression and
Naive Bayes.

simplicity
Both methods are simple; Naive Bayes is the simplest one.
interpretability
Both methods are interpretable: you can look at the features which influenced the
predictions most (in Naive Bayes - usually words, in logistic regression - whatever you
defined).
training speed
Naive Bayes is very fast to train - it requires only one pass through the training data to
evaluate the counts. For logistic regression, this is not the case: you have to go over the
data many times until the gradient ascent converges.
independence assumptions
Naive Bayes is too "naive" - it assumed that features (words) are conditionally independent
given class. Logistic regression does not make this assumption - we can hope it is better.
text representation: manual
Both methods use manually defined feature representation (in Naive Bayes, BOW is the
standard choice, but you still choose this yourself). While manually defined features are
good for interpretability, they may be no so good for performance - you are likely to miss
something which can be useful for the task.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 7/24
30/10/2024, 05:43 Text Classification

SVM for Text Classification


One more method for text
NLP Course | For You classification based on manually
designed features is SVM. The
Text Classification
most basic (and popular) features
Intro and Datasets for SVMs are bag-of-words and
bag-of-ngrams (ngram is a tuple of
General View 
n words). With these simple
features, SVMs with linear kernel
Classical Methods
 perform better than Naive Bayes
• Naive Bayes (see, for example, the paper
Question Classification using
• MaxEnt (Logistic Regression) Support Vector Machines).

• SVM

Neural Networks
 Text Classification with Neural Networks
• High-Level Pipeline
Instead of manually defined features, let a neural network to learn useful
• Training
features.
• Models: (Weighted) BOE
The main idea of neural-network-based classification is that feature representation of the input
• Models: Recurrent text can be obtained using a neural network. In this setting, we feed the embeddings of the
input tokens to a neural network, and this neural network gives us a vector representation of
• Models: Convolutional
the input text. After that, this vector is used for classification.
Multi-Label Classification

Practical Tips 

Analysis and Interpretability

Research Thinking

Related Papers

Have Fun!

When dealing with neural networks, we can think about the classification part (i.e., how to get
class probabilities from a vector representation of a text) in a very simple way.

Vector representation of a text has some


dimensionality d, but in the end, we need a
vector of size K (probabilities for K
classes). To get a K -sized vector from a d-
sized, we can use a linear layer. Once we
have a K -sized vector, all is left is to apply
the softmax operation to convert the raw
numbers into class probabilities.

Classification Part: This is Logistic Regression!

Let us look closer to the neural network classifier. The way we use vector representation of the
input text is exactly the same as we did with logistic regression: we weigh features according
to feature weights for each class. The only difference from logistic regression is where the
features come from: they are either defined manually (as we did before) or obtained by a
neural network.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 8/24
30/10/2024, 05:43 Text Classification

NLP Course | For You

Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks
 Intuition: Text Representation Points in the Direction of Class Representation

• High-Level Pipeline

• Training

• Models: (Weighted) BOE

• Models: Recurrent

• Models: Convolutional

Multi-Label Classification If we look at this final linear layer more closely, we will see that the columns of its matrix are
vectors w i . These vectors can be thought of as vector representations of classes. A good
Practical Tips 
neural network will learn to represent input texts in such a way that text vectors will point in the
direction of the corresponding class vectors.
Analysis and Interpretability
Training and the Cross-Entropy Loss
Research Thinking
Neural classifiers are trained to predict probability distributions over classes. Intuitively, at each
Related Papers step we maximize the probability a model assigns to the correct class.

Have Fun! The standard loss function is the cross-entropy loss. Cross-entropy loss for the target
probability distribution p ∗ = (0, … , 0, 1, 0, …) (1 for the target label, 0 for the rest) and the
predicted by the model distribution p = (p 1 , … , p K ), p i = p(i|x):

∗ ∗ ∗
Loss(p , p ) = −p log(p) = − ∑ p log(p i ).
i

i=1


Since only one of p
i
is non-zero (1 for the target label k, 0 for the rest), we will get

Loss(p , p) = − log(p k ) = − log(p(k|x)). Look at the illustration for one training example.

In training, we gradually improve model weights during multiple iterations over the data: we
iterate over training examples (or batches of examples) and make gradient updates. At each
step, we maximize the probability a model assigns to the correct class. At the same time, we
minimize sum of the probabilities of incorrect classes: since sum of all probabilities is constant,
by increasing one probability we decrease sum of all the rest (Lena: Here I usually imagine a
bunch of kittens eating from the same bowl: one kitten always eats at the expense of the others).

Look at the illustration of the training process.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 9/24
30/10/2024, 05:43 Text Classification

NLP Course | For You

Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks

• High-Level Pipeline
Recap: This is equivalent to maximizing the data likelihood
• Training
Do not forget that when talking about MaxEnt classifier (logistic regression), we showed that
• Models: (Weighted) BOE
minimizing cross-entropy is equivalent to maximizing the data likelihood. Therefore, here we are
• Models: Recurrent also trying to get the Maximum Likelihood Estimate (MLE) of model parameters.

• Models: Convolutional
Models for Text Classification
Multi-Label Classification
We need a model that can produce a fixed-sized vector for inputs of different
Practical Tips 
lengths.
Analysis and Interpretability
In this part, we will look at different ways to
Research Thinking get a vector representation of an input text
using neural networks. Note that while input
Related Papers texts can have different lengths, the vector
representation of a text has to have a fixed
Have Fun! size: otherwise, a network will not "work".

We begin with the simplest approaches


which use only word embeddings (without
adding a model on top of that). Then we
look at recurrent and convolutional
networks.

Lena: A bit later in the course, you will learn about Transformers and the most recent classification
techniques using large pretrained models.

Basics: Bag of Embeddings (BOE) and Weighted BOE


The simplest you can do is use only word embeddings without any neural network on top of
that. To get vector representation of a text, we can either sum all token embeddings (Bag of
Embeddings) or use a weighted sum of these embeddings (with weights, for example, being
tf-idf or something else).

Bag of Embeddings (ideally, along with Naive Bayes) should be a baseline for any model with
a neural network: if you can't do better than that, it's not worth using NNs at all. This can be the
case if you don't have much data.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 10/24
30/10/2024, 05:43 Text Classification

NLP Course | For You

Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes While Bag of Embeddings (BOE) is sometimes called Bag of Words (BOW), note that these two
are very different. BOE is the sum of embeddings and BOW is the sum of one-hot vectors: BOE
• MaxEnt (Logistic Regression)
knows a lot more about language. The pretrained embeddings (e.g., Word2Vec or GloVe)
• SVM understand similarity between words. For example, awesome, brilliant, great will be represented
with unrelated features in BOW but similar word vectors in BOE.
Neural Networks
 Note also that to use a weighted sum of embeddings, you need to come up with a way to get
• High-Level Pipeline weights. However, this is exactly what we wanted to avoid by using neural networks: we don't
want to introduce manual features, but rather let a network to learn useful patterns.
• Training
Bag of Embeddings as Features for SVM
• Models: (Weighted) BOE
You can use SVM on top of BOE! The only difference from SVMs in classical approaches (on
• Models: Recurrent
top of bag-of-words and bag-of-ngrams) if the choice of a kernel: here the RBF kernel is better.
• Models: Convolutional Models: Recurrent (RNN/LSTM/etc)
Multi-Label Classification
Recurrent networks are a natural way to process text in a sense that, similar to humans, they
Practical Tips  "read" a sequence of tokens one by one and process the information. Hopefully, at each step
the network will "remember" everything it has read before.
Analysis and Interpretability
Basics: Recurrent Neural Networks
Research Thinking
• RNN cell
Related Papers
At each step, a recurrent network receives a new input vector
(e.g., token embedding) and the previous network state
Have Fun!
(which, hopefully, encodes all previous information). Using
this input, the RNN cell computes the new state which it gives
as output. This new state now contains information about
both current input and the information from previous steps.

• RNN reads a sequence of


tokens

Look at the illustration: RNN reads a


text token by token, at each step
using a new token embedding and
the previous state.

Note that the RNN cell is the same at


each step!

• Vanilla RNN

The simplest recurrent network, Vanilla RNN, transforms


h t−1 and xt linearly, then applies a non-linearity (most
often, the tanh function):

h t = tanh(h t−1 W h + x t W t ).

Vanilla RNNs suffer from the vanishing and exploding gradients problem. To alleviate this
problem, more complex recurrent cells (e.g., LSTM, GRU, etc) perform several operations on
the input and use gates. For more details of RNN basics, look at the Colah's blog post.

Recurrent Neural Networks for Text Classification

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 11/24
30/10/2024, 05:43 Text Classification
Here we (finally!) look at how we can use recurrent models for text classification. Everything
you will see here will apply to all recurrent cells, and by "RNN" in this part I refer to recurrent
cells in general (e.g. vanilla RNN, LSTM, GRU, etc).
NLP Course | For You
Let us recall what we need:
Text Classification
We need a model that can produce a fixed-sized vector for inputs of different
Intro and Datasets
lengths.
General View 
• Simple: read a text, take the
Classical Methods
 final state
• Naive Bayes The most simple recurrent model is a
one-layer RNN network. In this
• MaxEnt (Logistic Regression)
network, we have to take the state
• SVM which knows more about input text.
Therefore, we have to use the last
Neural Networks
 state - only this state saw all input
tokens.
• High-Level Pipeline

• Training
• Multiple layers: feed the states from
one RNN to the next one
• Models: (Weighted) BOE
To get a better text representation, you can
• Models: Recurrent stack multiple layers. In this case, inputs for the
higher RNN are representations coming from
• Models: Convolutional
the previous layer.
Multi-Label Classification
The main hypothesis is that with several layers,
Practical Tips  lower layers will catch local phenomena (e.g.,
phrases), while higher layers will be able to
Analysis and Interpretability learn more high-level things (e.g., topic).

Research Thinking • Bidirectional: use final states from forward and backward RNNs.

Related Papers Previous approaches may have a problem: the last state can easily "forget" earlier tokens. Even
strong models such as LSTMs can still suffer from that!
Have Fun!
To avoid this, we can use two RNNs: forward, which reads input from left to right, and
backward, which reads input from right to left. Then we can use the final states from both
models: one will better remember the final part of a text, another - the beginning. These states
can be concatenated, or summed, or something else - it's your choice!

• Combinations: do everything you want!

You can combine the ideas above. For example, in a multi-layered network, some layers can
go in the opposite direction, etc.

Models: Convolutional (CNN)


The detailed description of convolutional models in general is in Convolutional Models
Supplementary. In this part, we consider only convolutions for text classification.

Convolutions for Images and Translation Invariance

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 12/24
30/10/2024, 05:43 Text Classification
Convolutional networks were originally developed for computer vision tasks. Therefore, let's
first understand the intuition behind convolutional models for images.

NLP Course | For You Imagine we want to classify an image into several classes, e.g. cat, dog, airplane, etc. In this
case, if you find a cat on an image, you don't care where on the image this cat is: you care only
Text Classification that it is there somewhere.

Intro and Datasets

General View 

Classical Methods

• Naive Bayes
Convolutional networks apply the same operation to small parts of an
• MaxEnt (Logistic Regression) image: this is how they extract features. Each operation is looking for a
match with a pattern, and a network learns which patterns are useful.
• SVM
With a lot of layers, the learned patterns become more and more
complicated: from lines in the early layers to very complicated patterns
Neural Networks
 (e.g., the whole cat or dog) on the upper ones. You can look at the
• High-Level Pipeline examples in the Analysis and Interpretability section.
The illustration is
adapted from the one
• Training This property is called translation invariance: translation because we taken from this cool
repo.
are talking about shifts in space, invariance because we want it to not
• Models: (Weighted) BOE
matter.
• Models: Recurrent
Convolutions for Text
• Models: Convolutional
Well, for images it's all clear: e.g. we want to be able to move a cat because we don't care
Multi-Label Classification where the cat is. But what about texts? At first glance, this is not so straightforward: we can not
move phrases easily - the meaning will change or we will get something that does not make
Practical Tips  much sense.

Analysis and Interpretability However, there are some applications where we can think of the same intuition. Let's imagine
that we want to classify texts, but not cats/dogs as in images, but positive/negative sentiment.
Research Thinking Then there are some words and phrases which could be very informative "clues" (e.g. it's been
great, bored to death, absolutely amazing, the best ever, etc), and others which are not important
Related Papers
at all. We don't care much where in a text we saw bored to death to understand the sentiment,
right?
Have Fun!

A Typical Model: Convolution+Pooling Blocks

Following the intuition above, we want to detect some patterns, but we don't care much where
exactly these patterns are. This behavior is implemented with two layers:

convolution: finds matches with patterns (as the cat head we saw above);
pooling: aggregates these matches over positions (either locally or globally).

A typical convolutional model for text classification is shown on the figure. To get a vector
representation of an input text, a convolutional layer is applied to word embedding, which is
followed by a non-linearity (usually ReLU) and a pooling operation. The way this representation
is used for classification is similar to other networks.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 13/24
30/10/2024, 05:43 Text Classification

NLP Course | For You

Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes

• MaxEnt (Logistic Regression)


In the following, we discuss in detail the main building blocks, convolution and pooling, then
• SVM consider modeling modifications.

Neural Networks
 Basics: Convolution Layer for Text
• High-Level Pipeline Convolutional Neural Networks were initially developed for computer
vision tasks, e.g. classification of images (cats vs dogs, etc). The idea
• Training
of a convolution is to go over an image with a sliding window and to
• Models: (Weighted) BOE apply the same operation, convolution filter, to each window.

• Models: Recurrent The illustration (taken from this cool repo) shows this process for one
filter: the bottom is the input image, the top is the filter output. Since an
Convolution filter for
• Models: Convolutional image has two dimensions (width and height), the convolution is two- images. The illustration
dimensional. is from this cool repo.
Multi-Label Classification
Differently from images, texts have only one
Practical Tips 
dimension: here a convolution is one-
dimensional: look at the illustration.
Analysis and Interpretability
Convolution filter for text.
Research Thinking
Convolution is a Linear Operation Applied to Each Window
Related Papers

Have Fun!

A convolution is a linear layer (followed by a non-linearity) applied to each input window.


Formally, let us assume that
d
- representations of the input words, x i
(x 1 , … , x n ) ∈ R ;
d (input channels) - size of an input embedding;

k (kernel size) - the length of a convolution window (on the illustration, k = 3);
m (output channels) - number of convolution filters (i.e., number of channels produced by
the convolution).

(k⋅d)×m
Then a convolution is a linear layer W ∈ R . For a k-sized window (x i , … x i+k−1 ) , the
convolution takes the concatenation of these vectors

k⋅d
u i = [x i , … x i+k−1 ] ∈ R

and multiplies by the convolution matrix:

Fi = ui × W .

A convolution goes over an input with a sliding window and applies the same linear
transformation to each window.

Intuition: Each Filter Extracts a Feature

Intuitively, each filter in a convolution extracts a feature.

• One filter - one feature extractor


https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 14/24
30/10/2024, 05:43 Text Classification
A filter takes vector representations in a
current window and transforms them linearly
into a single feature. Formally, for a window
NLP Course | For You u i = [x i , … x i+k−1 ] ∈ R
k⋅d
a filter
k⋅d
f ∈ R computes dot product:
Text Classification
(f )
F = (f , u i ).
Intro and Datasets i

(f )
General View  The number F i (the extracted "feature") is
a result of applying the filter f to the window
Classical Methods
 (x i , … x i+k−1 ) .

• Naive Bayes
• m filters: m feature extractors
• MaxEnt (Logistic Regression)

• SVM

Neural Networks

• High-Level Pipeline
One filter extracts a single feature. Usually,
• Training
we want many features: for this, we have
• Models: (Weighted) BOE to take several filters. Each filter reads an
input text and extracts a different feature -
• Models: Recurrent look at the illustration. The number of
filters is the number of output features you
• Models: Convolutional
want to get. With m filters instead of one,
Multi-Label Classification the size of the convolutional layer we
discussed above will become (k ⋅ d) × m
Practical Tips 
.

Analysis and Interpretability This is done in parallel! Note that while I show you how a CNN "reads" a text, in practice these
computations are done in parallel.
Research Thinking
Basics: Pooling Operation
Related Papers
After a convolution extracted m features from each window, a pooling layer summarises the
Have Fun! features in some region. Pooling layers are used to reduce the input dimension, and, therefore,
to reduce the number of parameters used by the network.

• Max and Mean Pooling

The most popular is max-pooling: it takes maximum over each dimension, i.e. takes the
maximum value of each feature.

Intuitively, each feature "fires" when it sees some pattern: a visual pattern in an image (line,
texture, a cat's paw, etc) or a text pattern (e.g., a phrase). After a pooling operation, we have a
vector saying which of these patterns occurred in the input.

Mean-pooling works similarly but computes mean over each feature instead of maximum.

• Pooling and Global Pooling

Similarly to convolution, pooling is applied to windows of several elements. Pooling also has the
stride parameter, and the most common approach is to use pooling with non-overlapping
windows. For this, you have to set the stride parameter the same as the pool size. Look at the
illustration.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 15/24
30/10/2024, 05:43 Text Classification
The difference between pooling and global pooling is that pooling is applied over features in
each window independently, while global pooling performs over the whole input. For texts,
global pooling is often used to get a single vector representing the whole text; such global
NLP Course | For You pooling is called max-over-time pooling, where the "time" axis goes from the first input token to
the last.
Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks
 Convolutional Neural Networks for Text Classification
• High-Level Pipeline
Now, when we understand how the convolution and pooling work, let's come to modeling
• Training modifications. First, let us recall what we need:

• Models: (Weighted) BOE


We need a model that can produce a fixed-sized vector for inputs of different
• Models: Recurrent lengths.

• Models: Convolutional Therefore, we need to construct a convolutional model that represents a text as a single vector.

Multi-Label Classification The basic convolutional model for text classification is shown on the figure. It is almost the
same as we saw before: the only thing that's changed is that we specified the type of pooling
Practical Tips 
used. Specifically, after the convolution, we use global-over-time pooling. This is the key
operation: it allows to compress a text into a single vector. The model itself can be different, but
Analysis and Interpretability
at some point it has to use the global pooling to compress input in a single vector.
Research Thinking

Related Papers

Have Fun!

• Several Convolutions with Different Kernel Sizes

Instead of picking one kernel size for your convolution, you can use several convolutions with
different kernel sizes. The recipe is simple: apply each convolution to the data, add non-
linearity and global pooling after each of them, then concatenate the results (on the illustration,
non-linearity is omitted for simplicity). This is how you get vector representation of the data
which is used for classification.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 16/24
30/10/2024, 05:43 Text Classification

NLP Course | For You

Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM This idea was used, among others, in the paper Convolutional Neural Networks for Sentence
Classification and many follow-ups.
Neural Networks
 • Stack Several Blocks Convolution+Pooling
• High-Level Pipeline
Instead of one layer, you can stack several blocks convolution+pooling on top of each other.
• Training After several blocks, you can apply another convolution, but with global pooling this time.
Remember: you have to get a single fixed-sized vector - for this, you need global pooling.
• Models: (Weighted) BOE
Such multi-layered convolutions can be useful when your texts are very long; for example, if
• Models: Recurrent
your model is character-level (as opposed to word-level).
• Models: Convolutional

Multi-Label Classification

Practical Tips 

Analysis and Interpretability

Research Thinking
This idea was used, among others, in the paper Character-level Convolutional Networks for
Related Papers Text Classification.

Have Fun!
Multi-Label Classification
Multi-label classification is different from the
single-label problems we discussed before in
that each input can have several correct
labels. For example, a twit can have several
hashtags, a user can have several topics of
interest, etc. Multi-label classification:
many labels, several can be correct
For a multi-label problem, we need to change
two things in the single-label pipeline we
discussed before:
1.
model (how we evaluate class 2.
probabilities);
loss function.

Model: Softmax → Element-wise Sigmoid


After the last linear layer, we have K values corresponding to the K classes - these are the
values we have to convert to class probabilities.

For single-label problems, we used softmax: it converts K values into a probability distribution,
i.e. sum of all probabilities is 1. It means that the classes share the same probability mass: if
the probability of one class is high, other classes can not have large probability (Lena: Once
again, imagine a bunch of kittens eating from the same bowl: one kitten always eats at the
expense of the others).

For multi-label problems, we convert each of the K values into a probability of the
corresponding class independently from the others. Specifically, we apply the sigmoid function
1
σ(x) =
1+e
−x
to each of the K values.

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 17/24
30/10/2024, 05:43 Text Classification
Intuitively, we can think of this as having
K independent binary classifiers that
use the same text representation.
NLP Course | For You
Loss Function: Binary
Text Classification
Cross-Entropy for Each
Intro and Datasets
Class
General View 
Loss function changes to enable
Classical Methods
 multiple labels: for each class, we use the binary cross-entropy loss. Look at the illustration.

• Naive Bayes

• MaxEnt (Logistic Regression)

• SVM

Neural Networks

• High-Level Pipeline

• Training

• Models: (Weighted) BOE Practical Tips


• Models: Recurrent
Word Embeddings: how to deal with them?
• Models: Convolutional
Input for a network is represented
Multi-Label Classification by word embeddings. You have
three options how to get these
Practical Tips 
embeddings for your model:
Analysis and Interpretability train from scratch as part of
your model,
Research Thinking
take pretrained (Word2Vec,
GloVe, etc) and fix them (use
Related Papers
them as static vectors),
Have Fun! initialize with pretrained embeddings and train them with the network ("fine-tune").

Let's think about these options by looking at the data a model can use. Training data for
classification is labeled and task-specific, but labeled data is usually hard to get. Therefore,
this corpus is likely to be not huge (at the very least), or not diverse, or both. On the contrary,
training data for word embeddings is not labeled - plain texts are enough. Therefore, these
datasets can be huge and diverse - a lot to learn from.

Now let us think what a model will know depending on what we do with the embeddings. If the
embeddings are trained from scratch, the model will "know" only the classification data - this
may not be enough to learn relationships between words well. But if we use pretrained
embeddings, they (and, therefore, the whole model) will know a huge corpus - they will learn a
lot about the world. To adapt these embeddings to your task-specific data, you can fine-tune
these embeddings by training them with the whole network - this can bring gains in the
performance (not huge though).

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 18/24
30/10/2024, 05:43 Text Classification
When we use pretrained embeddings, this is an example of transfer learning: through the
embeddings, we "transfer" the knowledge of their training data to our task-specific model. We
will learn more about transfer learning later in the course.
NLP Course | For You

Text Classification Fine-tune pretrained embeddings or not? Before training models, you can first think
why fine-tuning can be useful, and which types of examples can benefit from it.
Intro and Datasets Learn more from this exercise in the Research Thinking section.

General View 

Classical Methods
 For more details and the experiments with different settings for word embeddings,
• Naive Bayes look at this paper summary.

• MaxEnt (Logistic Regression)

• SVM Data Augmentation: Get More Data for Free


Neural Networks
 Data augmentation alters your dataset in different ways to get alternative versions of the same
training example. Data augmentation can increase
• High-Level Pipeline
the amount of data
• Training
Quality of your model depends a lot on your data. For deep learning models, having large
• Models: (Weighted) BOE datasets is very (very!) important.
diversity of data
• Models: Recurrent By giving different versions of training examples, you teach a model to be more robust to
real-world data which can be of lower quality or simply a bit different from your training
• Models: Convolutional
data. With augmented data, a model is less likely to overfit to specific types of training
Multi-Label Classification examples and will rely more on general patterns.

Practical Tips  Data augmentation for images can be done easily: look at the examples below. The standard
augmentations include flipping an image, geometrical transformations (e.g. rotation and
Analysis and Interpretability stretching along some direction), covering parts of an image with different patches.

Research Thinking

Related Papers

Have Fun!

How can we do something similar for texts?

• word dropout - the most simple and popular

Word dropout is the simplest regularization: for each example, you choose some words
randomly (say, each word is chosen with probability 10%) and replace the chosen words with
either the special token UNK or with a random token from the vocabulary.

The motivation here is simple: we teach a model not to over-rely on individual tokens, but take
into consideration context of the whole text. For example, here we masked great, and a model
has to understand the sentiment based on other words.

Note: For images, this corresponds to masking out some areas.


By masking out an area of an image, we also want a model not
to over-rely on local features and to make use of a more global
context.

• use external resources (e.g., thesaurus) - a bit more complicated

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 19/24
30/10/2024, 05:43 Text Classification
A bit more complicated approach is to replace words or phrases with their synonyms. The
tricky part is getting these synonyms: you need external resources, and they are rarely
available for languages other than English (for English, you can use e.g. WordNet). Another
NLP Course | For You problem is that for languages with rich morphology (e.g., Russian) you are likely to violate the
grammatical agreement.
Text Classification

Intro and Datasets

General View 

Classical Methods

• Naive Bayes • use separate models - even more complicated

• MaxEnt (Logistic Regression) An even more complicated method is to paraphrase the whole sentences using external
models. A popular paraphrasing method is to translate a sentence to some language and
• SVM back. We will learn how to train a translation model a bit later (in the Seq2seq and Attention
lecture), but for now, you can use industrial systems, e.g. Yandex Translate, Google Translate,
Neural Networks
 etc. (Lena: Obviously, personally I'm biased towards Yandex :) ) Note that you can combine
• High-Level Pipeline translation systems and languages to get several paraphrases.

• Training

• Models: (Weighted) BOE

• Models: Recurrent

• Models: Convolutional

Multi-Label Classification
Note: For images, the last two techniques correspond to
Practical Tips 
geometric transformations: we want to change text, but to
preserve the meaning. This is different from word dropout,
Analysis and Interpretability
where some parts are lost completely.
Research Thinking

Related Papers

Have Fun!
Analysis and Interpretability
What do Convolutions Learn? Analyzing Convolutional Filters
Convolutions in Computer Vision: Visual Patterns

Convolutions were originally developed for images, and there's already a pretty good
understanding of what the filters capture and how filters from different layers from a hierarchy.
While lower layers capture simple visual patterns such as lines or circles, final layers can
capture the whole pictures, animals, people, etc.

Examples of patterns captured by convolution filters for images. The examples are from Activation Atlas from
distill.pub.

What About Convolutions in Texts?

This part is based on the paper Understanding Convolutional Neural Networks for Text
Classification.

For images, filters capture local visual patterns which are important for classification. For text,
such local patterns are word n-grams. The main findings on how CNNs work for texts are:

convolving filters are used as ngram detectors


Each filter specializes in one or several families of closely-related ngrams. Filters are not

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 20/24
30/10/2024, 05:43 Text Classification
homogeneous, i.e. a single filter can, and often does, detect multiple distinctly different
families of ngrams.
max-pooling induces a thresholding behavior
NLP Course | For You Values below a given threshold are ignored when (i.e. irrelevant to) making a prediction.
For example, this paper shows that 40% of the pooled ngrams on average can be
Text Classification
dropped with no loss of performance.
Intro and Datasets
The simplest way to understand what a
General View  network captures is to look which
patterns activate its neurons. For
Classical Methods
 convolutions, we pick a filter and find
those n-grams which activate this filter
• Naive Bayes
most.
• MaxEnt (Logistic Regression)
Below are examples of the top-1 n-
• SVM gram for several filters. For one of them,
we also show other n-grams which lead
Neural Networks
 to high activation of this filter - you can
see that the n-grams have a very similar meaning.
• High-Level Pipeline

• Training

• Models: (Weighted) BOE

• Models: Recurrent

• Models: Convolutional

Multi-Label Classification

Practical Tips 

Analysis and Interpretability


For more details, look at the paper Understanding Convolutional Neural Networks for Text
Research Thinking Classification.

Related Papers How About RNN CLassifiers?


Have Fun!
How RNNs trained for classification process text? Learn here.

Research Thinking
How to

Read the short description at the beginning - this is our starting point, something known.
Read a question and think: for a minute, a day, a week, ... - give yourself some time! Even
if you are not thinking about it constantly, something can still come to mind.
Look at the possible answers - previous attempts to answer/solve this problem.
Important: You are not supposed to come up with something exactly like here -
remember, each paper usually takes the authors several months of work. It's a habit of
thinking about these things that counts! All the rest a scientist needs is time: to try-fail-
think until it works.

It's well-known that you will learn something easier if you are not just given the answer right
away, but if you think about it first. Even if you don't want to be a researcher, this is still a good
way to learn things!

Classical Approaches

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 21/24
30/10/2024, 05:43 Text Classification

Improve Naive Bayes

The simplest Naive Bayes implementation uses tokens as


NLP Course | For You
features. However, this is not always good: completely
Text Classification different texts can have the same features.

Intro and Datasets


? In the example above we see the main problem of Naive Bayes: it knows nothing
General View  about context. Of course, we can not remove the "naive" assumptions (otherwise, it won't
be Naive Bayes anymore). But can we improve the feature extraction part?
Classical Methods
 Possible answers
• Naive Bayes

• MaxEnt (Logistic Regression) ? What other types of features you can come up with?
Possible answers
• SVM

Neural Networks
 ? Are all words equally needed for classification? If not, how can we modify the
method?
• High-Level Pipeline
Possible answers
• Training

• Models: (Weighted) BOE

• Models: Recurrent Neural Approaches


• Models: Convolutional
Fine-tuning embeddings: Why and when can this help?
Multi-Label Classification

 Before training models, you can first think why fine-tuning


Practical Tips
can be useful, and which types of examples can benefit from
Analysis and Interpretability it. Remember how embeddings are trained: words that are
used similarly in texts have very close embeddings.
Research Thinking Therefore, sometimes antonyms are closest to each other,
e.g. descent and ascent.
Related Papers

Have Fun! ? Imagine we want to use embeddings for sentiment classification. Can you find
examples of antonyms such that if their embeddings are very close, it would hurt
sentiment classification? If you can, it means that might be better to fine-tune!
Possible answers

Here will be more exercises!


This part will be expanding from time to time.

Related Papers
How to

High-level: look at key results in short summaries - get an idea of what's going on in the
field.
A bit deeper: for topics which interest you more, read longer summaries with illustrations
and explanations. Take a walk through the authors' reasoning steps and key observations.
In depth: read the papers you liked. Now, when you got the main idea, this is going to be
easier!

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 22/24
30/10/2024, 05:43 Text Classification

What's inside:
NLP Course | For You

Text Classification Convolutions for Classification: Classics


Analyzing RNNs for Sentiment Classification
Intro and Datasets
... to be updated
General View 

Classical Methods
 Convolutions for Classification: Classics
• Naive Bayes
EMNLP 2014
• MaxEnt (Logistic Regression) Convolutional Neural Networks for Sentence Classification
Yoon Kim
• SVM
Even a very simple CNN with one layer on top of word
Neural Networks
 embeddings shows very good performance (without features
• High-Level Pipeline requiring some external knowledge!). The paper also shows
the importance of using pretrained embeddings (and not
• Training training from scratch) and gains from fine-tuning.

• Models: (Weighted) BOE


More details
• Models: Recurrent

• Models: Convolutional NeurIPS 2015


Character-level Convolutional Networks for Text
Multi-Label Classification Classification
Xiang Zhang, Junbo Zhao, Yann LeCun
Practical Tips 
This is the first paper showing that CNNs only on characters
Analysis and Interpretability can do quite well. This is interesting: classification can be
done without any external knowledge, even without text
Research Thinking
segmentation into words! An important point is that
character-level CNNs can do better than classical
Related Papers
approaches only for large datasets.
Have Fun!

More details

Analyzing RNNs for Sentiment Classification


NeurIPS 2019
Reverse engineering recurrent networks for sentiment
classification reveals line attractor dynamics
Niru Maheswaranathan, Alex H. Williams, Matthew D. Golub, Surya
Ganguli, David Sussillo

If we take an RNN trained for sentiment analysis and apply


PCA to lots of its states, we'll see that almost all variance is
explained with only two components. Moreover, when such
RNN reads a text, its states move along a 1D plane in either
negative or positive direction depending on the word it
reads.

More details

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 23/24
30/10/2024, 05:43 Text Classification

NLP Course | For You


Here will be more papers!
Text Classification
The papers will be gradually appearing.
Intro and Datasets

General View 

Classical Methods

• Naive Bayes Have Fun!
• MaxEnt (Logistic Regression)

SVM

Coming soon!
Neural Networks

We are still working on this!
• High-Level Pipeline

• Training

• Models: (Weighted) BOE

• Models: Recurrent Last updated November 17, 2023.

• Models: Convolutional

Multi-Label Classification

Practical Tips 

Analysis and Interpretability

Research Thinking

Related Papers

Have Fun!

https://lena-voita.github.io/nlp_course/text_classification.html#nn_models_rnn 24/24

You might also like