0% found this document useful (0 votes)
17 views

End To End Binarized Neural Networks For Text Classification

This paper proposes an end-to-end binarized neural network for text classification and intent classification. The network binarizes both the input representations and classifier. It embeds token counts into binary high-dimensional vectors and trains a binary neural network classifier. Evaluation on several datasets shows the binarized network achieves comparable accuracy to state-of-the-art methods while using less memory and training time.

Uploaded by

HARSHILJAIN3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

End To End Binarized Neural Networks For Text Classification

This paper proposes an end-to-end binarized neural network for text classification and intent classification. The network binarizes both the input representations and classifier. It embeds token counts into binary high-dimensional vectors and trains a binary neural network classifier. Evaluation on several datasets shows the binarized network achieves comparable accuracy to state-of-the-art methods while using less memory and training time.

Uploaded by

HARSHILJAIN3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

End to End Binarized Neural Networks for Text Classification

Kumar Shridhar1∗, Harshil Jain2∗, Akshat Agarwal3∗, Denis Kleyko4,5


1
NeuralSpace, London
2
Computer Science and Engineering, IIT Gandhinagar, Gujarat, India
3
Electrical Engineering, Delhi Technological University, Delhi, India
4
Redwood Center for Theoretical Neuroscience, University of California, Berkeley
5
Intelligent Systems Lab, Research Institutes of Sweden
[email protected], [email protected],
[email protected], [email protected]

Abstract quire solutions that would be highly computation-


ally and memory efficient. Such use-cases limit
Deep neural networks have demonstrated their
the potential use of the state-of-the-art deep net-
superior performance in almost every Natural
Language Processing task, however, their in- works. One viable solution is the transformation
creasing complexity raises concerns. A par- of these high-performance neural networks to a
ticular concern is that these networks pose more computationally efficient architecture. Re-
high requirements for computing hardware cently, Binarized Convolutional Neural Networks
and training budgets. The state-of-the-art (BNN) (Hubara et al., 2016) have been developed
transformer models are a vivid example. Sim- where both weights and activations are restricted
plifying the computations performed by a net- to {+1, −1}. BNN is a highly computationally
work is one way of addressing the issue of
efficient network with a much lower memory foot-
the increasing complexity. In this paper, we
propose an end to end binarized neural net- print. Tasks like language modeling (Zheng and
work for the task of intent and text classifi- Tang, 2016) were performed using binarized neural
cation. In order to fully utilize the poten- networks, but, to the best of our knowledge, in the
tial of end to end binarization, both the in- area of text classification, no end to end trainable
put representations (vector embeddings of to- binarized architectures have been demonstrated yet.
kens statistics) and the classifier are binarized.
In this paper, we introduce an architecture for
We demonstrate the efficiency of such a net-
work on the intent classification of short texts the tasks of intent and text classifications that fully
over three datasets and text classification with utilizes the power of binary representations. The in-
a larger dataset. On the considered datasets, put representations are tokenized and embedded in
the proposed network achieves comparable to binary high-dimensional (HD) vectors forming dis-
the state-of-the-art results while utilizing ∼ tributed representations using the paradigm known
20-40% lesser memory and training time com- as hyperdimensional computing (Kanerva, 2009).
pared to the benchmarks. The binary input representations are used for train-
1 Introduction ing an end to end BNN classifier for intent clas-
sification. Classification performance-wise, the
In recent years, deep neural networks have achieved binarized architecture achieves results comparable
great success in a variety of domains, but the to the state-of-the-art on several standard intent
networks are becoming more and more computa- classification datasets. The efficiency of the pro-
tionally expensive due to their ever-growing size. posed architecture is shown in terms of its time
This tendency has been noticed in (Strubell et al., and memory complexity relative to non-binarized
2019; Schwartz et al., 2019) and it has been rec- architectures.
ommended that academia and industry researchers
should draw their attention towards more computa- 2 Proposed Method
tionally efficient methods. At the same time, many
important application areas such as chatbots, IoT Figure 1 presents a schematic overview of the ar-
devices, mobile devices, and other types of power- chitecture. Given an input text document D, we
constrained and resource-constrained platforms re- first pre-process the document. The pre-processed

The authors contributed equally to this research and document is then tokenized into the corresponding
work was done at NeuralSpace tokens < T1 , T2 , ..., Tn >, which are used as an

29
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 29–34
Online, November 20, 2020. 2020
c Association for Computational Linguistics
Figure 1: A schematic diagram of the end to end binarized classification architecture for text classification.

input to a count-based vectorizer. The represen- the process


Qn of forming HD vector of an n-gram
tation of vectorizers, which is sparse and localist, j
is m = j=1 ρ (HTj ), where Tj is token in jth
is embedded into an HD vector (distributed repre- position of the n-gram; the consecutive binding op-
Q
sentation) using hyperdimensional computing. HD erations applied to n HD vectors are denoted by .
vector representing the counter’s content can be Once it is known how to form an HD vector for an
binarized. It is used as an input to a classifier. The individual n-gram, embedding the n-gram statis-
primary classifier studied in this work is BNN, but tics into an HD vector h is achieved by bundling
other classifiers are also considered for benchmark- together all n-grams observed in the document:
ing.
Xk k
X n
Y
2.1 High-Dimensional embedding of h=[ fi mi = fi ρj (HTj )],
vectorized representations i=1 i=1 j=1

In order to reduce the dimensionality of representa- where k is the total number of unique n-grams; fi
tions, we use hyperdimensional computing (Kan- is the frequency of ithPn-gram and mi is the HD
erva, 2009). First, each unique token Ti is assigned vector of ith n-gram; denotes the bundling op-
with a random d-dimensional bipolar HD vector, eration when applied to several HD vectors; [∗]
where d would be a hyperparameter of the method. denotes the binarization operation, which is im-
HD vectors are stored in the item memory, which plemented via the sign function. The usage of [∗]
is a matrix H ∈ [d × n], where n is the number of is optional, so we can either obtain binarized or
tokens. Thus, for a token Ti there is an HD vec- non-binarized h. If h is non-binarized, its com-
tor HTi ∈ {−1, +1}[d×1] . To construct composite ponents will be integers in the range [−k, k], but
representations from the atomic HD vectors stored these extreme values are highly unlikely since HD
in H, hyperdimensional computing defines three vectors for different n-grams are quasi-orthogonal,
key operations: permutation (ρ), binding ( , im- which means that in the simplest (but not practical)
plemented via element-wise multiplication), and case when all n-grams have the same probability
bundling (+, implemented via element-wise ad- the expected Pvalue of a component in h is 0. Due
dition) (Kanerva, 2009). The bundling operation to the use of for representing n-gram statistics,
allows storing information in HD vectors (Frady two HD vectors embedding two different n-gram
et al., 2018). The three operations above allow statistics might have very different amplitudes if
embedding vectorized representations based on the frequencies in these statistics are very differ-
n-gram statistics into an HD vector (Joshi et al., ent. When HD vectors h are binarized, this issue
2016). is addressed. In the case of non-binarized HD vec-
We first generate H, which has an HD vector tors, we address it by using the cosine similarity,
for each token. The permutation operation ρ is which is imposed by normalizing each h by its `2
applied to HTj j times (ρj (HTj )) to represent a rel- norm; thus, all h have the same norm, and their dot
ative position of token Tj in an n-gram. A single product is equivalent to their cosine similarity.
HD vector corresponding to an n-gram (denoted
as m) is formed using the consecutive binding of 2.2 Binarized Neural Networks
permuted HD vectors ρj (HTj ) representing tokens Based on the work of (Hubara et al., 2016), we
in each position j of the n-gram. For example, the construct BNNs capable of working with represen-
trigram ‘#he’ will be embedded to an HD vector as tations of texts. To take the full advantage of bi-
follows: ρ1 (H# ) ρ2 (Hh ) ρ3 (He ). In general, narized HD vectors, we constraint the weights and

30
 
+'%11 +'7H[W/H1HW 
  %11
*OR9H%11 *OR9H7H[W/H1HW  
7H[W/H1HW

7UDLQ7LPH(SRFK PV


0HPRU\ 0%

 

 
 


    
 

   

 
&KDWERW :HE$SSOLFDWLRQ $VN8EXQWX 1HZV*URXSV :HE$SSOLFDWLRQ $VN8EXQWX &KDWERW 1HZV*URXSV

(a) Memory Comparison (b) Train Time per epoch Comparison

Figure 2: (a) shows the memory comparisons for all 4 datasets using HD Text-LeNet, HD BNN, GloVe Text-LeNet,
GloVe BNN and (b) shows the training time per epoch comparison for all 4 datasets using BNN and Text-LeNet

Chatbot AskUbuntu WebApplication 20NewsGroups


Tokenizers Text-LeNet BNN Text-LeNet BNN Text-LeNet BNN Text-LeNet BNN
Word 0.80 0.73 0.51 0.79 0.56 0.78 0.54 0.56
SemHash 0.94 0.90 0.87 0.84 0.79 0.83 0.78 0.69
BPE 0.80 0.58 0.54 0.67 0.52 0.75 0.38 0.42
Char BPE 0.92 0.81 0.76 0.76 0.55 0.53 0.55 0.48
SentencePiece 0.80 0.99 0.70 0.72 0.50 0.70 0.41 0.43
BERT 0.89 0.88 0.72 0.71 0.70 0.77 0.60 0.60

Table 1: F1 performance comparison of binarized Text-LeNet (BNN) architecture with non-binarized Text-LeNet
for the task of intent classification on various datasets.

activations of the network layers to be {+1, −1}. backward pass:


This constraint is highly efficient in terms of hard- 
ware and memory, as bit-wise operations are used δb(x) +1 if |x| < clip value,
= (2)
instead of multiply-accumulate operations. For ex- δx 0 otherwise
ample, a multiplication on binary values can be
performed using an XNOR logical operation. This ensures that the entire architecture is end to
The vectorized representations of tokens embed- end trainable using gradient descent optimization.
ded into HD vectors are binarized with all values
{+1, −1}. In the case of HD vectors, we binarize 3 Empirical Analysis
the result of the bundling operation using the sign 3.1 Datasets
function.
All the experiments are performed on four datasets,
Similarly, the sign function is used in the BNN
namely: the Chatbot Corpus (Chatbot), the Ask
for every weight or activation to restrict them into
Ubuntu Corpus (AskUbuntu), the Web Applications
{+1, −1} as follows:
Corpus (WebApplication), and the 20 News Groups
 Corpus (20NewsGroups) (Braun et al., 2017).
+1 if x ≥ 0,
b(x) = [x] = sign(x) = (1)
−1 otherwise 3.2 Results and Discussions
For CNN-based architecture, 5 hidden layers were
where, x can be any weight or activation value. used: 3 convolutional 1D layers followed by 2
We further define a convolutional 1D layer that dense layers. Due to its resemblance to the orig-
creates a convolution kernel that is convolved with inal LeNet architecture (LeCun et al., 1998), we
the input HD vector over a single spatial dimen- refer to this architecture as Text-LeNet. We com-
sion to produce a tensor of outputs. Since gradient pare the results of binarized HD vectors with the
descent methods make small changes to the value binarized Text-LeNet (BNN) architecture as the
of the weights, which cannot be done with binary classifier against non-binarized HD vectors with
values, we use the straight-through estimator idea, non-binarized Text-LeNet. The F1 scores are com-
as mentioned in (Yin et al., 2019). We also define pared in Table 1 where BNN performed equally
a value over which we clip the gradients in the well to a Text-LeNet architecture while being 20%

31
Datasets Binarized GloVe Binarized SemHash Binarized HD vectors
Chatbot 0.74 0.91 0.99
AskUbuntu 0.86 0.87 0.84
WebApplication 0.66 0.80 0.83
20NewsGroups 0.62 0.64 0.69

Table 2: F1 performance comparison of Binarized GloVe vectors, Binarized SemHash vectors and Binarized HD
vectors. All vectorizers use the same binarized Text-LeNet architecture as classifier.

+':RUG +'6HQWHQFH3LHFH +':RUG +'6HQWHQFH3LHFH


+'6HP+DVK +'%(57 +'6HP+DVK +'%(57
Chatbot Corpus +'%3( 1RQ+'6HP+DVK AskUbuntu Corpus +'%3( 1RQ+'6HP+DVK
+'&KDU%3( 1RQ+'%3( +'&KDU%3( 1RQ+'%3(

  


                     
      
F16FRUH

F16FRUH
0/3 /LQHDU69& 0/3 /LQHDU69&
&ODVVLILHU &ODVVLILHU

(a) Chatbot Corpus (b) AskUbuntu Corpus

+':RUG +'6HQWHQFH3LHFH +':RUG +'6HQWHQFH3LHFH


+'6HP+DVK +'%(57 +'6HP+DVK +'%(57
WebApplication Corpus +'%3( 1RQ+'6HP+DVK 20NewsGroups Corpus +'%3( 1RQ+'6HP+DVK
+'&KDU%3( 1RQ+'%3( +'&KDU%3(

      


   
    
F16FRUH

F16FRUH






 
 
 




0/3 /LQHDU69& 0/3 /LQHDU69&


&ODVVLILHU &ODVVLILHU

(c) WebApplication Corpus (d) 20NewsGroups Corpus

Figure 3: (a), (b), (c) and (d) show the F1 score comparison of MLP and Linear SVC classifier with HD and
non-HD based tokenizers on Chatbot, AskUbuntu, WebApplication and 20NewsGroups corpus respectively.

to 40% more memory efficient, as shown in Fig- One thing to note here is that Text-LeNet also
ure 2 (a). Note that due to the specifics of imple- used HD vectors with the mentioned tokenizers,
mentation, BNNs use 32 bit float values as Text- but the HD vectors were non-binarized. HD vec-
LeNet. The memory efficiency of BNNs can be tors in itself are already faster and much more effi-
further improved by 4x when 8-bit representations cient than counter-based representations, as shown
are used and up to 32x if a single bit representations in (Alonso et al., 2020). When experimenting with
are used. However, the hardware limitations pre- other embedding methods like GloVe, the train-
vented us from going to that extreme. On the per- ing was significantly slower; therefore, HD vectors
formance side, BNNs outperforms the Text-LeNet were used for all the experiments. In addition to
for AskUbuntu and WebApplication datasets on that, using the binarized classifier (BNN) further
4 out of 6 tokenizers. The results reported in Ta- improved the training time up to 50% per epoch
ble 1 used 512 dimensional HD vectors for Chatbot, when compared to non-binarized classifier on all
AskUbuntu, and WebApplication corpus, while four datasets, as shown in Figure 2 (b). Further-
1, 024 dimensional HD vectors were used for the more, when compared to GloVe embeddings with
20NewsGroups dataset. Text-LeNet, HD BNN used around 20 - 40% lesser

32
Platform Chatbot AskUbuntu WebApplication Average
Botfuel 0.98 0.90 0.80 0.89
Luis 0.98 0.90 0.81 0.90
Dialogflow 0.93 0.85 0.80 0.86
Watson 0.97 0.92 0.83 0.91
Rasa 0.98 0.86 0.74 0.86
Snips 0.96 0.83 0.78 0.86
Recast 0.99 0.86 0.75 0.87
TildeCNN 0.99 0.92 0.81 0.91
FastText 0.97 0.91 0.76 0.88
SemHash (Shridhar et al., 2019) 0.96 0.92 0.87 0.92
BPE 0.95 0.93 0.85 0.91
HD vectors (Alonso et al., 2020) 0.97 0.92 0.82 0.90
Binarized HD vectors with the best classifier 0.98 0.93 0.84 0.92
HD Text-LeNet 0.94 0.87 0.79 0.88
HD BNN 0.99 0.84 0.83 0.91

Table 3: F1 score comparison of various platforms on intent classification datasets of short texts with methods used
in the paper. Some results are taken from (Alonso et al., 2020)

memory for all the intent classification datasets. are provided in the Appendix.
We also benchmarked the binarized HD vectors Table 3 compares the F1 scores of various plat-
with binarized 300-dimensional GloVe vectors and forms on the intent classification datasets. We
the binarized version of counter-based representa- report the results of binarized HD vectors with
tion for SemHash tokenizer (Alonso et al., 2020) the best classifiers from one of the nine classi-
for all the datasets. Table 2 summarizes the results fiers mentioned (Binarized HD vectors with the
of the comparison. All the binarized representa- best classifier), non-binarized HD vectors with Text-
tions were trained with the same BNN classifier. LeNet (HD Text-LeNet) and binarized HD vectors
Binarized HD vectors performed significantly bet- with binarized Text-LeNet (HD BNN). Our end
ter than other binarized methods outperforming bi- to end binarized architecture (HD BNN) achieved
narized GloVe by 4 - 25% and binarized SemHash the state-of-the-art results for the Chatbot dataset.
by 2 - 8% on 2 out of 3 smaller intent classifi- The approach where only HD vectors were bina-
cation datasets and achieved comparable results rized (binarized HD vectors with the best classi-
for AskUbuntu dataset. The trend continued for fier) achieved the state-of-the-art results for the
20NewsGroups with binarized HD achieving 5 - AskUbuntu dataset. The results on the WebAppli-
7% better F1 scores. Note that for the SemHash cation dataset are comparable to the state-of-the-
counter-based vectorizer, we put a sign function art (0.87 with SemHash): 0.84 for binarized HD
sign(x) = +1 for x > 0 and − 1 otherwise. vectors with the best classifier and 0.83 for HD
BNN. The average performance of both binarized
In Figure 3, MLP and Linear SVC with all the to-
HD vectors with the best classifier (0.92) and HD
kenizers with HD vectors as representation are com-
BNN (0.91) was also comparable to the best non-
pared with MLP and Linear SVC classifiers with
binarized approach (0.92).
SemHash tokenizers and counter-based vectorizer
as representation from (Alonso et al., 2020). The 4 Conclusion
F1 score is comparable to the state-of-the-art for
both MLP and SVC. For all small intent classifica- In this work, we show that it is possible to achieve
tion datasets, binarized HD vectors have achieved comparable to the state-of-the-art results while us-
better results than non-HD vectors. The proposed ing the binarized representations of all the compo-
architecture beats the non-HD baselines by +2% nents of the text classification architecture. This
for AskUbuntu and Chatbot Corpus, and +5% for allows exploring the effectiveness of binary rep-
WebApplication Corpus. However, for 20News- resentations both for reducing the memory foot-
Groups, the results of binarized HD Vectors are print of the architecture and for increasing the
lower than non-HD Vectors. This is mainly due to energy-efficiency of the inference phase due to
the large size of the dataset, and simple classifiers the effectiveness of binary operations. This work
like LinearSVC failed to perform with just bina- takes a step towards enabling NLP functionality on
rized values. The results for all the other classifiers resource-constrained devices.

33
References P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and
J. Xin. 2019. Understanding Straight-through Esti-
P. Alonso, K. Shridhar, D. Kleyko, E. Osipov, and mator in Training Activation Quantized Neural Nets.
M. Liwicki. 2020. HyperEmbed: Tradeoffs between arXiv:1903.05662.
Resources and Performance in NLP Tasks with Hy-
perdimensional Computing Enabled Embedding of W. Zheng and Y. Tang. 2016. Binarized Neural Net-
n-gram Statistics. arXiv:2003.01821. works for Language Modeling. Technical Report
cs224d, Stanford University.
D. Braun, A. Hernandez-Mendez, F. Matthes, and
M. Langen. 2017. Evaluating Natural Language
Understanding Services for Conversational Question
Answering Systems. In Annual Meeting of the Spe-
cial Interest Group on Discourse and Dialogue (SIG-
DIAL), pages 174–185.

E. P. Frady, D. Kleyko, and F. T. Sommer. 2018. A


Theory of Sequence Indexing and Working Memory
in Recurrent Neural Networks. Neural Computation,
30:1449–1513.

L. Geiger and P. Team. 2020. Larq: An Open-Source


Library for Training Binarized Neural Networks.
Journal of Open Source Software, 5(45):1746.

G. Hinton, N. Srivastava, and K. Swersky. 2012. Neu-


ral Networks for Machine Learning Lecture 6a
Overview of Mini-batch Gradient Descent.

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and


Y. Bengio. 2016. Binarized Neural Networks. In
Advances in Neural Information Processing Systems
(NIPS), pages 1–9.

S. Ioffe and C. Szegedy. 2015. Batch normalization:


Accelerating deep network training by reducing in-
ternal covariate shift. CoRR, abs/1502.03167.

A. Joshi, J. T. Halseth, and P. Kanerva. 2016. Language


Geometry Using Random Indexing. In Quantum In-
teraction (QI), pages 265–274.

P. Kanerva. 2009. Hyperdimensional Computing: An


Introduction to Computing in Distributed Repre-
sentation with High-Dimensional Random Vectors.
Cognitive Computation, 1(2):139–159.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.


1998. Gradient-based Learning Applied to Doc-
ument Recognition. Proceedings of the IEEE,
86(11):2278–2324.

R. Schwartz, J. Dodge, N. Smith, and O. Etzioni. 2019.


Green ai. arXiv preprint arXiv:1907.10597.

K. Shridhar, A. Dash, A. Sahu, G. Grund Pihlgren,


P. Alonso, V. Pondenkandath, G. Kovacs, F. Simi-
stira, and M. Liwicki. 2019. Subword Semantic
Hashing for Intent Classification on Small Datasets.
In International Joint Conference on Neural Net-
works (IJCNN), pages 1–6.

E. Strubell, A. Ganesh, and A. McCallum. 2019. En-


ergy and Policy Considerations for Deep Learning
in NLP. In 57th Annual Meeting of the Association
for Computational Linguistics (ACL), pages 3645–
3650.

34

You might also like