End To End Binarized Neural Networks For Text Classification
End To End Binarized Neural Networks For Text Classification
29
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 29–34
Online, November 20, 2020.
2020
c Association for Computational Linguistics
Figure 1: A schematic diagram of the end to end binarized classification architecture for text classification.
In order to reduce the dimensionality of representa- where k is the total number of unique n-grams; fi
tions, we use hyperdimensional computing (Kan- is the frequency of ithPn-gram and mi is the HD
erva, 2009). First, each unique token Ti is assigned vector of ith n-gram; denotes the bundling op-
with a random d-dimensional bipolar HD vector, eration when applied to several HD vectors; [∗]
where d would be a hyperparameter of the method. denotes the binarization operation, which is im-
HD vectors are stored in the item memory, which plemented via the sign function. The usage of [∗]
is a matrix H ∈ [d × n], where n is the number of is optional, so we can either obtain binarized or
tokens. Thus, for a token Ti there is an HD vec- non-binarized h. If h is non-binarized, its com-
tor HTi ∈ {−1, +1}[d×1] . To construct composite ponents will be integers in the range [−k, k], but
representations from the atomic HD vectors stored these extreme values are highly unlikely since HD
in H, hyperdimensional computing defines three vectors for different n-grams are quasi-orthogonal,
key operations: permutation (ρ), binding (, im- which means that in the simplest (but not practical)
plemented via element-wise multiplication), and case when all n-grams have the same probability
bundling (+, implemented via element-wise ad- the expected Pvalue of a component in h is 0. Due
dition) (Kanerva, 2009). The bundling operation to the use of for representing n-gram statistics,
allows storing information in HD vectors (Frady two HD vectors embedding two different n-gram
et al., 2018). The three operations above allow statistics might have very different amplitudes if
embedding vectorized representations based on the frequencies in these statistics are very differ-
n-gram statistics into an HD vector (Joshi et al., ent. When HD vectors h are binarized, this issue
2016). is addressed. In the case of non-binarized HD vec-
We first generate H, which has an HD vector tors, we address it by using the cosine similarity,
for each token. The permutation operation ρ is which is imposed by normalizing each h by its `2
applied to HTj j times (ρj (HTj )) to represent a rel- norm; thus, all h have the same norm, and their dot
ative position of token Tj in an n-gram. A single product is equivalent to their cosine similarity.
HD vector corresponding to an n-gram (denoted
as m) is formed using the consecutive binding of 2.2 Binarized Neural Networks
permuted HD vectors ρj (HTj ) representing tokens Based on the work of (Hubara et al., 2016), we
in each position j of the n-gram. For example, the construct BNNs capable of working with represen-
trigram ‘#he’ will be embedded to an HD vector as tations of texts. To take the full advantage of bi-
follows: ρ1 (H# ) ρ2 (Hh ) ρ3 (He ). In general, narized HD vectors, we constraint the weights and
30
+ '