09 - Extensions of Recurrent Neural Network Language Model
09 - Extensions of Recurrent Neural Network Language Model
Tomáš Mikolov1,2 , Stefan Kombrink1 , Lukáš Burget1 , Jan “Honza” Černocký1 , Sanjeev Khudanpur2
1
Brno University of Technology, Speech@FIT, Czech Republic
2
Department of Electrical and Computer Engineering, Johns Hopkins University, USA
{imikolov,kombrink,burget,cernocky}@fit.vutbr.cz, [email protected]
ABSTRACT
We present several modifications of the original recurrent neural net-
work language model (RNN LM). While this model has been shown
to significantly outperform many competitive language modeling
techniques in terms of accuracy, the remaining problem is the com-
putational complexity. In this work, we show approaches that lead
to more than 15 times speedup for both training and testing phases.
Next, we show importance of using a backpropagation through time
algorithm. An empirical comparison with feedforward networks is
also provided. In the end, we discuss possibilities how to reduce the
amount of parameters in the model. The resulting RNN model can
thus be smaller, faster both during training and testing, and more
accurate than the basic one.
Fig. 1. Simple recurrent neural network.
Index Terms— language modeling, recurrent neural networks,
speech recognition
Among many following modifications of the original model, the
1. INTRODUCTION recurrent neural network based language model [4] provides further
generalization: instead of considering just several preceding words,
Statistical models of natural language are a key part of many systems neurons with input from recurrent connections are assumed to repre-
today. The most widely known applications are automatic speech sent short term memory. The model learns itself from the data how
recognition (ASR), machine translation (MT) and optical charac- to represent memory. While shallow feedforward neural networks
ter recognition (OCR). In the past, there was always a struggle be- (those with just one hidden layer) can only cluster similar words,
tween those who follow the statistical way, and those who claim that recurrent neural network (which can be considered as a deep archi-
we need to adopt linguistics and expert knowledge to build mod- tecture [5]) can perform clustering of similar histories. This allows
els of natural language. The most serious criticism of statistical ap- for instance efficient representation of patterns with variable length.
proaches is that there is no true understanding occurring in these In this work, we show the importance of the Backpropagation
models, which are typically limited by the Markov assumption and through time algorithm for learning appropriate short term memory.
are represented by n-gram models. Prediction of the next word is Then we show how to further improve the original RNN LM by de-
often conditioned just on two preceding words, which is clearly in- creasing its computational complexity. In the end, we briefly discuss
sufficient to capture semantics. On the other hand, the criticism of possibilities of reducing the size of the resulting model.
linguistic approaches was even more straightforward: despite all the
efforts of linguists, statistical approaches were dominating when per- 2. MODEL DESCRIPTION
formance in real world applications was a measure. The recurrent neural network described in [4] is also called Elman
Thus, there has been a lot of research effort in the field of statis- network [6]. Its architecture is shown in Figure 1. The vector x(t) is
tical language modeling. Among models of natural language, neural formed by concatenating the vector w(t) that represents the current
network based models seemed to outperform most of the competi- word while using 1 of N coding (thus its size is equal to the size of
tion [1] [2], and were also showing steady improvements in state of the vocabulary) and vector s(t − 1) that represents output values in
the art speech recognition systems [3]. The main power of neural the hidden layer from the previous time step. The network is trained
network based language models seems to be in their simplicity: al- by using the standard backpropagation and contains input, hidden
most the same model can be used for prediction of many types of and output layers. Values in these layers are computed as follows:
signals, not just language. These models perform implicitly cluster-
ing of words in low-dimensional space. Prediction based on this x(t) = [w(t)T s(t − 1)T ]T (1)
compact representation of words is then more robust. No additional !
smoothing of probabilities is required. X
sj (t) = f xi (t)uji (2)
This work was partly supported by European project DIRAC (FP6- i
!
027787), Grant Agency of Czech Republic project No. 102/08/0707, Czech X
Ministry of Education project No. MSM0021630528 and by BUT FIT grant yk (t) = g sj (t)vkj (3)
No. FIT-10-S-2. j
where f (z) and g(z) are sigmoid and softmax activation functions
(the softmax function in the output layer is used to make sure that Fig. 2. Linear interpolation of different RNN models trained by
the outputs form a valid probability distribution, i.e. all outputs are BPTT.
greater than 0 and their sum is 1):
1 ezm 145
f (z) = , g(zm ) = P z (4) average over 4 models
1 + e−z ke
k 140 mixture of 4 models
KN5 baseline
epochs are needed. However, a valid question is whether the simple 115
backpropagation (BP) is sufficient to train the network properly -
if we assume that the prediction of the next word is influenced by 110
information which was present several time steps back, there is no 105
guarantee that the network will learn to keep this information in the 1 2 3 4 5
BPTT step
6 7 8
5529
4. SPEEDUP TECHNIQUES
classes precisely. The classical Brown clustering is usually not very The activation function g for both these distributions is again
useful, as its computational complexity is too high and it is often softmax (Equation 4). Thus, we have the probability distribution
faster to estimate the full neural network model. both for classes and for words within class that we are interested
in, and we can evaluate Equation 8. The error vector is computed
4.1. Factorization of the output layer for both distributions and then we follow the backpropagation algo-
rithm, so the errors computed in the word-based and the class-based
We can go further and assume that the probabilities of words within a parts of the network are summed together in the hidden layer. The
certain class do not depend just on the probability of the class itself, advantage of this approach is that the network still uses the whole
but also on the history - in context of neural networks, that is the hidden layer to estimate a (potentially) full probability distribution
hidden layer s(t). We can change Equation 6 to over the full vocabulary, while factorization allows us to evaluate just
P (wi |history) = P (ci |s(t))P (wi |ci , s(t)) (8) a subset of the output layer both during the training and during the
test phases. Based on the results shown in Table 3, we can conclude
The corresponding RNN architecture is shown in Figure 4. This that fast evaluation of the output layer via classes leads to around
idea has been already explored by Morin [13] (and in the context 15 times speedup against model that uses full vocabulary (10K), at
of Maximum Entropy models by Goodman [14]), who extended it a small cost of accuracy. The non-linear behaviour of reported time
further by assuming that the vocabulary can be represented by a hi- complexity is caused by the constant term (1 + H) × H × τ and also
erarchical binary tree. The drawback of Morin’s approach was the by suboptimal usage of cache with large matrices. With C = 1 and
dependence on WordNet for obtaining word similarity information, C = V , the model is equivalent to the full RNN model.
which can be unavailable for certain domains or languages.
In our work, we have implemented simple factorization of the 4.2. Compression layer
output layer using classes. Words are assigned to classes proportion-
ally, while respecting their frequencies (this is sometimes referred Alternatively, we can think about the two parts of the original re-
to as ’frequency binning’). The amount of classes is a parameter. current network separately: first, there is a matrix U responsible for
For example, if we choose 20 classes, words that correspond to the the input and for the recurrent connections that maintain short term
first 5% of the unigram probability distribution would be mapped to
2 After this paper was written, we have found that Emami [18] has pro-
class 1 (with Penn Corpus, this would correspond to token ’the’ as
posed a similar technique for reducing computational complexity, by assign-
1 As suggested to us by Y. Bengio, the τ term can practically disappear ing words into statistically derived classes. The novelty of our approach is
from the computational complexity, provided that the update of weights is thus in showing that simple frequency binning is adequate to obtain reason-
not done at every time step [11]. able performance.
5530
techniques lead to efficient training on very large corpora - we plan
Table 3. Perplexities on Penn corpus with factorization of the output to describe our current experiments that involve models trained on
layer by the class model. All models have the same basic configura- much more than 100M words while using non-truncated vocabulary.
tion (200 hidden units and BPTT=5). The Full model is a baseline Finally, we plan to show that the resulting models can be effi-
and does not use classes, but the whole 10K vocabulary. ciently used in state of the art systems that use very good baseline
Classes RNN RNN+KN5 Min/epoch Sec/test acoustic and language models based on huge amounts of in-domain
30 134 112 12.8 8.8 data, and that the additional processing cost by using RNN mod-
50 136 114 9.8 6.7 els does not need to be impractically high by exploiting techniques
100 136 114 9.1 5.6 described in this paper. For that purpose, we published a freely avail-
200 136 113 9.5 6.0 able toolkit for training RNN language models which is available at
400 134 112 10.9 8.1 http://www.fit.vutbr.cz/~imikolov/rnnlm/.
1000 131 111 16.1 15.7
2000 128 109 25.3 28.7 6. REFERENCES
4000 127 108 44.4 57.8
6000 127 109 70 96.5 [1] Yoshua Bengio, Rejean Ducharme and Pascal Vincent. 2003.
8000 124 107 107 148 A neural probabilistic language model. Journal of Machine
Full 123 106 154 212 Learning Research, 3:1137-1155
[2] Joshua T. Goodman (2001). A bit of progress in language mod-
eling, extended version. Technical report MSR-TR-2001-72.
memory, and then a matrix V that is used to obtain probability dis- [3] Holger Schwenk, Jean-Luc Gauvain. Training Neural Network
tribution in the output layer. Both weight matrices share the same Language Models On Very Large Corpora. in Proc. Joint Con-
hidden layer, however, while matrix U needs this vector to maintain ference HLT/EMNLP, October 2005.
all short term memory to store information for possibly several time [4] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký,
steps, matrix V needs only the information contained in the hidden Sanjeev Khudanpur: Recurrent neural network based language
layer that is needed to calculate probability distribution for the im- model, In: Proc. INTERSPEECH 2010
mediately following word3 . To reduce the size of the weight matrix
[5] Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI.
V, we can use an additional compression layer between the hidden
In Large-Scale Kernel Machines, MIT Press, 2007.
and output layers. We have used sigmoid activation function for the
compression layer, thus this projection is non-linear. [6] Jeffrey L. Elman. Finding Structure in Time. 1990. Cognitive
A compression layer not only reduces computational complex- Science, 14, 179-211
ity, but also reduces the total amount of parameters, which results [7] Mikael Bodén. A Guide to Recurrent Neural Networks and
in more compact models. It is also possible to use a similar com- Backpropagation. In the Dallas project, 2002.
pression layer between input and hidden layer to further reduce the [8] Peng Xu. Random forests and the data sparseness problem in
size of the models (such layer is usually referred to as a projec- language modeling, Ph.D. thesis, Johns Hopkins University,
tion layer). The empirical results show that with growing amount 2005.
of training data, the hidden layer needs to be increased to allow the [9] Denis Filimonov and Mary Harper. 2009. A joint language
model to store more information. Thus, the idea of using a com- model with fine-grain syntactic tags. In EMNLP.
pression layer is mostly useful when large amount of training data is
[10] Ahmad Emami, Frederick Jelinek. Exact training of a neural
used. We plan to report results with compression layers in the future.
syntactic language model. In ICASSP 2004.
[11] D. E. Rumelhart, G. E. Hinton, R. J. Williams. 1986. Learn-
5. CONCLUSION AND FUTURE WORK
ing internal representations by back-propagating errors. Na-
We presented to our knowledge the first published results when using ture, 323:533.536.
RNN trained by BPTT in the context of statistical language model- [12] Tomáš Mikolov, Jiřı́ Kopecký, Lukáš Burget, Ondřej Glembek
ing. The comparison to standard feedforward neural network based and Jan Černocký: Neural network based language models for
language models, as well as comparison to BP trained RNN mod- highly inflective languages, In: Proc. ICASSP 2009.
els shows clearly the potential of the presented model. Furthermore, [13] F. Morin, Y. Bengio: Hierarchical Probabilistic Neural Net-
we have shown how to obtain significantly better accuracy of RNN work Language Model. AISTATS’2005.
models by combining them linearly. The resulting mixture of RNN [14] J. Goodman. Classes for fast maximum entropy training. In:
models attains perplexity 96 on the well-known Penn corpus, which Proc. ICASSP 2001.
is significantly better than the best previously published result on this
[15] A. Alexandrescu, K. Kirchhoff. 2006. Factored neural lan-
setup [10]. In the future work, we plan to show how to further im-
guage models. In HLT-NAACL.
prove accuracy by combining statically and dynamically evaluated
RNN models [4] and by using complementary language modeling [16] Yoshua Bengio and Patrice Simard and Paolo Frasconi. Learn-
techniques to obtain even much lower perplexity. In our ongoing ing Long-Term Dependencies with Gradient Descent is Diffi-
ASR experiments, we have observed good correlation between per- cult. IEEE Transactions on Neural Networks, 5, 157-166.
plexity improvements and word error rate reduction. [17] Y. Bengio, J.-S. Senecal. Adaptive Importance Sampling to Ac-
Next, we have shown several possibilities how to reduce compu- celerate Training of a Neural Probabilistic Language Model.
tational and space complexity by using classes, factorization of the IEEE Transactions on Neural Networks, 2008.
output layer and by using compression layers. Combinations of these [18] Ahmad Emami. A Neural Syntactic Language Model. Ph.D.
3 Alternatively, thesis, Johns Hopkins University, 2006.
we can ask if the rank of the matrix V is full.
5531