0% found this document useful (0 votes)
34 views4 pages

09 - Extensions of Recurrent Neural Network Language Model

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views4 pages

09 - Extensions of Recurrent Neural Network Language Model

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

Tomáš Mikolov1,2 , Stefan Kombrink1 , Lukáš Burget1 , Jan “Honza” Černocký1 , Sanjeev Khudanpur2
1
Brno University of Technology, Speech@FIT, Czech Republic
2
Department of Electrical and Computer Engineering, Johns Hopkins University, USA
{imikolov,kombrink,burget,cernocky}@fit.vutbr.cz, [email protected]

ABSTRACT    
We present several modifications of the original recurrent neural net-
work language model (RNN LM). While this model has been shown   
to significantly outperform many competitive language modeling
techniques in terms of accuracy, the remaining problem is the com-  
putational complexity. In this work, we show approaches that lead
to more than 15 times speedup for both training and testing phases.
Next, we show importance of using a backpropagation through time
algorithm. An empirical comparison with feedforward networks is
also provided. In the end, we discuss possibilities how to reduce the
amount of parameters in the model. The resulting RNN model can 
thus be smaller, faster both during training and testing, and more
accurate than the basic one.
Fig. 1. Simple recurrent neural network.
Index Terms— language modeling, recurrent neural networks,
speech recognition
Among many following modifications of the original model, the
1. INTRODUCTION recurrent neural network based language model [4] provides further
generalization: instead of considering just several preceding words,
Statistical models of natural language are a key part of many systems neurons with input from recurrent connections are assumed to repre-
today. The most widely known applications are automatic speech sent short term memory. The model learns itself from the data how
recognition (ASR), machine translation (MT) and optical charac- to represent memory. While shallow feedforward neural networks
ter recognition (OCR). In the past, there was always a struggle be- (those with just one hidden layer) can only cluster similar words,
tween those who follow the statistical way, and those who claim that recurrent neural network (which can be considered as a deep archi-
we need to adopt linguistics and expert knowledge to build mod- tecture [5]) can perform clustering of similar histories. This allows
els of natural language. The most serious criticism of statistical ap- for instance efficient representation of patterns with variable length.
proaches is that there is no true understanding occurring in these In this work, we show the importance of the Backpropagation
models, which are typically limited by the Markov assumption and through time algorithm for learning appropriate short term memory.
are represented by n-gram models. Prediction of the next word is Then we show how to further improve the original RNN LM by de-
often conditioned just on two preceding words, which is clearly in- creasing its computational complexity. In the end, we briefly discuss
sufficient to capture semantics. On the other hand, the criticism of possibilities of reducing the size of the resulting model.
linguistic approaches was even more straightforward: despite all the
efforts of linguists, statistical approaches were dominating when per- 2. MODEL DESCRIPTION
formance in real world applications was a measure. The recurrent neural network described in [4] is also called Elman
Thus, there has been a lot of research effort in the field of statis- network [6]. Its architecture is shown in Figure 1. The vector x(t) is
tical language modeling. Among models of natural language, neural formed by concatenating the vector w(t) that represents the current
network based models seemed to outperform most of the competi- word while using 1 of N coding (thus its size is equal to the size of
tion [1] [2], and were also showing steady improvements in state of the vocabulary) and vector s(t − 1) that represents output values in
the art speech recognition systems [3]. The main power of neural the hidden layer from the previous time step. The network is trained
network based language models seems to be in their simplicity: al- by using the standard backpropagation and contains input, hidden
most the same model can be used for prediction of many types of and output layers. Values in these layers are computed as follows:
signals, not just language. These models perform implicitly cluster-
ing of words in low-dimensional space. Prediction based on this x(t) = [w(t)T s(t − 1)T ]T (1)
compact representation of words is then more robust. No additional !
smoothing of probabilities is required. X
sj (t) = f xi (t)uji (2)
This work was partly supported by European project DIRAC (FP6- i
!
027787), Grant Agency of Czech Republic project No. 102/08/0707, Czech X
Ministry of Education project No. MSM0021630528 and by BUT FIT grant yk (t) = g sj (t)vkj (3)
No. FIT-10-S-2. j

978-1-4577-0539-7/11/$26.00 ©2011 IEEE 5528 ICASSP 2011


130
Table 1. Comparison of different language modeling techniques on RNN mixture
Penn Corpus. Models are interpolated with KN backoff model. 125 RNN mixture + KN5

Model PPL 120

Perplexity (Penn corpus)


KN5 141
115
Random forest (Peng Xu) [8] 132
Structured LM (Filimonov) [9] 125 110
Syntactic NN LM (Emami) [10] 107
RNN trained by BP 113 105

RNN trained by BPTT 106 100


4x RNN trained by BPTT (mixture) 98
95
0 5 10 15 20 25
Number of RNN models

where f (z) and g(z) are sigmoid and softmax activation functions
(the softmax function in the output layer is used to make sure that Fig. 2. Linear interpolation of different RNN models trained by
the outputs form a valid probability distribution, i.e. all outputs are BPTT.
greater than 0 and their sum is 1):
1 ezm 145
f (z) = , g(zm ) = P z (4) average over 4 models
1 + e−z ke
k 140 mixture of 4 models
KN5 baseline

The cross entropy criterion is used to obtain an error vector in 135

Perplexity (Penn corpus)


the output layer, which is then backpropagated to the hidden layer. 130
The training algorithm uses validation data for early stopping and
125
to control learning rate. Training iterates over all the training data
in several epochs before convergence is achieved - usually, 10-20 120

epochs are needed. However, a valid question is whether the simple 115
backpropagation (BP) is sufficient to train the network properly -
if we assume that the prediction of the next word is influenced by 110

information which was present several time steps back, there is no 105
guarantee that the network will learn to keep this information in the 1 2 3 4 5
BPTT step
6 7 8

hidden layer. While the network can remember such information, it


is more by luck than by design.
Fig. 3. Effect of BPTT training on Penn Corpus. BPTT=1 corre-
sponds to standard backpropagation.
3. BACKPROPAGATION THROUGH TIME

Backpropagation through time (BPTT) [11] can be seen as an exten-


sion of the backpropagation algorithm for recurrent networks. With 300, 350 and 400 neurons in the hidden layer). Also, a combina-
truncated BPTT, the error is propagated through recurrent connec- tion of these models is shown (again, linear interpolation was used).
tions back in time for a specific number of time steps (here referred As can be seen, 4-5 steps of BPTT training seems to be sufficient.
to as τ ). Thus, the network learns to remember information for sev- Note that while complexity of the training phase increases with the
eral time steps in the hidden layer when it is learned by the BPTT. amount of steps for which the error is propagated back in time, the
Additional information and practical advices for implementation of complexity of the test phase is constant.
BPTT algorithm are described in [7]. Table 2 shows comparison of the feedforward [12], simple recur-
The data used in the following experiments were obtained from rent [4] and BPTT-trained recurrent neural network language models
Penn Tree Bank: sections 0-20 were used as training data (about on two corpora. Perplexity is shown on the test sets for configura-
930K tokens), sections 21-22 as validation data (74K) and sections tions of networks that were working the best on the development
23-24 as test data (82K). The vocabulary is limited to 10K words. sets. We can see that the simple recurrent neural network already
The processing of the data is exactly the same as used by [10] and outperforms the standard feedforward network, while BPTT train-
other researchers. For a comparison of techniques, see Table 1. ing provides another significant improvement.
KN5 denotes the baseline: interpolated 5-gram model with modified
Kneser Ney smoothing and no count cutoffs.
To improve results, it is often better to train several networks
(that differ either in random initialization of weights or also in the Table 2. Comparison of different neural network architectures on
numbers of parameters) than having one huge network. The combi- Penn Corpus (1M words) and Switchboard (4M words).
nation of these networks is done by linear interpolation with equal Penn Corpus Switchboard
weights assigned to each model (note similarity to random forests Model NN NN+KN NN NN+KN
that are composed of different decision trees [8]). The combination
of various amounts of models is shown in Figure 2. KN5 (baseline) - 141 - 92.9
Figure 3 shows the importance of number of time steps τ in feedforward NN 141 118 85.1 77.5
BPTT. To reduce noise, results are reported as an average of perplex- RNN trained by BP 137 113 81.3 75.4
ity given by four models with different RNN configurations (250, RNN trained by BPTT 123 106 77.5 72.5

5529
4. SPEEDUP TECHNIQUES    

The time complexity of one training step is proportional to


  
O = (1 + H) × H × τ + H × V (5) 

where H is the size of the hidden layer, V size of the vocabulary
and τ the amount of steps we backpropagate the error back in time 1 . 
Usually H << V , so the computational bottleneck is between the
hidden and output layers. This has motivated several researchers
to investigate possibilities how to reduce this huge weight matrix.

Originally, Bengio [1] has merged all low frequency words into one
special token in the output vocabulary, which usually results in 2-3 
times speedup without significant degradation of the performance.
This idea was later extended - instead of using unigram distribution Fig. 4. RNN with output layer factorized by class layer.
for words that belong to the special token, Schwenk [3] used proba-
bilities from a backoff model for the rare words.
An even more promising approach was based on the assump- its unigram probability is about 5%), the words that correspond to
tion that words can be mapped to classes [13] [14]. If we assume the next 5% of the unigram probability mass would be mapped to
that each word belongs to exactly one class, we can first estimate the class 2, etc. Thus, the first classes can hold just single words, while
probability distribution over the classes using RNN and then com- the last classes cover thousands of low-frequency words 2 .
pute the probability of a particular word from the desired class while Instead of computing a probability distribution over all words as
assuming unigram distribution of words within the class: it is specified in (3), we first estimate a probability distribution over
the classes and then a distribution over the words from a single class,
P (wi |history) = P (ci |history)P (wi |ci ) (6)
the one that contains the predicted word:
This reduces computational complexity to !
X
O = (1 + H) × H × τ + H × C, (7) cl (t) = g sj (t)wlj (9)
j
where C is the number of classes. While this architecture has obvi- !
ous advantages over the previously mentioned approaches as C can X
be order of magnitude smaller than V without sacrificing much of yc (t) = g sj (t)vcj (10)
accuracy, the performance depends heavily on our ability to estimate j

classes precisely. The classical Brown clustering is usually not very The activation function g for both these distributions is again
useful, as its computational complexity is too high and it is often softmax (Equation 4). Thus, we have the probability distribution
faster to estimate the full neural network model. both for classes and for words within class that we are interested
in, and we can evaluate Equation 8. The error vector is computed
4.1. Factorization of the output layer for both distributions and then we follow the backpropagation algo-
rithm, so the errors computed in the word-based and the class-based
We can go further and assume that the probabilities of words within a parts of the network are summed together in the hidden layer. The
certain class do not depend just on the probability of the class itself, advantage of this approach is that the network still uses the whole
but also on the history - in context of neural networks, that is the hidden layer to estimate a (potentially) full probability distribution
hidden layer s(t). We can change Equation 6 to over the full vocabulary, while factorization allows us to evaluate just
P (wi |history) = P (ci |s(t))P (wi |ci , s(t)) (8) a subset of the output layer both during the training and during the
test phases. Based on the results shown in Table 3, we can conclude
The corresponding RNN architecture is shown in Figure 4. This that fast evaluation of the output layer via classes leads to around
idea has been already explored by Morin [13] (and in the context 15 times speedup against model that uses full vocabulary (10K), at
of Maximum Entropy models by Goodman [14]), who extended it a small cost of accuracy. The non-linear behaviour of reported time
further by assuming that the vocabulary can be represented by a hi- complexity is caused by the constant term (1 + H) × H × τ and also
erarchical binary tree. The drawback of Morin’s approach was the by suboptimal usage of cache with large matrices. With C = 1 and
dependence on WordNet for obtaining word similarity information, C = V , the model is equivalent to the full RNN model.
which can be unavailable for certain domains or languages.
In our work, we have implemented simple factorization of the 4.2. Compression layer
output layer using classes. Words are assigned to classes proportion-
ally, while respecting their frequencies (this is sometimes referred Alternatively, we can think about the two parts of the original re-
to as ’frequency binning’). The amount of classes is a parameter. current network separately: first, there is a matrix U responsible for
For example, if we choose 20 classes, words that correspond to the the input and for the recurrent connections that maintain short term
first 5% of the unigram probability distribution would be mapped to
2 After this paper was written, we have found that Emami [18] has pro-
class 1 (with Penn Corpus, this would correspond to token ’the’ as
posed a similar technique for reducing computational complexity, by assign-
1 As suggested to us by Y. Bengio, the τ term can practically disappear ing words into statistically derived classes. The novelty of our approach is
from the computational complexity, provided that the update of weights is thus in showing that simple frequency binning is adequate to obtain reason-
not done at every time step [11]. able performance.

5530
techniques lead to efficient training on very large corpora - we plan
Table 3. Perplexities on Penn corpus with factorization of the output to describe our current experiments that involve models trained on
layer by the class model. All models have the same basic configura- much more than 100M words while using non-truncated vocabulary.
tion (200 hidden units and BPTT=5). The Full model is a baseline Finally, we plan to show that the resulting models can be effi-
and does not use classes, but the whole 10K vocabulary. ciently used in state of the art systems that use very good baseline
Classes RNN RNN+KN5 Min/epoch Sec/test acoustic and language models based on huge amounts of in-domain
30 134 112 12.8 8.8 data, and that the additional processing cost by using RNN mod-
50 136 114 9.8 6.7 els does not need to be impractically high by exploiting techniques
100 136 114 9.1 5.6 described in this paper. For that purpose, we published a freely avail-
200 136 113 9.5 6.0 able toolkit for training RNN language models which is available at
400 134 112 10.9 8.1 http://www.fit.vutbr.cz/~imikolov/rnnlm/.
1000 131 111 16.1 15.7
2000 128 109 25.3 28.7 6. REFERENCES
4000 127 108 44.4 57.8
6000 127 109 70 96.5 [1] Yoshua Bengio, Rejean Ducharme and Pascal Vincent. 2003.
8000 124 107 107 148 A neural probabilistic language model. Journal of Machine
Full 123 106 154 212 Learning Research, 3:1137-1155
[2] Joshua T. Goodman (2001). A bit of progress in language mod-
eling, extended version. Technical report MSR-TR-2001-72.
memory, and then a matrix V that is used to obtain probability dis- [3] Holger Schwenk, Jean-Luc Gauvain. Training Neural Network
tribution in the output layer. Both weight matrices share the same Language Models On Very Large Corpora. in Proc. Joint Con-
hidden layer, however, while matrix U needs this vector to maintain ference HLT/EMNLP, October 2005.
all short term memory to store information for possibly several time [4] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký,
steps, matrix V needs only the information contained in the hidden Sanjeev Khudanpur: Recurrent neural network based language
layer that is needed to calculate probability distribution for the im- model, In: Proc. INTERSPEECH 2010
mediately following word3 . To reduce the size of the weight matrix
[5] Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI.
V, we can use an additional compression layer between the hidden
In Large-Scale Kernel Machines, MIT Press, 2007.
and output layers. We have used sigmoid activation function for the
compression layer, thus this projection is non-linear. [6] Jeffrey L. Elman. Finding Structure in Time. 1990. Cognitive
A compression layer not only reduces computational complex- Science, 14, 179-211
ity, but also reduces the total amount of parameters, which results [7] Mikael Bodén. A Guide to Recurrent Neural Networks and
in more compact models. It is also possible to use a similar com- Backpropagation. In the Dallas project, 2002.
pression layer between input and hidden layer to further reduce the [8] Peng Xu. Random forests and the data sparseness problem in
size of the models (such layer is usually referred to as a projec- language modeling, Ph.D. thesis, Johns Hopkins University,
tion layer). The empirical results show that with growing amount 2005.
of training data, the hidden layer needs to be increased to allow the [9] Denis Filimonov and Mary Harper. 2009. A joint language
model to store more information. Thus, the idea of using a com- model with fine-grain syntactic tags. In EMNLP.
pression layer is mostly useful when large amount of training data is
[10] Ahmad Emami, Frederick Jelinek. Exact training of a neural
used. We plan to report results with compression layers in the future.
syntactic language model. In ICASSP 2004.
[11] D. E. Rumelhart, G. E. Hinton, R. J. Williams. 1986. Learn-
5. CONCLUSION AND FUTURE WORK
ing internal representations by back-propagating errors. Na-
We presented to our knowledge the first published results when using ture, 323:533.536.
RNN trained by BPTT in the context of statistical language model- [12] Tomáš Mikolov, Jiřı́ Kopecký, Lukáš Burget, Ondřej Glembek
ing. The comparison to standard feedforward neural network based and Jan Černocký: Neural network based language models for
language models, as well as comparison to BP trained RNN mod- highly inflective languages, In: Proc. ICASSP 2009.
els shows clearly the potential of the presented model. Furthermore, [13] F. Morin, Y. Bengio: Hierarchical Probabilistic Neural Net-
we have shown how to obtain significantly better accuracy of RNN work Language Model. AISTATS’2005.
models by combining them linearly. The resulting mixture of RNN [14] J. Goodman. Classes for fast maximum entropy training. In:
models attains perplexity 96 on the well-known Penn corpus, which Proc. ICASSP 2001.
is significantly better than the best previously published result on this
[15] A. Alexandrescu, K. Kirchhoff. 2006. Factored neural lan-
setup [10]. In the future work, we plan to show how to further im-
guage models. In HLT-NAACL.
prove accuracy by combining statically and dynamically evaluated
RNN models [4] and by using complementary language modeling [16] Yoshua Bengio and Patrice Simard and Paolo Frasconi. Learn-
techniques to obtain even much lower perplexity. In our ongoing ing Long-Term Dependencies with Gradient Descent is Diffi-
ASR experiments, we have observed good correlation between per- cult. IEEE Transactions on Neural Networks, 5, 157-166.
plexity improvements and word error rate reduction. [17] Y. Bengio, J.-S. Senecal. Adaptive Importance Sampling to Ac-
Next, we have shown several possibilities how to reduce compu- celerate Training of a Neural Probabilistic Language Model.
tational and space complexity by using classes, factorization of the IEEE Transactions on Neural Networks, 2008.
output layer and by using compression layers. Combinations of these [18] Ahmad Emami. A Neural Syntactic Language Model. Ph.D.
3 Alternatively, thesis, Johns Hopkins University, 2006.
we can ask if the rank of the matrix V is full.

5531

You might also like