0% found this document useful (0 votes)

34 views4 pages

09 - Extensions of Recurrent Neural Network Language Model

Uploaded by

CUHN-FEI JAMES TAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views4 pages

09 - Extensions of Recurrent Neural Network Language Model

Uploaded by

CUHN-FEI JAMES TAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

Tomáš Mikolov1,2 , Stefan Kombrink1 , Lukáš Burget1 , Jan “Honza” Černocký1 , Sanjeev Khudanpur2
1
Brno University of Technology, Speech@FIT, Czech Republic
2
Department of Electrical and Computer Engineering, Johns Hopkins University, USA
{imikolov,kombrink,burget,cernocky}@fit.vutbr.cz, [email protected]

ABSTRACT
We present several modifications of the original recurrent neural net-
work language model (RNN LM). While this model has been shown
to significantly outperform many competitive language modeling
techniques in terms of accuracy, the remaining problem is the com-
putational complexity. In this work, we show approaches that lead
to more than 15 times speedup for both training and testing phases.
Next, we show importance of using a backpropagation through time
algorithm. An empirical comparison with feedforward networks is
also provided. In the end, we discuss possibilities how to reduce the
amount of parameters in the model. The resulting RNN model can
thus be smaller, faster both during training and testing, and more
accurate than the basic one.
Fig. 1. Simple recurrent neural network.
Index Terms— language modeling, recurrent neural networks,
speech recognition
Among many following modifications of the original model, the
1. INTRODUCTION recurrent neural network based language model [4] provides further
generalization: instead of considering just several preceding words,
Statistical models of natural language are a key part of many systems neurons with input from recurrent connections are assumed to repre-
today. The most widely known applications are automatic speech sent short term memory. The model learns itself from the data how
recognition (ASR), machine translation (MT) and optical charac- to represent memory. While shallow feedforward neural networks
ter recognition (OCR). In the past, there was always a struggle be- (those with just one hidden layer) can only cluster similar words,
tween those who follow the statistical way, and those who claim that recurrent neural network (which can be considered as a deep archi-
we need to adopt linguistics and expert knowledge to build mod- tecture [5]) can perform clustering of similar histories. This allows
els of natural language. The most serious criticism of statistical ap- for instance efficient representation of patterns with variable length.
proaches is that there is no true understanding occurring in these In this work, we show the importance of the Backpropagation
models, which are typically limited by the Markov assumption and through time algorithm for learning appropriate short term memory.
are represented by n-gram models. Prediction of the next word is Then we show how to further improve the original RNN LM by de-
often conditioned just on two preceding words, which is clearly in- creasing its computational complexity. In the end, we briefly discuss
sufficient to capture semantics. On the other hand, the criticism of possibilities of reducing the size of the resulting model.
linguistic approaches was even more straightforward: despite all the
efforts of linguists, statistical approaches were dominating when per- 2. MODEL DESCRIPTION
formance in real world applications was a measure. The recurrent neural network described in [4] is also called Elman
Thus, there has been a lot of research effort in the field of statis- network [6]. Its architecture is shown in Figure 1. The vector x(t) is
tical language modeling. Among models of natural language, neural formed by concatenating the vector w(t) that represents the current
network based models seemed to outperform most of the competi- word while using 1 of N coding (thus its size is equal to the size of
tion [1] [2], and were also showing steady improvements in state of the vocabulary) and vector s(t − 1) that represents output values in
the art speech recognition systems [3]. The main power of neural the hidden layer from the previous time step. The network is trained
network based language models seems to be in their simplicity: al- by using the standard backpropagation and contains input, hidden
most the same model can be used for prediction of many types of and output layers. Values in these layers are computed as follows:
signals, not just language. These models perform implicitly cluster-
ing of words in low-dimensional space. Prediction based on this x(t) = [w(t)T s(t − 1)T ]T (1)
compact representation of words is then more robust. No additional !
smoothing of probabilities is required. X
sj (t) = f xi (t)uji (2)
This work was partly supported by European project DIRAC (FP6- i
!
027787), Grant Agency of Czech Republic project No. 102/08/0707, Czech X
Ministry of Education project No. MSM0021630528 and by BUT FIT grant yk (t) = g sj (t)vkj (3)
No. FIT-10-S-2. j

978-1-4577-0539-7/11/$26.00 ©2011 IEEE 5528 ICASSP 2011

130
Table 1. Comparison of different language modeling techniques on RNN mixture
Penn Corpus. Models are interpolated with KN backoff model. 125 RNN mixture + KN5

Model PPL 120

Perplexity (Penn corpus)

KN5 141
115
Random forest (Peng Xu) [8] 132
Structured LM (Filimonov) [9] 125 110
Syntactic NN LM (Emami) [10] 107
RNN trained by BP 113 105

RNN trained by BPTT 106 100

4x RNN trained by BPTT (mixture) 98
95
0 5 10 15 20 25
Number of RNN models

where f (z) and g(z) are sigmoid and softmax activation functions
(the softmax function in the output layer is used to make sure that Fig. 2. Linear interpolation of different RNN models trained by
the outputs form a valid probability distribution, i.e. all outputs are BPTT.
greater than 0 and their sum is 1):
1 ezm 145
f (z) = , g(zm ) = P z (4) average over 4 models
1 + e−z ke
k 140 mixture of 4 models
KN5 baseline

The cross entropy criterion is used to obtain an error vector in 135

Perplexity (Penn corpus)

the output layer, which is then backpropagated to the hidden layer. 130
The training algorithm uses validation data for early stopping and
125
to control learning rate. Training iterates over all the training data
in several epochs before convergence is achieved - usually, 10-20 120

epochs are needed. However, a valid question is whether the simple 115
backpropagation (BP) is sufficient to train the network properly -
if we assume that the prediction of the next word is influenced by 110

information which was present several time steps back, there is no 105
guarantee that the network will learn to keep this information in the 1 2 3 4 5
BPTT step
6 7 8

hidden layer. While the network can remember such information, it

is more by luck than by design.
Fig. 3. Effect of BPTT training on Penn Corpus. BPTT=1 corre-
sponds to standard backpropagation.
3. BACKPROPAGATION THROUGH TIME

Backpropagation through time (BPTT) [11] can be seen as an exten-

sion of the backpropagation algorithm for recurrent networks. With 300, 350 and 400 neurons in the hidden layer). Also, a combina-
truncated BPTT, the error is propagated through recurrent connec- tion of these models is shown (again, linear interpolation was used).
tions back in time for a specific number of time steps (here referred As can be seen, 4-5 steps of BPTT training seems to be sufficient.
to as τ ). Thus, the network learns to remember information for sev- Note that while complexity of the training phase increases with the
eral time steps in the hidden layer when it is learned by the BPTT. amount of steps for which the error is propagated back in time, the
Additional information and practical advices for implementation of complexity of the test phase is constant.
BPTT algorithm are described in [7]. Table 2 shows comparison of the feedforward [12], simple recur-
The data used in the following experiments were obtained from rent [4] and BPTT-trained recurrent neural network language models
Penn Tree Bank: sections 0-20 were used as training data (about on two corpora. Perplexity is shown on the test sets for configura-
930K tokens), sections 21-22 as validation data (74K) and sections tions of networks that were working the best on the development
23-24 as test data (82K). The vocabulary is limited to 10K words. sets. We can see that the simple recurrent neural network already
The processing of the data is exactly the same as used by [10] and outperforms the standard feedforward network, while BPTT train-
other researchers. For a comparison of techniques, see Table 1. ing provides another significant improvement.
KN5 denotes the baseline: interpolated 5-gram model with modified
Kneser Ney smoothing and no count cutoffs.
To improve results, it is often better to train several networks
(that differ either in random initialization of weights or also in the Table 2. Comparison of different neural network architectures on
numbers of parameters) than having one huge network. The combi- Penn Corpus (1M words) and Switchboard (4M words).
nation of these networks is done by linear interpolation with equal Penn Corpus Switchboard
weights assigned to each model (note similarity to random forests Model NN NN+KN NN NN+KN
that are composed of different decision trees [8]). The combination
of various amounts of models is shown in Figure 2. KN5 (baseline) - 141 - 92.9
Figure 3 shows the importance of number of time steps τ in feedforward NN 141 118 85.1 77.5
BPTT. To reduce noise, results are reported as an average of perplex- RNN trained by BP 137 113 81.3 75.4
ity given by four models with different RNN configurations (250, RNN trained by BPTT 123 106 77.5 72.5

5529
4. SPEEDUP TECHNIQUES

The time complexity of one training step is proportional to

O = (1 + H) × H × τ + H × V (5)

where H is the size of the hidden layer, V size of the vocabulary
and τ the amount of steps we backpropagate the error back in time 1 .
Usually H << V , so the computational bottleneck is between the
hidden and output layers. This has motivated several researchers
to investigate possibilities how to reduce this huge weight matrix.

Originally, Bengio [1] has merged all low frequency words into one
special token in the output vocabulary, which usually results in 2-3
times speedup without significant degradation of the performance.
This idea was later extended - instead of using unigram distribution Fig. 4. RNN with output layer factorized by class layer.
for words that belong to the special token, Schwenk [3] used proba-
bilities from a backoff model for the rare words.
An even more promising approach was based on the assump- its unigram probability is about 5%), the words that correspond to
tion that words can be mapped to classes [13] [14]. If we assume the next 5% of the unigram probability mass would be mapped to
that each word belongs to exactly one class, we can first estimate the class 2, etc. Thus, the first classes can hold just single words, while
probability distribution over the classes using RNN and then com- the last classes cover thousands of low-frequency words 2 .
pute the probability of a particular word from the desired class while Instead of computing a probability distribution over all words as
assuming unigram distribution of words within the class: it is specified in (3), we first estimate a probability distribution over
the classes and then a distribution over the words from a single class,
P (wi |history) = P (ci |history)P (wi |ci ) (6)
the one that contains the predicted word:
This reduces computational complexity to !
X
O = (1 + H) × H × τ + H × C, (7) cl (t) = g sj (t)wlj (9)
j
where C is the number of classes. While this architecture has obvi- !
ous advantages over the previously mentioned approaches as C can X
be order of magnitude smaller than V without sacrificing much of yc (t) = g sj (t)vcj (10)
accuracy, the performance depends heavily on our ability to estimate j

classes precisely. The classical Brown clustering is usually not very The activation function g for both these distributions is again
useful, as its computational complexity is too high and it is often softmax (Equation 4). Thus, we have the probability distribution
faster to estimate the full neural network model. both for classes and for words within class that we are interested
in, and we can evaluate Equation 8. The error vector is computed
4.1. Factorization of the output layer for both distributions and then we follow the backpropagation algo-
rithm, so the errors computed in the word-based and the class-based
We can go further and assume that the probabilities of words within a parts of the network are summed together in the hidden layer. The
certain class do not depend just on the probability of the class itself, advantage of this approach is that the network still uses the whole
but also on the history - in context of neural networks, that is the hidden layer to estimate a (potentially) full probability distribution
hidden layer s(t). We can change Equation 6 to over the full vocabulary, while factorization allows us to evaluate just
P (wi |history) = P (ci |s(t))P (wi |ci , s(t)) (8) a subset of the output layer both during the training and during the
test phases. Based on the results shown in Table 3, we can conclude
The corresponding RNN architecture is shown in Figure 4. This that fast evaluation of the output layer via classes leads to around
idea has been already explored by Morin [13] (and in the context 15 times speedup against model that uses full vocabulary (10K), at
of Maximum Entropy models by Goodman [14]), who extended it a small cost of accuracy. The non-linear behaviour of reported time
further by assuming that the vocabulary can be represented by a hi- complexity is caused by the constant term (1 + H) × H × τ and also
erarchical binary tree. The drawback of Morin’s approach was the by suboptimal usage of cache with large matrices. With C = 1 and
dependence on WordNet for obtaining word similarity information, C = V , the model is equivalent to the full RNN model.
which can be unavailable for certain domains or languages.
In our work, we have implemented simple factorization of the 4.2. Compression layer
output layer using classes. Words are assigned to classes proportion-
ally, while respecting their frequencies (this is sometimes referred Alternatively, we can think about the two parts of the original re-
to as ’frequency binning’). The amount of classes is a parameter. current network separately: first, there is a matrix U responsible for
For example, if we choose 20 classes, words that correspond to the the input and for the recurrent connections that maintain short term
first 5% of the unigram probability distribution would be mapped to
2 After this paper was written, we have found that Emami [18] has pro-
class 1 (with Penn Corpus, this would correspond to token ’the’ as
posed a similar technique for reducing computational complexity, by assign-
1 As suggested to us by Y. Bengio, the τ term can practically disappear ing words into statistically derived classes. The novelty of our approach is
from the computational complexity, provided that the update of weights is thus in showing that simple frequency binning is adequate to obtain reason-
not done at every time step [11]. able performance.

5530
techniques lead to efficient training on very large corpora - we plan
Table 3. Perplexities on Penn corpus with factorization of the output to describe our current experiments that involve models trained on
layer by the class model. All models have the same basic configura- much more than 100M words while using non-truncated vocabulary.
tion (200 hidden units and BPTT=5). The Full model is a baseline Finally, we plan to show that the resulting models can be effi-
and does not use classes, but the whole 10K vocabulary. ciently used in state of the art systems that use very good baseline
Classes RNN RNN+KN5 Min/epoch Sec/test acoustic and language models based on huge amounts of in-domain
30 134 112 12.8 8.8 data, and that the additional processing cost by using RNN mod-
50 136 114 9.8 6.7 els does not need to be impractically high by exploiting techniques
100 136 114 9.1 5.6 described in this paper. For that purpose, we published a freely avail-
200 136 113 9.5 6.0 able toolkit for training RNN language models which is available at
400 134 112 10.9 8.1 http://www.fit.vutbr.cz/~imikolov/rnnlm/.
1000 131 111 16.1 15.7
2000 128 109 25.3 28.7 6. REFERENCES
4000 127 108 44.4 57.8
6000 127 109 70 96.5 [1] Yoshua Bengio, Rejean Ducharme and Pascal Vincent. 2003.
8000 124 107 107 148 A neural probabilistic language model. Journal of Machine
Full 123 106 154 212 Learning Research, 3:1137-1155
[2] Joshua T. Goodman (2001). A bit of progress in language mod-
eling, extended version. Technical report MSR-TR-2001-72.
memory, and then a matrix V that is used to obtain probability dis- [3] Holger Schwenk, Jean-Luc Gauvain. Training Neural Network
tribution in the output layer. Both weight matrices share the same Language Models On Very Large Corpora. in Proc. Joint Con-
hidden layer, however, while matrix U needs this vector to maintain ference HLT/EMNLP, October 2005.
all short term memory to store information for possibly several time [4] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký,
steps, matrix V needs only the information contained in the hidden Sanjeev Khudanpur: Recurrent neural network based language
layer that is needed to calculate probability distribution for the im- model, In: Proc. INTERSPEECH 2010
mediately following word3 . To reduce the size of the weight matrix
[5] Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI.
V, we can use an additional compression layer between the hidden
In Large-Scale Kernel Machines, MIT Press, 2007.
and output layers. We have used sigmoid activation function for the
compression layer, thus this projection is non-linear. [6] Jeffrey L. Elman. Finding Structure in Time. 1990. Cognitive
A compression layer not only reduces computational complex- Science, 14, 179-211
ity, but also reduces the total amount of parameters, which results [7] Mikael Bodén. A Guide to Recurrent Neural Networks and
in more compact models. It is also possible to use a similar com- Backpropagation. In the Dallas project, 2002.
pression layer between input and hidden layer to further reduce the [8] Peng Xu. Random forests and the data sparseness problem in
size of the models (such layer is usually referred to as a projec- language modeling, Ph.D. thesis, Johns Hopkins University,
tion layer). The empirical results show that with growing amount 2005.
of training data, the hidden layer needs to be increased to allow the [9] Denis Filimonov and Mary Harper. 2009. A joint language
model to store more information. Thus, the idea of using a com- model with fine-grain syntactic tags. In EMNLP.
pression layer is mostly useful when large amount of training data is
[10] Ahmad Emami, Frederick Jelinek. Exact training of a neural
used. We plan to report results with compression layers in the future.
syntactic language model. In ICASSP 2004.
[11] D. E. Rumelhart, G. E. Hinton, R. J. Williams. 1986. Learn-
5. CONCLUSION AND FUTURE WORK
ing internal representations by back-propagating errors. Na-
We presented to our knowledge the first published results when using ture, 323:533.536.
RNN trained by BPTT in the context of statistical language model- [12] Tomáš Mikolov, Jiřı́ Kopecký, Lukáš Burget, Ondřej Glembek
ing. The comparison to standard feedforward neural network based and Jan Černocký: Neural network based language models for
language models, as well as comparison to BP trained RNN mod- highly inflective languages, In: Proc. ICASSP 2009.
els shows clearly the potential of the presented model. Furthermore, [13] F. Morin, Y. Bengio: Hierarchical Probabilistic Neural Net-
we have shown how to obtain significantly better accuracy of RNN work Language Model. AISTATS’2005.
models by combining them linearly. The resulting mixture of RNN [14] J. Goodman. Classes for fast maximum entropy training. In:
models attains perplexity 96 on the well-known Penn corpus, which Proc. ICASSP 2001.
is significantly better than the best previously published result on this
[15] A. Alexandrescu, K. Kirchhoff. 2006. Factored neural lan-
setup [10]. In the future work, we plan to show how to further im-
guage models. In HLT-NAACL.
prove accuracy by combining statically and dynamically evaluated
RNN models [4] and by using complementary language modeling [16] Yoshua Bengio and Patrice Simard and Paolo Frasconi. Learn-
techniques to obtain even much lower perplexity. In our ongoing ing Long-Term Dependencies with Gradient Descent is Diffi-
ASR experiments, we have observed good correlation between per- cult. IEEE Transactions on Neural Networks, 5, 157-166.
plexity improvements and word error rate reduction. [17] Y. Bengio, J.-S. Senecal. Adaptive Importance Sampling to Ac-
Next, we have shown several possibilities how to reduce compu- celerate Training of a Neural Probabilistic Language Model.
tational and space complexity by using classes, factorization of the IEEE Transactions on Neural Networks, 2008.
output layer and by using compression layers. Combinations of these [18] Ahmad Emami. A Neural Syntactic Language Model. Ph.D.
3 Alternatively, thesis, Johns Hopkins University, 2006.
we can ask if the rank of the matrix V is full.

5531

Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Extensions of Recurrent Neural Network Language Model
No ratings yet
Extensions of Recurrent Neural Network Language Model
18 pages
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
No ratings yet
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
9 pages
Survey On Recurrent Neural Network in Natural Lang
No ratings yet
Survey On Recurrent Neural Network in Natural Lang
5 pages
DL-UNIT_5
No ratings yet
DL-UNIT_5
10 pages
RNN.docx
No ratings yet
RNN.docx
10 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
12 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
Statistical Language Models Based On Neural Networks
No ratings yet
Statistical Language Models Based On Neural Networks
59 pages
Unit 3 Deep Learning SPPU BE IT
No ratings yet
Unit 3 Deep Learning SPPU BE IT
30 pages
2304.11461v1
No ratings yet
2304.11461v1
15 pages
Character-Aware Neural Language Models
No ratings yet
Character-Aware Neural Language Models
9 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Steps For Training A Recurrent Neural Network: Advantages
No ratings yet
Steps For Training A Recurrent Neural Network: Advantages
13 pages
DNN U2 Notes
No ratings yet
DNN U2 Notes
32 pages
Unit IV
No ratings yet
Unit IV
22 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
unit 4_merged
No ratings yet
unit 4_merged
13 pages
Rnn Tutorial
No ratings yet
Rnn Tutorial
41 pages
Transactions On Neural Networks and Learning Systems 11
No ratings yet
Transactions On Neural Networks and Learning Systems 11
1 page
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
DL MODULE 5
No ratings yet
DL MODULE 5
10 pages
NLP Cache Model
No ratings yet
NLP Cache Model
9 pages
Context Based
No ratings yet
Context Based
10 pages
Recurrent Convolutional Neural Networks For Text Classification
No ratings yet
Recurrent Convolutional Neural Networks For Text Classification
7 pages
RNN LSTM Gru R
No ratings yet
RNN LSTM Gru R
97 pages
4-1 Nic
No ratings yet
4-1 Nic
26 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
Unit IV
No ratings yet
Unit IV
31 pages
Module 06
No ratings yet
Module 06
5 pages
Sequence Modeling - Recurrent Networks: Biplab Banerjee
No ratings yet
Sequence Modeling - Recurrent Networks: Biplab Banerjee
66 pages
Recurrent Neural Network - Fundamentals of Deep Learning
No ratings yet
Recurrent Neural Network - Fundamentals of Deep Learning
16 pages
ML Unit 4
No ratings yet
ML Unit 4
47 pages
NLP-week7-rnnlstm
No ratings yet
NLP-week7-rnnlstm
66 pages
10. Chap 10-2 Sequence Modeling Recurrent and Recursive Net-Hyun-Lim Yang
No ratings yet
10. Chap 10-2 Sequence Modeling Recurrent and Recursive Net-Hyun-Lim Yang
39 pages
Recurrent Neural Networks cheatsheet
No ratings yet
Recurrent Neural Networks cheatsheet
44 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
Lecture Notes_RRN
No ratings yet
Lecture Notes_RRN
8 pages
Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan
No ratings yet
Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan
15 pages
lec14-RNN3-8-Feb-18
No ratings yet
lec14-RNN3-8-Feb-18
16 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
aM3RdIpjnYdPsGKF
No ratings yet
aM3RdIpjnYdPsGKF
20 pages
Recurrent Neural Network For Text Classification With Multi-Task Learning
No ratings yet
Recurrent Neural Network For Text Classification With Multi-Task Learning
7 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
DP Module 5
No ratings yet
DP Module 5
8 pages
ML Lec 21 RNN
No ratings yet
ML Lec 21 RNN
72 pages
4-Recurrent Neural Network
No ratings yet
4-Recurrent Neural Network
21 pages
Next Word Prediction Using Machine Learning Techniques: Cybersecurity November 2022
No ratings yet
Next Word Prediction Using Machine Learning Techniques: Cybersecurity November 2022
12 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
36 pages
Unit V
No ratings yet
Unit V
32 pages
RNN_2
No ratings yet
RNN_2
144 pages
RNN & LSTM Notes
No ratings yet
RNN & LSTM Notes
8 pages
module5
No ratings yet
module5
21 pages
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
From Everand
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
David Macêdo
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
CO3029 Assignment Paper
No ratings yet
CO3029 Assignment Paper
8 pages
Neo-Fashion A Data-Driven Fashion Trend Forecastin
No ratings yet
Neo-Fashion A Data-Driven Fashion Trend Forecastin
3 pages
Quality Inspection of Food and Agricultural Produc
No ratings yet
Quality Inspection of Food and Agricultural Produc
17 pages
Generative AI Course Outline
No ratings yet
Generative AI Course Outline
4 pages
Crispbusiness Understanding Assignment
No ratings yet
Crispbusiness Understanding Assignment
14 pages
A Comprehensive Guide To Ensemble Learning (With Python Codes)
No ratings yet
A Comprehensive Guide To Ensemble Learning (With Python Codes)
22 pages
石博士论文
No ratings yet
石博士论文
264 pages
Heart Disease Detection - Newreport
No ratings yet
Heart Disease Detection - Newreport
57 pages
Deloitte and Snowflake AI Trends in 2024
No ratings yet
Deloitte and Snowflake AI Trends in 2024
16 pages
AI in Warehousing
No ratings yet
AI in Warehousing
5 pages
20CS0527 - ML
No ratings yet
20CS0527 - ML
49 pages
CSS ESSAY - The Rise of Artificial Intelligence - Opportunities and Challenges
100% (3)
CSS ESSAY - The Rise of Artificial Intelligence - Opportunities and Challenges
9 pages
Practical Data Analysis Cookbook - Sample Chapter
100% (1)
Practical Data Analysis Cookbook - Sample Chapter
31 pages
Sem 8
No ratings yet
Sem 8
18 pages
ML_Group_9 (1)
No ratings yet
ML_Group_9 (1)
7 pages
Marco Corazza Artificial Intelligence and Beyond for Finance
No ratings yet
Marco Corazza Artificial Intelligence and Beyond for Finance
429 pages
CSE499B Demo Report
No ratings yet
CSE499B Demo Report
13 pages
A Review of Learning Vector Quantization Classifiers
No ratings yet
A Review of Learning Vector Quantization Classifiers
14 pages
Ijresm V4 I4 34
No ratings yet
Ijresm V4 I4 34
3 pages
19036
No ratings yet
19036
62 pages
Hareesh CV 10
No ratings yet
Hareesh CV 10
2 pages
Machine Learning With Python Report
100% (1)
Machine Learning With Python Report
41 pages
STOCK MARKET PREDICTION USING MACHINE LEARNING
No ratings yet
STOCK MARKET PREDICTION USING MACHINE LEARNING
1 page
Seismic Risk and Vulnerability Models Considering Typical Urban Building Portfolios
No ratings yet
Seismic Risk and Vulnerability Models Considering Typical Urban Building Portfolios
36 pages
A Weighted Partial Domain Adaptation For Acoustic Scene Classification and Its Application in Fiber Optic Security System
No ratings yet
A Weighted Partial Domain Adaptation For Acoustic Scene Classification and Its Application in Fiber Optic Security System
7 pages
Eti Practice Test
No ratings yet
Eti Practice Test
4 pages
Neural Network Tool in Matlab
No ratings yet
Neural Network Tool in Matlab
10 pages
DP-100 Microsoft Exam Practice Questions
No ratings yet
DP-100 Microsoft Exam Practice Questions
56 pages
Week 10
No ratings yet
Week 10
31 pages

09 - Extensions of Recurrent Neural Network Language Model

Uploaded by

09 - Extensions of Recurrent Neural Network Language Model

Uploaded by

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

978-1-4577-0539-7/11/$26.00 ©2011 IEEE 5528 ICASSP 2011

Model PPL 120

Perplexity (Penn corpus)

RNN trained by BPTT 106 100

The cross entropy criterion is used to obtain an error vector in 135

Perplexity (Penn corpus)

hidden layer. While the network can remember such information, it

Backpropagation through time (BPTT) [11] can be seen as an exten-

The time complexity of one training step is proportional to

You might also like