1707 06519
1707 06519
RNN Encoder
2. AUDIO WORD2VEC
3. LANGUAGE TRANSFER
RNN Search
Similarity
Encoder Result
(b) German word pairs: the last phoneme differs by existing ‘e’ or
not.
7. CONCLUSION AND FUTURE WORK
Fig. 6: Difference vectors between the average vector repre-
sentations for word pairs differing by one edit distance in (a) In this paper, we verify the capability of language transfer
French and (b) German. of Audio Word2Vec using Sequence-to-sequence Autoen-
coer (SA). We demonstrate that SA can learn the sequen-
tial phonetic structure commonly appearing in human lan-
SA with different levels of accessibility to the low-resource guage and thus make it possible to apply an Audio Word2Vec
target language along with two baseline models, N E and SA model learned from high-resource language to low-resource
trained purely by the target languages. For the four target lan- languages. The capability of language transfer in Audio
guages, the total available amount of audio word segments Word2Vec is beneficial to many real world applications, for
in the training set were 4 thousands for each language. In example, the query-by-example STD shown in this work. For
Table 1, we took different partitions of the target language the future work, we are examining the performance of the
training sets to fine tune the SA pretrained by the source lan- transferred system in other application scenarios, and explor-
guages. The amount of audio word segments in these parti- ing the performance of Audio Word2Vec under automatic
tions are: 1K, 2K, 3K, 4K, and 0, which means no fine-tuning. segmentation.
From Table 1, SA trained by source language generally
outperforms the SA trained by the limited amount of target
language (”SA No Transfer”), proving that with enough au- 8. REFERENCES
dio segments, SA can identify and encode universal phonetic
structure. Comparing with NE, SA surpasses N E in German [1] Najim Dehak, Reda Dehak, Patrick Kenny, Niko Brum-
and French even without fine-tuning, whereas in Czech, SA mer, Pierre Ouellet, and Pierre Dumouchel, “Sup-
also achieves better score than N E with fine-tuning. How- port vector machines versus fast scoring in the low-
ever, in Spanish, SA achieved a MAP score of 0.13 with fine- dimensional total variability space for speaker verifica-
tuning, slightly lower than 0.17 obtained by N E. Back to tion,” in INTERSPEECH, 2009.
[2] Bjorn Schuller, Stefan Steidl, and Anton Batliner, “The [15] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-
INTERSPEECH 2009 emotion challenge,” in INTER- Yi Lee, and Lin-Shan Lee, “Audio word2vec: Un-
SPEECH, 2009. supervised learning of audio segment representations
using sequence-to-sequence autoencoder,” in INTER-
[3] Hung-Yi Lee and Lin-Shan Lee, “Enhanced spo- SPEECH, 2016, pp. 765–769.
ken term detection using support vector machines and
weighted pseudo examples,” Audio, Speech, and Lan- [16] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
guage Processing, IEEE Transactions on, vol. 21, no. 6, Knight, “Transfer learning for low-resource neural ma-
pp. 1272–1284, 2013. chine translation,” arXiv preprint arXiv:1604.02201,
2016.
[4] I.-F. Chen and C.-H. Lee, “A hybrid HMM/DNN ap-
proach to keyword spotting of short words,” in INTER- [17] Prajit Ramachandran, Peter J. Liu, and Quoc V. Le, “Un-
SPEECH, 2013. supervised pretraining for sequence to sequence learn-
[5] A. Norouzian, A. Jansen, R. Rose, and S. Thomas, “Ex- ing,” CoRR, vol. abs/1611.02683, 2016.
ploiting discriminative point process models for spoken
[18] Yoshua Bengio, Patrice Simard, and Paolo Frasconi,
term detection,” in INTERSPEECH, 2012.
“Learning long-term dependencies with gradient de-
[6] Keith Levin, Katharine Henry, Anton Jansen, and Karen scent is difficult,” Neural Networks, IEEE Transactions
Livescu, “Fixed-dimensional acoustic embeddings of on, vol. 5, no. 2, pp. 157–166, 1994.
variable-length segments in low-resource settings,” in
ASRU, 2013. [19] Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-
gio, “On the difficulty of training recurrent neural net-
[7] Keith Levin, Aren Jansen, and Benjamin Van Durme, works,” in International Conference on Machine Learn-
“Segmental acoustic indexing for zero resource keyword ing, 2013, pp. 1310–1318.
search,” in ICASSP, 2015.
[20] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-
[8] Herman Kamper, Weiran Wang, and Karen Livescu, term memory,” Neural Computation, vol. 9, no. 8, pp.
“Deep convolutional acoustic word embeddings using 1735–1780, 1997.
word-pair side information,” in ICASSP, 2016.
[21] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
[9] Samy Bengio and Georg Heigold, “Word embeddings and Yoshua Bengio, “Empirical evaluation of gated re-
for speech recognition,” in INTERSPEECH, 2014. current neural networks on sequence modeling,” arXiv
preprint arXiv:1412.3555, 2014.
[10] Guoguo Chen, Carolina Parada, and Tara N. Sainath,
“Query-by-example keyword spotting using long short- [22] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah-
term memory networks,” in ICASSP, 2015. danau, and Yoshua Bengio, “On the properties of neu-
[11] Herman Kamper, Weiran Wang, and Karen Livescu, ral machine translation: Encoder-decoder approaches,”
“Deep convolutional acoustic word embeddings using arXiv preprint arXiv:1409.1259, 2014.
word-pair side information,” in Acoustics, Speech and
[23] Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo,
Signal Processing (ICASSP), 2016 IEEE International
and Faustino Gomez, “Training recurrent networks by
Conference on. IEEE, 2016, pp. 4950–4954.
evolino,” Neural Computation, vol. 19, no. 3, pp. 757–
[12] Guoguo Chen, Carolina Parada, and Tara N Sainath, 779, 2007.
“Query-by-example keyword spotting using long short-
term memory networks,” in Acoustics, Speech and [24] Justin Bayer, Daan Wierstra, Julian Togelius, and Jürgen
Signal Processing (ICASSP), 2015 IEEE International Schmidhuber, “Evolving memory cell structures for se-
Conference on. IEEE, 2015, pp. 5236–5240. quence learning,” in ICANN, 2009.
[13] Wanjia He, Weiran Wang, and Karen Livescu, “Multi- [25] Hasim Sak, Andrew Senior, and Françoise Beaufays,
view recurrent neural acoustic word embeddings,” arXiv “Long short-term memory recurrent neural network ar-
preprint arXiv:1611.04496, 2016. chitectures for large scale acoustic modeling,” in IN-
TERSPEECH, 2014.
[14] Shane Settle, Keith Levin, Herman Kamper, and Karen
Livescu, “Query-by-example search with discrimina- [26] Patrick Doetsch, Michal Kozielski, and Hermann Ney,
tive neural acoustic word embeddings,” arXiv preprint “Fast and robust training of recurrent neural networks
arXiv:1706.03818, 2017. for offline handwriting recognition,” in ICFHR, 2014.
[27] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- [39] Luis Javier Rodrguez-Fuentes, Amparo Varona, Mikel
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Peagarikano, Germn Bordel, and Mireia Dez, “GTTS
Schwenk, and Yoshua Bengio, “Learning phrase rep- systems for the SWS task at MediaEval 2013.,” in Me-
resentations using rnn encoder-decoder for statistical diaEval, 2013.
machine translation,” arXiv preprint arXiv:1406.1078,
2014. [40] Luis Javier Rodriguez-Fuentes, Amparo Varona, Mikel
Penagarikano, German Bordel, and Mireia Diez,
[28] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, “GTTS-EHU systems for QUESST at MediaEval 2014,”
Bas R Steunebrink, and Jürgen Schmidhuber, in MediaEval, 2014.
“Lstm: A search space odyssey,” arXiv preprint
arXiv:1503.04069, 2015. [41] Haipeng Wang, Tan Lee, Cheung-Chi Leung, Bin Ma,
and Haizhou Li, “Using parallel tokenizers with DTW
[29] Zhizheng Wu and Simon King, “Investigating gated re- matrix combination for low-resource spoken term detec-
current networks for speech synthesis,” in Acoustics, tion,” in ICASSP, 2013.
Speech and Signal Processing (ICASSP), 2016 IEEE In-
ternational Conference on. IEEE, 2016, pp. 5140–5144. [42] Peng Yang, Haihua Xu, Xiong Xiao, Lei Xie, Cheung-
Chi Leung, Hongjie Chen, Jia Yu, Hang Lv, Lei Wang,
[30] Lifeng Shang, Zhengdong Lu, and Hang Li, “Neural Su Jun Leow, Bin Ma, Eng Siong Chng, and Haizhou
responding machine for short-text conversation,” arXiv Li, “The NNI query-by-example system for MediaEval
preprint arXiv:1503.02364, 2015. 2014,” in MediaEval, 2014.
[31] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing [43] Haipeng Wang and Tan Lee, “CUHK system for the spo-
Xiang, et al., “Abstractive text summarization using ken web search task at MediaEval 2012,” in MediaEval,
sequence-to-sequence rnns and beyond,” arXiv preprint 2012.
arXiv:1602.06023, 2016.
[44] Haipeng Wang and Tan Lee, “CUHK system for
[32] Yaodong Tang, Zhiyong Wu, Helen M Meng, Mingx- QUESST task of MediaEval 2014,” in MediaEval, 2014.
ing Xu, and Lianhong Cai, “Analysis on gated recur-
rent unit based question detection approach.,” in IN- [45] Andi Buzo, Horia Cucu, and Corneliu Burileanu,
TERSPEECH, 2016, pp. 735–739. “SpeeD @ MediaEval 2014: Spoken term detection
with robust multilingual phone recognition,” in Medi-
[33] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le, “Se- aEval, 2014.
quence to sequence learning with neural networks,” in
NIPS, 2014. [46] Jorge Proena, Arlindo Veiga, and Fernando Perdigao,
“The SPL-IT query by example search on speech system
[34] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Re- for MediaEval 2014,” in MediaEval, 2014.
ducing the dimensionality of data with neural networks,”
Science, vol. 313, no. 5786, pp. 504–507, 2006. [47] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
jeev Khudanpur, “Librispeech: an asr corpus based on
[35] Pierre Baldi, “Autoencoders, unsupervised learning, and public domain audio books,” in Acoustics, Speech and
deep architectures,” Unsupervised and Transfer Learn- Signal Processing (ICASSP), 2015 IEEE International
ing Challenges in Machine Learning, Volume 7, p. 43, Conference on. IEEE, 2015, pp. 5206–5210.
2012.
[48] Tanja Schultz, “Globalphone: a multilingual speech and
[36] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An- text database developed at karlsruhe university.,” in IN-
drew M Dai, Rafal Jozefowicz, and Samy Ben- TERSPEECH, 2002.
gio, “Generating sentences from a continuous space,”
[49] M. Paul Lewis, Ed., Ethnologue: Languages of the
CoNLL 2016, p. 10, 2016.
World, SIL International, Dallas, TX, USA, sixteenth
[37] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt edition, 2009.
Barth, “A hybrid convolutional variational autoencoder
[50] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
for text generation,” arXiv preprint arXiv:1702.02390,
rado, and Jeff Dean, “Distributed representations of
2017.
words and phrases and their compositionality,” in NIPS,
[38] Preksha Nema, Mitesh Khapra, Anirban Laha, and 2013.
Balaraman Ravindran, “Diversity driven attention
model for query-based abstractive summarization,”
arXiv preprint arXiv:1704.08300, 2017.