0% found this document useful (0 votes)
40 views8 pages

1707 06519

This document discusses transferring an audio word2vec model trained on a high-resource language to represent audio segments in a low-resource language without target language data. It presents sequence-to-sequence autoencoders as an unsupervised method to learn fixed-dimensional vector representations of variable-length audio segments that capture phonetic structure. The paper examines whether an autoencoder trained on a high-resource language can still learn phonetic patterns from a different, low-resource target language and finds promising results, especially for query-by-example spoken term detection where representations from a transferred model outperform alternatives trained directly on limited target language data.

Uploaded by

Lihui Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views8 pages

1707 06519

This document discusses transferring an audio word2vec model trained on a high-resource language to represent audio segments in a low-resource language without target language data. It presents sequence-to-sequence autoencoders as an unsupervised method to learn fixed-dimensional vector representations of variable-length audio segments that capture phonetic structure. The paper examines whether an autoencoder trained on a high-resource language can still learn phonetic patterns from a different, low-resource target language and finds promising results, especially for query-by-example spoken term detection where representations from a transferred model outperform alternatives trained directly on limited target language data.

Uploaded by

Lihui Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

LANGUAGE TRANSFER OF AUDIO WORD2VEC: LEARNING AUDIO SEGMENT

REPRESENTATIONS WITHOUT TARGET LANGUAGE DATA

Chia-Hao Shen, Janet Y. Sung, Hung-Yi Lee

National Taiwan University


Electrical Engineering Department
{r04921047, b01901171, hungyilee}@ntu.edu.tw
arXiv:1707.06519v1 [cs.CL] 19 Jul 2017

ABSTRACT shown that it is possible to transform audio word segments


Audio Word2Vec offers vector representations of fixed di- into fixed dimensional vectors. The transformation success-
mensionality for variable-length audio segments using Sequence- fully produces vector space where word audio segments with
to-sequence Autoencoder (SA). These vector representations similar phonetic structures are closely located. In [11], the
are shown to describe the sequential phonetic structures of the authors train a Siamese convolutional neural network with
audio segments to a good degree, with real world applications side information to obtain embeddings that separate same-
such as query-by-example Spoken Term Detection (STD). word pairs and different-word pairs. Human annotated data
This paper examines the capability of language transfer of is required under this supervised learning scenario. Besides
Audio Word2Vec. We train SA from one language (source supervised approaches [12, 11, 13, 14], unsupervised ap-
language) and use it to extract the vector representation of proaches are also proposed to reduce the annotation effort
the audio segments of another language (target language). [15]. As for the unsupervised learning for the audio em-
We found that SA can still catch phonetic structure from the bedding, LSTM-based sequence-to-sequence autoencoder
audio segments of the target language if the source and target demonstrates a promising result [15]. The model is trained to
languages are similar. In query-by-example STD, we obtain minimize the reconstruction error of the input audio sequence
the vector representations from the SA learned from a large and then provides the embedding, namely Audio Word2Vec,
amount of source language data, and found them surpass the from its bottleneck layer. This is done without any annotation
representations from naive encoder and SA directly learned effort.
from a small amount of target language data. The result shows Although deep learning approaches have produced sat-
that it is possible to learn Audio Word2Vec model from high- isfactory result, the data-hungry nature of the deep model
resource languages and use it on low-resource languages. makes it hard to produce the same performance with low-
This further expands the usability of Audio Word2Vec. resource data. Both supervised and unsupervised approaches
Index Terms— Audio Word2Vec, Spoken Term Detec- assume that a large amount of audio data of the target lan-
tion, Seq2Seq, Autoencoder, Language Transfer guage is available. A question arises whether it is possible
to transfer the Audio Word2Vec model learned from a high-
resource language into a model targeted at a low-resource
1. INTRODUCTION language. While this problem is not yet to be fully exam-
ined in Audio Word2Vec, works in neural machine transla-
Embedding audio word segments into fixed-length vectors
tion (NMT) successfully transfer the model learned on high-
has many useful applications in natural language processing
resource languages to low-resource languages. In [16, 17],
such as speaker identification [1], audio emotion classifi-
the authors first train a source model with high-resource lan-
cation [2], and spoken term detection (STD) [3, 4, 5]. In
guage pair. The source model is used to initialize the target
these applications, audio segments are usually represented as
model which is then trained by low-resource language pairs.
feature vectors to be applied to a standard classifiers which
determines the speaker’s identification, emotion or whether For audio, all languages are uttered by human beings
the input queries are included. By representing the audio with a similar vocal tract structure, and therefore share some
segments in fixed-length vectors instead of using the original common acoustic patterns. This fact implies that knowl-
segments in variable lengths, we can reduce the effort for edge obtained from one spoken language can be transferred
indexing, accelerate the speed of calculation, and improve the onto other languages. This paper verifies that sequence-to-
efficiency for the retrieval task [6, 7, 8]. sequence autoencoder is not only able to transform audio
Recently, deep learning has been used for encoding acous- word segments into fixed-length vectors, the model is also
tic information into vectors [9, 10, 11]. Existing works have transferable to the languages it has never heard before. We
also demonstrate its promising applications with a query-
by-example spoken term detection (STD) experiment. In Reconstructed Acoustic Features
the query-by-example STD experiment, even without tunning 𝑥$ 𝑥% 𝑥& 𝑥'
with partial low-resource language segments, the autoencoder
ℎ " :Vector
can still produce high-quality vectors. Representation z 𝑦$ 𝑦% 𝑦& 𝑦'

RNN Encoder
2. AUDIO WORD2VEC

The goal for Audio Word2Vec model is to identify the pho- 0 0 0 0


⋮ ⋮ ⋮ ⋮
netic patterns in acoustic feature sequences such as MFCCs. acoustic features 𝑥$ 𝑥% 𝑥& 𝑥'
0 0 0 0
Given a sequence x = (x1 , x2 , ..., xT ) where xt is the acous- audio segment RNN Decoder
tic feature at time t, and T is the length, Audio Word2Vec
transforms the features into fixed-length vector z ∈ Rd with
dimension d based on the phonetic structure.
Fig. 1: Sequence-to-sequence Autoencoder (SA) consists of
two RNNs: RNN Encoder (the left large block) and RNN
2.1. RNN Encoder-Decoder Network Decoder (the right large block). The RNN Encoder reads
an audio segment represented as an acoustic feature sequence
Recurrent Neural Networks (RNNs) has shown great success x = (x1 , x2 , ..., xT ) and maps it into a vector representation
in many NLP tasks with its capability of capturing sequen- of fixed dimensionality z; and the RNN Decoder maps the
tial information. The hidden neurons form a directed cycle vector z to another sequence y = (y1 , y2 , ..., yT ). The RNN
and perform the same task for every element in a sequence. Encoder and Decoder are jointly trained to make the output
Given a sequence x = (x1 , x2 , ..., xT ), RNN updates its hid- sequence y as close to the input sequence x as possible, or to
den state ht according to the current input xt and the previous minimize the reconstruction error.
ht−1 . The hidden state ht acts as an internal memory at time
t that enables the network to capture dynamic temporal infor- framework with Autoencoder for unsupervised learning of au-
mation, and also allows the network to process sequences of dio segment representations. SA consists of an Encoder RNN
variable length. However, in practice, RNN does not seem (the left part of Figure 1) and a RNN Decoder (the right part).
to learn long-term dependencies due to the vanishing gradient Given an audio segment represented as an acoustic feature
problem [18, 19]. To conquer such difficulties, LSTM [20] sequence x = (x1 , x2 , ..., xT ) of any length T , the RNN En-
and GRU [21, 22] were proposed. While LSTM achieves coder reads each acoustic feature xt sequentially and the hid-
many amazing results [23, 24, 25, 26, 27, 21, 28], the rela- den state ht is updated accordingly. After the last acoustic
tive new GRU performs just as well with less parameters and feature xT has been read and processed, the hidden state hT
training effort [29, 30, 31, 32]. of the Encoder RNN is viewed as the learned representation
RNN Encoder-Decoder [27, 33] consists of an Encoder z of the input sequence (the purple block in Figure 1). The
RNN and a Decoder RNN.The Encoder RNN reads the input Decoder RNN takes hT as the initial state of the RNN cell,
sequence x = (x1 , x2 , ..., xT ) sequentially and the hidden and generates a output y1 . Instead of taking y1 as the input
state ht of the RNN is updated accordingly. After the last of the next time step, a zero vector is fed in as input to gen-
symbol xT is processed, the hidden state hT is interpreted as erate y2 , and so on. This structure is called the historyless
the learned representation of the whole input sequence. Then, decoder. Based on the principles of Autoencoder [34, 35], the
by taking hT as input, the Decoder RNN generates the out- target of the output sequence y = (y1 , y2 , ..., yT ) is the in-
put sequence y = (y1 , y2 , ..., yT 0 ) sequentially, where T and put sequence x = (x1 , x2 , ..., xT ). In other words, the RNN
T 0 can be different, or the length of x and y can be differ- Encoder and Decoder are jointly trained by minimizing the
ent. Such RNN Encoder-Decoder framework is able to han- reconstruction error, measured by the general mean squared
dle variable-length input. Although there may exist a con- PT
error t=1 kxt − yt k2 . Because the input sequence is taken
siderable time lag between the input symbols and their corre- as the learning target, the training process does not need any
sponding output symbols, LSTM and GRU are able to han- labeled data. The fixed-length vector representation z will be
dle such situation well due to their powerfulness in modeling a meaningful representation for the input audio segment x be-
long-term dependencies. cause the whole input sequence x can be reconstructed from
z by the RNN Decoder.
2.2. Sequence-to-sequence Autoencoder Using historyless decoder is critical here. We found out
that the performance in the STD experiment was undermined
Figure 1 depicts the structure of Sequence-to-sequence Au- despite the low reconstruction error. This shows that the vec-
toencoder (SA), which integrates the RNN Encoder-Decoder tor representations learned from SA do not include useful in-
formation. This might be caused by a strong decoder as the 4. AN EXAMPLE APPLICATION:
model focuses less on including more information into the QUERY-BY-EXAMPLE STD
vector representation. We eventually solved the problem by
using a historyless decoder. Historyless decoder is a weak- Audio Archive:
Fixed-length Vectors z
ened decoder. The input of the decoder is removed, and this Variable-length Audio Segments

forces the model to rely more on the vector representation. RNN


The historyless decoder is also used in recent NLP works Encoder
[36, 37, 38]. Off-line
Online

3. LANGUAGE TRANSFER
RNN Search
Similarity
Encoder Result

Training in Source Language Spoken Query


Fixed-length Vectors z
Source Language Vector Learning Target:
Representation z Fig. 3: The example application of query-by-example STD.
Input Audio Segment Reconstructed Segment
All audio segments in the audio archive are segmented based
RNN RNN on word boundaries and represented by fixed-length vectors
Encoder Decoder
off-line. When a spoken query is entered, it is also repre-
Testing in Target Language
sented as a vector. The similarity between this vector and all
Vector vectors for segments in the archive are calculated, and the au-
RNN Representation for
Encoder Target Language dio segments are ranked accordingly.
Segment
Target Language Encoder
Trained by The audio segment representation z learned in the last
Input Audio Segment
Source Language section can be applied in many possible scenarios. Here in
the preliminary tests we consider the unsupervised query-by-
Fig. 2: The training and testing mechanism for language example STD, whose target is to locate the occurrence regions
transfer. In the training phase, a RNN Encoder-Decoder of the input spoken query term in a large spoken archive with-
is trained by an abundant amount of audio segments in the out speech recognition. Figure 3 shows how the representa-
source language, which is shown in the upper section of the tion z proposed here can be easily used in this task. This ap-
figure and marked blue. During the testing phase, the learned proach is inspired from the previous work [7], but completely
RNN Encoder for the source language (the blue block in the different in the ways to represent the audio segments. In the
lower section of the figure) is directly used to transform an upper half of Figure 3, the audio archive are segmented based
audio segment in the target language (the orange segment) to on word boundaries into variable-length sequences, and then
a fixed-length vector. the system exploits the trained RNN encoder in Figure 1 to en-
code these audio segments into fixed-length vectors. All these
In the study of linguistic, scholars define a set of univer- are done off-line. In the lower left corner of Figure 3, when
sal phonetic rules which describe how sounds are commonly a spoken query is entered, the input spoken query is similarly
organized across different languages. Actually, in real life, encoded by the same RNN encoder into a vector. The system
we often find languages sharing similar phonemes especially then returns a list of audio segments in the archive ranked ac-
the ones spoken in nearby regions. These facts implies that cording to the cosine similarities evaluated between the vec-
when switching target languages, we do not need to learn the tor representation of the query and those of all segments in
new audio pattern from scratch due to the transferability in the archive. Note that the computation requirements for the
spoken languages. Language transfer has shown to be help- online process here are extremely low.
ful in STD [39, 40, 41, 42, 43, 44, 45, 46]. In this paper, we
focus on studying the capability of transfer learning of Audio 5. EXPERIMENTAL SETUP
Word2Vec.
In the proposed approach, we first train an SA using the Here we provide detail of our experiment including the
high-resource source language, as shown in the upper part of dataset, model setup, and the baseline model.
Fig. 2, and then the encoder is used to transform the audio
segment of a low-resource target language. It is also possible
5.1. Dataset
to fine-tune the parameters of SA with the target language. In
the following experiments, we found that in some cases the Two corpora across five languages were used in the experi-
STD performance of the encoder without fine-tuning with the ment. One of the corpora we used is LibriSpeech corpus [47]
low-resource target language can be as good as the one with (English). In this 960-hour English dataset, 2.2 million au-
fine-tuning. dio word segments were used for training while the other 250
thousand segments were used as the database to be retrieved 6. EXPERIMENTS
in STD and 1 thousand segments as spoken queries. In Sec-
tion 6.1, we further sampled 20 thousand segments from 250 In this section, we first examine how changing the hidden
thousand segments to form a small database to investigate layer size of the RNN Encoder/Decoder, the dimension of
the influence of database size. English served as the high- Audio Word2Vec, affects the MAP performance of query-
resource source language for model pre-training. by-example STD (Section 6.1). After obtaining the best hid-
The other dataset is the GlobalPhone corpus [48], which den layer size, we analyze the transferability of the Audio
includes French (FRE), German (GER), Czech (CZE), and Word2Vec by comparing the cosine similarity of the learned
Spanish (ESP). The four languages from GlobalPhone were representations to phoneme sequence edit distance (Section
used as the low-resource target languages. In Section 6.2, 20 6.2) . Visualization of multiple word pairs in different target
thousand segments for each language were used to calculate languages is also provided (Section 6.3). Last but not least,
the average cosine similarity. For the experiments of STD, the we performed the query-by-example STD on target languages
20 thousands segments served as the database to be retrieved, (Section 6.4). These experiments together verify that SA is
and the other 1 thousand used for query and 4 thousand for capable of extracting common phonetic structure in human
fine-tuning. language and thus is transferable to various languages.
MFCCs of 39-dim were used as the acoustic features. The
length of the input sequence was limited to 50 frames. All 6.1. Analysis on Dimension of Audio Word2Vector
datasets were segmented according to the word boundaries
obtained by forced alignment with respect to the reference Before evaluating the language transfer result, we first experi-
transcriptions. Although the oracle word boundaries were mented on the primary SA model in the source language (En-
used here for the query-by-example STD in the preliminary glish). The results are shown in Fig. 4. Here we compare
tests, the comparison in the following experiment was fair the representations of SA and N E. Furthermore, we exam-
since all approaches used the same segmentation. Mean aver- ined the influence of the dimension of Audio Word2Vector in
age precision (MAP) was used as the evaluation measure for terms of MAP. We also compared the MAP results on large
query-by-example STD. testing database (250K segments) and small database (20K).
In Fig. 4, we varied the dimension of Audio Word2Vector
as 100, 200, 400, 600, 800 and 1000. To match up the dimen-
5.2. Proposed Model: Sequence Autoencoder (SA) sionality with SA, we tested N E with dimensionality 117,
234, 390, 585, 819, 1014 (m = 3, 6, 10, 15, 21, 26) and de-
Both the proposed model (SA) and baseline model (N E, de-
noted them by N Ed where d is the dimensionality. SA get
scribed in the next subsection) were implemented with Ten-
higher MAP values than N E no matter the vector dimen-
sorflow. The network structure and the hyper parameters were
sion and the size of database. The highest MAP score SA
set as below:
can achieve is 0.881 (SA800 on small database), while the
highest score of the N E model is 0.490 (N E234 on small
• Both RNN Encoder and Decoder consisted one hidden database). The size of database has large influence on the
layer of GRU cells [21, 22]. The number of units in the results. The MAP scores of the two models both drop in
layer would be discussed in the experiment. the large database. For example, N E234 drops from 0.490
to 0.158, decaying by 68%, and the performance of SA800
• The networks were trained by SGD without momen- drops from 0.881 to 0.317, decaying by 64%. As shown in
tum. The initial learning rate was 1 and decayed with a Fig. 4, larger dimensionality does not imply better perfor-
factor of 0.95 every 500 batches. mance in query-by-example STD. The MAP scores gradually
improve until reaching the dimensionality of 400 in SA and
234 in N E, and start to decrease as the dimension increases.
5.3. Baseline: Naive Encoder (N E)
In the rest of the experiments, we would use 400 GRU units
We used naive encoder (N E) as the baseline approach. in the SA hidden layer, and set N E = N E234 (m = 6).
In this encoder, the input acoustic feature sequence x =
(x1 , x2 , x3 , ..., xT ), where xt was the 39-dimension MFCC 6.2. Analysis of Language Transfer
feature vector at time t, were divided into m partitions with
roughly equal length T /m. Then, we averaged each partition To evaluate the quality of language transfer, we trained the
into a single 39-dimension vector, and finally got the vector Audio Word2Vec model by SA from the source language,
representation through concatenating the m average vectors English, and applied it on different target languages, French
sequentially into a vector representation of dimensionality (FRE), German (GER), Czech (CZE), and Spanish (ESP). We
39 × m. Although N E is simple, similar approaches have computed the average cosine similarity of the vector repre-
been used in STD and achieved successful results [3, 4, 5]. sentations for each pair of the audio segments in the retrieval
Fig. 4: The retrieval performance in MAP for N E and SA Fig. 5: The average cosine similarity and variance (the length
with different dimensions on large testing database (250K of the black line on each bar) between the vector representa-
segments) and small database (20K). tions for all the segment pairs in the target languages test-
ing set, clustered by the phoneme sequence edit distances
(PSED).
database of the target languages (20K segments for each lan-
guage), and compare it with the phoneme sequence edit dis- 6.3. Visualization
tance (PSED). The average and variance (the length of the
In order to further investigate the performance of SA, we vi-
black line on each bar) of the cosine similarity for groups
sualize the vector representation of two sets of word pairs dif-
of pairs clustered by the phoneme sequence edit distances
fering by only one phoneme from French and German as be-
(PSED) between the two words are shown in Fig. 5. For com-
low:
parison, we also provide the results obtained from the English
retrieval database (250K segments), where the segments were 1. French Word Pairs: (parler, parlons), (noter,notons),
not seen by the model in training procedure. (rappeler, rappelons), (utiliser, utilisons)
In Fig. 5, the cosine similarities of the segment pairs
get smaller as the edit distances increase, and the trend is 2. German Word Pairs: (tag, tage), (spiel, spiele), (wenig,
observed in all languages. The gap between each edit dis- wenige), (angriff, angriffe)
tance groups, i.e. (0,1), (1,2), (2,3), (3,4), is obvious. This
means that SA learned from English can successfully encode To show the vector representations in Fig. 6, we first ob-
the sequential phonetic structures into fixed-length vector for tained the mean value of representations for the audio seg-
the target languages to some good extend even though it has ments of a specific word, denoted by δ(word). Then the av-
never seen any audio data of the target languages. erage representation δ was projected from 400-dimensional
to 2-dimensional using PCA [50]. The result of the differ-
Another interesting fact is the corresponding variance be- ence vector from each word pair, e.g. δ(parlons) - δ(parler),
tween languages. In the source language, English, the vari- is shown. Although the representations for French and Ger-
ances of the five edit distance groups are fixed at 0.030, which man word audio segments were extracted from the model
means that the cosine similarity in each edit distance group trained by English audio word segments and never heard any
is centralized. However, the variances of the groups in the French and German, the direction and magnitude of the dif-
target languages vary. In French and German, the variance ferent vectors are coherent. In Fig. 6a, δ(parlons) - δ(parler)
grows from 0.030 to 0.060 as the edit distance increases from is close to δ(utilison) - δ(utiliser); and δ(tage) - δ(tag) is close
0 to 4. For Czech/Spanish, the variance starts at a larger value to δ(wenige) - δ(wenig) in Fig. 6b.
of 0.040/0.050 and increases to 0.050/0.073. We suspect that
the fluctuating variance is related to the similarity between
languages. English, German and French are more similar 6.4. Language Transferring on STD
compared with Czech and Spanish. Among the four target Besides analyzing the cosine similarity of the learned repre-
languages, German has the highest lexical similarity with En- sentations, we also apply them to the query-by-example STD
glish (0.60) and the second highest is French (0.27), while for task. Here we compare the retrieval performance in MAP of
Czech and Spanish, the lexical similarity scores is 0 [49].
Fig. 5, the gap between phoneme sequence edit distances 2
and 3 in Spanish is smaller than other languages. Also, as
parlons discussed earlier in Section 6.2, the variance in Spanish is
also bigger. The smaller gap and bigger variance together
parler indicate that the model is weaker on Spanish at identifying
notons audio segments of different words and thus affects the MAP
rappelons performance in Spanish.

Table 1: The retrieval performance of N E, SA trained by


noter rappeler the target language only (denoted as SA No Transfer), and
utilisons
utiliser SA of the source language tuning with different amounts of
data. The numbers (0, 1K, 2K, 3K, 4K) are the amount of
target language segments used to tune the original SA trained
by the source language. For example, SA 2K means that the
(a) French word pairs: the last phoneme changes from ‘er’ to ‘ons’.
SA is first trained by the source language and then tuned by
2K target language segments.

spiele FRE GER CZE ESP


angriffe NE 0.22 0.18 0.09 0.17
angriff SA
spiel No Transfer
0.03 0.01 0.00 0.00

0 0.26 0.24 0.06 0.04


1K 0.24 0.20 0.09 0.13
SA 2K 0.26 0.25 0.10 0.12
wenige 3K 0.22 0.19 0.08 0.11
tage
4K 0.26 0.20 0.09 0.13
wenig
tag

(b) German word pairs: the last phoneme differs by existing ‘e’ or
not.
7. CONCLUSION AND FUTURE WORK
Fig. 6: Difference vectors between the average vector repre-
sentations for word pairs differing by one edit distance in (a) In this paper, we verify the capability of language transfer
French and (b) German. of Audio Word2Vec using Sequence-to-sequence Autoen-
coer (SA). We demonstrate that SA can learn the sequen-
tial phonetic structure commonly appearing in human lan-
SA with different levels of accessibility to the low-resource guage and thus make it possible to apply an Audio Word2Vec
target language along with two baseline models, N E and SA model learned from high-resource language to low-resource
trained purely by the target languages. For the four target lan- languages. The capability of language transfer in Audio
guages, the total available amount of audio word segments Word2Vec is beneficial to many real world applications, for
in the training set were 4 thousands for each language. In example, the query-by-example STD shown in this work. For
Table 1, we took different partitions of the target language the future work, we are examining the performance of the
training sets to fine tune the SA pretrained by the source lan- transferred system in other application scenarios, and explor-
guages. The amount of audio word segments in these parti- ing the performance of Audio Word2Vec under automatic
tions are: 1K, 2K, 3K, 4K, and 0, which means no fine-tuning. segmentation.
From Table 1, SA trained by source language generally
outperforms the SA trained by the limited amount of target
language (”SA No Transfer”), proving that with enough au- 8. REFERENCES
dio segments, SA can identify and encode universal phonetic
structure. Comparing with NE, SA surpasses N E in German [1] Najim Dehak, Reda Dehak, Patrick Kenny, Niko Brum-
and French even without fine-tuning, whereas in Czech, SA mer, Pierre Ouellet, and Pierre Dumouchel, “Sup-
also achieves better score than N E with fine-tuning. How- port vector machines versus fast scoring in the low-
ever, in Spanish, SA achieved a MAP score of 0.13 with fine- dimensional total variability space for speaker verifica-
tuning, slightly lower than 0.17 obtained by N E. Back to tion,” in INTERSPEECH, 2009.
[2] Bjorn Schuller, Stefan Steidl, and Anton Batliner, “The [15] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-
INTERSPEECH 2009 emotion challenge,” in INTER- Yi Lee, and Lin-Shan Lee, “Audio word2vec: Un-
SPEECH, 2009. supervised learning of audio segment representations
using sequence-to-sequence autoencoder,” in INTER-
[3] Hung-Yi Lee and Lin-Shan Lee, “Enhanced spo- SPEECH, 2016, pp. 765–769.
ken term detection using support vector machines and
weighted pseudo examples,” Audio, Speech, and Lan- [16] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
guage Processing, IEEE Transactions on, vol. 21, no. 6, Knight, “Transfer learning for low-resource neural ma-
pp. 1272–1284, 2013. chine translation,” arXiv preprint arXiv:1604.02201,
2016.
[4] I.-F. Chen and C.-H. Lee, “A hybrid HMM/DNN ap-
proach to keyword spotting of short words,” in INTER- [17] Prajit Ramachandran, Peter J. Liu, and Quoc V. Le, “Un-
SPEECH, 2013. supervised pretraining for sequence to sequence learn-
[5] A. Norouzian, A. Jansen, R. Rose, and S. Thomas, “Ex- ing,” CoRR, vol. abs/1611.02683, 2016.
ploiting discriminative point process models for spoken
[18] Yoshua Bengio, Patrice Simard, and Paolo Frasconi,
term detection,” in INTERSPEECH, 2012.
“Learning long-term dependencies with gradient de-
[6] Keith Levin, Katharine Henry, Anton Jansen, and Karen scent is difficult,” Neural Networks, IEEE Transactions
Livescu, “Fixed-dimensional acoustic embeddings of on, vol. 5, no. 2, pp. 157–166, 1994.
variable-length segments in low-resource settings,” in
ASRU, 2013. [19] Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-
gio, “On the difficulty of training recurrent neural net-
[7] Keith Levin, Aren Jansen, and Benjamin Van Durme, works,” in International Conference on Machine Learn-
“Segmental acoustic indexing for zero resource keyword ing, 2013, pp. 1310–1318.
search,” in ICASSP, 2015.
[20] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-
[8] Herman Kamper, Weiran Wang, and Karen Livescu, term memory,” Neural Computation, vol. 9, no. 8, pp.
“Deep convolutional acoustic word embeddings using 1735–1780, 1997.
word-pair side information,” in ICASSP, 2016.
[21] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
[9] Samy Bengio and Georg Heigold, “Word embeddings and Yoshua Bengio, “Empirical evaluation of gated re-
for speech recognition,” in INTERSPEECH, 2014. current neural networks on sequence modeling,” arXiv
preprint arXiv:1412.3555, 2014.
[10] Guoguo Chen, Carolina Parada, and Tara N. Sainath,
“Query-by-example keyword spotting using long short- [22] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah-
term memory networks,” in ICASSP, 2015. danau, and Yoshua Bengio, “On the properties of neu-
[11] Herman Kamper, Weiran Wang, and Karen Livescu, ral machine translation: Encoder-decoder approaches,”
“Deep convolutional acoustic word embeddings using arXiv preprint arXiv:1409.1259, 2014.
word-pair side information,” in Acoustics, Speech and
[23] Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo,
Signal Processing (ICASSP), 2016 IEEE International
and Faustino Gomez, “Training recurrent networks by
Conference on. IEEE, 2016, pp. 4950–4954.
evolino,” Neural Computation, vol. 19, no. 3, pp. 757–
[12] Guoguo Chen, Carolina Parada, and Tara N Sainath, 779, 2007.
“Query-by-example keyword spotting using long short-
term memory networks,” in Acoustics, Speech and [24] Justin Bayer, Daan Wierstra, Julian Togelius, and Jürgen
Signal Processing (ICASSP), 2015 IEEE International Schmidhuber, “Evolving memory cell structures for se-
Conference on. IEEE, 2015, pp. 5236–5240. quence learning,” in ICANN, 2009.

[13] Wanjia He, Weiran Wang, and Karen Livescu, “Multi- [25] Hasim Sak, Andrew Senior, and Françoise Beaufays,
view recurrent neural acoustic word embeddings,” arXiv “Long short-term memory recurrent neural network ar-
preprint arXiv:1611.04496, 2016. chitectures for large scale acoustic modeling,” in IN-
TERSPEECH, 2014.
[14] Shane Settle, Keith Levin, Herman Kamper, and Karen
Livescu, “Query-by-example search with discrimina- [26] Patrick Doetsch, Michal Kozielski, and Hermann Ney,
tive neural acoustic word embeddings,” arXiv preprint “Fast and robust training of recurrent neural networks
arXiv:1706.03818, 2017. for offline handwriting recognition,” in ICFHR, 2014.
[27] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- [39] Luis Javier Rodrguez-Fuentes, Amparo Varona, Mikel
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Peagarikano, Germn Bordel, and Mireia Dez, “GTTS
Schwenk, and Yoshua Bengio, “Learning phrase rep- systems for the SWS task at MediaEval 2013.,” in Me-
resentations using rnn encoder-decoder for statistical diaEval, 2013.
machine translation,” arXiv preprint arXiv:1406.1078,
2014. [40] Luis Javier Rodriguez-Fuentes, Amparo Varona, Mikel
Penagarikano, German Bordel, and Mireia Diez,
[28] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, “GTTS-EHU systems for QUESST at MediaEval 2014,”
Bas R Steunebrink, and Jürgen Schmidhuber, in MediaEval, 2014.
“Lstm: A search space odyssey,” arXiv preprint
arXiv:1503.04069, 2015. [41] Haipeng Wang, Tan Lee, Cheung-Chi Leung, Bin Ma,
and Haizhou Li, “Using parallel tokenizers with DTW
[29] Zhizheng Wu and Simon King, “Investigating gated re- matrix combination for low-resource spoken term detec-
current networks for speech synthesis,” in Acoustics, tion,” in ICASSP, 2013.
Speech and Signal Processing (ICASSP), 2016 IEEE In-
ternational Conference on. IEEE, 2016, pp. 5140–5144. [42] Peng Yang, Haihua Xu, Xiong Xiao, Lei Xie, Cheung-
Chi Leung, Hongjie Chen, Jia Yu, Hang Lv, Lei Wang,
[30] Lifeng Shang, Zhengdong Lu, and Hang Li, “Neural Su Jun Leow, Bin Ma, Eng Siong Chng, and Haizhou
responding machine for short-text conversation,” arXiv Li, “The NNI query-by-example system for MediaEval
preprint arXiv:1503.02364, 2015. 2014,” in MediaEval, 2014.

[31] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing [43] Haipeng Wang and Tan Lee, “CUHK system for the spo-
Xiang, et al., “Abstractive text summarization using ken web search task at MediaEval 2012,” in MediaEval,
sequence-to-sequence rnns and beyond,” arXiv preprint 2012.
arXiv:1602.06023, 2016.
[44] Haipeng Wang and Tan Lee, “CUHK system for
[32] Yaodong Tang, Zhiyong Wu, Helen M Meng, Mingx- QUESST task of MediaEval 2014,” in MediaEval, 2014.
ing Xu, and Lianhong Cai, “Analysis on gated recur-
rent unit based question detection approach.,” in IN- [45] Andi Buzo, Horia Cucu, and Corneliu Burileanu,
TERSPEECH, 2016, pp. 735–739. “SpeeD @ MediaEval 2014: Spoken term detection
with robust multilingual phone recognition,” in Medi-
[33] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le, “Se- aEval, 2014.
quence to sequence learning with neural networks,” in
NIPS, 2014. [46] Jorge Proena, Arlindo Veiga, and Fernando Perdigao,
“The SPL-IT query by example search on speech system
[34] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Re- for MediaEval 2014,” in MediaEval, 2014.
ducing the dimensionality of data with neural networks,”
Science, vol. 313, no. 5786, pp. 504–507, 2006. [47] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
jeev Khudanpur, “Librispeech: an asr corpus based on
[35] Pierre Baldi, “Autoencoders, unsupervised learning, and public domain audio books,” in Acoustics, Speech and
deep architectures,” Unsupervised and Transfer Learn- Signal Processing (ICASSP), 2015 IEEE International
ing Challenges in Machine Learning, Volume 7, p. 43, Conference on. IEEE, 2015, pp. 5206–5210.
2012.
[48] Tanja Schultz, “Globalphone: a multilingual speech and
[36] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An- text database developed at karlsruhe university.,” in IN-
drew M Dai, Rafal Jozefowicz, and Samy Ben- TERSPEECH, 2002.
gio, “Generating sentences from a continuous space,”
[49] M. Paul Lewis, Ed., Ethnologue: Languages of the
CoNLL 2016, p. 10, 2016.
World, SIL International, Dallas, TX, USA, sixteenth
[37] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt edition, 2009.
Barth, “A hybrid convolutional variational autoencoder
[50] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
for text generation,” arXiv preprint arXiv:1702.02390,
rado, and Jeff Dean, “Distributed representations of
2017.
words and phrases and their compositionality,” in NIPS,
[38] Preksha Nema, Mitesh Khapra, Anirban Laha, and 2013.
Balaraman Ravindran, “Diversity driven attention
model for query-based abstractive summarization,”
arXiv preprint arXiv:1704.08300, 2017.

You might also like