A Phonotactic Language Model For Spoken Language Identification

The document presents a phonotactic language model for spoken language identification. It discusses related work on tokenization, language modeling and language identification for LID systems. The paper then proposes a novel solution using a unified set of acoustic tokens to represent sounds across languages, treating the LID task as a text categorization problem.

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views8 pages

A Phonotactic Language Model For Spoken Language Identification

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

A Phonotactic Language Model for Spoken Language Identification

Haizhou Li and Bin Ma

Institute for Infocomm Research
Singapore 119613
{hli,mabin}@i2r.a-star.edu.sg

Orthographic forms of language, ranging from

Abstract Latin alphabet to Cyrillic script to Chinese charac-
ters, are far more unique to the language than their
We have established a phonotactic lan- phonetic counterparts. From the speech production
guage model as the solution to spoken point of view, thousands of spoken languages from
language identification (LID). In this all over the world are phonetically articulated us-
framework, we define a single set of ing only a few hundred distinctive sounds or pho-
acoustic tokens to represent the acoustic nemes (Hieronymus, 1994). In other words,
activities in the world’s spoken languages. common sounds are shared considerably across
A voice tokenizer converts a spoken different spoken languages. In addition, spoken
document into a text-like document of documents1, in the form of digitized wave files, are
acoustic tokens. Thus a spoken document far less structured than written documents and need
can be represented by a count vector of to be treated with techniques that go beyond the
acoustic tokens and token n-grams in the bounds of written language. All of this makes the
vector space. We apply latent semantic identification of spoken language based on pho-
analysis to the vectors, in the same way netic units much more challenging than the identi-
that it is applied in information retrieval, fication of written language. In fact, the challenge
in order to capture salient phonotactics of LID is inter-disciplinary, involving digital signal
present in spoken documents. The vector processing, speech recognition and natural lan-
space modeling of spoken utterances con- guage processing.
stitutes a paradigm shift in LID technol- In general, a LID system usually has three fun-
ogy and has proven to be very successful. damental components as follows:
It presents a 12.4% error rate reduction 1) A voice tokenizer which segments incoming
over one of the best reported results on voice feature frames and associates the seg-
the 1996 NIST Language Recognition ments with acoustic or phonetic labels, called
Evaluation database. tokens;
2) A statistical language model which captures
language dependent phonetic and phonotactic
1 Introduction information from the sequences of tokens;
3) A language classifier which identifies the lan-
Spoken language and written language are similar guage based on discriminatory characteristics
in many ways. Therefore, much of the research in of acoustic score from the voice tokenizer and
spoken language identification, LID, has been in- phonotactic score from the language model.
spired by text-categorization methodology. Both In this paper, we present a novel solution to the
text and voice are generated from language de- three problems, focusing on the second and third
pendent vocabulary. For example, both can be seen problems from a computational linguistic perspec-
as stochastic time-sequences corrupted by a chan- tive. The paper is organized as follows: In Section
nel noise. The n-gram language model has 2, we summarize relevant existing approaches to
achieved equal amounts of success in both tasks, the LID task. We highlight the shortcomings of
e.g. n-character slice for text categorization by lan- existing approaches and our attempts to address the
guage (Cavnar and Trenkle, 1994) and Phone Rec-
ognition followed by n-gram Language Modeling, 1
A spoken utterance is regarded as a spoken document in this
or PRLM (Zissman, 1996) . paper.

515
Proceedings of the 43rd Annual Meeting of the ACL, pages 515–522,
Ann Arbor, June 2005. 2005
c Association for Computational Linguistics
issues. In Section 3 we propose the bag-of-sounds frames, which are assumed to be independent of
paradigm to turn the LID task into a typical text each other, were used as a set of prototypical spec-
categorization problem. In Section 4, we study the tra for each language (Sugiyama, 1991). By adopt-
effects of different settings in experiments on the ing hidden Markov models, people moved beyond
1996 NIST Language Recognition Evaluation low-level spectral analysis towards modeling a
(LRE) database 2 . In Section 5, we conclude our frame sequence into a larger unit such as a pho-
study and discuss future work. neme and even a lexical word.
Since the lexical word is language specific, the
2 Related Work phoneme becomes the natural choice when build-
ing a language-independent voice tokenization
Formal evaluations conducted by the National In- front-end. Previous studies show that parallel lan-
stitute of Science and Technology (NIST) in recent guage-dependent phoneme tokenizers effectively
years demonstrated that the most successful ap- serve as the tokenization front-ends with P-PRLM
proach to LID used the phonotactic content of the being the typical example. However, a language-
voice signal to discriminate between a set of lan- independent phoneme set has not been explored
guages (Singer et al., 2003). We briefly discuss yet experimentally. In this paper, we would like to
previous work cast in the formalism mentioned explore the potential of voice tokenization using a
above: tokenization, statistical language modeling, unified phoneme set.
and language identification. A typical LID system
is illustrated in Figure 1 (Zissman, 1996), where word
language dependent voice tokenizers (VT) and lan-
guage models (LM) are deployed in the Parallel
PRLM architecture, or P-PRLM. phoneme

VT-1: Chinese
frame
LM-L: French
hypothesized language
language classifier

LM-1 … LM-L
spoken utterance

VT-2: English LM-L: French

LM-1 … LM-L Figure 2 Tokenization at different resolutions

2.2 n-gram Language Model

VT-L: French
LM-L: French

LM-1 … LM-L With the sequence of tokens, we are able to es-

Figure 1. L monolingual phoneme recognition timate an n-gram language model (LM) from the
front-ends are used in parallel to tokenize the input statistics. It is generally agreed that phonotactics,
utterance, which is analyzed by LMs to predict the i.e. the rules governing the phone/phonemes se-
spoken language quences admissible in a language, carry more lan-
guage discriminative information than the
2.1 Voice Tokenization phonemes themselves. An n-gram LM over the
tokens describes well n-local phonotactics among
A voice tokenizer is a speech recognizer that neighboring tokens. While some systems model
converts a spoken document into a sequence of the phonotactics at the frame level (Torres-
tokens. As illustrated in Figure 2, a token can be of Carrasquillo et al., 2002), others have proposed P-
different sizes, ranging from a speech feature PRLM. The latter has become one of the most
frame, to a phoneme, to a lexical word. A token is promising solutions so far (Zissman, 1996).
defined to describe a distinct acoustic/phonetic A variety of cues can be used by humans and
activity. In early research, low level spectral machines to distinguish one language from another.
These cues include phonology, prosody, morphol-
2 ogy, and syntax in the context of an utterance.
http://www.nist.gov/speech/tests/

516
However, global phonotactic cues at the level of Intuitively, individual sounds are heavily shared
utterance or spoken document remains unexplored among different spoken languages due to the com-
in previous work. In this paper, we pay special at- mon speech production mechanism of humans.
tention to it. A spoken language always contains a Thus, the acoustic score has little language dis-
set of high frequency function words, prefixes, and criminative ability. Many experiments (Yan and
suffixes, which are realized as phonetic token sub- Barnard, 1995; Zissman, 1996) have further at-
strings in the spoken document. Individually, those tested that the n-gram LM score provides more
substrings may be shared across languages. How- language discriminative information than their
ever, the pattern of their co-occurrences discrimi- acoustic counterparts. In Figure 1, the decoding of
nates one language from another. voice tokenization is governed by the acoustic
Perceptual experiments have shown (Mut- model λlAM to arrive at an acoustic score
( )
husamy, 1994) that with adequate training, human
listeners’ language identification ability increases P O / Tˆ , λ AM and a token sequence Tˆ . The n-
l l l

when given longer excerpts of speech. Experi- gram LM derives the n-local phonotactic score
ments have also shown that increased exposure to
each language and longer training sessions im-
( )
P Tˆl / λlLM from the language model λlLM .
prove listeners’ language identification perform- Clearly, the n-gram LM suffers the major short-
ance. Although it is not entirely clear how human coming of having not exploited the global phono-
listeners make use of the high-order phonotac- tactics in the larger context of a spoken utterance.
tic/prosodic cues present in longer spans of a spo- Speech recognition researchers have so far chosen
ken document, strong evidence shows that to only use n-gram local statistics for primarily
phonotactics over larger context provides valuable pragmatic reasons, as this n-gram is easier to attain.
LID cues beyond n-gram, which will be further In this work, a language independent voice tokeni-
attested by our experiments in Section 4. zation front-end is proposed, that uses a unified
acoustic model λ AM instead of multiple language
2.3 Language Classifier
dependent acoustic models λlAM . The n-gram
The task of a language classifier is to make LM λ is generalized to model both local and
l
LM

good use of the LID cues that are encoded in the global phonotactics.
model λl to hypothesize lˆ from among L lan-
guages, Λ , as the one that is actually spoken in a 3 Bag-of-Sounds Paradigm
spoken document O. The LID model λl in P-
The bag-of-sounds concept is analogous to the
PRLM refers to extracted information from acous- bag-of-words paradigm originally formulated in
tic model and n-gram LM for language l. We have the context of information retrieval (IR) and text
λl = {λlAM , λlLM } and λl ∈ Λ (l = 1,..., L) . A maxi- categorization (TC) (Salton 1971; Berry et al.,
mum-likelihood classifier can be formulated as 1995; Chu-Caroll and Carpenter, 1999). One focus
follows: of IR is to extract informative features for docu-
lˆ = arg max P (O / λl ) ment representation. The bag-of-words paradigm
l∈Λ represents a document as a vector of counts. It is
(1)
≈ arg max ∑ P ( O / T , λlAM ) P (T / λlLM ) believed that it is not just the words, but also the
l∈Λ T ∈Γ co-occurrence of words that distinguish semantic
The exact computation in Eq.(1) involves sum- domains of text documents.
ming over all possible decoding of token se- Similarly, it is generally believed in LID that, al-
quences T ∈ Γ given O. In many implementations, though the sounds of different spoken languages
it is approximated by the maximum over all se- overlap considerably, the phonotactics differenti-
quences in the sum by finding the most likely to- ates one language from another. Therefore, one can
ken sequence, Tˆl , for each language l, using the easily draw the analogy between an acoustic token
in bag-of-sounds and a word in bag-of-words.
Viterbi algorithm:
Unlike words in a text document, the phonotactic
( ) ( )
lˆ ≈ arg max[ P O / Tˆl , λlAM P Tˆl / λlLM ]
l∈Λ
(2) information that distinguishes spoken languages is

517
concealed in the sound waves of spoken languages. It is often advantageous to weight the raw
After transcribing a spoken document into a text counts to refine the contribution of each n-gram to
like document of tokens, many IR or TC tech- LID. We begin by normalizing the vectors repre-
niques can then be readily applied. senting the spoken document by making each vec-
It is beyond the scope of this paper to discuss tor of unit length. Our second weighting is based
what would be a good voice tokenizer. We adopt on the notion that an n-gram that only occurs in a
phoneme size language-independent acoustic to- few languages is more discriminative than an n-
kens to form a unified acoustic vocabulary in our gram that occurs in nearly every document. We use
voice tokenizer. Readers are referred to (Ma et al., the inverse-document frequency (idf) weighting
2005) for details of acoustic modeling. scheme (Spark Jones, 1972), in which a word is
weighted inversely to the number of documents in
3.1 Vector Space Modeling which it occurs, by means of
idf ( w) = log D / d ( w) , where w is a word in the
In human languages, some words invariably occur
more frequently than others. One of the most vocabulary of W token n-grams. D is the total num-
common ways of expressing this idea is known as ber of documents in the training corpus from L lan-
Zipf’s Law (Zipf, 1949). This law states that there guages. Since each language has at least one
is always a set of words which dominates most of document in the training corpus, we have D ≥ L .
the other words of the language in terms of their d ( w) is the number of documents containing the
frequency of use. This is true both of written words word w. Letting cw, d be the count of word w in
and of spoken words. The short-term, or local pho- document d, we have the weighted count as
notactics, is devised to describe Zipf’s Law. cw′ , d = cw, d × idf ( w) /( ∑ cw2 ′, d )1/ 2 (3)
The local phonotactic constraints can be typi- 1≤ w′≤W
cally described by the token n-grams, or phoneme and a vector cd = {c1,′ d , c2,′ d ,..., cW′ , d }T to represent
n-grams as in (Ng et al., 2000), which represents
short-term statistics such as lexical constraints. document d. A corpus is then represented by a
Suppose that we have a token sequence, t1 t2 t3 t4. term-document matrix H = {c1 , c2 ,..., cD } of W × D .
We derive the unigram statistics from the token
sequence itself. We derive the bigram statistics 3.2 Latent Semantic Analysis
from t1(t2) t2(t3) t3(t4) t4(#) where the token vo- The fundamental idea in LSA is to reduce the
cabulary is expanded over the token’s right context. dimension of a document vector, W to Q, where
Similarly, we derive the trigram statistics from the
Q << W and Q << D , by projecting the problem
t1(#,t2) t2(t1,t3) t3(t2,t4) t4(t3,#) to account for left
and right contexts. The # sign is a place holder for into the space spanned by the rows of the closest
free context. In the interest of manageability, we rank-Q matrix to H in the Frobenius norm (Deer-
propose to use up to token trigram. In this way, for wester et al, 1990). Through singular value de-
an acoustic system of Y tokens, we have poten- composition (SVD) of H, we construct a modified
matrix HQ from the Q-largest singular values:
tially Y 2 bigram and Y 3 trigram in the vocabulary.
Meanwhile, motivated by the ideas of having H Q = U Q SQVQT (4)
both short-term and long-term phonotactic statis- U Q is a W × Q left singular matrix with rows
tics, we propose to derive global phonotactics in-
uw ,1 ≤ w ≤ W ; SQ is a Q × Q diagonal matrix of Q-
formation to account for long-term phonotactics:
The global phonotactic constraint is the high- largest singular values of H; VQ is D × Q right sin-
order statistics of n-grams. It represents document gular matrix with rows vd , 1 ≤ d ≤ D .
level long-term phonotactics such as co-
With the SVD, we project the D document vec-
occurrences of n-grams. By representing a spoken
tors in H into a reduced space VQ , referred to as
document as a count vector of n-grams, also called
bag-of-sounds vector, it is possible to explore the Q-space in the rest of this paper. A test document
relations and higher-order statistics among the di- c p of unknown language ID is mapped to a
verse n-grams through latent semantic analysis pseudo-document v p in the Q-space by matrix U Q
(LSA).

518
c p → v p = cTpU Q SQ−1 (5) referred to as a centroid. The pattern of language
distribution is inherently multi-modal, so it is
After SVD, it is straightforward to arrive at a
unlikely well fitted by a single vector. One solution
natural metric for the closeness between two spo-
to this problem is to span the language space with
ken documents vi and v j in Q-space instead of multiple vectors. Applying LSA to a term-
their original W-dimensional space ci and c j . document matrix H : W × L′ , where L′ = L × M as-
suming each language l is represented by a set of
vi ⋅ vTj
g (ci , c j ) ≈ cos(vi , v j ) = (6) M vectors, Φ l , a new classifier, using k-nearest
|| vi || ⋅ || v j || neighboring rule (Duda and Hart, 1973) , is formu-
g (ci , c j ) indicates the similarity between two vec- lated, named k-nearest classifier (KNC):
tors, which can be transformed to a distance meas- lˆ = arg min ∑ k (v p , vl ′ ) (8)
l∈Λ l ′∈φl
ure k (ci , c j ) = cos −1 g (ci , c j ) .
where φl is the set of k-nearest-neighbor to v p and
In the forced-choice classification, a test docu-
ment, supposedly monolingual, is classified into φl ⊂ Φ l .
one of the L languages. Note that the test document Among many ways to derive the M centroid vec-
is unknown to the H matrix. We assume consis- tors, here is one option. Suppose that we have a set
tency between the test document’s intrinsic phono- of training documents Dl for language l , as subset
tactic pattern and one of the D patterns, that is of corpus Ω , Dl ⊂ Ω and ∪lL=1 Dl = Ω . To derive
extracted from the training data and is presented in the M vectors, we choose to carry out vector quan-
the H matrix, so that the SVD matrices still apply tization (VQ) to partition Dl into M cells Dl,m in the
to the test document, and Eq.(5) still holds for di-
Q-space such that ∪mM=1 Dl ,m = Dl using similarity
mension reduction.
metric Eq.(6). All the documents in each cell
3.3 Bag-of-Sounds Language Classifier Dl , m can then be merged to form a super-document,
which is further projected into a Q-space vector
The bag-of-sounds phonotactic LM benefits from
several properties of vector space modeling and vl , m . This results in M prototypical centroids
LSA. vl , m ∈ Φ l (m = 1,...M ) . Using KNC, a test vector is
1) It allows for representing a spoken document compared with M vectors to arrive at the k-nearest
as a vector of n-gram features, such as unigram, neighbors for each language, which can be compu-
bigram, trigram, and the mixture of them; tationally expensive when M is large.
2) It provides a well-defined distance metric for Alternatively, one can account for multi-modal
measurement of phonotactic distance between distribution through finite mixture model. A mix-
spoken documents; ture model is to represent the M discrete compo-
3) It processes spoken documents in a lower di- nents with soft combination. To extend the KNC
mensional Q-space, that makes the bag-of- into a statistical framework, it is necessary to map
sounds phonotactic language modeling, λlLM , our distance metric Eq.(6) into a probability meas-
and classification computationally manageable. ure. One way is for the distance measure to induce
Suppose we have only one prototypical vector a family of exponential distributions with pertinent
cl and its projection in the Q-space vl to represent marginality constraints. In practice, what we need
is a reasonable probability distribution, which
language l. Applying LSA to the term-document
sums to one, to act as a lookup table for the dis-
matrix H : W × L , a minimum distance classifier is
tance measure. We here choose to use the empiri-
formulated: cal multivariate distribution constructed by
lˆ = arg min k (v p , vl ) (7) allocating the total probability mass in proportion
l∈Λ
to the distances observed with the training data. In
In Eq.(7), v p is the Q-space projection of c p , a test short, this reduces the task to a histogram normali-
document. zation. In this way, we map the distance k (ci , c j )
Apparently, it is very restrictive for each lan-
guage to have just one prototypical vector, also to a conditional probability distribution p (vi | v j )

519
subject to ∑ i =1 p (vi | v j ) = 1 . Now that we are in the
| Ω|

Hypothesized language
λ1LM LM-1: Chinese

spoken utterance

Language Classifier
probability domain, techniques such as mixture Unified VT λ2LM LM-2: English
smoothing can be readily applied to model a lan-
guage class with finer fitting. λ AM
Let’s re-visit the task of L language forced- λlLM LM-L: French
choice classification. Similar to KNC, suppose we
have M centroids vl , m ∈ Φ l (m = 1,...M ) in the Q-
space for each language l. Each centroid represents Figure 3. A bag-of-sounds classifier. A unified
a class. The class conditional probability can be front-end followed by L parallel bag-of-sounds
described as a linear combination of p (vi | vl , m ) : phonotactic LMs.
M
p (vi | λlLM ) = ∑ p(vl ,m ) p(vi | vl , m ) (9) 4 Experiments
m =1

the probability p (vl ,m ) , functionally serves as a This section will experimentally analyze the per-
formance of the proposed bag-of-sounds frame-
mixture weight of p (vi | vl , m ) . Together with a set work using the 1996 NIST Language Recognition
of centroids vl , m ∈ Φ l (m = 1,...M ) , p (vi | vl , m ) and Evaluation (LRE) data. The database was intended
to establish a baseline of performance capability
p (vl , m ) define a mixture model λlLM . p (vi | vl , m ) for language recognition of conversational tele-
is estimated by histogram normalization and phone speech. The database contains recorded
p (vl , m ) is estimated under the maximum likelihood speech of 12 languages: Arabic, English, Farsi,
criteria, p (vl , m ) = Cm ,l / Cl , where Cl is total French, German, Hindi, Japanese, Korean, Manda-
rin, Spanish, Tamil and Vietnamese. We use the
number of documents in Dl, of which Cm ,l docu- training set and development set from LDC Call-
ments fall into the cell m. Friend corpus3 as the training data. Each conversa-
An Expectation-Maximization iterative process tion is segmented into overlapping sessions of
can be devised for training of λlLM to maximize the about 30 seconds each, resulting in about 12,000
likelihood Eq.(9) over the entire training corpus: sessions for each language. The evaluation set con-
L | Dl | sists of 1,492 30-sec sessions, each distributed
p (Ω | Λ ) = ∏∏ p (vd | λlLM ) (10) among the various languages of interest. We treat a
l =1 d =1 30-sec session as a spoken document in both train-
Using the phonotactic LM score P Tˆl / λlLM for ( ) ing and testing. We report error rates (ER) of the
1,492 test trials.
classification, with Tˆl being represented by the
4.1 Effect of Acoustic Vocabulary
bag-of-sounds vector v p , Eq.(2) can be reformu-
lated as Eq.(11), named mixture-model classifier The choice of n-gram affects the performance of
(MMC): LID systems. Here we would like to see how a bet-
lˆ = arg max p (v p | λlLM ) ter choice of acoustic vocabulary can help convert
l∈Λ a spoken document into a phonotactically dis-
M (11) criminative space. There are two parameters that
= arg max ∑ p (vl ,m ) p (v p | vl ,m ) determine the acoustic vocabulary: the choice of
l∈Λ m =1
acoustic token, and the choice of n-grams. In this
To establish fair comparison with P-PRLM, as
paper, the former concerns the size of an acoustic
shown in Figure 3, we devise our bag-of-sounds
system Y in the unified front-end. It is studied in
classifier to solely use the LM score
more details in (Ma et al., 2005). We set Y to 32 in
( )
P Tˆ / λ LM for classification decision whereas the
l l

(
acoustic score P O / Tˆl , λlAM may potentially help ) 3
See http://www.ldc.upenn.edu/. The overlap between 1996
NIST evaluation data and CallFriend database has been re-
as reported in (Singer et al., 2003). moved from training data as suggested in the 2003 NIST LRE
website http://www.nist.gov/speech/tests/index.htm

520
this experiment; the latter decides what features to (M=1,024) achieves 14.9% error rate, which al-
be included in the vector space. The vector space most equalizes the best result in the KNC experi-
modeling allows for multiple heterogeneous fea- ment (M=12,000) with much less computation.
tures in one vector. We introduce three types of
acoustic vocabulary (AV) with mixture of token #M 4 16 64 256 1,024
unigram, bigram, and trigram: ER % 29.6 26.4 19.7 16.0 14.9
a) AV1: 32 broad class phonemes as unigram, Table 3. Effect of number of mixtures (MMC)
selected from 12 languages, also referred to as
P-ASM as detailed in (Ma et al., 2005) 4.3 Discussion
b) AV2: AV1 augmented by 32 × 32 bigrams of
The bag-of-sounds approach has achieved equal
AV1, amounting to 1,056 tokens
success in both 1996 and 2003 NIST LRE data-
c) AV3: AV2 augmented by 32 × 32 × 32 tri-
bases. As more results are published on the 1996
grams of AV1, amounting to 33,824 tokens NIST LRE database, we choose it as the platform
of comparison. In Table 4, we report the perform-
AV1 AV2 AV3
ance across different approaches in terms of error
ER % 46.1 32.8 28.3
rate for a quick comparison. MMC presents a
Table 1. Effect of acoustic vocabulary (KNC)
12.4% ER reduction over the best reported result4
(Torres-Carrasquillo et al., 2002).
We carry out experiments with KNC classifier
It is interesting to note that the bag-of-sounds
of 4,800 centroids. Applying k-nearest-neighboring
classifier outperforms its P-PRLM counterpart by a
rule, k is empirically set to 3. The error rates are
wide margin (14.9% vs 22.0%). This is attributed
reported in Table 1 for the experiments over the
three AV types. It is found that high-order token n- to the global phonotactic features in λlLM . The
grams improve LID performance. This reaffirms performance gain in (Torres-Carrasquillo et al.,
many previous findings that n-gram phonotactics 2002; Singer et al., 2003) was obtained mainly by
serves as a valuable cue in LID. fusing scores from several classifiers, namely
GMM, P-PRLM and SVM, to benefit from both
4.2 Effect of Model Size acoustic and language model scores. Noting that
the bag-of-sounds classifier in this work solely re-
As discussed in KNC, one would expect to im- lies on the LM score, it is believed that fusing with
prove the phonotactic model by using more cen- scores from other classifiers will further boost the
troids. Let’s examine how the number of centroid LID performance.
vectors M affects the performance of KNC. We set
the acoustic system size Y to 128, k-nearest to 3, ER %
and only use token bigrams in the bag-of-sounds P-PRLM5 22.0
vector. In Table 2, it is not surprising to find that P-PRLM + GMM acoustic5 19.5
the performance improves as M increases. How- P-PRLM + GMM acoustic + 17.0
ever, it is not practical to have large M be- GMM tokenizer5
cause L′ = L × M comparisons need to take place in Bag-of-sounds classifier (MMC) 14.9
each test trial. Table 4. Benchmark of different approaches

#M 1,200 2,400 4,800 12,000 Besides the error rate reduction, the bag-of-
ER % 17.0 15.7 15.4 14.8 sounds approach also simplifies the on-line com-
Table 2. Effect of number of centroids (KNC) puting procedure over its P-PRLM counterpart. It
would be interesting to estimate the on-line com-
To reduce computation, MMC attempts to use putational need of MMC. The cost incurred has
less number of mixtures M to represent the phono- two main components: 1) the construction of the
tactic space. With the smoothing effect of the mix-
ture model, we expect to use less computation to 4
Previous results are also reported in DCF, DET, and equal
achieve similar performance as KNC. In the ex- error rate (EER). Comprehensive benchmarking for bag-of-
periment reported in Table 3, we find that MMC sounds phonotactic LM will be reported soon.
5
Results extracted from (Torres-Carrasquillo et al., 2002)

521
pseudo document vector, as done via Eq.(5); 2) Jennifer Chu-Carroll, and Bob Carpenter. 1999. Vector-
L′ = L × M vector comparisons. The computing based Natural Language Call Routing, Computa-
tional Linguistics, 25(3):361-388.
cost is estimated to be O (Q 2 ) per test trial
(Bellegarda, 2000). For typical values of Q, this S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and
R. Harshman, 1990, Indexing by latent semantic
amounts to less than 0.05 Mflops. While this is
analysis, Journal of the American Society for Infor-
more expensive than the usual table look-up in matin Science, 41(6):391-407
conventional n-gram LM, the performance im-
provement is able to justify the relatively modest Richard O. Duda and Peter E. Hart. 1973. Pattern Clas-
computing overhead. sification and scene analysis. John Wiley & Sons
James L. Hieronymus. 1994. ASCII Phonetic Symbols
5 Conclusion for the World’s Languages: Worldbet. Technical Re-
port AT&T Bell Labs.
We have proposed a phonotactic LM approach to
Spark Jones, K. 1972. A statistical interpretation of
LID problem. The concept of bag-of-sounds is in-
term specificity and its application in retrieval, Jour-
troduced, for the first time, to model phonotactics nal of Documentation, 28:11-20
present in a spoken language over a larger context.
With bag-of-sounds phonotactic LM, a spoken Bin Ma, Haizhou Li and Chin-Hui Lee, 2005. An Acous-
document can be treated as a text-like document of tic Segment Modeling Approach to Automatic Lan-
guage Identification, submitted to Interspeech 2005
acoustic tokens. This way, the well-established
LSA technique can be readily applied. This novel Yeshwant K. Muthusamy, Neena Jain, and Ronald A.
approach not only suggests a paradigm shift in LID, Cole. 1994. Perceptual benchmarks for automatic
but also brings 12.4% error rate reduction over one language identification, In Proc. of ICASSP
of the best reported results on the 1996 NIST LRE Corinna Ng , Ross Wilkinson , Justin Zobel, 2000. Ex-
data. It has proven to be very successful. periments in spoken document retrieval using pho-
We would like to extend this approach to other neme n-grams, Speech Communication, 32(1-2):61-
spoken document categorization tasks. In monolin- 77
gual spoken document categorization, we suggest G. Salton, 1971. The SMART Retrieval System, Pren-
that the semantic domain can be characterized by tice-Hall, Englewood Cliffs, NJ, 1971
latent phonotactic features. Thus it is straightfor-
E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, W.M.
ward to extend the proposed bag-of-sounds frame-
Campbell and D.A. Reynolds. 2003. Acoustic, Pho-
work to spoken document categorization. netic and Discriminative Approaches to Automatic
language recognition, In Proc. of Eurospeech
Acknowledgement
Masahide Sugiyama. 1991. Automatic language recog-
The authors are grateful to Dr. Alvin F. Martin of nition using acoustic features, In Proc. of ICASSP.
the NIST Speech Group for his advice when pre-
Pedro A. Torres-Carrasquillo, Douglas A. Reynolds,
paring the 1996 NIST LRE experiments, to Dr G. and J.R. Deller. Jr. 2002. Language identification us-
M. White and Ms Y. Chen of Institute for Info- ing Gaussian Mixture model tokenization, in Proc. of
comm Research for insightful discussions. ICASSP.
Yonghong Yan, and Etienne Barnard. 1995. An ap-
References proach to automatic language identification based on
Jerome R. Bellegarda. 2000. Exploiting latent semantic language dependent phone recognition, In Proc. of
information in statistical language modeling, In Proc. ICASSP.
of the IEEE, 88(8):1279-1296.
George K. Zipf. 1949. Human Behavior and the Princi-
M. W. Berry, S.T. Dumais and G.W. O’Brien. 1995. pal of Least effort, an introduction to human ecology.
Using Linear Algebra for intelligent information re- Addison-Wesley, Reading, Mass.
trieval, SIAM Review, 37(4):573-595.
Marc A. Zissman. 1996. Comparison of four ap-
William B. Cavnar, and John M. Trenkle. 1994. N- proaches to automatic language identification of
Gram-Based Text Categorization, In Proc. of 3rd telephone speech, IEEE Trans. on Speech and Audio
Annual Symposium on Document Analysis and In- Processing, 4(1):31-44.
formation Retrieval, pp. 161-169.