0% found this document useful (0 votes)
2 views

FS06-01-012

Uploaded by

gamzeecngz658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

FS06-01-012

Uploaded by

gamzeecngz658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Vector-based Representation and Clustering of Audio Using Onomatopoeia Words

Shiva Sundaram and Shrikanth Narayanan


Speech Analysis and Interpretation Laboratory (SAIL),
Dept. Electrical Enginneering-Systems, University of Southern California,
3740 McClintock Ave, EEB400, Los Angeles, CA 90089. USA
[email protected], [email protected]

Abstract level properties (frequency components, energy etc.) while


in the semantic space the definition is based on human per-
We present results on organization of audio data based on
ception and context information. This semantic definition
their descriptions using onomatopoeia words. Onomatopoeia
words are imitative of sounds that directly describe and repre- is often represented using natural language in textual form,
sent different types of sound sources through their perceived since words directly represent ‘meaning’. Therefore natural
properties. For instance, the word pop aptly describes the language representation of audio properties and events are
sound of opening a champagne bottle. We first establish this important for semantic understanding of audio, and it is the
type of audio-to-word relationship by manually tagging a va- focus of this present paper.
riety of audio clips from a sound effects library with ono- In what we call as content-based processing, natural lan-
matopoeia words. Using principal component analysis (PCA) guage representations are typically established by a naive
and a newly proposed distance metric for word-level cluster- labeling scheme where the audio data is mapped onto a set
ing, we cluster the audio data representing the clips. Due to of pre-specified classes. The resulting mapped clusters are
the distance metric and the audio-to-word relationship, the re-
used to train a pattern classifier and eventually used to iden-
sulting clusters of clips have similar acoustic properties. We
found that as language level units, the onomatopoeic descrip- tify the correct class for a given test data. Examples of
tions are able to represent perceived properties of audio sig- such systems are in (Guo & Li 2003; L. Liu & Jiang 2002;
nals. We believe that this form of description can be use- T. Zhang 2001). While such an approach yields high classi-
ful in relating higher-level descriptions of events in a scene fication accuracy, they have limited scope in characterizing
by providing an intermediate perceptual understanding of the generic audio scenes, save for situations where the expected
acoustic event. audio classes are known previously. Other techniques for
retrieval that better exploit semantic relations in language
Introduction is implemented in (P. Cano, Herrera, & Wack 2004). Here
the authors have used WordNet (Fellbaum 1998) to generate
Automatic techniques are required to interpret and manage word tags for a given audio clip using acoustic feature sim-
the ever-increasing multimedia data that is acquired, stored ilarities, and also retrieve clips that are similar to the initial
and delivered in a wide variety of forms. In interactive envi- tags. While such semantic relations in language are impor-
ronments, involving humans and/or robots, data is available tant in building audio ontologies, they are still sufficiently
in the form of video/images, audio and a variety of sensors insulated from signal level properties that directly affect the
depending on the nature of the application. Each of these perception of sources.
represent different forms of communication and a variety of In our work, however, we present an approach to use se-
expressions. To utilize and manage them effectively, such as mantic information that are closer to signal level properties.
reason with them in a human-robot interaction, it is desirable This is implemented using onomatopoeia words present in
to organize, index and label these forms according to their the English language. These are words that are imitative of
content. Language (rather textual) description or annotation sounds (as defined by the Oxford English Dictionary). We
is a concise representation of an event that is useful in this believe that such a description will help tackle the poten-
respect. It makes the audio and video data more presentable tial disambiguity in generic linguistic characterizations of
and accessible for reasoning, and or search/retrieval. This audio. The presentation of the idea is as follows. We first
also aids in developing machine listening systems that can represent the onomatopoeia words as vectors in a ‘meaning
use aural information for decision making tasks. The work space’. This is implemented using the proposed inter-word
we present here mainly deals with ontological representa- distance metric. We then tag (offline) various clips of acous-
tion and characterization of different audio events. While tic sources from a general sound effects library with appro-
the recorded data is stored in signal feature space (such as priate onomatopoeia words. These words are the descrip-
in terms of frequency components or energy etc.) for auto- tions of the acoustic properties of the corresponding audio
matic processing, text annotation represents the audio clip in clip. Using the tags of each clip, and the vector representa-
the semantic space. The underlying representations of an au- tion of each word, we represent and cluster the audio clips
dio clip in the signal feature space and in semantic space are in the meaning space. Using an unsupervised clustering al-
different. This is because the feature vectors represent signal gorithm and a model fit measure, the clips are then clustered
Copyright c 2006, American Association for Artificial Intelli- according to their representation in this space. The resulting
gence (www.aaai.org). All rights reserved. clusters are both semantically relevant and share similar per-
bang bark bash beep biff blah blare blat bleep scription of an audio event. Given the scene of a thicket or a
blip boo boom bump burr buzz caw chink chuck
barn, the acoustic features of the sample clip with hoot as its
clang clank clap clatter click cluck coo crackle crash
description is likely to be an owl than a door bell. However,
creak cuckoo ding dong fizz flump gabble gurgle hiss
honk hoot huff hum hush meow moo murmur pitapat
given the scene of a living room, the same acoustic features
plunk pluck pop purr ring rip roar rustle screech are more likely to represent a door bell. Based on such ideas,
scrunch sizzle splash splat squeak tap-tap thud thump thwack it can be seen that descriptions with onomatopoeia words
tick ting toot twang tweet whack wham wheeze whiff automatically provide a flexible framework for recognition
whip whir whiz whomp whoop whoosh wow yak yawp or classification of general auditory scenes. In the next sec-
yip yowl zap zing zip zoom tions, we start with the implementation of the analysis in this
work.
Table 1: Complete list of Onomatopoeia Words used in this work.
Implementation
ceived acoustic properties. We also present some examples Distance metric in lexical meaning space
of the resulting clusters. Next, we briefly discuss the moti- The onomatopoeia words are represented as vectors using a
vation for this research. semantic word based similarity/distance metric and Princi-
Motivation: Describing sounds with words. Humans are pal Component Analysis (PCA). The details of this method
able to express and convey a wide variety of acoustic events follows:
using language. This is achieved by using words that ex- A set {Li } consisting of li words is generated by a The-
press the properties of a particular acoustic event. For ex- saurus for each word Oi in the list of onomatopoeia words.
ample, if one attempts to describe the event “knocking on Then the similarity between the j th and kth word can be
the door” , the words “tap-tap-tap” describe the acoustic defined to be:
properties well. Communicating acoustic events in such a cj,k
manner is possible because of a two way mapping between s(j, k) = d
, (1)
lj,k
the acoustic space and language or semantic space. Exis-
tence of such a mapping is a result of common understand- resulting in a distance measure :
ing of familiar acoustic events. The person communicating
the acoustic aspect of the event “knocking on the door” may d(j, k) = 1 − s(j, k) (2)
use the words “tap” to describe it. That individual is aware Here cj,k is the number of common words in the set {Lj }
of a provision in language (the onomatopoeia word “tap”) d
and {Lk } and lj,k is the total number of words in the union
that would best describe it to another. The person who hears
of {Lj } and {Lk }. By this definition it can be seen that
the word is also familiar with acoustic properties associated
with the word “tap”. Here, it is important to point out the 0 ≤ d(j, k) ≤ 1 (3)
following issues: (1) There is a difference in the language d(j, k) = d(k, j) (4)
descriptions “knocking on the door” and “tap-tap”. The for-
mer is an original lexical description of the event and the d(k, k) = 0 (5)
later is closer to the description of the acoustic properties of Except for the triangular inequality, it is a valid distance
the knocking event. (2) Since the words such as “tap” de- metric. It is also semantically relevant because the words
scribe the acoustic properties, they can also represent mul- in the set {Lj } and {Lk } generated by the Thesaurus have
tiple events (for example, knocking a door, horse hooves on some meaning associated with the words Oj and Ok in the
tarmac etc.). Other relevant examples of such descriptions language. The similarity between two words depends on the
using onomatopoeia words of familiar sounds are as follows: number of common words (a measure of sameness in mean-
• In case of sounds of birds: A hen clucks, a sparrow tweets, ing). Therefore for a set of W words, using this distance
a crow or raven caws, and an owl hoots. metric, we get a symmetric W × W distance matrix where
the (j, k)th element is the distance between the j th and kth
• Example of sounds from everyday life: A door close is word. Note that the j th row of the matrix is a vector rep-
described as a thud and/or thump. A door can creak or resentation of the j th word in terms of other words present
squeak while opening or closing. A clock ticks. A door- in the set. We perform principal component analysis (PCA)
bell is described with the words ding and/or dong or even (R. O. Duda & Stork 2000) on this set of feature vectors,
toot. and represent each word as a point in a smaller dimensional
In general, onomatopoeic description of such sounds is not space Od with d < W . In our implementation the squared
restricted to single word expressions. One usually uses sum of the first eight ordered eigenvalues covered more than
multiple words to paint an appropriate acoustic picture. 95% of the total squared sum of all the eigenvalues. There-
The above examples also provide the rationale for using fore d = 8 was selected for reduced dimension representa-
onomatopoeic descriptions. For example, by their ono- tion and W = 83. Thus these points (or vectors) are repre-
matopoeic descriptions, the sound of door bell is closer to an sentation of the onomatopoeic words in meaning space.
owl hooting whereas their lexical descriptions (that seman- Table 1 lists all the onomatopoeia words used in this
tically represents the events using the sound sources “door work. By studying the words it can be seen that many have
bell” and “owl”) are entirely different. It is also possible to overlapping meanings (eg. clang and clank), some words
draw a higher level of inference from the onomatopoeic de- are ‘closer’ in meaning to each other with respect to other
sizzle
fizz
whiz
zip
whoosh
hiss
wheeze

buzz
dimension 2

zing

zoom twang
boom murmur
huff growl
rip grunt
rustle roar coo
hum
gurgle
whiff wow bark
0 gabble blare
blat
whir crackleyawp hoot scrunch
purrscreech
splash chuck cuckoo burr caw
boo blahpluck ring
clatter tweetcrunch
zapwhip yaktap toot
cluck honk
click tickcrash
pitapat beepbleep clang
thudflump bump dong chink
bang
clap ting clank
creak
squeak whomp
thump
wham ding
whoop
yowl thwack
blip

0
dimension 1

Figure 1: Arrangement of some onomatopoeia words in 2 dimen-


sional ‘meaning space’. Note that words such as clang and clank
are closer to each other, but they are farther away from words such
as fizz and sizzle
Figure 2: Tagging and clustering the audio clips with ono-
matopoeia words.
words (eg. fizz is close to sizzle, bark is close to roar, but
(fizz/sizzle) and (bark/roar) are far from each other). These
observations can also be made from Figure 1 that illustrates the clip with the lexical name BRITISH SAANEN GOAT 1 BB re-
the arrangement of the words in a d = 2 dimensional space. ceived the tags {blah, blat, boo, yip, yowl} and this same set
Observe that the words growl and twang are close to each of words were used to tag the file BRITISH SAANEN GOAT 2 BB.
other. This is mainly because the words are represented in a Similarly, the audio clip BIG BEN 10TH STRIKE 12 BB received
low dimensional space (d = 2) in the figure. the tags {clang, ding, dong}. These tags were also used for
Once we have the tags for each audio clip, the clips the file BIG BEN 2ND STRIKE 12 BB After transposing the tags, a
can also be represented as vectors in the meaning space. total of 1014 clips was available. Next, we represent each
Next, we discuss the tagging procedure that results in ono- tagged audio clip in the meaning space.
matopoeic descriptions of each audio clip. Later, vector rep-
resentation based on these tags is discussed. Vector representation of audio clips in meaning
space
Tagging the Audio Clips with onomatopoeia words The vector representation of the tagged audio clips in
two dimensions is illustrated in Figure 3. The vectors
A set of 236 audio clips were selected from the BBC sound for each audio clip is simply the sum of the vectors
effects Library (http://www.sound ideas.com 2006). The that correspond to the onomatopoeic tags. Let the clip
clips were chosen to represent a wide variety of recordings HORSE VARIOUS SURFACES BB have the onomatopoeic descrip-
belonging to categories such as: animals, birds, footsteps, tion tags {pitpat, clatter}. Now the tags pitpat and clatter
transportation, construction work, fireworks etc. Four sub- are already represented as vectors in meaning space. Per-
jects (with English as their first language) volunteered to tag forming a vector sum of the vectors that correspond to these
this initial set of clips with onomatopoeia words. A Graph- tags (pitpat, clatter), i.e, the sum of the vectors of the points
ical User Interface (GUI) based software tool was designed 1 and 2 shown in Figure 3. This results in the point 3. There-
to play each clip in stereo over a pair of headphones. All fore, the vector of point 3 is taken to be the vector of the
the clips were edited to be about 10-14 seconds in duration. clip HORSE VARIOUS SURFACES BB . The implicit properties of
The GUI also had the complete list of the words. The vol- this representation technique is as follows:
unteers were asked to choose words that best described the
audio by clicking on them. The clips were randomly di- • If two or more audio clips have the same tags then the
vided into 4 sets, so that the volunteers spent only 20-25 resulting vectors of the clips would be the same.
minutes at a time tagging the clips in each set. The chosen • If two clips have similar meaning tags (not the same tags)
words were recorded as the onomatopoeia tags for the cor- then the resulting points of the vectors of the clips would
responding clip. The tags of all the volunteers recorded for be in close proximity to each other. For example, let
each clip were counted. The tags that had a count of two clips A and B have tags {sizzle, whiz} and {fizz, whoosh}
or more were retained and the rest discarded. This results respectively. Since these tags are already close to each
in tags that are common to all the responses of the volun- other in the meaning space (refer to Figure 1), because
teers. This tagging method is illustrated in Figure 2. Note of the vector sum, the resulting points of the vectors of
that the resulting tags are basically onomatopoeic descrip- clips A and B would also be in close proximity to each
tions that best represent the perceived audio signal. The tags other. In contrast, if the clips have tags that are entirely
for this initial set of words were then transposed to other different from each other, then the vector sum would re-
clips with similar original lexical description. For example, sult in points that are relatively far from each other. Sub-
5
x 10
−0.7

−0.8

−0.9

−1

−1.1

BIC value →
−1.2

−1.3

−1.4

−1.5

−1.6

−1.7
0 50 100 150
number of clusters →

Figure 3: Vector representation of the audio clip Figure 4: BIC as a function of number of clusters k in model Mk .
HORSE VARIOUS SURFACES BB with tags {clatter, pitpat} The maximum value is obtained for k = 112.

sequently, using clustering algorithms in this space, audio of samples in each cluster. We use this criterion to choose
clips that have similar acoustic and/or semantic properties k for the k-means algorithm for clustering the audio clips in
can be grouped together. the meaning space. Figure 4 is a plot of the BIC as a function
Thus the audio clips can be represented as vectors in the of number of clusters k estimated using equation (7). It can
proposed meaning space. This allows us to use conventional be seen that the maximum value is obtained for k = 112.
pattern recognition algorithms. In this work, we group clips
with similar onomatopoeic descriptions (and hence similar Clustering Results
acoustic properties) using the unsupervised k-means clus- Some of the resulting clusters using the presented method
tering algorithm. The complete summary of the tagging and are shown in Table 2. The table lists some of the signif-
clustering the clips is illustrated in Figure 2. The Clustering icant audio clips in each of the clusters. Only five out of
procedure is discussed in the next section. k = 112 clusters are shown for illustration. As mentioned
previously, audio clips with similar onomatopoeic descrip-
Experiments and Results tions are clustered together. As a result, the clips in the
Unsupervised Clustering of audio clips in meaning clusters share similar perceived acoustic properties. For ex-
space ample, the clips SML NAILS DROP ON BENCH B2.wav and DOOR-
BELL DING DING DONG MULTI BB.wav in cluster 5 listed in the ta-
The Bayesian Information Criterion (BIC) (Schwarz 1978)
ble. From their respective onomatopoeic descriptions and
has been used as a criteria for model selection in unsuper-
an understanding of the properties of the sound generated
vised learning. It is widely used for choosing the appro-
by a doorbell and a nail dropping on a bench, a relationship
priate number of clusters in unsupervised clustering (Zhou
can be made between them. The relationship is established
& Hansen 2000; Chen & Gopalakrishnan ). It works by
by the vector representation of the audio clips in meaning
penalizing a selected model in terms of the complexity of
space according to their onomatopoeic descriptions.
the model fit to the observed data. For a model fit M
for an observation set X , it is defined as (Schwarz 1978;
Zhou & Hansen 2000): Conclusion
1 In this paper we represent descriptions of audio clips with
BIC(M ) = log(P (X |M )) − · rM · log(RX ), (6) onomatopoeia words and cluster them according to their
2
vector representation in the linguistic (lexical) meaning
where RX is the number of observations in the set X and space. Onomatopoeia words are imitative of sounds and
rM is the number of independent parameters in the model provide a means to represent perceived audio characteristics
M . For a set of competing models {M1 , M2 , . . . , Mi } we with language level units. This form of representation es-
choose the model that maximizes the BIC. For the case sentially bridges the gap between signal level acoustic prop-
where each cluster in Mk (with k clusters) is modelled as a erties and higher-level audio class labels.
multivariate Gaussian distribution we get the following ex- First, using the proposed distance/similarity metric we es-
pression for the BIC: tablish a vector representation of the words in a ‘meaning
BIC(Mk ) = (7) space’. We then provide onomatopoeic descriptions (ono-
j=k   matopoeia words that best describe the sound in an audio
X 1 1 clip) by manually tagging them with relevant words. Then,
− · nj · log(|Σj |) − · rM · log(RX )
j=1
2 2 the audio clips are represented in the meaning space as
the sum of the vectors of its corresponding onomatopoeia
Here, Σj is the sample covariance matrix for the j th cluster, words. Using unsupervised k-means clustering algorithm,
k is the number of clusters in the model and nj is the number and the Bayesian Information Criterion (BIC), we cluster
Cluster # Clip Name & Onomatopoeic Descriptions
CAR FERRY ENGINE ROOM BB {buzz,fizz,hiss}
Cluster 1 WASHING MACHINE DRAIN BB {buzz,hiss,woosh}
PROP AIRLINER LAND TAXI BB {buzz,hiss,whir }
GOLF CHIP SHOT 01 BB.wav {thump, thwack}
81MM MED MORTAR FIRING 5 BB.wav {bang, thud, thump}
Cluster 2
THUNDERFLASH BANG BB.wav {bang, thud, wham}
TRAIN ELEC DOOR SLAM 01 B2.wav { thud, thump, whomp }
PARTICLE BEAM DEVICE 01 BB.wav {buzz, hum}
BUILDING SITE AERATOR.wav {burr, hum, murmur, whir}
Cluster 3
PULSATING HARMONIC BASS BB.wav {burr, hum, murmur}
...
HUNT KENNELS FEED BB.wav {bark, blat, yip, yowl}
PIGS FARROWING PENS 1 BB.wav {blare, boo, screech, squeak, yip}
Cluster 4
SMALL DOG THREATENING BB.wav {bark, blare}
...
DOORBELL DING DING DONG MULTI BB.wav {ding, dong, ring}
SIGNAL EQUIPMENT WARN B2.wav {ding, ring, ting}
Cluster 5
SML NAILS DROP ON BENCH B2.wav {chink, clank}
...

Table 2: Results of unsupervised clustering of audio clips using the proposed vector representation method.

the clips into meaningful groups. The clustering results References


presented in this work indicate that the clips within each Chen, S. S., and Gopalakrishnan, P. S. “Clustering via
cluster are well represented by their onomatopoeic descrip- the Bayesian Information Criterion with applications in
tions. These descriptions effectively capture the relationship Speech Recognition”. In Proc. of the International Confer-
between the audio clips based on their acoustic properties. ence on Acoustic Speech and Signal Processing (ICASSP)
Vol.2:12–15.
Discussion and Future Work Fellbaum, C. D. 1998. ”WordNet: An electronic lexical
Onomatopoeia words are useful in representing signal prop- database” edited by. The MIT Press ISBN:026206197X.
erties of acoustic events. They are a useful provision in Guo, G., and Li, S. Z. 2003. “Content-Based Audio Classi-
language to describe and convey acoustic events. They are fication and Retrieval by Support Vector Machines”. IEEE
especially useful to convey the underlying audio in media Trans. on Neural Networks 14(1).
that cannot represent audio. For example, comic books fre-
quently use words such as bang to represent the acoustic http://www.sound ideas.com. 2006. “The BBC Sound Ef-
properties of an explosion in the illustrations. As mentioned fects Library- Original Series.”.
previously, this is a result of common understanding of the L. Liu, H. Z., and Jiang, H. 2002. “Content Analysis for
words that convey specific audio properties of the acoustic Audio Classification and Segmentation”. IEEE Trans. on
events. This is a desirable trait in language level units mak- Speech and Audio Processing 10(7).
ing them suitable for automatic annotation and processing P. Cano, M. Koppenberger, S. L. G. J. R.; Herrera, P.; and
of audio. This form of representation is useful in develop- Wack, N. 2004. “Nearest-Neighbor generic Sound Classi-
ing machine listening systems that can exploit both seman- fication with a WordNet-based Taxonomy”. In Proc. 116th
tic information and similarities in acoustic properties for au- Audio Engineering Society (AES) Convention, Berlin, Ger-
ral detection and decision making tasks. As a part of our many.
future work, we wish to explore the clustering and vector R. O. Duda, P. E. H., and Stork, D. 2000. “Pattern Classi-
representation of audio clips directly based on their lexical fication”. Wiley-Interscience 2nd edition.
labels and then relate it to the underlying properties of the
acoustic sources using onomatopoeic descriptions and sig- Schwarz, G. 1978. “Estimating the Dimension of a
nal level features. For this, we would like to develop tech- Model”. The Annals of Statistics Vol.6(2):461–464.
niques based on pattern recognition algorithms that can au- T. Zhang, J. K. 2001. Audio Content Analysis for Online
tomatically identify acoustic properties and build relation- Audiovisual data Segmentation and Classification. IEEE
ships amongst various audio events. Trans. on Speech and Audio Processing 9(4).
Zhou, B., and Hansen, J. H. L. 2000. “Unsupervised Au-
Acknowledgments dio Stream Segmentation and Clustering Via the Bayesian
We would like to express our gratitude to the volunteers who Information Criterion”. In Proc. of the International Con-
took time to listen to each audio clip and tag them. We would ference on Speech and Language Processing (ICSLP), Bei-
especially like to thank Abe Kazemzadeh, Matt Black, Joe jing, China.
Tepperman, and Murtaza Bulut for their time and help. We
gratefully acknowledge support from NSF, DARPA, the U.S.
Army and ONR.

You might also like