FS06-01-012
FS06-01-012
buzz
dimension 2
zing
zoom twang
boom murmur
huff growl
rip grunt
rustle roar coo
hum
gurgle
whiff wow bark
0 gabble blare
blat
whir crackleyawp hoot scrunch
purrscreech
splash chuck cuckoo burr caw
boo blahpluck ring
clatter tweetcrunch
zapwhip yaktap toot
cluck honk
click tickcrash
pitapat beepbleep clang
thudflump bump dong chink
bang
clap ting clank
creak
squeak whomp
thump
wham ding
whoop
yowl thwack
blip
0
dimension 1
−0.8
−0.9
−1
−1.1
BIC value →
−1.2
−1.3
−1.4
−1.5
−1.6
−1.7
0 50 100 150
number of clusters →
Figure 3: Vector representation of the audio clip Figure 4: BIC as a function of number of clusters k in model Mk .
HORSE VARIOUS SURFACES BB with tags {clatter, pitpat} The maximum value is obtained for k = 112.
sequently, using clustering algorithms in this space, audio of samples in each cluster. We use this criterion to choose
clips that have similar acoustic and/or semantic properties k for the k-means algorithm for clustering the audio clips in
can be grouped together. the meaning space. Figure 4 is a plot of the BIC as a function
Thus the audio clips can be represented as vectors in the of number of clusters k estimated using equation (7). It can
proposed meaning space. This allows us to use conventional be seen that the maximum value is obtained for k = 112.
pattern recognition algorithms. In this work, we group clips
with similar onomatopoeic descriptions (and hence similar Clustering Results
acoustic properties) using the unsupervised k-means clus- Some of the resulting clusters using the presented method
tering algorithm. The complete summary of the tagging and are shown in Table 2. The table lists some of the signif-
clustering the clips is illustrated in Figure 2. The Clustering icant audio clips in each of the clusters. Only five out of
procedure is discussed in the next section. k = 112 clusters are shown for illustration. As mentioned
previously, audio clips with similar onomatopoeic descrip-
Experiments and Results tions are clustered together. As a result, the clips in the
Unsupervised Clustering of audio clips in meaning clusters share similar perceived acoustic properties. For ex-
space ample, the clips SML NAILS DROP ON BENCH B2.wav and DOOR-
BELL DING DING DONG MULTI BB.wav in cluster 5 listed in the ta-
The Bayesian Information Criterion (BIC) (Schwarz 1978)
ble. From their respective onomatopoeic descriptions and
has been used as a criteria for model selection in unsuper-
an understanding of the properties of the sound generated
vised learning. It is widely used for choosing the appro-
by a doorbell and a nail dropping on a bench, a relationship
priate number of clusters in unsupervised clustering (Zhou
can be made between them. The relationship is established
& Hansen 2000; Chen & Gopalakrishnan ). It works by
by the vector representation of the audio clips in meaning
penalizing a selected model in terms of the complexity of
space according to their onomatopoeic descriptions.
the model fit to the observed data. For a model fit M
for an observation set X , it is defined as (Schwarz 1978;
Zhou & Hansen 2000): Conclusion
1 In this paper we represent descriptions of audio clips with
BIC(M ) = log(P (X |M )) − · rM · log(RX ), (6) onomatopoeia words and cluster them according to their
2
vector representation in the linguistic (lexical) meaning
where RX is the number of observations in the set X and space. Onomatopoeia words are imitative of sounds and
rM is the number of independent parameters in the model provide a means to represent perceived audio characteristics
M . For a set of competing models {M1 , M2 , . . . , Mi } we with language level units. This form of representation es-
choose the model that maximizes the BIC. For the case sentially bridges the gap between signal level acoustic prop-
where each cluster in Mk (with k clusters) is modelled as a erties and higher-level audio class labels.
multivariate Gaussian distribution we get the following ex- First, using the proposed distance/similarity metric we es-
pression for the BIC: tablish a vector representation of the words in a ‘meaning
BIC(Mk ) = (7) space’. We then provide onomatopoeic descriptions (ono-
j=k matopoeia words that best describe the sound in an audio
X 1 1 clip) by manually tagging them with relevant words. Then,
− · nj · log(|Σj |) − · rM · log(RX )
j=1
2 2 the audio clips are represented in the meaning space as
the sum of the vectors of its corresponding onomatopoeia
Here, Σj is the sample covariance matrix for the j th cluster, words. Using unsupervised k-means clustering algorithm,
k is the number of clusters in the model and nj is the number and the Bayesian Information Criterion (BIC), we cluster
Cluster # Clip Name & Onomatopoeic Descriptions
CAR FERRY ENGINE ROOM BB {buzz,fizz,hiss}
Cluster 1 WASHING MACHINE DRAIN BB {buzz,hiss,woosh}
PROP AIRLINER LAND TAXI BB {buzz,hiss,whir }
GOLF CHIP SHOT 01 BB.wav {thump, thwack}
81MM MED MORTAR FIRING 5 BB.wav {bang, thud, thump}
Cluster 2
THUNDERFLASH BANG BB.wav {bang, thud, wham}
TRAIN ELEC DOOR SLAM 01 B2.wav { thud, thump, whomp }
PARTICLE BEAM DEVICE 01 BB.wav {buzz, hum}
BUILDING SITE AERATOR.wav {burr, hum, murmur, whir}
Cluster 3
PULSATING HARMONIC BASS BB.wav {burr, hum, murmur}
...
HUNT KENNELS FEED BB.wav {bark, blat, yip, yowl}
PIGS FARROWING PENS 1 BB.wav {blare, boo, screech, squeak, yip}
Cluster 4
SMALL DOG THREATENING BB.wav {bark, blare}
...
DOORBELL DING DING DONG MULTI BB.wav {ding, dong, ring}
SIGNAL EQUIPMENT WARN B2.wav {ding, ring, ting}
Cluster 5
SML NAILS DROP ON BENCH B2.wav {chink, clank}
...
Table 2: Results of unsupervised clustering of audio clips using the proposed vector representation method.