Using Exact Locality Sensitive Mapping To Group and Detect Audio-Based Cover Songs
Using Exact Locality Sensitive Mapping To Group and Detect Audio-Based Cover Songs
Cover Songs
Audio documents can be described as time-series to the sequence distance d ( R , R ) as possible, i.e., DP m1 m2
feature sequences. Directly computing the distance we hope the melody information is contained in the
between audio feature sequences (matching audio summary. After we determine the distance between the
documents) is an important task in implementing pairs of training data, we get a M*K matrix D and a V
query-by-content audio information retrieval. Dynamic M-dimension column vector D . The mth row of D DP V
Programming (DP) [5][19] can be used in matching has K distance values calculated from independent
two audio feature sequences and is an essentially ex- features d ( v , v ), k = 1, 2, ..., K and the mth element
m 1, k m 2,k
haustive searching approach (which offers high accu- of D is the normalized distance between the two fea-
DP
racy). But it lacks scalability and results in a lower ture sequences d ( R , R ) ⋅ K /(| R | ⋅ | R |) . Let DP m1 m2 m1 m2
retrieval speed as the database gets larger. To quicken A = [α , α , , α ] . According to Eq.(4) D , A and
2
1
2
2
2
K
T
the audio feature sequence comparison and obtain the DDP satisfies the equation D ⋅ A = D Then V DP
scalable content-based retrieval, semantic features are A = ( D D ) D D and we obtain the weight α . We
T −1 T
k
are only interested in the absolute value of α k . The
V V V DP
and must be weighed by different weights. Existing and V ) with a short distance ( d (V , V ) ) have the same
j i j
retrieval schemes have selected different audio features. integer hash value ( H (V ) = H (V ) ) with a high prob- i j
We choose several competitive audio features and in- ability. By assigning integer hash values to buckets,
troduce a scheme based on multivariable regression to the songs located in the same bucket as the query can
determine the weight for each feature. The goal of our be found quickly.
approach is to apply linear and non-parametric regres- However even if two similar FUs V and V have a i j
sion models to investigate the correlation. In the model short distance d (Vi , V j ) , it is not always guaranteed
we use K (K=7) groups of features (218-dimension): that they have the same hash values due to the map-
ping and quantization errors. When a vector of N hash
Mean and Std of MFCC (13+13) [7], Mean and Std of
values instead of a single hash value is used to locate a
Mel-Magnitudes (40+40) [8], Mean and Std of Chroma
bucket the precision can be improved and the effect of
(12+12) [21], Pitch Histogram (88) [5]. quantization error gets more obvious. To find a similar
Let the groups of features of mith song be song from the database with a specific query, multiple
vmi ,1 , vmi ,2 , ..., vmi , K (i=1,2). With different weight α k parallel and independent hash instances are necessary,
assigned to each feature group the total summary vec- which in turn takes more time and requires more space.
tor is Our solution to the above problem is to exploit the
Vmi = [α 1 vmi ,1 , α 2 vmi , 2 , ..., α K vmi , K ]
(3 )
T continuous non-quantized hash values with two
schemes, Exact Locality Sensitive Mapping (ELSM)
The Euclidean distance between two features V and m1 and SoftLSH.
V is
m2
K K K
3.2.1 SoftLSH
d (Vm1 , Vm 2 ) = d ( ∑α v , ∑α v
k m1, k k m 2, k
)= ∑α d (v 2
k m1, k
, vm 2, k ) (4)
We assume the search-by-hash system has L paral-
k =1 k =1 k =1
To determine the weight in Eq.(3), we apply multivari- lel hash instances and each hash instance has a group
able regression process. Consider a training database of N locality sensitive mapping functions. In the mth
composed of M pairs of songs < R , R >, m = 1, 2, ..., M , hash instance the function group is H = {h , h , ..., h } . m m1 m2 mN
similar to [21] and then calculate M sequence distances ping, the hash vector in the mth hash instance corre-
d ( R , R ) via DP. sponding to V is H (V ) = {h (V ), h (V ), ..., h (V )} .
m m1 m2 mN
DP m1 m2
Consider the kth dimension of the hash vectors H (V ) m i the candidates to find the features that are actually
and H (V ) corresponding to V and V respectively.
m j i j similar to V . Of course V will be one of the nearest
j i
Dimension
Summary Vi Hash instance
into non-overlapping squares, where H (V ) is quan- m
1 2 3
neither of the integer hash values of the two FUs is the Figure 2 Feature organization in the database.
same. H (V ) is quantized to H (V ) =(2,3) while
m i m i Each song in the database is processed as in Figure
H (V ) is quantized to H (V ) =(1,2). By careful ob-
m j m j 2. Its FU feature V is obtained by the regression model.
servation we can learn that the quantization error usu- The mth hash instance has its own sets of N hash func-
ally happens when both H (V ) and H (V ) are near the m i m j tions h (V ) = ( a ⋅ V + b ) / w (1≤k≤N), which is
mk mk mk mk
edge of the squares. Even a little error near the edge determined by amk and bmk, the random variables, and
will result in an error up to N between two integer hash wmk, the quantization interval. By wmk, standard devia-
set H (V ) and H (V ) .
m i m j
tions of soft hash values in different hash instances are
made almost equal and the distribution of hash vectors
roughly spans a square in the Euclidean space. The
hash set for the summarized semantic FU feature V is
Hm (Vi ) H (V ) = [ h (V ), h (V ), ..., h (V )] in the mth hash instance.
hm2=3 m m1 m2 mN
rate buffer and utilized as the ELSM feature. The re- ered the distance d(p,q) between the query q and its
sidual part H (V ) - H (V ) reflects the uncertainty. This
m i
nearest neighbor p in the query stage of the LSH
m i
part is usually neglected in all LSH indexing schemes. scheme. By selecting a random point p’ at a distance
Fully exploiting this part facilitates the accurate locat- d(p,q) from q and checking the bucket that p’ is hashed
ing of the buckets that possibly contain the similar to, the entropy-based LSH scheme ensures that all the
features. In times of retrieval the FU of the query, Vq , buckets which contain p with a high probability are
is calculated. In this way its ELSM feature, probed. An improvement of this scheme by multi-
H (V ) = [ H (V ), H (V ), ..., H (V )] , is also calculated.
q 1 q 2 q L q probe was proposed in [9], where minor adjustment of
In the mth hash instance the features similar to V will q the integer hash values are conducted to find the buck-
be located in the buckets that intersect the neighbor- ets that may contain the point p. According to Eq.(5)
hood C ( H (V ), r ) . Due to the quantization effect the
m q when the feature summary of two tracks are similar to
buckets are squares. Any vertex of a bucket lying in each other, their non-quantized hash values will also be
the neighborhood will result in its intersection with the similar to each other. Instead of probing, our SoftLSH
neighborhood. scheme utilizes the ELSM feature to accurately locate
Buckets in a hash instance are centered at a vector all the buckets that intersect the neighborhood deter-
of integer hash values. Their vertexes are the center mined by the query.
plus or minus 0.5. H (V ) , the integer part of H (V ) , m q m q
have 50% overlap. Each frame is weighed by a ham- ance of the algorithms, in our experiment recall and
ming window and further appended with 1024 zeros to precision are respectively defined as | S ∩ K | / | S | q q q
fit the length of FFT (2048 point). From FFT result the and | S ∩ K | / | K | , and also F-measure is defined as
q q q
instantaneous frequencies are extracted and Chroma is 2 ⋅ recall ⋅ precision / ( precision + recall ) .
calculated. From the amplitude spectrum pitch, MFCC 0.8
and Mel-magnitude are calculated. Then the summary
is calculated from all frames. 0.6
The ground truth is set up according to human per-
Recall
ception. We have listened to all the songs and manu- 0.4
ally labeled them so that retrieval results of our algo- KNN
ELSM
rithms correspond to human perception to support 0.2 LSH
practical application. Trains80 and Covers79 datasets SoftLSH
were divided into groups according to their verse (the 0
main theme represented by the song lyrics) to judge 2 4 6 8 10 12 14
whether tracks belong to the same group or not (one Number of hash instances
group represents one song and different versions of Figure 3 Recall under different number of hash in-
one song are members in this group). The 30s seg- stances.
ments in these two datasets are extracted from verse
0.06
sections of songs. Average retrieval time (s)
0.05
KNN
5. Evaluation 0.04 ELSM
LSH
0.03
In this section we present the performance evalua- SoftLSH
determined according to the number of audio cover ance as an exhaustive search. The gap between KNN
tracks in each group. The average size of query’s rele- and ELSM/SoftLSH also decreases as more hash in-
vant set is 12.5 (on the average each song in Coves79 stances are used. The recall, however, does not in-
has 13.5 covers. When one cover is used as query, the crease linearly. The slope of recall approaches 0 and
rest covers are in the database). The total number of further increase of hash instances results in diminish-
relevant items can be calculated from each group (a ing returns. When the number of hash instances is
greater than 10, the gap between ELSM/SoftLSH and they can only be retrieved when KNN returns many
KNN is almost constant, which means that the infor- tracks. Therefore the precision of KNN decreases
mation loss due to utilizing a lower dimension feature quickly when recall is around 0.7. ELSM and SoftLSH
can not be salvaged by the increase of hash instances. have a performance approaching that of KNN. But at
When there are 10 hash instances, 0.682*14452 can be the same precision they have a loss of about 4% in
identified with KNN, 0.633*14452 are identified with recall compared with KNN due to utilizing a lower
ELSM, 0.445*14452 are identified with LSH and dimensional feature. The number of hash instances is
0.625*14452 are identified with SoftLSH. fixed at 10 in the experiment. Some of the tracks can
Figure 4 shows the average retrieval time for each not be retrieved by the LSH scheme at all. Its recall is
query. The exhaustive KNN always takes the longest upper bounded at 0.5 and a higher recall requires much
time (0.542s). Time consumption in the other three more hash instances in LSH compared with SoftLSH.
schemes gradually increases as the number of hash Figure 6 demonstrates F-measure scores of the four
instances does. Average retrieval time of SoftLSH is schemes with respect to different number of retrieved
about double as much as LSH due to the search in mul- tracks. It can be seen easily that the LSH always per-
tiple buckets that intersect the query’s neighborhood. forms worst. KNN performs slightly better than ELSM
From Figure 3 and Figure 4 the tradeoff among accu- and SoftLSH at the cost of a much longer time to fin-
racy and time indicates that 10 hash instances are a ish the search, as shown in Figure 4. Here we would
suitable choice. In such cases SoftLSH has a recall address that when the number of the retrieved tracks is
close to KNN with a much less retrieval time. The ad- less than that of the query’s covers in the database, an
ditional time saved by LSH would result in a signifi- increase of the retrieved tracks results in an almost
cant drop of accuracy. Therefore the number of hash linear increase of recall and a little decrease of preci-
table is set to 10 in the following experiments. sion. Therefore F-measure increases quickly. When the
1 number of retrieved tracks gets larger than the actual
tracks, the slopes of the recall curves in all schemes
0.8 become steady while increasing the retrieved tracks
always results in a decrease of precision. In this ex-
Precision
0.6
periment each query has an average number of 12.5
KNN
0.4 covers in the database. Coincidently in Figure 6 the
ELSM
curves of KNN, ELSM and SoftLSH reach the maxi-
0.2 LSH
SoftLSH
mal F-measure score when the number of returned
0 songs equal 12. This reflects that the FU feature is very
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 effective in representing the similarity of tracks in each
Recall group. The tracks belonging to the same group that
Figure 5 Precision-recall curve (10 hash instances). really have a short distance quickly appear in the re-
turned list. Not-so-similar tracks have a relatively large
0.8 KNN distance and too many retrieved tracks only result in a
ELSM very low precision and F-measure. It also confirms the
LSH
0.6 SoftLSH SoftLSH is a good alternative to KNN.
F-measure
0.4 6. Conclusion
0.2 Both the representation and organization of audio
files play important roles in audio content detection. In
0 this paper we have considered both the semantic sum-
0 10 20 30 40 50 marization of audio documents and the hash-based
Number of retrieved tracks
approximate retrieval for the purpose of reducing re-
Figure 6 F-measure at different number of re- trieval time and improving retrieval quality. By a new
trieved tracks. principle of similarity-invariance, a concise audio fea-
Figure 5 is the precision-recall curve achieved by ture representation (FU) is generated based on multi-
adjusting the number of system outputs. As expected at variable regression. Associated with the FU, variants
the same recall KNN always has the highest precision of LSH (ELSM and SoftLSH) are proposed. Different
and LSH has the lowest precision. Some of the percep- from the conventional LSH schemes, soft hash values
tually similar tracks have quite different features and are exploited to accurately locate the searching region
and improve the retrieval quality without requiring Proceedings of the 1st Workshop on Learning the Semantics
many hash instances. It is easy to make the proposed of Audio Signals (LSAS), pp.66-75, 2006.
retrieval schemes applicable to other applications with
a bit effort (especially video, bio-informatics, e,g.). [8] L.Rabiner and B.-H. Juang. Fundamentals of Speech
Recognition. Prentice-Hall, 1993.
We experimentally show the efficacy of our algo-
rithms via evaluation on ‘multi-versions’ music covers [9] Q. Lv , W. Josephson, Z. Wang, M. Charikar, K. Li,
datasets, adopting human perception as a quality meas- “MultiProbe LSH: Efficient Indexing for High Dimensional
ure. As expected our results demonstrate that (i) the Similarity Search”, In Proc. of the Very Large Data Base
FU feature is a good summary of audio sequence (ii) (VLDB), pp.950-961, 2007.
SoftLSH achieves a better balance between retrieval
time and accuracy than conventional LSH and KNN. [10] N. Bertin, A. Cheveigne, “Scalable Metadata and Quick
This work remains the room to be improved. In the Retrieval of Audio Signals”, ISMIR 2005, pp.238-244, 2005.
future we will study semantic features that better rep-
[11] P. Indyk and R. Motwani. Approximate Nearest
resent melody information and other training models
Neighbor: Towards Removing the Curse of Dimensionality.
that best combine feature groups. In Proc. of the 30th Annual ACM Symposium on Theory of
Computing, pp.604–613, 1998.
Acknowledgment
[12] M. Henzinger. Finding Near-Duplicate Web Pages: a
We thank Initiative Project of Nara Women’s Univer- Large-Scale Evaluation of Algorithms. In Proc. of the 29th
sity for supporting the first author to visit IMIRSEL, conference on research and development in IR, 2006.
where this work was partly discussed in summer, 2007.
[13] J. Reiss, J. J. Aucouturier and M. Sandler, “Efficient
The second author was supported by the Andrew W. multi dimensional searching routines for music information
Mellon and national Science Foundation (NSF) under retrieval”, 2nd ISMIR, 2001.
Nos. IIS-0340597 IIS-0327371.
[14] S. Hu, “Efficient Video Retrieval by Locality Sensitive
References Hashing”, ICASSP 2005, pp.449-452, 2005.
[1] B. Cui, J. Shen, G. Cong, H. Shen, C. Yu. Exploring [15] P. Indyk and N. Thaper, “Fast color image retrieval via
Composite Acoustic Features for Efficient Music Similarity embeddings,” Workshop on Statistical and Computational
Query, ACM MM’06, pp.634-642, 2006. Theories of Vision ( ICCV), 2003.
[2] M. Robine, P. Hanna, P. Ferraro and J. Allali. Adapta- [16] LSH Algorithm and Implementation (E2LSH)
tion of String Matching Algorithms for Identification of http://web.mit.edu/andoni/www/LSH/index.html.
Near-Duplicate Music Documents, SIGIR Workshop on
Plagiarism Analysis, Authorship Identification, and Near- [17] R. Panigrahy. Entropy based nearest neighbor search in
Duplicate Detection, 2007. high dimensions. In Proc. of ACM-SIAM Symposium on
Discrete Algorithms(SODA), 2006.
[3] J. F. Serrano J. M. Inesta. Music Motive Extraction
through Hanson Intervallic Analysis. CIC’06, pp.154-160, [18] M. Lesaffre and M. Leman, “Using Fuzzy to Handle
2006. Semantic Descriptions of Music in a Content-based Retrieval
System.”, In Proc. LSAS06, pp.43-5, 2006.
[4] I. Karydis, A. Nanopoulos, A. N. Papadopoulos and Y.
Manolopoulos, “Audio Indexing for Efficient Music Infor- [19] J. P. Bello, “Audio-based Cover Song Retrieval Using
mation Retrieval”, MMM’05, pp.22-29, 2005. Approximate Chord Sequences: Testing Shifts, Gaps, Swaps
and Beats”, pp.239-244, ISMIR2007.
[5] W. H. Tsai, H. M. Yu and H. M. Wang, “A Query-by-
Example Technique for Retrieving Cover versions of Popular [20] G. Tzanetakis and P. Cook, “Musical Genre Classifica-
Songs with Similar Melodies”, pp.183-190, ISMIR2005. tion of Audio Signals”, IEEE Transactions on Speech and
Audio Processing, Vol.10, No.5, pp. 293-302, 2002.
[6] C.Yang, “Efficient Acoustic Index for Music Retrieval
with Various Degrees of Similarity”, ACM Multimedia, [21] D. Ellis and G. Poliner. Identifying cover songs with
pp.584-591, 2002. chroma features and dynamic programming beat tracking. In
Proceedings of ICASSP-07, Volume: 4, pp.1429-1432, 2007.
[7] T. Pohle, M. Schedl, P. Knees, and G. Widmer. Auto-
matically Adapting the Structure of Audio Similarity Spaces. [22] R. Miotto and N. Orio. “A Methodology for the Seg-
mentation and Identification of Music Works.”, pp.239-244,
ISMIR 2007.