How Does Chromaprint Work
How Does Chromaprint Work
oxygene.sk/2011/01/how-does-chromaprint-work
I've been meaning to write this post for a long time, but never really finished it. I hope it will
help people understand how does the Chromaprint algorithm work, where do the
individual ideas come from and what do the fingerprints really represent. It's not meant to
be a detailed description, just the basics to get the general idea.
Being primarily based on the Computer Vision for Music Identification paper, images play an
important role in the algorithm. When people "see" audio, they usually see it as waveforms:
This is what most applications display, but it's not really useful for analysis. A more useful
representation is the spectrogram, which shows how does the intensity on specific
frequencies changes over time:
You can get this kind of image by splitting the original audio into many overlapping frames
and applying the Fourier transform on them ("Short-time Fourier transform"). In the case of
Chromaprint, the input audio is converted to the sampling rate 11025 Hz and the frame size
is 4096 (0.371 s) with 2/3 overlap.
Many fingerprinting algorithms work with this kind of audio representation. Some are
comparing differences across time and frequency, some are looking for peaks in the image,
etc.
1/4
Now we have a representation of the audio that is pretty robust to changes caused by lossy
codecs or similar things and also it isn't very hard to compare such images to check how
"similar" they are, but if we want to search for them in a database, we need a more compact
form. The idea how to do it again comes from the Computer Vision for Music Identification
paper with some modifications based on the Pairwise Boosted Audio Fingerprint paper. You
can imagine having a 16x12 pixel large window and moving it over the image from the left
to the right, one pixel at a time. This will generate a lot of small subimages. On each of them
we apply a pre-defined set of 16 filters that capture intensity differences across musical
notes and time. What the filters do is they calculate the sum of specific areas of the
grayscale subimage and then compare the two sums. There are six possible ways to
arrange the areas:
There is 16 filters and each can produce an integer that can be encoded into 2 bits (using the
Gray code), so if you combine all the results, you get a 32-bit integer. If you do this for every
subimage generated by the sliding window, you get the full audio fingerprint. The simplest
way to compare such fingerprints is to calculate bit error rates. I took the last UNKLE album
and generated a few fingerprints for the original FLAC files as well as low quality 32 kbps
MP3 versions and differences between them:
Heaven FLAC
Differences between Under The Ice FLAC and Under The Ice 32kbps MP3
You can see that there is very few differences when comparing FLAC and MP3 of the same
track, but comparing two different tracks will generate a lot of noise. It's not perfect and
some parts could be done differently to improve the rates, but overall I'm pretty happy with
the results. I especially like how little conflicts does the algorithm generate, even on the
individual subfingerprints.
Update:
Leave a Reply
3/4
4/4