0% found this document useful (0 votes)
34 views

Fingerprint

Audio fingerprinting allows for identification of audio content using small excerpts. It works by computing fingerprints of audio segments and comparing them to a database of previously analyzed content. Fingerprints are computed by analyzing features of the audio like frequency coefficients or MFCCs. Systems aim to generate fingerprints that are robust to distortions while maintaining reliability and efficient search capabilities. Two main components are fingerprint computation and database comparison to identify matching content. Various techniques analyze different frame lengths and features to generate compact fingerprints that can identify audio despite noise or compression.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Fingerprint

Audio fingerprinting allows for identification of audio content using small excerpts. It works by computing fingerprints of audio segments and comparing them to a database of previously analyzed content. Fingerprints are computed by analyzing features of the audio like frequency coefficients or MFCCs. Systems aim to generate fingerprints that are robust to distortions while maintaining reliability and efficient search capabilities. Two main components are fingerprint computation and database comparison to identify matching content. Various techniques analyze different frame lengths and features to generate compact fingerprints that can identify audio despite noise or compression.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Audio Fingerprinting

Wes Hatch
MUMT-614
Mar.13, 2003

What is Audio Fingerprinting?

a small, unknown segment of


audio data (it can be as short
as just a couple of seconds) is
used to identify the original
audio file from which it came

Applications

Broadcast monitoring
playlist generation

royalty collection
ad verification

Connected Audio

general term for consumer applications

Other

Napster--use of fingerprinting systems


to prohibit the transmission of
copywritten materials
Finding desired content efficiently in an
overwhelming amount of audio material

Benefits
Automated

search of illegal content on


the Internet
examines the real audio information rather
than just tag information

For

the consumer

make the meta-data of songs in a library


consistent, allowing for easy organization
can guarantee that what is downloaded is
actually what it says it is
will allow consumer to record signatures of
sound and music on small handheld devices

Two principle components


Compute

the fingerprint
Compare it to a database of
previously computed fingerprints
A text example: in a box. I will not eat
them with a fox. I

Details to worry about


Robustness

(to noise, distortion)

Reliability
Fingerprint

size (reduced dimensionality)

Granularity
Search

speed and scalablity


Computationally efficient
Resulting features must be informative about the
audio content
Semantic or non-semantic features?
Hash table or vector representation?

Computing the fingerprint


Compare

to hash functions?

compare computed hash value with that


stored in a database
Drawback

need to worry about perceptual similarity and not


mathematical similarity
PCM audio vs. MP3: both sound alike but mathematically
(i.e. spectral content) are quite different

perceptual similarity is not transitive


not possible to design a system which computes
mathematical fingerprints for perceptually similar
objects

Techniques (general)
Any

x number of seconds may be used to


compute the fingerprint
Audio gets separated into frames
Features computed for each frame:

Fourier coefficients
MFCC, LPC
Spectral flatness
sharpness

features

mapped into a more


compact representation by using
HMM, or quantization

Techniques (Haitsma, Kalker)


one

32-bit sub-fingerprint every 11.6

ms
A block consists of 256 sub-fingerprints
Corresponds to a granularity of only 3 seconds

Large overlap (31/32), so subsequent


sub-fingerprints are similar and vary
slowly in time
worst-case scenario: the frame
boundaries used during identification are
5.8 ms off with those in database

Techniques (Haitsma, Kalker)


Data

from each frame is sent through


a filterbank
33 filters, logarithmically spaced
(to correspond roughly to the
Bark scale)
between 300 and 2000Hz

phase is neglected (perceptual


reasons)

System overview

Techniques (Burges, Platt)


downsampled

to 11.025 kHz,
split into frames with overlap
of 2
MCLT is then applied to each
frame. A 128-sample log
spectrum is generated by taking
the log modulus of each MCLT
coefficient

Techniques (Burges, Platt)


Use

prior knowledge to define form of the


feature extractor
Features computed by a linear,
convolutional neural network
convert signal into a feature vector
uses Pattern Classification and Scene
Analysis (PCA) to find a set of projections
generates a vector of 128 values for
every 11.6ms interval
dimensional-reduction method (i.e. lots of
math)

Techniques (Burges, Platt)

3 layers of Oriented
PCA (OPCA)
operates on a frame of
128 values
layer 1: generates 10
values for each frame
layer 2: takes 42 layer
1 outputs and
produces 20 values
layer 3: takes 40 layer
2 outputs and
produces 64 values
(11K inputs --> 64
outputs)

Searching the Database


Look

for the most similar (not


necessarily exact) fingerprint
10,000 5-min. songs 250
million sub-fingerprints
brute force takes in excess of 20
minutes on a very fast PC
brute force computes bit-error rate
for every possible position in the
database

Searching the Database


make

assumption that at least


1 (of the 256) sub-fingerprints
are error-free
then, use a hash table (as
opposed to more memoryintensive look-up table)
800,000 times faster

Results
false-positive

rate of 3.6x10-2
(Haitsma, Kalker)
On tests with a large (500,000) set of
input traces
has a low false-positive and falsenegative rate. (Burges, Platt)
didnt test on time compression, expansion
can

withstand distortions occurring


from transmission over mobile phones.

You might also like