0% found this document useful (0 votes)
77 views

Automatic Audio Analysis For Content Description & Indexing

This document discusses automatic audio analysis for content description and indexing. It covers: 1. Auditory scene analysis and how the human auditory system organizes complex sound scenes. 2. Computational auditory scene analysis (CASA), which aims to automatically analyze sound scenes by translating psychological rules of sound organization into computational models. 3. Prediction-driven CASA, which models audio perception as searching for plausible hypotheses to explain observations, rather than directly extracting information from sounds.

Uploaded by

mcastilho
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Automatic Audio Analysis For Content Description & Indexing

This document discusses automatic audio analysis for content description and indexing. It covers: 1. Auditory scene analysis and how the human auditory system organizes complex sound scenes. 2. Computational auditory scene analysis (CASA), which aims to automatically analyze sound scenes by translating psychological rules of sound organization into computational models. 3. Prediction-driven CASA, which models audio perception as searching for plausible hypotheses to explain observations, rather than directly extracting information from sounds.

Uploaded by

mcastilho
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Automatic audio analysis

for content description & indexing


Dan Ellis
International Computer Science Institute, Berkeley CA
<[email protected]>

Outline

1 Auditory Scene Analysis (ASA)

2 Computational ASA (CASA)

3 Prediction-driven CASA

4 Speech recognition & sound mixtures

5 Implications for content analysis

Audio Indexing - Dan Ellis 1998feb04 - 1


1 Auditory Scene Analysis
“The organization of complex sound scenes
according to their inferred sources”
• Sounds rarely occur in isolation
- organization required for useful information
• Human audition is very effective
- unexpectedly difficult to model
• ‘Correct’ analysis defined by goal
- source shows independence, continuity
→ecological constraints enable organization
f/Hz
city22

4000
−40
2000
−50
1000

400 −60

200 −70

0 1 2 3 4 5 6 7 8 9 dB
time/s

Audio Indexing - Dan Ellis 1998feb04 - 2


Psychology of ASA
• Extensive experimental research
- organization of ‘simple pieces’
(sinusoids & white noise)
- streaming, pitch perception, ‘double vowels’
• “Auditory Scene Analysis” [Bregman 1990]
→ grouping ‘rules’
- common onset/offset/modulation,
harmonicity, spatial location
• Debated... (Darwin, Carlyon, Moore, Remez)

(from
Darwin 1996)

Audio Indexing - Dan Ellis 1998feb04 - 3


2 Computational Auditory Scene Analysis
(CASA)
• Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
f/Hz
city22

4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60

200 −70
car noise

0 1 2 3 4 5 6 7 8 9 dB
time/s

• Translate psych. rules into programs?


- representations to reveal common onset,
harmonicity ...
• Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing...
Audio Indexing - Dan Ellis 1998feb04 - 4
CASA survey
• Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
• Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
• Ultimately, more constraints needed
- nonperiodic signals
- masked cues
- ambiguous signals

Audio Indexing - Dan Ellis 1998feb04 - 5


CASA1: Periodic pieces
• Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
• Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
brn1h.aif brn1h.fi.aif
frq/Hz frq/Hz

3000 3000
2000 2000
1500 1500
1000 1000

600 600
400 400
300 300
200 200
150 150
100 100
0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s

Audio Indexing - Dan Ellis 1998feb04 - 6


CASA2: Hypothesis systems
• Okuno et al. (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
• Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
3760 Hz

Buzzer-Alarm
2540 Hz
2230 Hz 2350 Hz
Glass-Clink
1675 Hz
1475 Hz

950 Hz
500 Hz
Phone-Ring
460 Hz Siren-Chirp
420 Hz

1.0 2.0 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec
TIME TIME
(a) (b)

• Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses

Audio Indexing - Dan Ellis 1998feb04 - 7


CASA3: Other approaches
• Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
• HMM decomposition (RK Moore)
- recover combined source states directly
• Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?

Audio Indexing - Dan Ellis 1998feb04 - 8


3 Prediction-driven CASA
Perception is not direct
but a search for plausible hypotheses
• Data-driven...
input signal discrete
mixture features Object objects Grouping Source
Front end
formation rules groups

vs. Prediction-driven hypotheses


Noise
components
Hypothesis Predict
management & combine
Periodic
components
prediction
errors
input signal predicted
mixture features Compare features
Front end
& reconcile

• Motivations
- detect non-tonal events (noise & clicks)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses,
resynthesis
Audio Indexing - Dan Ellis 1998feb04 - 9
Analyzing the continuity illusion
• Interrupted tone heard as continuous
- .. if the interruption could be a masker
f/Hz
ptshort

4000
2000
1000

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4


time/s

• Data-driven just sees gaps

• Prediction-driven can accommodate

- special case or general principle?


Audio Indexing - Dan Ellis 1998feb04 - 10
Phonemic Restoration (Warren 1970)
• Another ‘illusion’ instance
• Inference relies on high-level semantics
nsoffee.aif
frq/Hz
3500

3000

2500

2000

1500

1000

500

0
1.2 1.3 1.4 1.5 1.6 1.7 time/s

• Incorporating knowledge into models?

Audio Indexing - Dan Ellis 1998feb04 - 11


Subjective ground-truth in mixtures?
• Listening tests collect ‘perceived events’:

• Consistent answers:
f/Hz
City

4000
2000
1000
400
200

0 1 2 3 4 5 6 7 8 9

Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk

Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam

Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3

Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road

Audio Indexing - Dan Ellis 1998feb04 - 12


PDCASA example:
City-street ambience
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0 1 2 3 4 5 6 7 8 9

f/Hz
Wefts1−4 Weft5 Wefts6,7 Weft8 Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50

Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)

f/Hz
Noise2,Click1
4000
2000
1000
400
200

Crash (10/10)

f/Hz
Noise1
4000
2000
1000 −40
400
200 −50

−60
Squeal (6/10)
Truck (7/10)
−70

0 1 2 3 4 5 6 7 8 9 dB
time/s

• Problems
- error allocation - rating hypotheses
- source hierarchy - resynthesis

Audio Indexing - Dan Ellis 1998feb04 - 13


4 Speech recognition
& sound mixtures
• Conventional speech recognition:

Feature Phoneme HMM


extraction low-dim. classifier phoneme decoder words
signal
features probabilities

- signal assumed entirely speech


- find valid labelling by discrete labels
- class models from training data
• Some problems:
- need to ignore lexically-irrelevant variation
(microphone, voice pitch etc.)
- compact feature space → everything speech-like
• Very fragile to nonspeech, background
- scene-analysis methods very attractive...

Audio Indexing - Dan Ellis 1998feb04 - 14


CASA for speech recognition
• Data-driven: CASA as preprocessor
- problems with ‘holes’ (but: Cooke, Okuno)
- doesn’t exploit knowledge of speech structure
• Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components

Hypothesis Noise Predict


management components & combine

Periodic
components
input
mixture Compare
Front end
& reconcile

Audio Indexing - Dan Ellis 1998feb04 - 15


Example of speech & nonspeech
f/Bark
(a) Clap (clap8k−env.pf)
15
10
5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
(b) Speech plus clap (223cl−env.pf)
dB
60
40
20
(c) Recognizer output
h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#

h# n ay n t uw f ay v ow h# s eh v ax n
tcl
<SIL> nine two five oh <SIL> seven
(d) Reconstruction from labels alone (223cl−renvG.pf)

(e) Slowly−varying portion of original (223cl−envg.pf)

(f) Predicted speech element ( = (d)+(e) ) (223cl−renv.pf)

(g) Click5 from nonspeech analysis (justclick.pf)

(h) Spurious elements from nonspeech analysis (nonclicks.pf)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

• Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
Audio Indexing - Dan Ellis 1998feb04 - 16
5 Implications for content analysis:
Using CASA to index soundtracks
f/Hz
city22

4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60

200 −70
car noise

0 1 2 3 4 5 6 7 8 9 dB
time/s

• What are the ‘objects’ in a soundtrack?


- subjective definition → need auditory model
• Segmentation vs. classification
- low-level cues → locate events
- higher-level ‘learned’ knowledge to give
semantic label (footstep, crash)
... AI complete?
• But: hard to separate
- illusion phenomena suggest auditory
organization depends on interpretation

Audio Indexing - Dan Ellis 1998feb04 - 17


Using speech recognition for indexing
• Active research area:
Access to news broadcast databases
- e.g. Informedia (CMU), ThisL (BBC+...)
- use LV-CSR to transcribe,
then text-retrieval to find
- 30-40% word error rate, still works OK
• Several systems at NIST TREC workshop
• Tricks to ‘ignore’ nonspeech/poor speech

Audio Indexing - Dan Ellis 1998feb04 - 18


Open issues in automatic indexing
• How to do ASA?
• Explanation/description hierarchy
- PDCASA: ‘generic’ primitives
+ constraining hierarchy
- subjective & task-dependent
• Classification
- connecting subjective & objective properties
→ finding subjective invariants, prominence
- representation of sound-object ‘classes’
• Resynthesis?
- a ‘good’ description should be adequate
- provided in PDCASA, but low quality
- requires good knowledge-based constraints

Audio Indexing - Dan Ellis 1998feb04 - 19


6 Conclusions
• Auditory organization is required in real
environments
• We don’t know how listeners do it!
- plenty of modeling interest
• Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
• Speech recognizers are a good source of
knowledge
• Automatic indexing implies ‘synthetic listener’
- need to solve a lot of modeling issues

Audio Indexing - Dan Ellis 1998feb04 - 20

You might also like