Lecture 16
Lecture 16
ELEC1200
Time-Frequency Analysis
• For many complex signals (like speech, music and other sounds), short
segments are well described by a sinusoidal representation with a
few important frequency components.
ELEC1200 2
Spectrogram Example
amplitude spectrum amplitude spectrum amplitude spectrum
35 35 35
15 15 15
10 10 10
5 5 5
0 0 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
frequency (Hz) frequency (Hz) frequency (Hz)
4000
3500
blue = quiet
red = loud
frequency (Hz) 3000
2500
2000
1500
1000
500
Red = high amplitude
Blue = low amplitude 0
5 10 15 20 25 30 35
frame number
ELEC1200 3
Computation of the Spectrogram
• Divide the signal into a set of frames, typically about 20-50 ms long.
frame
ELEC1200 4
Speech Spectrogram
3500 3500
3000 3000
2500 2500
frequency (Hz)
frequency (Hz)
2000 2000
1500 1500
1000 1000
500 500
0 0
5 10 15 20 25 30 35 5 10 15 20 25 30 35
frame number frame number
ELEC1200 5
Train Whistle vs. Bird Chirps
Can you figure out which is which?
spectrogram spectrogram
4000 4000
3500 3500
3000 3000
2500 2500
frequency (Hz)
frequency (Hz)
2000 2000
1500 1500
1000 1000
500 500
0 0
10 20 30 40 50 10 20 30 40 50
frame number frame number
ELEC1200 6
Speech Data Red = high amplitude
Blue = low amplitude
• Characteristics
– Recent measurements more
informative in predicting the
future than those in the distant
past.
Phonemes
Any distinct unit of sound that distinguish
one word from another (e.g., p, b)
ELEC1200 7
ELEC1200: A System View of
Communications: from Signals to Packets
Lecture 16
• Time-Frequency Analysis
– Analyzing sounds as a sequence of frames
– Spectrogram
ELEC1200 8
Source Coding/Data Compression
We have a way to reliably send bits across a complex communications
network
ELEC1200 9
Source Encoding & Decoding
• Encoding (Compression)
– INPUT information is converted to a bit stream with as few bits as possible.
– Example INPUTs: text, music, images, video
• Decoding (Decompression)
– bit stream is converted to OUT, which is similar/identical to INPUT.
ELEC1200 10
Lossless vs. Lossy Compression
Source Store/Retrieve Source
INPUT OUT
Encoding Transmit/Receive Decoding
– A 3-minute song will require 1.411 Mb/sec * 180 sec = 254 Mb = 31.75 MB
ELEC1200 12
Poor ways to compress an audio file
• Reduce the total number of bits per sample
– e.g., 16 to 8 bit
– Gives you a factor of 2 in compression
– However, introduces noticeable distortion
• Reduce the sampling rate
– e.g., 44 kHz to 22 kHz
– Again only a gain of a factor of 2 in size
– However, leads to a noticeable loss of high frequency information
ELEC1200 13
MP3
• MPEG = Moving Pictures Experts Group
– set up by ISO (international standards organization)
– every few years issues a standard
• MPEG1 (1992)
• MPEG2 (1994)..
ELEC1200 14
Psycho-acoustics
• Psycho-acoustic ~ Human perception of sound principles
• Human hearing frequencies at [20 Hz,20 kHz]
– Most sensitive at 2 to 4 KHz.
• Normal voice range is about 500 Hz to 2 kHz
– Low frequencies are vowels such as A, E, I, O, …
– High frequencies are consonants (can be combined with a vowel to form a
syllable) such as B, C, D, F, G, K, L, ...
• More sensitive to loudness at mid frequencies than at other
frequencies (e.g., Two tones of equal power and different
frequencies will not be equally loud)
– Intermediate frequencies at [500 Hz, 5000 Hz]
– Sensitivity decreases at low and high frequencies
ELEC1200 15
Perceptual Coding
• What matters is how the consumer (e.g. human ears or eyes)
perceives the input.
– Frequency range, amplitude sensitivity, color response, …
– Masking effects
• Identify information that can be removed from bit stream without
perceived effect, e.g.,
– Sounds outside frequency range
– Masked sounds
• Encode remaining information efficiently
– Use frequency-based transformations
– Quantize coefficients of frequency (loss occurs here)
– Add lossless coding (e.g., the Huffman encoding)
ELEC1200 16
Masking
• Masking: If a dominant tone • Definitions
is present, then sounds at – Auditory threshold – minimum
frequencies near it will be signal level at which a pure tone
harder to hear. can be heard
– Masking threshold – minimum
• Coding consequences signal level if a dominant tone is
– Less precision is required to present
store information about nearby
frequencies
– Less precision = coarser
quantization
ELEC1200 17
Quantization
• Real signals typically assume continuous values, e.g., between 0 and
+1 or -1 and +1.
• However, for digital storage, we use binary numbers, which have a
limited number of values
– An n bit number has 2n different values.
• Thus, we divide the expected signal range, R into 2n different levels,
and quantize the original signal by recording the closest level.
quantized signal
R 3 bits → 8 levels
original signal
ELEC1200 18
Resolution
• Quantization or rounding error is the difference between the actual
signal and the quantized signal
• We reduce quantization error by using more levels
• Resolution is either measured as
– the number of bits (more bits mean finer resolution)
– the difference between levels (smaller is better)
• Higher resolution requires more storage
quantization
error
R 8 levels
3 bit resolution
R
resolution =
2n − 1
ELEC1200 19
Principles of Auditory Coding
• Time frequency decomposition
– Divide the signal into frames
– Obtain the spectrum of each piece
• Use psycho-acoustic model to determine what information to keep
– Don’t store information outside the hearing range
(40 Hz to 15 kHz)
– Stereo info not stored for low frequencies
– Masking
• A component (at a given frequency) masks components at neighboring frequencies
• Store the information in the most compact way possible
– Minimize the bitrate requirement
– Maximize the audible auditory content
ELEC1200 20
MP3 schematic
Input: 1.411 Mbit/s (16 bit per channel @44.1 kHz - stereo)
Output: Coded audio signal at ~128 kbit/s
Without Compression: 1.4 M of data per second or 84 M per minute or 5 G per hour
MP3: 128 K of data per second or 7.68 M per minute or 0.46 G per hour
minor extra
Frequency analysis effects of masking
similar to Fourier series
information
determined here
encoded here
ELEC1200 21
Non-uniform quantization
• MP3 compression quantizes the amplitudes of different frequency components
differently, depending upon masking.
• Frequency components near a dominant masker are quantized with fewer bits.
minor extra
Frequency analysis effects of masking
similar to Fourier series
information
determined here
encoded here
ELEC1200 22
Summary
• Audio waveforms are typically analyzed as a sequence of frames
– Within each frame, the signal can be well approximated by a few frequency
components
– The spectrogram can be used to visualize changes in the frequency content over
time
• Source coding/data compression
– Recode message stream to remove redundant information, aka compression. The
goal is to match data rate to actual information content.
– Two types of compression: Lossless vs lossy
• MP3 audio lossy compression combines framing and frequency analysis
with a non-uniform quantization based on a perceptual model
– By throwing away “unimportant” (imperceptible) information, we can obtain large
compression ratios
ELEC1200 23