Lecture 1
Lecture 1
CS 753
Instructor: Preethi Jyothi
Course Plan (I)
5
AM
• Weighted Finite State Transducers for ASR
:
ay
models is ih z NGRAMS
Good prose
SCORE
2.5
Pronunciation model like 0.7
• LM: Ngram models (+smoothing), PM is like
is like a
1.2
0.8
RNNLMs Grammar (language) model
LM
good prose is like a windowpane
•
architect
End-to-end Neural Models for ASR compute
2.2. Att
Check www.cse.iitb.ac.in/~pjyothi/cs753 for latest updates The Att
based LS
ducer pr
Moodle will be used for assignment/project-related submissions condition
for yi is
and all announcements x1 x2 x3 x4 xT coder sta
emitted c
Fig. 1: Listen, Attend and Spell (LAS) model: the listener is a pyra- produced
Image from: Chan etmidal
al., Listen,
BLSTMAttend
encoding and
ourSpell: A NN forx LVCSR,
input sequence into high ICASSP
level fea- 2016
tures h, the speller is an attention-based decoder generating the y
Other Course Info
• Teaching Assistants (TAs):
- Vinit Unni (vinit AT cse)
- Saiteja Nalla (saitejan AT cse)
- Naman Jain (namanjain AT cse)
• Readings:
- No fixed textbook. “Speech and Language Processing” by Jurafsky and Martin serves
as a good starting point.
- All further readings will be posted online.
• Participation 5%
• Always cite your sources (be it images, papers or existing code repos).
Follow proper citation guidelines.
• Always cite your sources (be it images, papers or existing code repos).
Follow proper citation guidelines.
• Preliminary Project Evaluation: Short report detailing project statement, SEP 1-7
• Excellent Projects:
- Will earn extra credit that counts towards the final grade
- Can be turned into a research paper
#1: Speech-driven Facial Animation
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR
1 word 16 words
Freq. Isolated word
detector recognition
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
History of ASR
Cortana
Siri
1922 1932 1942 1952 1962 1972 1982 1992 2002 2012
How are ASR systems evaluated?
• Error rates computed on an unseen test set by comparing W* (decoded
sentence) against Wref (reference sentence) for each test utterance
- Sentence/Utterance error rate (trivial to compute!)
- Word/Phone error rate
• Word/Phone error rate (ER) uses the Levenshtein distance measure: What
are the minimum number of edits (insertions/deletions/substitutions)
required to convert W* to Wref?
On a test set with N instances:
PN
j=1 Insj + Delj + Subj
ER = PN
`
j=1 j
Insj, Delj, Subj are number of insertions/deletions/substitutions in the jth ASR output
`j is the total number of words/phones in the jth reference
NIST STT Benchmark Test History
Remarkable progress in ASR in the last decade
http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
100%
Switchboard
Conversational Speech
Meeting Speech
(Non-English)
News English 1X
News English unlimited
10%
Noisy News English 10X
5k
1k
4%
2%
1%
2018
Image from: http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
Statistical Speech Recognition
Let W denote a word sequence. An ASR decoder solves the foll. problem:
• Data splits
- Training data: 30 utterances
- Test data: 20 utterances
Pr(O | "down")
a01 a12 a23 a34
down
Pht+1 0 1 2 3 4
sub-word units Q corresponding to the word sequence W and the language model
b ()
P (W ) provides a prior
1 probability for2 W . b () b3( )
St+1
Ot+1
O O O O .... O
Compute arg max Pr(O | w)
Acoustic model:
1 The most2commonly3 used acoustic
4 models in ASR
T systems to-
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
Trt prehensive tutorial of a a
11HMMs and their22applicability to ASR a
33 in the 1980’s (with acoustic w
Figure 2.1: Standard topology used to represent a phone HMM.
ideas that are largely applicable to systems today). HMMs are used to build prob- features
a a a a
left abilistic models01 12 labeling problems.
23 Since speech 34 is represented O
Pr(O | "left")
for linear sequence
1 2 3
Pht+1 0
sub-word units Q corresponding to the word sequence W and the language model 4
in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-
P (W ) provides a prior probability for W .
St+1
eled using HMMs. 1 b () 2 b () 3 b ()
j
Themodel:
Acoustic HMM isThe
defined
mostby specifyingused
commonly transition probabilities
acoustic models in (a i ) and
ASR observation
systems to-
Ot+1
day(or
areemission)
O
1
probability
Hidden Markov Models
2 O
distributions
(HMMs). Please
O
(b3j (Oi ))refer
(along O
to Rabiner
....
4 with the number
T O
(1989) for aofcom-
hidden
a
abilistic models for01
linear sequence labeling20
a12 problems. Since speech is represented
a23 a34
right
Pht+1
insub-word
0
the form of a sequence
units
1
of acoustictovectors
Q corresponding
2
O, it
the word lends itself
sequence
3
to bethe
W and naturally
languagemod-
model
4
eled using
P (W HMMs.a prior probability for W .
) provides
b1( ) b2( ) b3( j)
Pr(O | "right")
The HMM is defined by specifying transition probabilities (ai ) and observation
Ot+1 Acoustic
(or model:
emission) O The most
probability
1
commonly
O2
distributions used
(bj (O O3 O4 ....
acoustic models in ASR systems to-
i )) (along with the number of hidden OT
day are
states Hidden
in the Markov
HMM). Models
An HMM (HMMs).
makes Please refer
a transition fromtostate
Rabiner
i to(1989)
state jfor a com-
with a
prehensiveoftutorial
probability aji . On of HMMsaand
reaching statetheir applicability
j, the observationtovector
ASR inat the
that1980’s (with
state (O j)
Figure 2.1: Standard topology used to represent a phone HMM.
ideas that are largely applicable to systems today). HMMs are used to build prob-
Small tweak
a
sub-word units Q01
corresponding to the word sequence W and the language model
a12 a23 a34
Pht down
Pht+1 0 1
P (W ) provides a prior probability for W .
2 3 4
b ()
Acoustic model: The1most commonly used b () b ()
3 in ASR systems to-
2 acoustic models
Ot Ot+1
O1 O2 O3 O4 ....
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
OT
prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with
ideas that are largely applicable to systems today). HMMs are used to build prob-
Figure
abilistic models for2.1: Standard
linear topology
sequence usedproblems.
labeling to represent a phone
Since HMM.
speech is represented
sub-word
eled units Q corresponding to the word sequence W and the language model
using HMMs.
) provides
The
P (W a prior probability
HMM is defined fortransition
by specifying W. probabilities (aji ) and observation
(or emission) probability distributions (bj (Oi )) (along with the number of hidden
Acoustic model: The most commonly used acoustic models in ASR systems to-
states in the HMM). An HMM makes a transition from state i to state j with a
Small tweak
Trt-1
Search within this graph
Trt Figure 2.1: Standard topology used to represent a phone HMM.
a11 a22 a33
a
sub-word units Q01
corresponding to the word sequence W and the language model
a12 a23 a34
Pht-1 Pht Pht+1 0 1 2 3 4
P (W ) provides a prior probability for W .
b ()
Acoustic model: The1most commonly used b () b ()
3 in ASR systems to-
2 acoustic models
Ot-1 Ot Ot+1
O1 O2 O3 O4 ....
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
OT
prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with
ideas that are largely applicable to systems today). HMMs are used to build prob-
Figure
abilistic models for2.1: Standard
linear topology
sequence usedproblems.
labeling to represent a phone
Since HMM.
speech is represented
sub-word
eled units Q corresponding to the word sequence W and the language model
using HMMs.
) provides
The
P (W a prior probability
HMM is defined fortransition
by specifying W. probabilities (aji ) and observation
(or emission) probability distributions (bj (Oi )) (along with the number of hidden
Small vocabulary ASR
• Task: Recognize utterances which consist of speakers saying one of 1000
words multiple times per recording.
Acoustic
Model (phones)
Acoustic
Pronunciation
Feature SEARCH Model
Generator
speech O
signal
Acoustic
Feature
word sequence
Generator
W*
speech O
signal
Single end-to-end model that directly learns a mapping from speech to text
ASR Progress contd.
AUG ‘16
AUG '17
MAR ‘19
https://venturebeat.com/2019/04/22/amazons-ai-system-could-cut-alexa-speech-recognition-errors-by-15/
https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/
https://www.npr.org/sections/alltechconsidered/2016/08/24/491156218/voice-recognition-software-finally-beats-humans-at-typing-study-finds
What are some unsolved problems related to ASR?