9 Speech Recognition
9 Speech Recognition
01/02/23 2
Outline
Speech Processing
Speech Recognition
Speech Recognition Process
Parameters of Speech Recognition Task
Large-Vocabulary Continuous Speech Recognition
Speech Recognition Architecture
Feature Extraction
01/02/23 3
Speech Processing
01/02/23 4
Speech Processing …
Speech Analysis/Synthesis:
Speech analysis enables to identify words and analyze audio patterns to
detect emotions and stress in a speaker's voice.
Speech Synthesis is the artificial production of human speech. The
modern task of speech synthesis, also called text-to-speech or TTS, is to
produce speech (acoustic waveforms) from text input.
Application areas:
Human-computer interaction:
While many tasks are better solved with visual or pointing interfaces,
speech has the potential to be a better interface than the keyboard for
tasks where full natural language communication is useful, or for which
keyboards are not appropriate.
This includes hands-busy or eyes-busy applications, such as where the
user has objects to manipulate or equipment to control.
Telephony:
Where speech recognition is already used for example in spoken
dialogue systems for entering digits, recognizing “yes” to accept collect
calls, finding out airplane or train information, and call-routing
(“Accounting, please”, “Prof. Regier, please”).
In some applications, a multimodal interface combining speech and
pointing can be more efficient than a graphical user interface without
01/02/23 speech 6
Speech Recognition…
Application areas…
Dictation
Automatic Speech Recognition (ASR) can also be applied to
dictation, that is, transcription of extended monologue by a single
specific speaker.
Dictation is common in fields such as law and is also important as
part of augmentative communication (interaction between computers
and humans with some disability resulting in the inability to type, or
the inability to speak)
01/02/23 7
Speech Recognition…
01/02/23 9
Speech Recognition Process…
01/02/23 12
Parameters of Speech Recognition
Task…
A second dimension of variation is how fluent, natural, or
conversational the speech is.
Isolated word recognition, in which each word is surrounded by
some sort of pause, is much easier than recognizing continuous
speech.
Continuous speech in which words run into each other and have
to be segmented. Continuous speech tasks themselves vary greatly
in difficulty.
01/02/23 13
Parameters of Speech Recognition
Task…
A third dimension of variation is channel and noise.
The dictation task (and much laboratory research in speech
recognition) is done with high quality, using head mounted
microphones.
01/02/23 15
Parameters of Speech Recognition
Task…
The table shows the rough percentage of incorrect words (the
word error rate, or WER) from state-of-the-art systems on
different ASR tasks.
01/02/23 16
Parameters of Speech Recognition
Task…
A final dimension of variation is accent or speaker-class
characteristics…
Variation due to noise and accent increases the error rates quite
a bit.
01/02/23 18
Speech Recognition Architecture…
01/02/23 19
Feature Extraction
01/02/23 21
Potential Research Areas in NLP
01/02/23 22
Potential Research Areas in NLP
01/02/23 24
Question & Answer
01/02/23 25
Thank You !!!
01/02/23 26