Chapter 8 - Applications of NLP-Part II
Chapter 8 - Applications of NLP-Part II
Te ssf u Ge t e ye (Ph D)
2020/2021-Se me st er- II
Speech Recognition Speech Recognition Processes
Types of Automatic Speech Recognition
Optical Character Recognition
Difficulties with ASR
Speech Recognition Approaches
Speech Recognition Performance Evaluation
NLP in Speech Recognition
ASR systems are used for translating speech sequences into the corresponding textual
representation.
Speech Text
- sequence of word
Acoustic Model
Language Model
Lexical Model
Decoder
Acoustic Model:
It is used in ASR to represent the relationship between an audio signal and the
phonemes or other linguistic units that make up speech.
The model is learned from a set of audio recordings and their corresponding
transcripts.
It is typically deals with the raw audio waveforms of human speech, predicting
what phoneme each waveform corresponds to, typically at the character or
subword level.
It defines the probability that a basic sound unit, or phoneme has been uttered.
It represents the relationship between the speech signal and the linguistic or
acoustic units in the language.
Language model:
Lexical Model:
Decoder:
It combines acoustic, language, and lexical models given the feature vector
sequence and the hypothesized word sequence, and outputs the word sequence
with the highest score as the recognition result.
Isolated speech
Isolated word recognition system which recognizes single utterances i.e. single
word.
It is suitable for situations where the user is required to give only one-word
response or commands, but it is very unnatural for multiple word inputs.
Connected words
Spontaneous speech
They are generally more accurate for the particular speaker, but could be less
accurate for other type of speakers.
These systems are usually cheaper, easier to develop and more accurate.
Speaking Style
Speaker Sex
Dialects
ASR Approaches
Syntax
Pragmatics
Typical approach is based on “blackboard” architecture:
At each decision point, lay out the possibilities
Pattern classification: Compare speech patterns with a local distance measure and a
global time alignment procedure (DTW).
Decision logic: similarity scores are used to decide which is the best reference pattern.
The pattern recognition approach has two steps-namely, training of speech
patterns, and recognition of patterns by way of pattern Classifier.
The pattern recognition approach can be: Template based or Stochastic or Statistical
based approach.
This approach contains many techniques such as :
Dynamic Time Warping (DTW)
Polynomial Classifier
Test pattern, T, and reference patterns, {R1, …, Rv}, are represented by sequences of
feature measurements.
Pattern similarity is determined by aligning test pattern, T, with reference pattern, Rv,
with distortion D(T, Rv)
Decision rule chooses reference pattern, R*, with smallest alignment distortion D(T,
R*).
Dynamic time warping (DTW) is used to compute the best possible alignment warp,
between T and Rv, and associated distortion D(T, Rv).
SVM
ASR system developed using the statistical approach as a hybrid way of GMM-HMM.
This approach is also called conventional statistical approach.
It is the widely used approach for ASR for more than four decades.
In GMM-HMM approach:
GMM – is used for modeling the spectral features of the speech signal.
- is used for estimating the emission probabilities or observation
likelihoods of the HMM states via the expectation-maximization
algorithm.
HMM – is used for modeling the temporal features of the speech signal with
respect to the linguistic units in the development of the ASR system for
a particular language.
- is used for computing the probabilities of observation sequences using a
forward algorithm, for finding out the optimal sequences of HMM states
The ANN approach attempts to mechanize the recognition procedure according to the
way a person applies its intelligence in visualizing, analyzing and finally making a
decision on the measured acoustic features.
The deep neural networks based acoustic modeling technique reduce the limitations of
GMM-HMM ASR system.
Deep neural network such as feed forward and recurrent neural network are applied :
For acoustic modeling in ASR as a feature extractor for GMM-HMM system
For replacing GMM to develop hybrid neural networks-HMM systems.
For developing ASR in End-to-End approach
In these networks, the information always travels in one direction (from the
input layer to the output layer via the hidden layers)and never goes backward.
Those networks include:
DNN
CNN
TDNN networks.
Recurrent DL Networks
RNNs are called cyclic networks with self-connections from the previous time
steps used as inputs to the current time steps.
These networks capture a dynamic history of information about the input
feature sequences and are less influenced by temporal distortion.
Unlike the feed-forward DL networks, RNNs can take a long sequence of input
features and generated a long sequence of output values .
Data sharing DL networks are vital for minimizing the overfitting problems of
unilingual feed-forward and RNNs in low-resource ASRs.
These networks include multitask, multilingual, and weight-transfer learning
techniques.
Multitask learning is used to improve the overall performance of a learning task
by jointly learning multiple associated tasks.
This helps to transfer knowledge between or among tasks if the tasks
are associated with each other and share an internal representation
by joint learning.
Example: Train ASR system for Amharic and Chaha languages.
Source languages are widely high-resource, which have sufficient training corpus
for training the DL models, while target language is usually a low-resource, with
limited training corpus that is insufficient to train DL models.
This technique allows transferring the weights from DL models trained via the
source languages to train the target language.
The hidden layers of the source DL models are trained using either unilingual or
multilingual training corpora, and then the output layers are discarded, and
replaced with a new target language output layer. Then, the weights of nodes in
the added output layer and biases are randomly initialized.
Finally, either all the hidden layers are made fixed and we train only the added
output layer or we retrain all the hidden layers and the added output layer using
a small training dataset of the target language.
This technique is important for developing ASR systems for languages that have
very limited training datasets, have no known phone sets , and have no well-
defined orthographic systems .
D - is number of deletions
I - is number of insertions
N - is number of words in the reference.
Sometimes word recognition rate (WRR) is used instead of WER while describing
performance of speech recognition.
Speed
It involves automatic conversion of text in an image into letter codes which are
usable within computer and text-processing applications.
It deals with a data stream which comes from a transducer while the user is
writing.
When the user writes on the tablet, the successive movements of the pen are
transformed to a series of electronic signal which is memorized and analyzed by
the computer.
OCR is usually referred to as an off-line character recognition process to mean that the
system scans and recognizes static images of the characters.
OCR Phases
Digitization
Preprocessing
Segmentation
Feature extraction
Post processing
OCR Phases
OCR Phases
Digitization
Preprocessing
Binarization is converting the gray-scale or color images into binary mages for
reducing the storage space and for increasing processing speed.
Size normalization is making the characterize normalized for reducing the size
varieties.
OCR Phases
Segmentation:
The position of the character in the image is found out and the size of the image
is normalized to that of the template size.
For example: diagonal features, intersection and open end points features,
transition features, zoning features, directional features, parabola curve fitting–
based features, and power curve fitting–based features in order to find the
feature set for a given character.
OCR Phases
Post-processing:
It is the final stage in OCR system, and the most important stage.
It is working for checking the result text from previous stage, and correct it to
make sure it is free from errors.
Common deep learning approach which is effective for OCR is CNN models.
Recognition rate
Rejection rate
Rejected characters can be flagged by the OCR system, and are therefore easily
retraceable for manual correction.
Error rate