0% found this document useful (0 votes)
85 views

AVSR Project Report

This document is a final year B.E project report submitted by four students on audio visual speech recognition using deep learning. It discusses developing models for audio speech recognition using MFCC and bidirectional LSTM, visual speech recognition using LSTM on extracted mouth ROIs, and fusion of audio and visual models using convolutional neural network. Evaluation shows the audio model achieved accuracy of X% for English and Y% for Kannada datasets, visual model achieved accuracy of A% and B%, and fusion model achieved improved accuracy of C% and D%, outperforming existing approaches. The project was carried out under the guidance of Prof. Shashidhar R at SJCE, JSS STU, Mysuru.

Uploaded by

Prajwal Gowda L
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

AVSR Project Report

This document is a final year B.E project report submitted by four students on audio visual speech recognition using deep learning. It discusses developing models for audio speech recognition using MFCC and bidirectional LSTM, visual speech recognition using LSTM on extracted mouth ROIs, and fusion of audio and visual models using convolutional neural network. Evaluation shows the audio model achieved accuracy of X% for English and Y% for Kannada datasets, visual model achieved accuracy of A% and B%, and fusion model achieved improved accuracy of C% and D%, outperforming existing approaches. The project was carried out under the guidance of Prof. Shashidhar R at SJCE, JSS STU, Mysuru.

Uploaded by

Prajwal Gowda L
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Final Year B E Project Report

AUDIO VISUAL SPEECH RECOGNITION USING


DEEP LEARNING
Submitted in partial fulfillment of the requirement for the award of the
degree of

BACHELOR OF ENGINEERING IN
ELECTRONICS AND COMMUNICATION ENGINEERING

Submitted by

NAME USN
Gayatri V Shetti 01JST18EC034

Surabhi R 01JST18EC118

Subhash H M 01JST18EC094

Nagaraj Naik 01JST18EC058

Under the guidance of


PROF. SHASHIDHAR R
ASSISTANT PROFESSOR
Department of Electronics & communication engineering
SJCE, JSS STU, Mysuru.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


2021-2022
CERTIFICATE
This is to certify that the project work entitled Title of Project carried out by GAYATRI
V SHETTI (01JST18EC034), SURABHI R (01JST18EC118), SUBHASH H M
(01JST18EC094), NAGARAJ NAIK (01JST18EC058) bonafide students of Sri
Jayachamarajendra College of Engineering, JSS Science and Technology University, Mysuru,
in partial fulfillment for the award of the degree of Bachelor of Engineering in Electronics
and Communication Engineering during the year 2021-22. It is certified that all
corrections/suggestions indicated during internal assessments have been incorporated in the
final report. The project report has been approved as it satisfies the academic requirements in
respect of the project work prescribed for the degree.

Prof. Shashidar R Dr. U B Mahadevaswamy


Assistant Professor Professor and Head
Dept. of ECE, SJCE Dept. of ECE, SJCE
JSS STU, Mysuru. JSS STU, Mysuru.
Dr. S B Kivade
Principal and Dean (Engineering & Technology)
SJCE, JSS STU, Mysuru
.
External Viva

Name of Examiners Signature with Date

1. ……………………. …………………….
2. ……………………. …………………….
3. ……………………. …………………….

2
DECLARATION

We, Gayatri V Shetti (01JST18EC034), Surabhi R (01JST18EC118), Subhash H M


(01JST18EC094), Nagaraj Naik (01JST18EC058), students of B.E in Electronics and
Communication Engineering, Sri Jayachamarajendra College of Engineering, JSS Science and
Technology University, Mysuru, do hereby declare that this project entitled ‘Audio Visual
Speech Recognition using Deep Learning’ carried out by us under the guidance of Prof.
Shashidhar R, Assistant Professor, Department of ECE, SJCE, JSS STU, Mysuru for the
fulfillment of the requirement for the award of the degree of Bachelor of Engineering in
Electronics and Communication Engineering. We also declare that, to the best of our
knowledge and belief, the matter embodied in this project report has not been submitted
previously by us for any other course.

Place: Mysuru Gayatri V Shetti(01JST18EC034)

Date: Surabhi R(01JST18EC118)

Subhash H M(01JST18EC094)
Nagaraj Naik(01JST18EC058)

3
ACKNOWLEDGEMENT

The success and the final outcome of this project required a lot of guidance and assistance and
we are privileged to have got all this along with the completion of this project.
We express our sincere gratitude to Dr. U B Mahadevaswamy, Professor and Head of the
Department of Electronics and Communication Engineering, JSS Science and Technology
University, SJCE, Mysuru.
We are grateful to our mentor Prof. Shashidhar R, Assistant Professor, Department of
Electronics and Communication Engineering, JSS Science and Technology University, SJCE,
Mysuru for his constant support and encouragement towards the completion of this project.
We also thank our panel members Prof. B A Sujathakumari, Prof. Pavitra D R, and Prof. M S
Praveen for their support and guidance throughout the project duration.

Project Members

Gayatri V Shetti(01JST18EC034)
Surabhi R(01JST18EC118)
Subhash H M(01JST18EC094)
Nagaraj Naik(01JST18EC058)

4
ABSTRACT
Audio visual speech recognition (AVSR) is a technique that uses image processing
methods in lip-reading to assist the speech recognition systems. It is a combination of both audio
and visual parts that involves the integration of both lip-reading and speech recognition processes.
One recent trend in Image processing is speech to text conversion. Deep learning has been widely
used to tackle such audio visual speech recognition (AVSR) problems due to its astonishing
achievements in both speech recognition and image recognition. Speech recognition for disabled
people is a difficult task due to the lack of motor control of the speech articulators. Multimodal
speech recognition can be used to enhance the robustness of disordered speech in which an image
provides contextual information for a spoken caption to be decoded. Thus, the AVSR system can
be coined as a lifeline for people who have a hearing impairment which helps them by providing
a way to understand the words that are being conveyed to them through speech. In the proposed
AVSR system, a custom dataset was designed for English and Kannada languages. Mel Frequency
Cepstral Coefficients technique was used for audio feature extraction and Bidirectional LSTM was
used for audio classification. Long Short-Term Memory (LSTM) method was used for visual
speech recognition. Finally, to integrate the audio and visual into a single platform Deep
convolutional neural network was used. From the result, it was evident that the accuracy was …..%
for audio speech recognition, ….% for visual speech recognition, and ….% for audiovisual speech
recognition for english dataset and accuracy was …..% for audio speech recognition, ….% for
visual speech recognition, and ….% for audiovisual speech recognition for kannada dataset, the
result was better than the existing approaches.

Keywords: Audio Visual Speech Recognition, Lip Reading, LSTM, Bidirectional LSTM, Custom
Dataset

5
CONTENTS

ABSTRACT…………………………………………………………………………………….. 5
CONTENTS……………………………………………………………………………………. 6

LIST OF FIGURES…………………………………………………………………………… 7
LIST OF TABLES…………………………………………………………………………….. 8
Chapter 1:
INTRODUCTION………………………………………………………………….. 10
1.1
Overview……………………………………………………………………………….. Error!
Bookmark not defined.
1.2 Motivation……………………………………………………………………………… 11
1.3 Problem statement……………………………………………………………………… 11
1.4 Objectives………………………………………………………………………………. 11
1.5 Chapters Overview……………………………………………………………………… 12

Chapter 2: LITERATURE
SURVEY………………………………………………………... Error! Bookmark not
defined.3
2.1 Previous Research……………………………………………………………………… 133
2.2 Observation from Literature
Review…………………………………………………… 233

Chapter 3: HARDWARE AND SOFTWARE


REQUIREMENTS…………………………. 244
3.1 Hardware Requirements……………………………………………………………….. 244
3.2 Software Requirements………………………………………………………………… 244

Chapter 4:
METHODOLOGY……………………………………………………………… Error!
Bookmark not defined.7
4.1 Block Diagram………………………………………………………………………… 277
4.2 Audio Model…………………………………………………………………………… 288
4.3 Video Model…………………………………………………………………………… 31
4.4 AVSR
Model…………………………………………………………………………… Error!
Bookmark not defined.5

Chapter 5: RESULTS AND DISCUSSIONS……………………………………………….. 399

6
5.1 Evaluation of Audio
Model……………………………………………………………. 399
5.2 Evaluation of Visual Model……………………………………………………………. 45
5.3 Evaluation of Fusion Model…………………………………………………………… 51

Chapter 6: CONCLUSION & FUTURE SCOPE…………………………………………... 58


6.1 Conclusion……………………………………………………………………………... 58
6.2 Future Scopes…………………………..………………………………………………. 58
6.3 Advantages and Limitations…………………………………………………………… 59

REFERENCES………………………………………………………………………………… 60

LIST OF FIGURES
Figure 4.1.1: Proposed Block diagram......................................................................................... 28

Figure 4.2.1: MFCC Extraction.................................................................................................... 29

Figure 4.3.1 Mouth ROI extraction.............................................................................................. 32

Figure 4.3.2: Structure of a LSTM cell......................................................................................... 33

Figure 4.3.3 Video model for english dataset……………………………………………………34


Figure 4.3.4 Video model for kannada dataset…………………………………………………..34

Figure 4.4.1 AVSR model……………………………………………………………………….37


Figure 4.4.2 Feed forward network……………………………………………………………...38
Figure 5.1.1: Training english dataset with Audio model ……………………………………………… 34
Figure 5.1.2: Training english dataset with Audio model ……………………………………………… 34
Figure 5.1.2: Accuracy curve for audio model...………………………………………………………….. 34
Figure 5.3: Loss curve for audio model…………………………………………………………………….. 34
Figure 5.4: Confusion matrix of audio model………………………………………………………………40
Figure 5.5: Classification report of audio model…………………………………………………………..41

Figure 5.6: Training english dataset with video model ………………………………………………….34


Figure 5.7: Accuracy curve for video model...……………………………………………………………...34
Figure 5.8: Loss curve for video model…………………………………………………………………….. 34

7
Figure 5.9: Confusion matrix of video model………………………………………………………………40
Figure 5.10: Classification report of video model…………………………………………………………41

Figure 5.11: Training english dataset with fusion model ………………………………………………..34


Figure 5.12: Accuracy curve for fusion model...…………………………………………………………...34
Figure 5.13: Loss curve for fusion model………………………………………………………………….. 34
Figure 5.14: Confusion matrix of fusion model……………………………………………………………40
Figure 5.15: Classification report of fusion model………………………………………………………...41

LIST OF TABLES
Table 5.1: Results of the proposed method compared with the existing method for audio visual
speech Recognition ……………………………………………………………………………. 36

8
ACRONYMS

AVSR Audio Visual Speech Recognition

LSTM Long Short Term Memory

Bi-LSTM Bidirectional Long Short Term Memory

MFCC Mel Frequency Cepstral Coefficients

9
Chapter -1:

INTRODUCTION
This chapter would give an overview of AVSR and its significance. The steps that are involved in
the AVSR process are also discussed here, followed by the motivational factors. The problem
statement is then defined and the objectives were set which defines the course of action to stay on
track with the proposed work.
1.1 Overview
Audio visual speech recognition (AVSR) is a technique that uses image processing methods in lip-
reading to assist the speech recognition systems. It is a combination of both audio and visual parts
that involves the integration of both lip-reading and speech recognition processes. Automatic
speech recognition of overlapped speech is a challenging task. The presence of interfering speakers
produces a large mismatch between clean and overlapped speech and hence shows significant
performance degradation. One recent trend in Image processing is speech to text conversion. The
purpose of speech enhancement is to extract target speech from a mixture of sounds generated
from several sources. Speech enhancement can potentially benefit from the visual information
from the target speaker, such as lip movement and facial expressions since the visual aspect of
speech is essentially unaffected by the acoustic environment. In order to fuse audio and visual
information, an audio-visual fusion strategy is used, which goes beyond simple feature
concatenation and learns to automatically align the two modalities, leading to a more powerful

10
representation that increases intelligibility in noisy conditions. Advanced audio-only speech
enhancement algorithms make noisy signals more audible, but the deficiency in restoring
intelligibility remains. Consequently, multi-modal speech enhancement algorithms are demanded
that simulate the audio-visual speech processing mechanism in human contexts, amplify the target
speaker, or filter out acoustic clutter. Audio visual speech recognition (AVSR) uses image
processing capabilities in lip reading to aid speech recognition systems in recognizing non-
deterministic phones or giving preponderance among near probability decisions. Each system of
lip reading and speech recognition works separately, then their results are mixed at the stage of
feature fusion. AVSR as its name suggests has two parts. The first one is the audio part and the
second one is the visual part. In the audio part, we use features like log mel spectrogram, mfcc,
etc. from the raw audio samples and we build a model to get feature vectors out of it. For the visual
part generally, we use some variant of a deep learning algorithm to compress the image to a feature
vector after which these two vectors are concatenated (audio and visual) and the target object is
predicted. The audio and visual features are merged to get the final text result.

1.2 Motivation:

People with hearing loss have difficulty in hearing and understanding speech. Despite significant
advances in hearing aids and cochlear implants, these devices are frequently not enough to enable
users to hear and understand what is being communicated in different settings. AVSR technology
is one of the solutions to this problem. Hence, we aim to develop a model that’ll be helpful to
people with hearing impairment.

1.3 Problem statement:

Primarily there are two points of interest in the design of an AVSR application
1. Gaining of an appropriate representation of the visual speech modality.
2. The effective integration of the acoustic and visual speech modalities in the presence of a
variety of degradations.

1.4 Objectives:

The objectives behind AVSR are:

11
● Collection of the dataset.
● Develop an algorithm for audio feature extraction.
● Develop an algorithm for audio speech processing.
● Develop an algorithm for lip localization.
● Develop an algorithm for visual speech recognition.
● Develop an algorithm for the Integration of audio and video.
● Compare the proposed result with existing results.

1.5 Chapters Overview


There are totally six chapters in the report each explaining different aspects of the project in detail
with relevant figures.
Chapter 1gives a brief introduction to the project and gives an overview of the AVSR
system steps that are involved in it followed by the motivation to carry out this system and then
the objectives that we aim to achieve from this proposed system so that the proposed system doesn't
go off track.
Chapter 2 is the literature survey that contains an overview of the previous research in the
field and the advantages and disadvantages of each research paper.
Chapter 3 explains the hardware and software components required for the proposed
system.
Chapter 4 provides comprehensive details of the design and implementation. It lists the
architectural design of each module used in the project and explains the methodology using which
the project was completed.
Chapter 5 explains about the results obtained from audio model, visual model and audio-
visual model for both english and kannada dataset along with the confusion matrix and the
classification report.
Chapter 6 tells about the conclusion drawn from the results obtained. Future improvements
are listed, finally concluded by mentioning the advantages, tasks accomplished, the major
takeaways, and general observations.

12
Chapter- 2:
LITERATURE SURVEY
The previous research on Audio Visual Speech Recognition systems along with their description,
the method used, and accuracy are mentioned in detail.

2.1 Previous Research:

Karel et. al [1] have proposed a method in which they have combined deep image
representations for object recognition and scene understanding with representations from an
audiovisual affect recognition model. To this set, they have included content agnostic audio-visual
synchrony representations and mel-frequency cepstral coefficients (MFCC) to capture other
intrinsic properties of audio. These features are used in a modular supervised model.They have
used the CoView dataset that consists of 1500 videos. The extracted features are from GoogLeNet
model, trained to the Places365 database. Multisensory class activation maps (CAMs) are obtained
per frame using each video. Faces, MFCCs. To better study the task of highlight detection, a pilot
experiment with highlights annotations for a small subset of video clips was done and the best
model was fine tuned on it. Accuracy of 80.2% was obtained for GoogLeNet model which is the
maximum out of other features considered.

Zakaria et. at [2] present an introspection of audio-visual speech enhancement model. It is


shown that visual features provide not only high-level information about speech activity, i.e.
speech vs. no speech, but also fine-grained visual information about the place of articulation.
Visual representations can be used to discriminate visemes during continuous speech,e.g. rounding
lips, stretching lips, and visible teeth. The effectiveness of the learned visual representations for
classifying visemes (the visual analogy to phonemes) is demonstrated. As a benchmark, an

13
unweighted accuracy of 49.2% was obtained using a separate VGG-M neural network trained from
scratch specifically to detect visemes, which suggests that our self-supervised visual features were
able to close a large proportion of the performance gap. This demonstrates the efficacy of
audiovisual speech enhancement as a self-supervised task for learning strong visual features. In
this paper it was shown that the performance of enhancement models varies depending on what is
being articulated, and the addition of visual cues provides non-consistent gains in performance
depending on what is being articulated.

Afouras et. al [3] have proposed a method where two models for lip-reading are compared,
one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on
top of the transformer self-attention architecture. The extent to which lip reading is complementary
to audio speech recognition, especially when the audio signal is noisy is investigated. A new
dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural
sentences from British television is introduced in this model. The best performing network is TM-
seq2seq, which achieves a word error rate of 48.3% on LRS2-BBC when decoded with a language
model. This model also sets a baseline for LRS3-TED at 58.9%. It is finally demonstrated that
visual information helps improve speech recognition performance even when the clean audio
signal is available. Especially in the presence of noise in the audio, combining the two modalities
leads to a significant improvement.

Xinmeng et al [4] proposed a audio-visual fusion strategy where they have attempted to
fuse audio and visual information which goes beyond simple feature concatenation and learns to
automatically align the two modalities, leading to more powerful representation which increases
intelligibility in noisy conditions.The proposed model fuses audio-visual features layer by layer,
and feeds these audio-visual features to each corresponding decoding layer. Experiment results
show relative improvement from 6% to 24% on test sets over the audio modality alone, depending
on audio noise level. Moreover, a significant increase of PESQ from 1.21 to 2.06 in the -15dB
SNR experiment was observed. The proposed model consistently improves the quality and
intelligibility of noisy speech, and experiment results shows that MFFCN has better performance
than recent audio-only based models and also demonstrated an obvious improvement on highly
noisy speech enhancement.

14
Xia et al. [5] have presented a detailed review of recent advances in the audiovisual speech
recognition area. Main emphasis is given on typical audiovisual speech database descriptions in
terms of single view and multi-view, since the public databases for general purpose should be the
first concern for audiovisual speech recognition tasks. Datasets with varying backgrounds are
considered. Using HMM classifier and MFCC for feature extraction accuracy of 90% is obtained
for SNR of 10dB. For MFCC classifier and GAFS feature extraction method 93% accuracy is
obtained. In this paper several problems that confront the designers of such complex speech
recognition systems have been stated and elaborately discussed.

Gao et al. [6] have come up with a new approach for audio-visual speech separation. Given
a video, the goal is to extract the speech associated with a face in spite of simultaneous back-
ground sounds and/or other human speakers. The existing methods focus on learning the alignment
between the speaker’s lip movements and the sounds they generate. We propose to leverage the
speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they
are likely to produce. The dataset contains over 1 million utterances with the associated face tracks
extracted from YouTube videos, with 5,994 identities in the training set and 118 identities in the
test set. For speech enhancement experiments, speech mixtures are additionally mixed with non-
speech audios from AudioSet as background noise during both training and testing. The output of
our model consists of voices separated from the original test video—in terms of masking the input
spectrogram as opposed to being generated or machine synthesized.

Lalonde et al. [7] have proposed a paper which tells about the evidence regarding infants
and children’s use of temporal and phonetic mechanisms in audiovisual speech perception benefit.
Sensitivity to the correspondence between auditory and visual speech cues is apparent shortly after
birth and is observed throughout the first year of life. This early sensitivity contrasts with
protracted development—into adolescence—in the ability to use visual speech to compensate for
the noisy nature of our everyday world. Careful experimental design is necessary to determine
what cues are used on any particular audiovisual speech perception task. Stimulus manipulations
that decrease the auditory or visual cues available to observers help to disambiguate what cues are
being used, as do comparisons across tasks differing in complexity.

Simon et al. [8] have proposed a paper based around the well known hidden Markov model
(HMM) classifier framework for modeling speech. The main impetus of practical audio-visual

15
integration is to dampen the independent errors, resulting from the mismatch, rather than trying to
model any bimodal speech dependencies. It is shown empirically and theoretically the benefit of
LI(late integration) over other integration strategies in terms of classifier flexibility and its ability
to dampen independent errors coming from either modality. The benefits of a hybrid combination
scheme have been highlighted for addressing another mode of AVSP operation. In the paper it has
been shown that the sum rule naturally results as an approximation to the product rule when
confidence errors are present, provided the true a posteriori probabilities are conditionally
independent. Future work is to address the use of 2-D state histograms as a possible avenue for
further identification and verification performance in an AVSP application.

Liu et al. [9] have proposed a lip graph assisted AVSR method with bidirectional
synchronous fusion. The experiment results on LRW-BBC dataset shows that the method
outperforms the end-to-end AVSR baseline method in both clean and noisy conditions. An audio
stream is used to extract audio features from the acoustic signal and a hybrid visual stream to
extract visual features from visual signals. A bidirectional synchronous fusion is applied to fuse
the audio feature and visual feature. Compared to the baseline method this method shows
significant improvement. A graph branch is proposed to extract additional shape-based features,
and then combined with the image branch to extract more discriminative visual features. An
attention-based bidirectional sync block is proposed to achieve more reliable audio and visual
synchronization and boost the ability to explore the correlation between the two modalities. The
proposed model has got 84.25% of accuracy compared to different visual models on the LRW
dataset.

Weijiang et. al [10] have proposed a multimodal recurrent neural network (multimodal
RNN) model to take into account the sequential characteristics of both audio and visual modalities
for AVSR. In particular, multimodal RNN includes three components, i.e., audio part, visual part,
and fusion part, where the audio part and visual part capture the sequential characteristics of audio
and visual modalities, respectively, and the fusion part combines the outputs of both modalities.
Here they have modeled the audio modality by using a LSTM RNN, and modeled the visual
modality by using a convolutional neural network (CNN) plus a LSTM RNN, and combined both
models by a multimodal layer in the fusion part resulted in an accuracy of 87.7%. They validated
the effectiveness of the proposed multimodal RNN model on a multi-speaker AVSR benchmark
dataset termed AVletters. The experimental results show the performance improvements compared

16
to the known highest audio visual recognition accuracies on AVletters, and confirm the robustness
of our multimodal RNN model.

Soonkyu et al. [11] describe audio-to-visual conversion techniques where the audio signals
are automatically converted to visual images of mouth shape. The visual speech can be represented
as a sequence of visemes, which are the generic face images corresponding to particular
sounds.They use HMMs (hidden Markov models) to convert audio signals to a sequence of
visemes. In this paper, they compared two approaches in using HMMs. In the first approach, an
HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of
visemes.In the second approach, each phoneme is modeled with an HMM, and a general phoneme
recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme
sequence is then converted to a viseme sequence. the two approaches are then tested on the TIMIT
speech corpus.The viseme recognizer shows 33.9% error rate, and the phoneme-based approach
exhibits 29.7% viseme recognition error rate.

Aggelos et.al[12] proposed an article that reviews recent results in audiovisual fusion and
discusses main challenges in the area with a focus on desynchronization of the two modalities and
the issue of training and testing where one of the modalities might be absent from testing. They
have described the feature extraction step that includes spectrum-based features, like Mel-
frequency cepstral coefficients (MFCCs) and linear predictive coding(LPC) , phoneme posterior
features and prosodic features. Fusion can be performed in different levels i.e early integration,
intermediate integration and late integration .Discussed about some of the dominant fusion
techniques, support vector machines (SVMs), dynamic Bayesian networks (DBNs), hidden
Markov models (HMMs), and Kalman filters. Later described some of the challenges in fusing
audio and video streams. Lastly they reviewed the approaches followed in addressing some of the
challenges in AV fusion present two recent approaches toward it ,namely deep
learning(multimodal fusion learning; cross-modality learning;shared-representation learning) and
multiview learning.

Dana et.al[13] presented a paper which promote various challenges in multimodal data
fusion at the conceptual level, without focusing on any specific model, method or application.They
have discussed about need of multimodality and different challenges i.e challenges that are
imposed by data, data fusion, Model(choosing an analytical model that faithfully represents the

17
link between modalities and yields a meaningful combination ) and lastly limitations on theoretical
validation. Paper focuses on understanding and identifying the particularities of multimodal data,
as opposed to other types of aggregated datasets. A second message is that the encountered
challenges are ubiquitous, hence the incentive that both challenges and solutions be discussed at a
level that brings together all involved communities.

Hui et.al [14] proposed wavenet with cross-attention mechanism for audio-visual
automatic speech recognition (AV-ASR) to address multimodal feature fusion and frame
alignment problems between two data streams.Wavenet is usually used for speech generation and
speech recognition, however, in this paper, extended it to audiovisual speech recognition, and the
cross-attention mechanism is introduced into different places of wavenet for feature fusion. The
proposed cross-attention mechanism tries to explore the correlated frames of visual feature to the
acoustic feature frame. The experimental results show that the wavenet with cross-attention can
reduce the Tibetan single syllable error about 4.5% and english word error about 39.8% relative to
the audio-only speech recognition, and reduce Tibetan single syllable error about 35.1% and
English word error about 21.6% relative to the conventional feature concatenation method for AV-
ASR.

Bertrand et.al [15] proposed an article that provides an overview of the key methodologies
in AV speech source separation building from early methods that simply use the visual modality
to identify speech activity to sophisticated techniques which synthesize a full AV model. The LPC
method is used to model noisy speech. The audio feature based on the LPC inverse filtered
spectrum is fused with the visual features such as the lip width (LW) and height (LH) for enhancing
the LPC spectrum of the noisy speech. The enhanced speech signal can therefore be obtained based
on the LPC enhanced filter and the residual signal is obtained from the inverse filtering of the
noisy speech. AV methods for speech enhancement/separation: methods based on visual voice
activity are Spectral subtraction, AV post processing of audio ICA, Extraction based on temporal
voice activity. Methods based on visual scene analysis are AV beamforming , AV T-F masking .
Methods based on a full joint AV model are Maximization of AV likelihood, AV Regularization
of ICA, AV post processing of audio ICA, AVDL + T-F masking.

Themos et al. [16] proposed an end-to-end deep learning architecture for word level visual
speech recognition. The system is a combination of spatiotemporal convolutional, residual and

18
bidirectional Long Short-Term Memory networks. A challenging database of 500-size target-
words consisting of 1.28sec video excerpts from BBC TV broadcasts was considered. The network
consists of a 3D convolutional front-end, a ResNet and an LSTM-based back-end, and trained
using an aggregated per time step loss which was experimented on LRW Database. Demonstrated
the importance of each building block of the network as well as the gain in performance attained
by training the network end-to-end. The proposed network yielded 83.0% work accuracy, which
corresponds to less than half the error rate of the baseline VGG-M network and 6.8% absolute
improvement over the state-of-the art 76.2% accuracy, attained by an attentional encoder-decoder
network.

Alexandros et al. [17] In this paper they studied the problem of multiview lipreading.They
used the View2 View approach with the OuluVS2 dataset. A deep-learning based image mapping
approach was proposed as a solution, transforming a non-frontal view of the speaker’s mouth
region to a frontal view. Their approach made no assumptions on the nature of the frontal visual
front-end system it was applied to, and thus can be used on already existent lipreading systems.
They performed a number of experiments under different setups to compare against other
solutions. Their results showed that the view-mapping approach works well in practice for visual-
only speech recognition, audio-visual speech recognition, and for a realistic scenario where the
view is not known. The encoder-decoder convolutional model used is a generic model that does
not take into account the distinctive geometry of the mouth region. Incorporating constraints on
the models’ output could potentially improve convergence time and the quality of the generated
image.

Daniel et. al[18] In this paper, They presented an extensive analysis of the impact of the
Lombard effect on audio, visual and audio-visual speech enhancement systems based on deep
learning. Conducted several experiments using a database consisting of 54 speakers and showed
the general benefit of training a system with Lombard speech. They first trained systems with
Lombard or nonLombard speech and evaluated them on Lombard speech adopting a cross-
validation setup. Trained systems using Lombard and non-Lombard speech and compared them
with systems trained only on nonLombard speech. They also performed subjective listening tests
with audio-visual stimuli, in order to evaluate the systems in a situation closer to the real world
scenario, where the listener can see the face of the talker.

19
Onno et. al[19] proposed a fusion method, based on deep neural networks, to predict
personality traits from audio, language and appearance. They have seen that each of the three
modalities contains a signal relevant for personality prediction, that using all three modalities
combined greatly outperforms using individual modalities, and that the channels interact with each
other in a non-trivial fashion. By combining the last network layers and fine-tuning the parameters
they obtained the best result, average among all traits, of 0.0938 Mean Square Error, which is 9.4%
better than the performance of the best individual modality (visual). Out of all modalities, language
or speech pattern seems to be the least relevant. Video frames (appearance) are slightly more
relevant than audio information (i.e. non-verbal parts of speech). Accuracy of 88.47% was
obtained.

Kaun et. al[20] paper investigates new ways to combine multimodal data that accounts for
heterogeneity of modality signal strength across modalities, both in general and at a per-sample
level. It focuses on addressing the challenge of “weak modalities”: some modalities may provide
better predictors on average, but worse for a given instance. To exploit these facts, we propose
multiplicative combination techniques to tolerate errors from the weak modalities, and help combat
overfitting. It further proposes a multiplicative combination of modality mixtures to combine the
strength of proposed multiplicative combination and existing additive combination.

Bertrand et. al[21]In this paper, they proposed a new statistical AV model expressing the
complex relationship between two basic lip video parameters and acoustic speech parameters
which consisted in the log-modulus of the coefficients of the short-time Fourier transform. The
series of experiments presented in this paper, using a new AV model, confirm the interest of using
AV processing to solve these ambiguities. The method required integration over a large number
of frames to obtain a good estimation of the permutations. Method in this paper is very efficient in
order to estimate large blocks of consecutive permuted frequencies (20 to 40 frames are enough in
this case). However, over 100 frames are generally necessary to find isolated permuted
frequencies. The presented method uses criteria which only consider one source of interest.

Shiyang et al.[22]We have presented a method for pose-invariant lip-reading by


constructing large-pose synthetic data. The proposed approach is based on 3DMM which allows
us to take a frontal facial image and render the face in any arbitrary pose. Augmenting the training
set with this method results in improved performance when training on the mostly frontal LRW

20
database and testing on the LRS2 database which contains a variety of poses. The highest accuracy
obtained by a model trained on the combined training set of LRW and LRS2 datasets.

Mohammad et al.[23]In this paper, three different types of visual features from both the
image-based and model-based ones are investigated inside a professional lip reading task. The
simple raw gray level information of the lips Region of Interest (ROI), the geometric representation
of lips shape and the Deep Bottle-neck Features (DBNFs) extracted from a 6-layer Deep Auto-
encoder Neural Network (DANN) are three valuable feature sets compared while employed for
the lip reading purpose. Two different recognition systems, including the conventional GMM-
HMM and the state-of-the-art DNN-HMM hybrid, are utilized to perform an isolated and
connected digit recognition task. The individual section of the database is divided into two parts:
86% of the database is devoted to the training and the other 14% to the test set. DBNFs showed a
relative improvement with an average of 15.4% in comparison to the shape features and the shape
features showed a relative improvement with an average of 20.4% in comparison to the ROI
features over the test data.

Prajwal et. al[24] In this work, they explore the task of lip to speech synthesis, learning to
generate natural speech given only the lip movements of a speaker. They collect and publicly
release a 120-hour video dataset of 5 speakers uttering natural speech in unconstrained settings.
The Lip2Wav dataset contains 800× more data per speaker than the current multi-speaker datasets
to facilitate accurate modeling of speaker-specific audiovisual cues. The sequence-to-sequence
modeling approach produces speech that is almost 4× more intelligible in unconstrained
environments compared to the previous works. They also train the Lip2Wav model on the GRID
corpus and the TCD-TIMIT lip speaker corpus. Next, they train on all five speakers of our newly
collected speaker-specific Lip2Wav dataset. For training on small datasets like GRID and TIMIT,
they halve the hidden dimension to prevent overfitting. They set the training batch size to 32 and
train until the mel reconstruction loss plateaus for at least 30K iterations. While using a 3D-CNN
worked best in the experiments to capture both the spatial and temporal information in
unconstrained settings. They replace the encoder module while keeping the speech decoder module
intact. They see that the best performance is obtained with a 3D-CNN encoder.

Zhiyong et. al[25] This paper explores the fusion of audio and visual evidence through a
multi-level hybrid fusion architecture based on dynamic Bayesian network (DBN), which

21
combines model level and decision level fusion to achieve higher performance. DBN offers a
flexible and extensible means of modeling the feature-based and temporal correlations between
audio and visual cues for speaker identification. The CMU database includes 10 subjects (7 males
and 3 females) speaking 78 isolated words repeated 10 times. These words include numbers,
weekdays, months, and others that are commonly used for scheduling applications. Artificial white
Gaussian noise was added to the original audio data (SNR=30dB) to simulate various SNR levels.
The models were trained at 30dB SNR and tested under SNR levels ranging from 0dB to 30dB at
10dB intervals. We applied cross-validation for every subject’s data, i.e. 90% of all the data are
used as a training set, and the remaining 10% as a testing set. This partitioning was repeated until
all the data had been covered in the testing set. The experiments on the audio-visual bimodal
speaker identification demonstrate that the AVCM model improves the identification accuracies
compared to the previous methods.

Wentao et.al [26].For many small- and medium-vocabulary tasks, audio-visual speech
recognition can significantly improve the recognition rates compared to audio-only systems. The
training set contains 45,839 spoken sentences and 17,660 words, with a test set of 1,243 sentences
and 1,698 words. To analyze the performance in different acoustic noise conditions, we have
artificially created noisy versions of the LRS2 database. The audio model uses 13-dimensional
MFCCs as features. MFCCs are extracted with 25 ms frame size and 10 ms frame shift. The video
frame is 40 ms long without overlap. The mouth region is detected via OpenFace. The audio-only
model (AO) has a much better performance than the video shape (VS) and video-appearance (VA)
models alone. Early integration (EI) can already improve the WER at lower SNR conditions (6 0
dB), but there is no improvement, if we compare the average WER over all SNRs. Improving the
performance of large-vocabulary speech recognition through the inclusion of video data has
remained challenging despite much progress in deep learning models for speech recognition and
image processing. In this paper, we address this issue by learning an explicit stream integration
network for audio-visual speech recognition.

Petridis et al.[27] have used hybrid CTC/attention architecture for audio-visual recognition
of speech in-the-wild. Dataset used: the LRS2 database. The proposed audiovisual model leads to
an 1.3% absolute decrease in word error rate compared to audio only model with overall accuracy
of 83%. This audio-visual model significantly outperforms the audio-based model (up to 32.9%

22
absolute improvement in word error rate) for several different types of noise as the signal-to-noise
ratio decreases.

A Recurrent Neural Network (RNN) based AVSR is proposed in this research. Here the
audio features mechanism is modelled by Mel-frequency Cepstrum Coefficient (MFCC) and
further processed by RNN system, whereas the visual features mechanism is modelled by Haar-
Cascade Detection with OpenCV and again, it is further processed by RNN syste [20]. Then, both
of these extracted features were integrated by multimodal RNN-based features-integration
mechanism.

2.2 Observation from Literature Review:


After performing an extensive literature survey it was observed that AVSR can be done in two
ways. Early fusion and later fusion[8]. Acoustic and visual data are taken separately from the
dataset and then are fused together to get the final result. We intend to make use of later fusion
techniques wherein the audio data classification and visual data classification is first done and then
the results are integrated to get the final text output. It is also seen that visual frames are more
relevant than audio frames. We are considering both for the classification. For extracting audio
features Mel frequency cepstral coefficients (MFCCs) are best suited[1][12]. Acquired audio
feature sequences can be processed with a conventional hidden Markov model (HMM) with a
Gaussian mixture observation model (GMM-HMM) to conduct an isolated word recognition task.
To perform an AVSR task by integrating both audio and visual features into a single model, a
multi-stream hidden Markov model (MSHMM) can be used. The main advantage of the MSHMM
is that the observation information source can be explicitly selected (i.e., from audio input to visual
input) by controlling the stream weights of the MSHMM depending on the reliability of
multimodal inputs. When audio information reliability is degraded,the isolated word recognition
performance can be improved by utilizing visual information. When audio and visual features are
separately utilized for isolated word recognition tasks the amount of accuracy that can be obtained
is lesser than compared to the accuracy using multimodal recognition.

23
Chapter-3:
HARDWARE AND SOFTWARE REQUIREMENTS
The Hardware and software components that are utilized to complete the project are listed along
with their specifications.

3.1 Hardware Requirements

The proposed methodology includes training of large files of datasets followed by testing and
validation of test samples, all of which require robust and efficient processing units. We make use
of Windows 10 PCs.For handling advanced Machine Learning and Neural Network techniques,
this system requires a PC with at least an Intel Core i5 processor, and at least 8GB RAM. To store
the massive quantity of datasets used for training and testing the models, a minimum of 25GB of
storage space is required.

3.2 Software Requirements

1. Python: Python has always been an interpreted and high-level, general-purpose


programming language. The design of Python's philosophy is emphasized by code
readability which has its notable use with a significant amount of indentation. It's a
language that constructs as well as it even supports an object-oriented approach which
helps in aiming to help the programmers write logical, clear code in case of small-scale
and large-scale projects. Python was dynamically typed with garbage collection added to
it. It even supports some of the multiple programming paradigms, which include structured
(procedural, particularly), functional programming, and object-oriented. Python was quite
often described as the "Batteries Included" programming language because of its
comprehensive standard library.
2. Tensorflow: TensorFlow is a very popular open-source library for high performance
numerical computation developed by the Google Brain team in Google. As the name

24
suggests, Tensorflow is a framework that involves defining and running computations
involving tensors. It can train and run deep neural networks that can be used to develop
several AI applications. TensorFlow is widely used in the field of deep learning research
and application.
3. Numpy: NumPy is a very popular Python library for large multi-dimensional array and
matrix processing, with the help of a large collection of high-level mathematical functions.
It is very useful for fundamental scientific computations in Machine Learning. It is
particularly useful for linear algebra, Fourier transform, and random number capabilities.
High-end libraries like TensorFlow uses NumPy internally for the manipulation of Tensors.
4. SciPy: SciPy is a very popular library in Deep Learning as it contains different modules
for optimization, linear algebra, integration, and statistics. There is a difference between
the SciPy library and the SciPy stack. The SciPy is one of the core packages that make up
the SciPy stack. SciPy is also very useful for image manipulation.
5. Dlib: Dlib is a versatile and well-diffused facial recognition library, with perhaps an ideal
balance of resource usage, accuracy and latency, suited for real-time face recognition. Dlib
offers a wide range of functionality across a number of machine learning sectors, including
classification and regression, numerical algorithms such as quadratic program solvers, an
array of image processing tools, and diverse networking functionality, among many other
facets. Dlib also features robust tools for object pose estimation, object tracking, face
detection (classifying a perceived object as a face) and face recognition (identifying a
perceived face).
6. Speech Recognition: Library for performing speech recognition, with support for several
engines and APIs, online and offline.Offers easy audio processing and microphone
accessibility
7. Scikit-learn: Scikit-learn is one of the most popular ML libraries for classical ML
algorithms. It is built on top of two basic Python libraries, viz., NumPy and SciPy. Scikit-
learn supports most of the supervised and unsupervised learning algorithms. Scikit-learn
can also be used for data mining and data analysis, which makes it a great tool who is
starting out with ML.
8. Keras: Keras is a very popular deep learning library for Python. It is a high-level neural
networks API capable of running on top of TensorFlow, CNTK, or Theano. It can run

25
seamlessly on both CPU and GPU. Keras is used to build and design a Neural Network.
One of the best thing about Keras is that it allows for easy and fast prototyping.
9. Pandas: Pandas is a popular Python library for data analysis. It is not directly related to
Machine Learning. As we know that the dataset must be prepared before training. In this
case, Pandas comes handy as it was developed specifically for data extraction and
preparation. It provides high-level data structures and a wide variety of tools for data
analysis. It provides many inbuilt methods for groping, combining, and filtering data.
10. Matplotlib: Matplotlib is a very popular Python library for data visualization. Like Pandas,
it is not directly related to Machine Learning. It particularly comes in handy when a
programmer wants to visualize the patterns in the data. It is a 2D plotting library used for
creating 2D graphs and plots. A module named pyplot makes it easy for programmers for
plotting as it provides features to control line styles, font properties, formatting axes, etc.
It provides various kinds of graphs and plots for data visualization, viz., histogram, error
charts, bar charts, etc,
11. Librosa: Librosa is a Python package for music and audio analysis. Librosa is basically
used for working with audio data like in music generation, Automatic Speech Recognition.
It provides the building blocks necessary to create the music information retrieval systems.
Librosa helps to visualize the audio signals and also do the feature extractions in it using
different signal processing techniques.

26
Chapter- 4:
METHODOLOGY
This chapter gives highlights about the architecture and methods used in the working of this project
are described along with implementation steps, and the basic insight on the deep Learning
terminologies used in the proposed methodology, and their significance has been explained.

4.1 Block Diagram


The research is fulfilled by conducting four steps. The block diagram of Audio-Visual Speech
Recognition is shown in Figure 4.1.1
● Audio Model
● Visual Model
● Fusion Model
First a dataset is created and video features and audio features are extracted and then using Deep
Learning algorithms models are created for the classification using only audio, using only video
and using both audio and video.

4.1.1 Dataset Creation


To implement AVSR, the custom database was created and the language that was used in the
database was English and Kannada, which comprised nine and eight different words respectively.
The nine english words are ‘About’, ‘Bad’, ‘Bottle’, ‘Come’, ‘Cow’, ‘Good’, ‘Pencil’, ‘Read’,
‘Where’.The nine kannada words are 'Avanu', 'Bagge', 'Bari', 'Guruthu', 'Helidha', 'Hodhu', 'Hogu',
'Howdu'. As the first step, the dataset is created with the specification. Each video in the dataset is
of 1920 x 1080 resolution. Then these videos are clipped with each video containing the duration
of 1 sec at which the word is uttered and the video rate has been adjusted to 30 fps. Then the dataset
is divided in two (Train and Validation) in such a way that the validation is nearly equal to 25%
of the data and the train contains the remaining 75% of the data.

27
Figure 4.1.1: Proposed Block diagram
The proposed model takes in audio-visual data as input. The input audio-video signal is split into
audio and video channels, audio data is considered for audio speech processing and the Video
data is taken for visual speech processing. The important steps involved in both audio and visual
data sets are preprocessing, feature extraction, training, testing and validation. Image frames are
extracted from the visual data and feature extraction is done on the dataset taken. Feature extraction
is performed on the noise free audio signal and is then fed to the classifier. The audio only output
in the text form is obtained. For the visual data the region of interest extraction is done and the lip
localisation is performed. The extracted features are then fed into the classifier and visual only
output in the form of text is obtained. The acoustic features and visual features are then integrated
and fed to a model to get the final text output. The model used here involves early fusion technique
wherein the audio and visual data are processed separately and then after feature extraction they
are fused together to get the final output.

4.2 Audio Model


First audio files are created from the video dataset and saved in.wav formats. Then the features are
extracted from the audio using Librosa which is an open source module that is available in python
using MFCC (Mel-frequency cepstral coefficients).
Mel-frequency cepstrum coefficients (MFCC)
In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term
power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a non-

28
linear mel scale of frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that
collectively make up an MFCC. They are derived from a type of cepstral representation of the
audio clip(a nonlinear "spectrum-of-a-spectrum").

Figure 4.2.1: MFCC Extraction

Mel Frequency Cepstral Coefficients of a signal are small set of features usually about 10-20
values which concisely describe overall shape of the spectral envelope.The envelope of the time
power spectrum of a speech signal is a representative of the vocal tract and MFCC accurately
represents this envelope.
● Pre-emphasis: Pre-emphasis refers to filtering that emphasizes the higher frequencies. Its
purpose is to balance the spectrum of voiced sounds that have a steep roll-off in the high-
frequency region.
● Frame blocking and windowing: The speech signal is a slowly time-varying
or quasi-stationary signal. For stable acoustic characteristics, speech needs to be
examined over a sufficiently short period of time. Therefore, speech analysis must
always be carried out on short segments acrosswhich the speech signal is assumed
to be stationary.
● DFT spectrum: Each windowed frame is converted into magnitude spectrum by
applying DFT.
−𝑁2𝑁𝑁𝑁
X(k) =∑𝑁−1
𝑁=0 𝑁(𝑁)𝑁 𝑁 ; 0 ≤ k ≤ N −1
(4.2.1)
where N is the number of points used to compute the DFT.

29
● Mel spectrum: Mel spectrum is computed by passing the Fourier transformed signal
through a set of band-pass filters known as Mel-filter bank. AMel is a unit
of measure based on the human ears perceived frequency.
● Discrete cosine transform (DCT): Since the vocal tract is smooth, the energy levels in
adjacent bands tend to be correlated. The DCT is applied to the transformed Mel frequency
coefficients produces a set of cepstral coefficients. Prior to computing DCT, the Mel
spectrum is usually represented on a log scale. This results in a signal in the cepstral domain
with a quefrequency peak corresponding to the pitch of the signal and a number of formants
representing low frequency peaks. Since most of the signal information is represented by
the first few MFCC coefficients, the system can be made robust by extracting only those
coefficients ignoring or truncating higher order DCT components. Finally. MFCC is
calculated as
𝑁𝑁(𝑁 − 0.5)
𝑁(𝑁) = ∑𝑁−1
𝑁=0 𝑁𝑁𝑁10 (𝑁(𝑁))𝑁𝑁𝑁 ( ); n = 0, 1, 2, . . . ,C
𝑁

−1 (4.2.2)
where c(n) are the cepstral coefficients, and C is the number of MFCCs. Traditional
MFCC systems use only 8–13 cepstral coefficients. The zeroth coefficient is often excluded
since it represents the average log-energy of the input signal, which only carries little
speaker-specific information.

Next, a Bidirectional lstm model is applied consisting of Bidirectinal lstm layer, followed by
MaxPooling1D layer, Batch normalization layer, and dropout and then followed by Dense Layers
is created. Bidirectional LSTM tends to be useful in various signal analysis over fixed-length
signals. They work well for the analysis of audio signals. In bidirectional, our input flows in two
directions, making a bi-lstm different from the regular LSTM. With the regular LSTM, we can
make input flow in one direction, either backward or forward. However, in bi-directional, we can
make the input flow in both directions to preserve the future and the past information. BI-LSTM
is usually employed where the sequence to sequence tasks are needed. This kind of network can
be used in text classification, speech recognition, and forecasting models.

30
A bidirectional LSTM network is similar to an RNN, but the hidden layer updating process
is replaced by a special unit called a memory cell. In Bi-LSTM there are two distinct hidden layers,
𝑓
called the forward hidden layer and backward hidden layer. The forward hidden layer ℎ𝑡 considers
the input in ascending order, i.e., t = 1, 2, 3,..., T. On the other hand, the backward hidden layer ℎ𝑡𝑏
considers the input in descending The BiLSTM model is implemented with the following
equations.
𝑓 𝑓 𝑓 𝑓 𝑓
ℎ𝑡 = 𝑡𝑎𝑛 ℎ(𝑊𝑥ℎ 𝑥𝑡 + 𝑊ℎℎ ℎ𝑡−1 + 𝑏ℎ ) (4.2.3)
ℎ𝑡𝑏 = 𝑏
𝑡𝑎𝑛 ℎ(𝑊𝑥ℎ 𝑥𝑡 + 𝑏 𝑏
𝑊ℎℎ ℎ𝑡+1 + 𝑏ℎ𝑏 ) (4.2.4)
𝑓 𝑓
𝑏 𝑏
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦 (4.2.5)

Figure 4.2.2: Overview of the bidirectional long short-term memory (LSTM) model.

4.3 Video Model


First mouth region is extracted from the video using dlib library which is available in python3.
Videos are converted into frames. Once the frames are obtained, every frame is cropped to extract
the ROI (Region of interest) i.e. the mouth region as shown in fig 4.3.1, which is done by the
shape_predictor which is a function available in the Dlib library. Parallelly the images are
converted in grey scale to further compress the dataset. The Shape predictor covers a human face
into 68 points from which the range of 48-68 covers the region of mouth. Then the position of the

31
outer lip coordinates are extracted and saved in the feature vector. visual speech is recognised
using Long Short-Term Memory (LSTM). Then a model with a network of LSTM’s and dense
layers (Deep LSTM Network) is created. Long Short-Term Memory (LSTM) network is a type of
recurrent neural network which is can learn order dependence in a sequence prediction problem.

Fig 4.3.1 Mouth ROI extraction

It contains three gates:


1. Forget Gate
2. Input Gate
3. Output Gate

32
Figure 4.3.2: Structure of a LSTM cell
Forget Gate decides whether to keep or forget the info from the previous timestamps and
Input Gate quantifies the importance of that data coming as an input and the Output Gate figures
the most relevant output that it has to generate. The model which is created contains an LSTM
layer with 128 hidden units and 8 time stamps as the first layer. The operations inside the LSTM
cell are as follows
𝑁<𝑁> = 𝑁𝑁𝑁ℎ (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 )
(4.3.1) 𝛤𝑁 = 𝑁 (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 )

(4.3.2) 𝛤𝑁 = 𝑁 (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 )


(4.3.3) 𝛤𝑁 = 𝑁 (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 )
(4.3.4) 𝑁<𝑁> = 𝑁𝑁 ∗ 𝑁−<𝑁> + 𝑁𝑁 ∗
𝑁<𝑁−1> (4.3.5) 𝑁<𝑁> = 𝛤𝑁 *
𝑁𝑁𝑁ℎ(𝑁<𝑁> ) (4.3.6)
where c represents the memory cell, t represent time stamp, 𝛤𝑁 represents the update gate.
𝛤𝑁 represents the forget gate, 𝛤𝑁 represents the output gate,𝜎 represents the sigmoid function,
Wu represents the weights of update gate, 𝑁𝑁 represents the weights of forget gate, 𝑁𝑁
represents the weights of output gate, b represents bias, 𝑁<𝑁> represents candidate cell variables.

33
Fig 4.3.3 Video model for english dataset Fig 4.3.4 Video model for kannada dataset
Next one more LSTM layer is introduced. So the resulting output from the second layer will be

𝑁<𝑁> = 𝑁𝑁𝑁ℎ (𝑁1𝑁 [𝑁1<𝑁−1> , 𝑁<𝑁> ] + 𝑁1𝑁 ) (4.3.7)


𝛤1𝑁 = 𝑁 (𝑁1𝑁 [𝑁1<𝑁−1> , 𝑁1<𝑁> ] + 𝑁1𝑁 ) (4.3.8)

34
𝛤1𝑁 = 𝑁 (𝑁1𝑁 [𝑁1<𝑁−1> , 𝑁1<𝑁> ] + 𝑁1𝑁 ) (4.3.9)
𝑁<𝑁> = 𝑁1𝑁 ∗ 𝑁−<𝑁> + 𝑁1𝑁 ∗ 𝑁1<𝑁−1>
(4.3.10)
𝑁<𝑁> = 𝛤1𝑁 * 𝑁𝑁𝑁ℎ(𝑁1<𝑁> )
(4.3.11)
This is followed by three dense layers. Hence, output equations from these three dense layers will
be
y2 = R(W2 * a1 + b2) (4.3.12)
y3 = R(W3 * y2 + b3) (4.3.13)
y4 = R(W4 * y3 + b4) (4.3.14)
Finally, a softmax layer was attached to this network for the classification.

4.4 AVSR Model


The fusion model consists of Audio only part, Visual only part and combination of both audio and
visual part. For the audio only part, one-dimensional CNNs are used. Then a deep convolutional
neural network is created.

For the video-only model, video features are extracted in the same way as in video model. Then a
deep LSTM network is created. The equations of LSTM network are as below: 𝑁<𝑁> =
𝑁𝑁𝑁ℎ (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 ) (4.4.1) 𝛤𝑁 =
𝑁 (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 ) (4.4.2) 𝛤𝑁 =
𝑁 (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 ) (4.4.3)
𝛤𝑁 = 𝑁 (𝑁𝑁 [𝑁<𝑁−1> , 𝑁<𝑁> ] + 𝑁𝑁 ) (4.4.4)
𝑁<𝑁> = 𝑁𝑁 ∗ 𝑁−<𝑁> + 𝑁𝑁 ∗ 𝑁<𝑁−1> (4.4.5)
𝑁<𝑁> = 𝛤𝑁 * 𝑁𝑁𝑁ℎ(𝑁<𝑁> ) (4.4.6)

where c represents the memory cell, t represent time stamp, 𝛤𝑁 represents the update gate.
𝛤𝑁 represents the forget gate, 𝛤𝑁 represents the output gate,𝜎 represents the sigmoid function,
Wu represents the weights of update gate, 𝑁𝑁 represents the weights of forget gate, 𝑁𝑁
represents the weights of output gate, b represents bias, 𝑁<𝑁> represents candidate cell variables.

35
In ”Combination of Audio-only and Visual-only parts” the feature map from the first
dense layer from Audio only part is concatenated with the feature map from first LSTM
layer from Visual only part
𝑁𝑁 = [𝑁1<𝑁> , 𝑁2 ] (4.4.7)
and the resulting feature map is passed as an input to Deep Feedforward Neural Network
which contains three dense layers while the first two dense layers are followed by a Batch
normalization layer and a dropout layer respectively. Feedforward Neural network is the
basic feedforward neural network which does not form any loops. The architecture of
Feedforward Neural network is shown in Figure 4.4.1

In early integration, the correlation between modalities can be found at the feature level, and only
one modeling process is needed, which will result in lower cost and complexity compared to the
other fusion techniques which need more modeling process.

36
Fig 4.4.1 AVSR model

37
yd1= R(Wd1 * ac * bd1) (4.4.8)
yd2= R(Wd2 * yd1 * bd2) (4.4.9)
yd3= R(Wd3 * yd2 * bd3) (4.4.10)
And as the final step, all the above three parts are combined so the vector formed by this will be a
combination of the output vector of all the above three parts.
𝑁𝑁2 = [𝑁5 , 𝑁𝑁 , 𝑁𝑁3 ] (4.4.11)
Then this is passed on to a deep neural network which contains three dense layers followed by a
batch normalization layer and a dropout layer.
yc1= R(Wc1 * ac2 * bc1)
(4.4.12)
yc2= R(Wc2 * yc1 * bc2)
(4.4.13)
yc3= R(Wc3 * yc2 * bc3) (4.4.14)

Fig 4.4.2 Feed forward network

38
Chapter -5:

RESULTS AND DISCUSSIONS


This section describes the execution of the proposed Audio Visual Speech Recognition System
output that was conducted in a different phase.

5.1 Evaluation of Audio Model


● For the audio speech recognition mfcc features have been extracted from the custom
dataset and these features are loaded to the model.
● The model is trained for 100 epochs
● From this model, the accuracy obtained is 93% for training and 88% for testing of
english dataset.

Figure 5.1.1: Training english dataset with Audio model

Fig 5.1.2: Training kannada dataset with audio model

39
Figure 5.1.2: Accuracy curve of audio model for english dataset

Figure 5.1.3: Loss curve of audio model for english dataset

40
Fig 5.1.4: Accuracy curve of audio model for kannada dataset

Fig 5.1.5: Loss curve of audio model for kannada dataset

A confusion matrix is a table that is often used to describe the performance of a classification
model (or "classifier") on a set of test data for which the true values are known. The confusion
matrix of english dataset for the audio model is shown in the figure 5.1.4.

41
Fig 5.1.6: Confusion matrix of audio model for english dataset

42
Fig 5.1.7: Confusion matrix of audio model for kannada dataset

Classification tells about the precision, recall and f1 score for every label where
𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 = 𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 + 𝑁𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁

(5.1)
𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
𝑁𝑁𝑁𝑁𝑁𝑁 = 𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 + 𝑁𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁

(5.2)
2∗(𝑁𝑁𝑁𝑁𝑁𝑁 ∗ 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁)
𝑁1 𝑁𝑁𝑁𝑁𝑁 =
𝑁𝑁𝑁𝑁𝑁𝑁 + 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
(5.3)

Classification report of english dataset for audio model is shown below:

43
Fig 5.1.8: Classification report of audio model for english dataset

Fig 5.1.9: Classification report of audio model for kannada dataset

44
5.2 Evaluation of Visual Model:
● The dataset is trained for 200 epochs.
● Training accuracy of 92% and testing accuracy of 90% is obtained for english dataset.
● Training accuracy of 92% and testing accuracy of 90% is obtained for english dataset.

Fig 5.2.1: Training english dataset with video model

Fig 5.2.2: Training kannada dataset with video model

Model accuracy and model loss curves are as shown below::

45
Fig 5.2.3: Accuracy curve of video model for english dataset

Fig 5.2.4: Loss curve of video model for english dataset

46
Fig 5.2.5: Accuracy curve of video model for kannada dataset

Fig 5.2.6: Loss curve of video model for kannada dataset

Confusion matrix of video model is shown below:

47
Fig 5.2.7: Confusion matrix of video model for english dataset

48
Fig 5.2.8: Confusion matrix of video model for kannada dataset

Classification report of video model for english dataset is shown below:

49
Fig 5.2.9: Classification report of video model for english dataset

Fig 5.2.10: Classification report of video model for kannada dataset

50
5.3 Evaluation of Fusion Model
● The English dataset is trained for 200 epochs.
● Training accuracy of 94% is obtained and testing accuracy of 87% is obtained for english
dataset.
● Training accuracy of 95% is obtained and testing accuracy of 89% is obtained for english
dataset.

Fig 5.3.1 Training english dataset for fusion model of english dataset

Fig 5.3.2 Training english dataset for fusion model of kannada dataset

51
Fig 5.3.3: Accuracy curve for fusion model of english dataset

Fig 5.3.4: Loss curve for fusion model of english dataset

52
Fig 5.3.5: Accuracy curve for fusion model of kannada dataset

Fig 5.3.6: Loss curve for fusion model of kannada dataset

Confusion matrix for fusion model is shown below:

53
Fig 5.3.7: Confusion matrix for fusion model of english dataset

54
Fig 5.3.8: Confusion matrix for fusion model of kannada dataset

55
Fig 5.3.9: Classification report for fusion model of english dataset

Fig 5.3.10: Confusion matrix for fusion model of kannada dataset

Table 5.3.1: Results of the proposed method compared with the existing method for audio
Visual speech recognition

Method Dataset Overall Accuracy

56
Audio: LSTM AV Letters 87%
Video: CNN
Fusion: RNN[10]

Audio:Bi LSTM LRW Database 76%


Video: Residual Networks
Fusion:CNN

Audio: Bidirectional LSTM Custom 95%


Video: LSTM
Fusion: DCNN

57
Chapter-6

CONCLUSION & FUTURE SCOPE


This chapter describes the conclusion and future work of the proposed system.

6.1 Conclusion

The Audio Visual speech recognition system based on deep learning architectures with the
proposed objective was accomplished for the custom dataset and resources. The performance can
still be increased with the use of hybrid models in each audio, visual integration processing
individually. The Performance output of the Audio-Visual Speech recognition with the
performance of Audio with the use of Bidirectional LSTM a accuracy of 93% for english and 92%
for kannada has been achieved, visual recognition with the LSTM algorithm an overall accuracy
of 92% for english and 89% for kannada has been achieved and the integration with the help of
deep convolutional neural networks an accuracy of 95 % for both english and kannada languages
have been achieved.

6.2 Future Scopes

● In future work, we can use more datasets for training and testing and plan to use different
neural networks.
● Create a database in different angles other than the straight to the face to the speaker.
● Real-time data inputs can be used for processing.
● Different multimodel and hybrid models can be used to achieve good accuracy.
● With the use of effective algorithms and a compact effective dataset better performance
can be achieved.

6.3 Advantages and Limitations

The following are some of the advantages of the AVSR model:


● Every module, and library used in the AVSR model, in helping to recognize the word
spoken by the speaker, is open source. It is available to everyone.

58
● The AVSR model which is designed in this paper works for native English language and
also for native Kannada Language.
● The Integration of the video model along with the audio model helps in recognizing the
word a lot better.
The following are some of the Limitations of the AVSR model:
● The AVSR model proposed can only recognize a single word.
● This model can not recognize sentences.

59
REFERENCES
1. Karel Mundnich, Alexandra Fenster, Aparna Khare, Shiva Sundaram, “Audiovisual
highlight detection in videos” Publisher: ICASSP, 2021
2. Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin
Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz, “Self-supervised Learning of Visual
Speech Features with Audiovisual Speech Enhancement” University of Michigan, Ann
Arbor, MI, USA Apple, Cupertino, CA, USA, 2020
3. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew
Zisserman, “Deep Audio-Visual Speech Recognition” 2018
4. Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin
Chen, “MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual
Speech Enhancement” E.E. Engineering, Trinity College Dublin, Ireland vivo AI Lab, P.R.
China, 2021
5. Linlin Xia , Gang Chen, Xun Xu, Jiashuo Cui and Yiping Gao, “Audiovisual speech
recognition: A review and forecast” International journal of advanced robotics, 2020
6. Ruohan Gao, Kristen Grauman, “VISUAL VOICE: Audio-Visual Speech Separation with
Cross-Modal Consistency”The University of Texas at Austin, Facebook AI Research,
2021
7. Kaylah Lalonde 1,and Lynne A. Werner “Development of the Mechanisms Underlying
Audiovisual Speech Perception Benefit” , 2021
8. Simon Lucey, Tsuhan Chen, Sridha Sridharan and Vinod Chandran, Speech Research
Laboratory, RCSAVT, Australia, Advanced Multimedia Processing Laboratory, USA
“Integration Strategies for Audio-visual Speech Processing: Applied to Text Dependent
Speaker Recognition” , 2005
9. Hong Liu, Zhan Chen, Bing Yang, “Lip Graph Assisted Audio-Visual Speech Recognition
Using Bidirectional Synchronous Fusion” Key Laboratory of Machine Perception,
Shenzhen Graduate School, Peking University, China , 2020.
10. Weijiang Feng1, Naiyang Guan2, Yuan Li1, Xiang Zhang1, Zhigang Luo2 , “ Audio Visual
Speech Recognition with Multimodal Recurrent Neural Networks”, 1Institute of Software,
College of Computer, National University of Defense Technology, Changsha, Hunan, P.R.
China, 4100732 ,Science and Technology on Parallel and Distributed Laboratory ,National
University of Defense Technology, Changsha, Hunan, P.R. China, 410073 , 2017.
11. Soonkyu Lee and DongSuk Yook, “Audio-to-Visual Conversion Using Hidden Markov
Model” , Speech Information Processing Laboratory , Department of Computer Science
and Engineering, Korea University ,Songbook Goo Anam Dong 5-1, Seoul, Korea 136-
701, 2002.
12. Aggelos K. Katsaggelos, Fellow IEEE, Sara Bahaadini, and Rafael Molina, “Audiovisual
Fusion: Challenges and New Approaches”,2015.

60
13. Dana Lahat, Tülay Adalı, and Christian Jutten, “challenges in multimodal data fusion”
GIPSA-Lab, University of Maryland, Baltimore County, Baltimore, MD 21250, USA ,
2014.
14. Hui Wang , Fei Gao , Yue Zhao , (Member, Ieee), and Licheng Wu, “WaveNet With Cross-
Attention for Audiovisual Speech Recognition”, School of Information Engineering, Minzu
University of China, Beijing 100081, China ,2020.
15. Bertrand Rivet, Wenwu Wang, Syed Mohsen Naqvi, and Jonathon A. Chambers ,
“Audiovisual Speech Source Separation”, An overview of key methodologies, 2014.
16. Themos Stafylakis, Georgios Tzimiropoulos , “Combining Residual Networks with LSTMs
for Lipreading”, Computer Vision Laboratory University of Nottingham, UK, 2017.
17. Alexandros Koumparoulis, Gerasimos Potamianos,”Deep view2view mapping for view-
invariant lip reading”,2018.
18. Daniel Michelsanti, Zheng-Hua Tana, Sigurdur Sigurdsson, Jesper Jensena, “Deep-
learning-based audio-visual speech enhancement in presence of Lombard effect“,2019.
19. Onno Kampman, Elham J. Barezi, Dario Bertero, Pascale Fung,”Investigating Audio,
Video, and Text Fusion Methods for End-to-End Automatic Personality Prediction“,2018.
20. Kuan Kiu, Yanen Li, Ning Xu, Prem Natarajan, “Learn to Combine Modalities in
Multimodal Deep Learning“,2018.
21. Bertrand Rivet, Laurent Girin, and Christian Jutten, Member, IEEE, “Mixing Audiovisual
Speech Processing and Blind Source Separation for the Extraction of Speech Signals From
Convolutive Mixtures” ,2006.
22. Shiyang Cheng, Pingchuan Ma, Georgios Tzimiropoulos, Stavros Petridis, Adrian Bulat ,
Jie Shen, Maja Pantic, “Towards pose-invariant lip-reading”,2019.
23. Mohammad Hasan Rahmani, and Farshad Almasganj, “Lip-reading via a DNN-HMM
Hybrid System Using Combination of The Image-based and Model-based Features”,
Biomedical Engineering Department, Amirkabir University of Technology (Tehran
Polytechnic) Tehran, Iran, 2017.
24. K R Prajwal IIIT Hyderabad, Rudrabha Mukhopadhyay IIIT Hyderabad, Hyderabad Vinay
P. Namboodiri IIT Kanpur, C V Jawahar IIIT Hyderabad, “Learning Individual Speaking
Styles for Accurate Lip to Speech Synthesis”,2020.
25. Zhiyong Wu, Lianhong Cai1, and Helen Meng, “Multi-level Fusion of Audio and Visual
Features for Speaker Identification”, Department of Computer Science and Technology,
Tsinghua University, Beijing, China, 100084 ,2006
26. Wentao Yu, Steffen Zeiler and Dorothea, “Multimodal Integration for Large-Vocabulary
Audio-Visual Speech Recognition”, Kolossa Institute of Communication Acoustics, Ruhr
University Bochum, Germany {wentao.yu, steffen.zeiler, dorothea.kolossa}@rub.de,
2020.
27. Y. Goh, K. Lau and Y. Lee, "Audio-Visual Speech Recognition System Using Recurrent
Neural Network," 2019 4th International Conference on Information Technology (InCIT),
Bangkok, Thailand, 2019, pp. 38-43, doi:10.1109/INCIT.2019.8912049.10, 2019.

61
62

You might also like