Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
1st Javad Peymanfard 2st Mohammad Reza Mohammadi 3st Hossein Zeinali
School of Computer Engineering School of Computer Engineering Department of Computer Engineering
Iran University of Science and Technology Iran University of Science and Technology Amirkabir University of Technology
Tehran, Iran Tehran, Iran Tehran, Iran
[email protected] [email protected] [email protected]
Abstract—Lip-reading is the operation of recognizing speech at assigning an input video to a class. But in the second group,
from lip movements. This is a difficult task because the move- which is a sequence learning task, each sample (video) has
ments of the lips when pronouncing the words are similar for a sentence-level text. This is a more challenging task than
some of them. Viseme is used to describe lip movements during
a conversation. This paper aims to show how to use external word-level prediction.
text data (for viseme-to-character mapping) by dividing video- In this paper, we propose a method in which independent
to-character into two stages, namely converting video to viseme, textual data from lip-reading datasets can be utilized to achieve
and then converting viseme to character by using separate models. higher accuracy for modeling. In this method, an external
Our proposed method improves word error rate by an absolute viseme decoding can be modeled using only textual data, so it
rate of 4% compared to the normal sequence to sequence lip-
reading model on the BBC-Oxford Lip Reading Sentences 2 is efficient to build it.
(LRS2) dataset. We employed a sequence-to-sequence model for viseme-to-
Index Terms—lip-reading, visual speech recognition, viseme character modeling to predict characters using a large amount
of text data. A two-layer GRU with an attention mechanism is
I. I NTRODUCTION used in this model. The use of this external model has increased
Lip-reading is commonly used to understand human speech the accuracy of the entire lip-reading process.
without hearing sound and by using only visual features. This The paper is organized as follows: In section 2, we review the
ability is more common in people with hearing loss or hearing most important related works. In this section, we also discuss
problems. Over the past years, several methods have been the advantages and disadvantages of the available methods. In
proposed for a person to lip-read [1], but there is an important Section 3, we discuss the proposed model and describe it in
difference between these methods and the lip-reading methods detail. In Section 4, we discuss the experiments. Finally, we
suggested in AI. The purpose of the proposed methods for conclude the paper in Section 5.
lip-reading by the machine is to convert visual information
into words. This conversion takes place on two levels, which II. R ELATED W ORK
are described below. However, the main purpose of lip-reading There are a variety of methods for lip-reading. These methods
by humans is to understand the meaning of speech and not to can fall into two categories: word-level and sentence-level lip-
understand every single word of speech. Obviously, visemes reading. In word-level methods, lip-reading is a classification
are the main challenge in lip-reading. Visemes are the visual task, whereas in sentence-level methods, sequence prediction
equivalent of phonemes [2]. In fact, each viseme refers to a is the problem. There are many pre-deep learning methods,
group of phonemes in which the movement of the lips is the which you can review by referring to [9]–[11]. In one of
same, such as /b/, /p/, and /m/. these methods [12], modeling is performed at the viseme-level
The traditional approaches of automatic lip-reading used using HMM (Hidden Markov model). After obtaining visemes,
hand-crafted features [1], [3]–[5]. Hidden Markov Model was another phoneme HMM is trained for converting each viseme
also applied for modeling [6]–[8]. However, with new large to a specific phoneme. This method is a non-deep approach
public datasets introduced in this area in recent years and and examined on little data.
deep learning methods used for this purpose, more appropriate The first method proposed for sentence-based lip-reading
methods have been proposed which are more accurate than based on deep learning was called LipNet [13]. The LipNet
even a professional lip reader. These datasets are divided into architecture has 3 layers of STCNN (Spatio-Temporal CNN)
two groups, namely word-level and sentence-level lip-reading. followed by 2 Bi-GRUs (Bidirectional Gated Recurrent Unit).
In the first group, the lip-reading problem is a classification As an end-to-end model, LipNet is trained with CTC loss [14].
task. There are some vocabularies (classes) and the model aims This method has been tested on GRID dataset. In the next
Fig. 1. Traditional sequence to sequence methods.
Fig. 2. Our proposed lipreading method using external viseme to character model.
method called WAS (Watch, Attend and Spell), the attention and used Temporal Convolutional Networks, suggesting another
mechanism is used, and lip-reading is performed on LRS2 improvement to word-level lip-reading.
data, which is a real-world data [2]. This model is based on
LAS (Listen, Attend and Spell) which has been developed for III. P ROPOSED METHOD
speech recognition task [15]. In this section, we propose a method in which external
The deep learning architectures are compared in [16]. In textual data can improve the lip-reading model accuracy. This
this comparison, three new neural network architectures (Fully section consists of two parts. In the first part, we will describe
convolutional, Bidirectional LSTM, and Transformer [17]) are the highest accuracy that can be achieved in word-level lip-
compared and the best performing network with respect to reading, and in the second part, we will explain our proposed
word error rate is the Transformer with a 50% of WER (Word method.
Error Rate) on LRS2. The fully convolutional network also
has the best training and inference time. A. Word-level lower bound error for greedy algorithm
In another recent work, an effective strategy for training lip- First, we used the available lip-reading text data to find
reading model has been proposed that uses speech recognition the lower bound error of word-level viseme-to-character
directly [18]. This method, which is based on knowledge modeling. In this case, we first find text data vocabularies
distillation, does not require manually annotated lip-reading and the percentage of repetition of each word. Then, using
data and the videos are unlabeled. This method predicts the the pronunciation list of the words and one of the suggested
speech in sentence-level and obtains state-of-the-art results on phoneme-to-viseme mapping [21], the viseme sequence for
the LRS2 and LRS3 datasets. each word is obtained. Obviously, some words have the same
In still another work, the main focus is on multilingual viseme sequence (like art and heart). We then categorize these
synergized lip-reading [19]. In this method, a model with words and use a greedy algorithm to get the minimum error.
higher accuracy in both languages can be achieved using data The best choice for any viseme sequence, if there is more
from two different languages. The main idea of this work is than one word, is the word that is repeated the most. In the
based on the fact that common patterns in lip movement exist experiment we performed on the LRS3 dataset [22], the lowest
in different languages because human vocal organs are the WER was 24.29%. In addition, we tested this experiment on
same. This method obtains state-of-the-art performance on the the LRS2 dataset, and the lowest WER was 27.16%. But this
two challenging word-level lip-reading benchmarks, namely is for the case that the context is not taken into account. In the
LRW (English) and LRW-1000 (Mandarin). following, we propose a method that can be used to achieve
The authors proposed in [20] a variable-length augmentation higher accuracy for viseme decoding.
TABLE I
E XAMPLES OF VISEME DECODING RESULTS .
B. Lip reading using external viseme decoding independent data. In fact, in this method, two networks are
In recent years, with the advances in the field of deep trained, one of which converts video to viseme and the other
learning, significant progress has been made in many computer predicts the characters using visemes sequence.
vision problems. One of the most difficult tasks in this field
IV. E XPERIMENTS
is lip reading and viseme decoding. There are usually small
datasets available for this task in a variety of languages. Also, In this section, we will refer to the experiments we performed
providing data in this area for the available methods is very as well as the results. The experiments are divided into two
costly. Because these methods require curriculum learning, parts. In the first part, we describe the conversion of viseme
word-level annotation is needed. to character. We also compare the accuracy obtained for the
In this paper, we intend to solve this problem separately case of using raw data or existing data for lip-reading. Also,
using available sequence-to-sequence methods. This allows in the second part, we perform lip-reading at the character
us to use raw textual data in a language directly for viseme level using the obtained model, along with a viseme level lip
decoding. With respect to the lower bound error mentioned in reading model.
the previous section, we expect the inaccuracy of this model to Note that our goal was not to achieve the best reported
be much lower because in this task the context is considered results, but due to lack of time, we intended to show that the
and each word is not decoded separately. proposed method can improve the baseline. We believe that
The lip-reading methods mentioned in the previous section this improvement can be achieved by any other method. In the
perform modeling at the sentence level, and with respect to future, we will try to first replicate the results reported in [2]
the main challenge in lip-reading, which is viseme decoding, and then incorporate the proposed method into them to make
it is expected that the video-to-viseme conversion will be done a better comparison with the state-of-the-art results.
with greater accuracy. Also, lip-reading can be done at the
sentence level using these two proposed models. A. Viseme decoding
In fact, both sub-models have their own advantages which As explained earlier, usually there is little data for training
improve accuracy. We describe these two models in order of use. a lip-reading model. In this experiment, we want to show how
In the first model, which aims to convert video to viseme, the increasing the unlabeled text data can affect the accuracy of
existing methods can be used exactly and we do not need more the viseme decoding. In this step, we first trained the model
data or any change in the structure of the network. Nevertheless, using the textual data of the LRS2 dataset. Given that the size
we expect to achieve higher accuracy due to the smaller number of the LRS2 samples is not large in terms of textual data,
of classes. Moreover, there is no need for a lip-reading dataset in the second case we used OpenSubtitles corpus [23] for
in the second model. To train this model, we need raw textual viseme to character modeling and measured the effect of this
data in the target language. Also, training data can be obtained improvement.
as needed by having a phoneme sequence for each word and We selected 6 million samples from the OpenSubtitles corpus
using the phoneme-to-viseme mapping. The only challenge for this purpose. But since only the word sequence is available
when constructing this data set is to obtain the sequence of in this corpus, in the first step we used CMU Pronouncing
words for an utterance. There are several solutions to this Dictionary to convert this word sequence into a phoneme
problem called G2P (grapheme-to-phoneme). sequence. In this step, we also removed the sentences containing
As shown in Figure 1, in traditional sequence-to-sequence words that were not in the dictionary. Given the fact that
methods, the first step is determining the mouth area using the the main purpose here is a kind of language modeling, (i.e.
facial landmarks for cropping the ROI (region of interest). This the possibility of occurrence of different consecutive visemes
sequence is then modeled using a 3D visual front-end (usually is important to us), we are not very sensitive to choose an
using 3D-CNN) followed by a sequence processing model. In accurate transcription for all the words. Consequently, we used
fact, in these methods, the feature extraction of lip movements a simple method for this purpose. In this way, we only select
is obtained using a 3D convolutional network, and the output the first transcript for each word in the dictionary with multiple
is a set of probabilities for each character. But in the proposed transcripts. We used the CMU dictionary to convert the word
method, shown in Figure 2, another network is trained with sequence in the OpenSubtitles data into a phoneme sequence.
TABLE II As shown in Table III, in the case of viseme-level modeling,
WER AND CER FOR V ISEME TO C HARACTER M ODELING . the character error rate is 33.9%, which is 16% more accurate
than in the case of word-level modeling (i.e. the second row).
Dataset CER WER Also, when external textual data is used for viseme decoding,
LRS2 26 % 37% we achieve higher accuracy than if the network implicitly learns
OpenSubtitles 10 % 16 %
the language model probabilities.
TABLE III
V. C ONCLUSION
P ERFORMANCE ON LRS2 DATASET. Lip reading is one of the most challenging tasks in the field
of computer vision. There is, in fact, scant data available on
Method WER CER
this task for many languages. We introduced a new method to
Video to Viseme 62.3 % 33.9 % use external text data for lip-reading. We can achieve higher
WAS [2] 73.9 % 49.9 % accuracy for lip-reading by utilizing the raw text data of a
Proposed method 69.5 % 46.1 % specific language, a grapheme-to-phoneme as well as a viseme
mapping. The experimental results indicated that the proposed
method can improve the accuracy of viseme decoding and
Subsequently, the viseme sequence for each sample is obtained outperforms the case where only lip-reading text data is used
using the phoneme to viseme mapping. After preparing these for language modeling by a wide margin. We also incorporated
data, the model was trained and some of the results are shown in this model into the viseme-level model for lip-reading and
Table I. As mentioned above, we used a sequence-to-sequence achieved higher accuracy than the case where only video data
network with a two-layer GRU with a cell size of 1024. We was used for training. One of the limitations of our work is the
also used attention mechanism [24] in order to achieve a better use of the same phoneme sequence for words with more than
result. one correct pronunciation. These words can be pronounced
The results of Table III show that using the OpenSubtitles in several ways and we do not know which pronunciation
corpus reduced the relative CER (Character Error Rate) by is used in the video. To get better results, the output of an
approximately 62%, and the relative WER error by around 57%. automatic speech recognition system can be considered as
The results of this experiment indicate that by having more future work. Furthermore, due to internal limitations, we had
training data for viseme decoding, a better language model to use a simple sequence-to-sequence model for both tasks.
can be obtained, and this will improve the accuracy of this Therefore, as another future work, we will incorporate our
decoding. Of course, there is definitely an upper bound for proposed method into state-of-the-art systems to show how it
this improvement, and it cannot be claimed that by increasing can improve the overall performance of a lip-reading system.
the data, the error can be reduced to zero. But it shows that it
is easier to improve the accuracy with that as there is a lot of VI. ACKNOWLEDGEMENT
unlabeled textual data in any language. The authors would like to extend their gratitude to the Speech
Laboratory of the Brno University of Technology for providing
B. Sentence-level lip-reading
access to computational servers and LRS datasets.
In this step, using the proposed models for character-level lip-
reading, we perform viseme-level lip-reading. For this purpose, R EFERENCES
we had to first train the video-to-viseme model, for which we
[1] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey,
need to have a sequence of viseme for each video sample. Here “Extraction of visual features for lipreading,” IEEE Transactions on
again, we used a dictionary similar to the previous experiment. Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198–213,
After the phoneme sequence for each training sample was 2002.
[2] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading
obtained, the phoneme-to-viseme mapping was used to convert sentences in the wild,” in 2017 IEEE Conference on Computer Vision
the phoneme sequence into the viseme sequence. After that, the and Pattern Recognition (CVPR). IEEE, 2017, pp. 3444–3453.
video-to-viseme model was trained using a simple sequence to [3] Z. Zhou, G. Zhao, and M. Pietikäinen, “Towards a practical lipreading
system,” in CVPR 2011. IEEE, 2011, pp. 137–144.
sequence model. The results of this experiment are shown in [4] G. Potamianos and C. Neti, “Improved ROI and within frame discriminant
Table III. features for lipreading,” in Proceedings 2001 International Conference
In the first row of this table, the video-to-viseme result shows on Image Processing (Cat. No. 01CH37205), vol. 3. IEEE, 2001, pp.
250–253.
that this model is able to do this task with an acceptable degree [5] Y. Lan, R. Harvey, B. Theobald, E.-J. Ong, and R. Bowden, “Comparing
of accuracy even by using a simple model. So far, we have visual features for lipreading,” in International Conference on Auditory-
only viseme sequence as output. In the second part of the table, Visual Speech Processing 2009, 2009, pp. 102–106.
[6] G. Potamianos, H. P. Graf, and E. Cosatto, “An image transform approach
the result of combining this model with the model prepared in for HMM based automatic lipreading,” in Proceedings 1998 International
the previous step is provided, along with a comparison with the Conference on Image Processing. ICIP98 (Cat. No. 98CB36269). IEEE,
result obtained through the method used in [2]. Considering 1998, pp. 173–177.
[7] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden, “Improving
that a simple architecture is used in both trained models, the visual features for lip-reading,” in Auditory-Visual Speech Processing
improvement of accuracy compared to [2] is considerable. 2010, 2010.
[8] S. S. Morade and S. Patnaik, “A novel lip reading algorithm by using
localized ACM and HMM: Tested for digit recognition,” optik, vol. 125,
no. 18, pp. 5181–5186, 2014.
[9] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent
advances in visual speech decoding,” Image and vision computing, vol. 32,
no. 9, pp. 590–605, 2014.
[10] L. Lombardi et al., “A survey of automatic lip reading approaches,” in
Eighth International Conference on Digital Information Management
(ICDIM 2013). IEEE, 2013, pp. 299–302.
[11] S. Mathulaprangsan, C.-Y. Wang, A. Z. Kusum, T.-C. Tai, and J.-C.
Wang, “A survey of visual lip reading and lip-password verification,” in
2015 International Conference on Orange Technologies (ICOT). IEEE,
2015, pp. 22–25.
[12] H. L. Bear and R. Harvey, “Decoding visemes: Improving machine lip-
reading,” in 2016 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2016, pp. 2009–2013.
[13] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “LipNet:
Sentence-level lipreading,” arXiv preprint arXiv:1611.01599, vol. 2, no. 4,
2016.
[14] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
temporal classification: labelling unsegmented sequence data with
recurrent neural networks,” in Proceedings of the 23rd international
conference on Machine learning, 2006, pp. 369–376.
[15] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A
neural network for large vocabulary conversational speech recognition,”
in 2016 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
[16] T. Afouras, J. S. Chung, and A. Zisserman, “Deep lip reading: a
comparison of models and an online application,” arXiv preprint
arXiv:1806.06053, 2018.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in
neural information processing systems, 2017, pp. 5998–6008.
[18] T. Afouras, J. S. Chung, and A. Zisserman, “Asr is all you need:
Cross-modal distillation for lip reading,” in ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2020, pp. 2143–2147.
[19] M. Luo, S. Yang, X. Chen, Z. Liu, and S. Shan, “Synchronous
bidirectional learning for multilingual lip reading,” arXiv preprint
arXiv:2005.03846, 2020.
[20] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal
convolutional networks,” in ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2020, pp. 6319–6323.
[21] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus of
continuous speech,” IEEE Transactions on Multimedia, vol. 17, no. 5,
pp. 603–615, 2015.
[22] T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale
dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496,
2018.
[23] J. Tiedemann and L. Nygaard, “The opus corpus-parallel and free:
http://logos. uio. no/opus.” in LREC. Citeseer, 2004.
[24] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” arXiv preprint
arXiv:1506.07503, 2015.