Full Thesis
Full Thesis
(CSE) Thesis
Submitted by
Chowdhury Rafeed Rahman
1018052013
Supervised by
Dr. Mohammed Eunus Ali
Submitted to
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology
Dhaka, Bangladesh
August 2021
Acknowledgement
I am thankful to all the people who have made this research possible through their continuous
assistance and motivation. I would first like to thank my respected supervisor, Professor
Mohammed Eunus Ali who supported the research from the start to the end through his wise
suggestions, corrective measures and intriguing insight. His strong and constructive criticism
helped this research reach its high standards.
I would also like to acknowledge Swakkhar Shatabda, Associate Professor of United International
University (UIU) and the IT team of UIU who facilitated a dedicated GPU server that was
essential for conducting the large number of experiments of this research.
Finally, I would like to thank my parents and my wife for providing me with mental support and
wise counseling during the tough times of the research period.
iii
Contents
Candidate’s Declaration i
Board of Examiners ii
Acknowledgement iii
List of Tables ix
Abstract x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Bangla Language Modeling Problem . . . . . . . . . . . . . . . . . . . 3
1.2.2 Spelling Correction Problem . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of Related Works and Limitations . . . . . . . . . . . . . . . . . . . 4
1.4 Proposed Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 10
2.1 Unidirection and Bidirection in Many to One LSTM . . . . . . . . . . . . . . . 10
2.2 One Dimensional CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Attention Mechanism in LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Self Attention Based Transformer Encoders . . . . . . . . . . . . . . . . . . . 14
2.5 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Word Representation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Overview of Synthetic Spelling Error Generation . . . . . . . . . . . . . . . . 19
3 Literature Review 21
3.1 Language Modeling Related Researches . . . . . . . . . . . . . . . . . . . . . 21
iv
3.1.1 LSTM Based Language Modeling . . . . . . . . . . . . . . . . . . . . 21
3.1.2 CNN Based Language Modeling . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Self Attention Based Language Modeling . . . . . . . . . . . . . . . . 24
3.2 Spell Checker Related Researches . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Existing Bangla Spell Checkers . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 State-of-the-art Neural Network Based Chinese Spell Checkers . . . . . 27
3.2.3 Researches Indirectly Connected to Spell Checking . . . . . . . . . . . 28
v
7.1 Dataset Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.1 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.2 Hyperparameter Specifications . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Comparing BSpell with Contemporary BERT Variants . . . . . . . . . . . . . . 56
7.4 Comparing Different Pretraining Schemes on BSpell . . . . . . . . . . . . . . . 58
7.5 Comparing LSTM Variants with BSpell . . . . . . . . . . . . . . . . . . . . . 59
7.6 Comparing Existing Bangla Spell Checkers with BSpell . . . . . . . . . . . . . 60
7.7 Effectiveness of SemanticNet Sub-Model . . . . . . . . . . . . . . . . . . . . . 61
7.8 BSpell Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8 Conclusion 64
8.1 Future Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References 66
vi
List of Figures
vii
4.3 Beam search simulation with 3 time stamps and a beam width of 2 . . . . . . . 35
7.1 The bold words in the sentences are the error words. The red marked words
are the ones where only BSpell succeeds in correcting among the approaches
provided in Table 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 The bold words in the sentences are the error words. The red marked words are
the ones where only hybrid pretrained BSpell succeeds in correcting among the
approaches provided in Table 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 10 popular Bangla words and their real life error variations. Cluster visual is
provided in Figure 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.4 (a) 10 Bangla popular words and three error variations of each of those
words (provided in Figure 7.3) have been plotted based on their SemanticVec
representation. Each popular word along with its three error variations form a
separate cluster marked with unique color. (b) Two of the popular word clusters
zoomed in to have a closer look. . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.5 Three example scenarios where BSpell fails . . . . . . . . . . . . . . . . . . . 62
viii
List of Tables
ix
Abstract
Though there has been a large body of recent works in language modeling for
high resource languages such as English and Chinese, the area is still unexplored for
low resource languages like Bangla and Hindi. We propose an end to end trainable
memory efficient convolutional neural network (CNN) architecture CoCNN to handle
specific characteristics such as high inflection, morphological richness, flexible word
order and phonetical spelling errors of Bangla and Hindi. In particular, we introduce
two learnable convolutional sub-models at word and sentence level. We show that
state-of-the-art Transformer models do not necessarily yield the best performance for
Bangla and Hindi. CoCNN outperforms pretrained BERT with 16X less parameters
and 10X less training time, while it achieves much better performance than state-
of-the-art long short term memory (LSTM) models on multiple real-world datasets.
The word level CNN sub-model SemanticNet of CoCNN architecture has shown its
potential as an effective Bangla spell checker. We explore this potential and develop
a state-of-the-art Bangla spell checker. Bangla typing is mostly performed using
English keyboard and can be highly erroneous due to the presence of compound
and similarly pronounced letters. Spelling correction of a misspelled word requires
understanding of word typing pattern as well as the context of the word usage.
We propose a specialized BERT model, BSpell targeted towards word for word
correction in sentence level. BSpell contains CNN sub-model SemanticNet being
motivated from CoCNN along with specialized auxiliary loss. This allows BSpell to
specialize in highly inflected Bangla vocabulary in the presence of spelling errors.
We further propose hybrid pretraining scheme for BSpell combining word level and
character level masking. Utilizing this pretraining scheme, BSpell achieves 91.5%
accuracy on real life Bangla spelling correction validation set.
x
Chapter 1
Introduction
Natural language processing (NLP) has flourished in recent years in various domains starting
from writing automation tools to robot communication modules. We can see the applications of
NLP in mobile phone writing pads during automatic next word suggestion, in emailing software
during sentence auto completion, in writing and editing tools during automatic spelling correction
and even in spam message detection. NLP research has flourished in high resource languages
such as English and Chinese in recent times more than ever before [2–7]. Although Bangla is the
international mother language, it is still considered as a low resource language, especially when
we consider Bangla research from NLP perspective. Languages such as Bangla and Hindi are
Sanskrit originated and have very different characteristics from English and Chinese. Hence we
focus on Bangla language modeling where we propose suitable model for Bangla NLP tasks.
1.1 Motivation
Compound Component
Root Word Inflected Variations
Character Characters
শাধ পির শাধ, িত শাধ, শািধত
ক+্+ত
চল চলিত, চালক, চলমান
ক+্+ষ
কিব কিবতা, কিব , মহাকিব ব+্+র
High Inflection Morphological Richness
Bangla is the native language of 228 million speakers making it the fourth most spoken language
1
1.1. MOTIVATION 2
in the world. This language originated from Sanskrit [8] and displays some unique characteristics.
First of all, it is a highly inflected language. Each root word of Bangla may have many
variations due to addition of different suffixes and prefixes. Second, Bangla as a language is
morphologically enriched. There are 11 vowels, 39 consonants, 11 modified vowels and more
than 170 compound characters found in Bangla scripts [9]. Third, there is vast difference between
Bangla grapheme representation and phonetic utterances for many commonly used words.
As a result, fast typing of Bangla yields frequent spelling mistakes. Finally, the word-order
in Bangla sentences is flexible. The importance of word order and their positions in Bangla
sentences are loosely bounded. Some relevant examples regarding these unique aspects have
been shown in Figure 1.1. Many other languages such as Hindi, Nepali, Gujarati, Marathi,
Kannada, Punjabi and Telugu also share these characteristics because of the same Sanskrit origin.
Main Correction
Incorrect Correct
Dependency
কাল আমােদর ু েল বািষক অনু ান। কাল আমােদর ু েল বািষক অনু ান।
Error Word Pattern
(We have final (error) at school tomorrow.) (We have final ceremony at school tomorrow.)
Figure 1.2: Examples of context and word pattern dependency for spelling correction. Bold red
marked word is the misspelled word, the correction of which depends on the italicized blue words.
In the provided English translation, (error) has been placed as a representative of meaningless
misspelled Bangla words.
Because of the above mentioned unique characteristics, Bangla language modeling needs special
attention which is different from high resource languages such as English and Chinese. High
inflection, morphological richness and deviation between grapheme representation and phonetic
utterance in Bangla language make it spelling error prone. So, spell checking as a major
downstream task related to Bangla language modeling can be quite interesting. Moreover, most
Bangla native speakers type using English QWERTY layout keyboard [10] which makes it
difficult to type Bangla compound characters, phonetically similar single characters and similar
pronounced modified vowels correctly. Thus Bangla typing speed, if correct typing is desired,
is really slow. Moreover, a major portion of Bangla native speakers are semi-literate who have
no clear idea on Bangla correct spelling. So, necessary focus should be given at solving this
problem for Bangla language.
Neural language models have shown great promise recently in solving several key NLP tasks
such as word prediction and sentence completion in major languages such as English and
Chinese [2–7]. To the best of our knowledge, none of the existing study investigates the efficacy
of recent language models in the context of languages like Bangla and Hindi. In this research,
we conduct an in-depth analysis of major deep learning architectures for language modeling
1.2. PROBLEM STATEMENT 3
and propose an end to end trainable memory efficient CNN architecture to address the unique
characteristics of Bangla and Hindi. A sub-component of this architecture has shown great
promise as a Bangla spell checker at word level. We argue that in order to correct a misspelled
word, we need to consider the word pattern, its prior context and its post context. Figure
1.2 shows three examples where the dominant aspect for the word correction process among
these three have been mentioned. So, our proposed method for Bangla spell checking takes
necessary intuition from our CNN based proposed architecture for Bangla language modeling
and incorporates necessary changes for learning context information as well.
Next word prediction task is the primary language modeling task for any language. Given
W ord1 , W ord2 , . . . , W ordi , we predict the next word W ordi+1 in sentence level. We also
include other many to one language modeling tasks such as text classification and sentiment
analysis as part of auxiliary tasks of this research.
We focus on Bangla sentence level spelling correction. Suppose, an input sentence consists of
n words – W ord1 , W ord2 , . . . , W ordn . For each W ordi , we have to predict the right spelling,
if W ordi exists in the top word list of our corpus. By top word we mean frequently occurring
words in our Bangla corpus. If W ordi is a rare word (Proper Noun in most cases), we predict
U N K token denoting that we do not make any correction of such words. Think of a user who is
writing the name of his friend in a sentence. We obviously have no way of knowing what the
1.3. OVERVIEW OF RELATED WORKS AND LIMITATIONS 4
correct spelling of this friend’s name is. For correcting a particular W ordi in a paragraph, we
only consider other words of the same sentence for context information. Figure 1.3 illustrates
the problem of spelling correction. The input is a sentence containing misspelled words, while to
output is the correctly spelled version of the sentence.
spell checker offers an assumption free BERT architecture where error detection network based
soft-masking is included [27]. This model takes all N characters of a sentence as input and
produces the correct version of these N characters as output in a parallel manner. In this model,
the input and output character number have to be the same.
Wrong Correct
পিরকা পরী া
(প+র+ি +ক+ া) (প+র+ ◌ী+ক+ ◌্ +ষ+ ◌া)
িবশশ িব
(ব+ি +শ+শ) ্
(ব+ি +শ+ +ব)
ভাদর ভা
(ভ+ া+দ+র) (ভ+ া+দ+ +র)
্
Figure 1.4: Heterogeneous character number between error spelling and correct spelling
We now point out the limitations of the researches performed for spell checking in terms of
Bangla language. One of the limitations in developing Bangla spell checker using state-of-the-art
BERT based implementation [27] is that number of input and output characters in BERT have to
be exactly the same. This scenario is problematic for Bangla due to the following reasons. First
of all, such scheme is only capable of correcting substitution type errors. Deletion or insertion
will certainly alter word length. As compound characters (made up of three or more simple
characters) are commonly used in Bangla words, an error made due to the substitution of such
characters also changes word length. We show this scenario in Figure 1.4. Three real life cases
are shown. Here, character number has decreased in the misspelled version. The correct version
of the first word of this figure contains 7 characters, whereas the misspelled version contains
only 5 characters due to typing mistake in place of the only compound character of this word.
Word length has changed without any deletion or insertion. As a result, the character based
BERT correction is not feasible for Bangla. Hence, we introduce a word level prediction in our
proposed BERT based model.
Correct Incorrect
সিনক ঘাড়া চেড় যুে গল। সিনক ঘারা চেড় যুে গল।
(Soldier went to war riding a horse.) (Soldier went to war riding a visit.)
Figure 1.5: Some sample words that are correctly spelled accidentally, but are context-wise
incorrect.
Although most existing Bangla spell checkers focus only on the word to be corrected, the
importance of context in this regard is often ignored. Figure 1.5 illustrates the importance of
1.4. PROPOSED METHOD OVERVIEW 6
context in Bangla spell checking. Although the red marked words of this figure are the misspelled
versions of the corresponding green marked correct words, these red words are valid Bangla
words. But if we check these red words based on sentence semantic context, we can realize that
these words have been produced accidentally because of spelling error. In the first sentence, the
red marked word means to visit. But the sentence context hints about a soldier riding towards
battleground. A slight modification of this red word makes the word correct (written in green),
which now has the meaning of horse. These are all real life Bangla spelling error examples which
prove the paramount importance of context in spell checking.
Figure 1.6: Necessity of understanding semantic meaning of existing erroneous words for
spelling correction of an individual misspelled word.
Furthermore, real life Bangla spelling errors while typing fast often span up to multiple words in
a sentence. Such a real life example has been provided in Figure 1.6, where all words have been
misspelled during speed typing. The table in the figure shows the words to look at in order to
correct each misspelled word which include correct version of a few misspelled words existing
in the sentence. Our proposed solution should be such that it can also capture the approximate
semantic meaning of a word even if it is misspelled.
attention based module capable of carrying out many to many task. We modify our loss function
to focus more on word level spelling for better spell checking performance.
We design a novel CNN architecture, namely Coordinated CNN (CoCNN) that achieves state-
of-the-art performance in Bangla language modeling with low training time. In particular,
our proposed architecture consists of two learnable convolutional sub-models: word level
(SemanticNet) and sentence level (Terminal Coordinator). SemanticNet is well suited for syllable
pattern learning, whereas Terminal Coordinator serves the purpose of word coordination learning
while maintaining positional independence, which suits the flexible word order of Bangla and
Hindi languages. It is important to note that CoCNN does not explicitly incorporate any self
attention mechanism like Transformers; rather it relies on its Terminal Coordinator sub-model for
emphasizing on important word patterns. We validate the effectiveness of CoCNN on a number
of tasks that include next word prediction in erroneous setting, text classification, sentiment
analysis and sentence completion. CoCNN achieves slightly better performance than pretrained
BERT model with 16X less parameter and 10X less training time. Also, CoCNN shows far
superior performance than the contemporary LSTM models.
CoCNN is an architecture suitable for many to one type tasks. As sentence level spelling
correction is a many to many type NLP task. We propose different architecture for this task in line
with the intuition of CNN sub-model usage. We propose word level BERT [28] based architecture
namely BSpell. This architecture is capable of learning prior and post context dependency through
the use of multi-head attention mechanism of stacked Transformer encoders [29].The proposed
architecture has the following three major features.
SemanticNet: In this architecture, each input word in a sentence has to pass through learnable
CNN sub-layers so that the model can capture semantic meaning even if the words are misspelled
to a certain extent. These learnable sub-layers form SemanticNet and the word representation
obtained is named as SemanticVec.
Auxiliary loss: We also introduce a specialized auxiliary loss to facilitate SemanticNet sub-
model learning in order to emphasize on the importance of user writing pattern. This loss has
been placed at an intermediate level of the model instead of the final level thus eliminating the
gradient vanishing problem prevalent in very deep architectures.
Hybrid pretraining: An important aspect of existing BERT models is pretraining scheme.
Existing pretraining schemes [1] are designed mainly to understand language context. In order
to capture both context and word error pattern, we introduce a new pretraining scheme namely
hybrid pretraining. Besides word masking based fill in the gaps task of contemporary pretraining
methods, hybrid pretraining incorporates word prediction task after blurring a few characters of
some words. Such scheme provides the BERT model with ideas about retrieving a word from its
noisy version.
1.5. RESEARCH CONTRIBUTIONS 8
• We propose an end to end trainable CNN architecture CoCNN based on the coordination
of two CNN sub-models.
• We introduce hybrid pretraining scheme for BSpell architecture which facilitates sentence
level spelling correction by combining word level and character level masking. Detailed
evaluation on three error datasets that includes a real life Bangla error dataset shows the
effectiveness of our approach. Our evaluation also includes detailed analysis on possible
LSTM based spelling correction architectures.
languages, existing Bangla spell checkers and LSTM based possible Bangla spell checkers.
Chapter 8, the final chapter of this thesis book, provides our concluding remarks regarding this
research and some possible future directions towards research in Bangla NLP.
Chapter 2
Background
This chapter focuses on those components that are needed to be understood prior to going through
our proposed method and our conducted experiments. We start this chapter with some learning
based model components such as bidirection in LSTM, one dimensional convolution, attention
mechanism in LSTMs and self attention in Transformers. Then we move on to describing
commonly used pretraining scheme. Finally, we provide an overview of a synthetic spelling
error generation algorithm used to prepare some of our datasets.
Y0 Y1 Y2 YT
X0 X1 X2 XT
10
2.2. ONE DIMENSIONAL CNN 11
does the same thing from XT to X0 . Outputs of both unidirectional LSTMs are combined to get
the final output. In order to get output Yi , we maximize the probability Pi .
Pi = PF i × PBi
X1
X2
START
X3
Conv1D
X4
X5
X6
END X7
X8
We demonstrate 1D convolution operation in Figure 2.2. There are 8 entities in the input sequence
- X1 , X2 , . . . , X8 - each represented with a vector of size 6. We have a single 1D convolution
filter of size 2 performing convolution operation on this sequence containing 8 entities. This
input sequence can be represented as a 8 × 6 size two dimensional matrix. Since our filter size is
2, the convolution operation is performed on two entities at a time. With a stride of 1, the first
convolution operation will be performed on X1 , X2 ; the second convolution operation will be
performed on X2 , X3 ; the final operation will be on X7 , X8 . Each convolution operation will
provide us with a scalar value as output. Thus we have a vector of scalar values of size 7 as a
result of 1D convolution on the sequence matrix using a 1D filter of size 2.
We use the same input sequence matrix in order to explain 1D max pooling operation in Figure
2.3. We demonstrate a pooling operation of size 2 with a stride of 2. Unlike convolution, 1D
pooling is performed column-wise. It starts with the leftmost column C1 . 1D pooling of size
2.3. ATTENTION MECHANISM IN LSTM 12
Start End
Start X1
X2
X3 Pool1D
X4
X5
X6
O1 O2 O3 O4 O5 O6
X7
End X8
C1 C2 C3 C4 C5 C6
Figure 2.3: 1D Pooling operation of size 2 and stride 2 on a 8 length sequence matrix
X1
X2
X3
GlobalMaxPool1D
X4
X5
X6
X7
X8
C1 C2 C3 C4 C5 C6
Figure 2.4: GlobalMaxPooling operation one dimensional version applied on a sequence feature
matrix
Finally, we demonstrate 1D global max pooling in Figure 2.4 using the same input sequence
matrix. This operation is performed column-wise as well. We take the maximum value from
each column C1 , C2 , . . . C6 . Maximum value of each column has been marked with deep blue .
These column-wise maximum values form the output vector of 1D global max pooling.
Y0 Y1 Y2
Last Encoder
Hidden State
Context X1 Context
X0 X2
Vector1 Vector2
Figure 2.5: Attention mechanism in encoder decoder based LSTM. Here, we assume number of
time stamps in both encoder and decoder to be 3.
mechanism is popularly incorporated with encoder decoder based LSTM architectures. Such
architectures are used in sequence to sequence type tasks such as machine translation and
sentence correction. In such tasks, the number and positions of sequence entities vary greatly in
input and output sequence. Figure 2.5 demonstrates attention mechanism in encoder decoder
LSTM. The encoder and decoder are two different LSTMs having different weights. The encoder
LSTM encodes the whole input sequence in a hidden state S0 . The output vectors obtained from
each time step of encoding are concatenated into an encoder output matrix. The initial hidden
state of the decoder LSTM is S0 , the summarized state of input sequence, while the input is
X0 , typically a default start symbol. Suppose, the output hidden state and output probability
vector from time step 0 are S1 and Y0 , respectively. In typical encoder decoder LSTMs, we
would feed the vector X1 corresponding to maximum probability index of output vector Y0 to the
decoder in time step 1. But in case of attention mechanism, we first merge S1 and encoder output
matrix through the use of dot product like operations into a context vector. We concatenate this
context vector with X1 and then feed the vector to decoder LSTM in time step 1. Such context
vector helps the decoder to get an idea of the portion of input sequence to focus on most to get
output vector Y1 . In time step 2, we follow similar procedure of context vector formation using
output hidden state S2 and encoder output matrix, then concatenate with vector X2 , a vector
corresponding to the maximum probability index of output Y1 . This process continues until we
receive a stop symbol as output.
2.4. SELF ATTENTION BASED TRANSFORMER ENCODERS 14
Dense
Vector Conversion
TransformerN
Transformer2
Transformer1
X1 X2
Figure 2.6: High level overview of a Transformer based many to one model
Dense Dense
X1 X2
We now take a peek inside Transformer encoder block in Figure 2.7. The unique feature of this
encoder is self attention mechanism based on multi-head self attention block shown in figure.
Such attention provides a representative vector for each input entity based on its relationship
with all other input entities and itself. These vectors are then layer normalized and are passed to
dense layer. It is to note that the dense layer applied on each vector is the same layer. The final
self attention based vector representation of input entities come out of these dense layers.
QW1
KW1 Q1
Z1
VW1 K1
V1
X QW2
X1 KW2 Q2
VW2
Z2 2X9
X2 K2
V2 Row-wise
concatenation
Dense
QW3 (9X4)
KW3 Q3
Z3
VW3 K3
V3 X1
X2
Now let us peek inside the multi-head self attention block. We demonstrate this block in Figure
2.8 with two input vectors X1 and X2 . Multi-head self attention block as its name suggests
consists of multiple branches for applying attention mechanism on the inputs. In the provided
figure, there are 3 attention heads; there obviously can be more. Let us describe the first attention
head in detail, as all of them follow the same principal. We have three types of weight matrices
in an attention head - query (QW1 ), key (KW1 ) and value (V W1 ) weight matrix. These are all
learnable parameters updated during training. In figure, we have three 4 × 3 size weight matrices
for demonstration purpose. We take the dot product of our input matrix X of size 2 × 4 with
QW1 , KW1 and V W1 separately to obtain three matrices - Q1 , K1 and V1 , respectively of size
2 × 3. These three obtained matrices are merged into a single matrix Z1 of size 2 × 3 (the details
of this process is described in the next paragraph). We obtain Z2 and Z3 matrix of same size in
the same manner from the other two attention heads. These three matrices are now concatenated
row-wise to obtain a single matrix of size 2 × 9. This concatenated matrix is then passed through
a dense layer, where the node number of the dense layer is the same as the vector size of X1 and
X2 , that is 4 in our case. Finally, after the application of this dense layer, we obtain two modified
vectors for X1 and X2 , which are of the same as their original vector representations.
2.5. BERT ARCHITECTURE 16
X1 X2
Q11 Q12
K11 K12
V11 V12
Q11.K11 = 20 Q11.K12 = 5
0.8 0.2
T11 T12
Sum
Z11
Figure 2.9: Looking at self attention mechanism prevalent in a single head of Transformer
multi-head attention
Now let us try to understand the intuition behind the dot products that we mention using query,
key and value matrix and the merging procedure in a single Transformer attention head. Detailed
demonstration has been provided in Figure 2.9. We already know from previous discussion that
we obtain three vectors Q11 (row 1 of Q1 matrix of Figure 2.8), K11 and V11 from input vector
X1 using the three weight matrices of attention head one. Similarly, from X2 we obtain Q12 ,
K12 and V12 . We want to obtain the Z11 , the modified vector of X1 produced through applying
attention head one on X1 . We apply dot product query vector Q11 of X1 with key vector of
both X1 and X2 . We get 20 and 5, respectively. Now we turn these values into a probability
distribution using Softmax activation on them. We get 0.8 and 0.2, respectively after doing so.
Vector T11 is obtained by multiplying V11 with 0.8, while vector T12 is obtained by multiplying
V12 with 0.2. These two modified vectors are now added to obtain Z11 (first row of matrix Z1
in Figure 2.8). Similarly, if we want to obtain Z12 , the modified vector of X2 using attention
head one, we must follow the same procedure except for the dot products which now have to
performed using Q12 instead of Q11 . So, this is how self attention works - modifying each input
vector with respect to other input vectors using query, key and value weight matrices.
Transformer Encoder 12
Transformer Encoder 2
Transformer Encoder 1
models were not sample efficient. For each word in a sentence, it was required to generate a
new sample for training which ultimately resulted in an outburst of pretraining sample number.
BERT pretraining scheme for the very first time provided masking based pretraining scheme that
does not increase the pretraining sample number from original corpus sentence number and at
the same time provides bidirectional pretraining to the model. A sample mask based pretraining
has been shown in Figure 2.10. There are N words in the input sentence. Among them W ord2
has been randomly replaced by a default mask word. At the output side, BERT needs to predict
that this masked word is indeed W ord2 in addition to keeping the other words in tact. It is a
sort of filling in the gaps task through which our language model is learning the context and
grammatical structure of the language. Generally, 15% of the words of a sentence are masked
randomly. BERT Base model consists of 12 Transformer encoders, while BERT Large model
consists of 24 encoders. Each Transformer encoder has 12 attention heads and 768 hidden units
in dense layers.
Window Size = 3
D-dim
Hidden Layer
V-dim
Input Layer
V-dim
Output Layers
The learning scheme of FastText is similar to that of Skip-gram Word2vec. Word2vec formation
requires specifying a window size. Word representation should be such that words of similar
meaning should have close (small Euclidean distance) vector representation. Words of similar
meaning are generally used to express similar contexts; that means their neighbouring words
are more or less similar in large corpus. Window size denotes the neighbours to consider in
order to form Word2vec of a word. Figure 2.11 shows the neural network structure of Word2vec
formation for a corpus containing V no. words. We consider a window size of 3 (the word of
interest, its right neighbour word and its left neighbour word). We want our Word2vec vector
size to be D. In such case, our input is a V size one hot vector of the word we are interested in.
We use a hidden layer containing D nodes, that means our hidden layer weight matrix is of size
V × D. The output layer consists of two parallel branches - one for the left neighbour and the
other for the right neighbour word. We provide the one hot vectors of the neighbour words in a
sentence as the ground truth output during training. After the end of training, the ith row of our
hidden layer weight matrix will be Word2vec representation of our ith word in the corpus.
The idea of FastText is based on character N-gram representation learning instead of full word
representation learning. There are millions of words in a corpus, where many of them are
rarely used. Most of those rarely or uncommonly used words are actually a combination of
some common sub-words which we call character N-grams (N denotes number of characters
2.7. OVERVIEW OF SYNTHETIC SPELLING ERROR GENERATION 19
<artificial>
Start End
N=3
<ar art rti tif ifi fic ici cia ial al>
Window Size = 5
in the sub-word). The concept of training and model architecture is the same as Word2vec,
only this time we are considering common sub-words. Figure 2.12 demonstrates character
3-gram based FastText formation where we consider a window size of 3. We first show the
break down of word artificial into size 3 sub-words. With a window size of 5, we consider
the immediate couple of sub-words to the left and to the right of our sub-word of interest. In
FastText, we consider different values of N and take the common character N-grams for FastText
representation formation in a way similar to Word2vec formation. Ultimately, the FastText
representation of a full word is formed from the FastText representation of its consisting common
sub-words through addition, averaging or some other operation.
Input Word
Error Word
Figure 2.13: Bangla synthetic error generation algorithm high level overview
Resource for real life labeled Bangla spelling error is limited. Native Bangla speakers most
commonly use English Qwerty keyboard to write Bangla utilizing writing assistant software.
Because of phonetic similarity among different Bangla letters, presence of compond letters in
Bangla words and robust use of modified vowels; Bangla typing is inherently error prone. For
2.7. OVERVIEW OF SYNTHETIC SPELLING ERROR GENERATION 20
producing synthetic but realistic Bangla spelling errors in words, a probabilistic error generation
algorithm was proposed in [30] in light of popular Bangla misspellings. Figure 2.13 provides a
high level overview of this algorithm. Given a word, the algorithm first detects all compound
and simple/ base letters, as it handles these two types of letters differently. For compound letters,
there are some position dependent probabilistic error generation rules that take a generalized
approach for error generation in many Bangla compound letters. For the detected base letters, the
algorithm introduces two types of errors. The first type of error generation utilizes phonetically
similar clusters of Bangla base letters formed through detail investigation. The main idea is that
a letter can be replaced with one of the other letters of the same cluster with some probability.
The second type of error includes errors produced from accidental mispress of keyboard. This
type of error contains accidentally missed modified vowels as a form of deletion. As a form
of insertion, this second type of error produces those letters that are obtained due to pressing
of neighbouring keys of the intended letter key press. A pair of realistic spelling errors of 10
Bangla words have been provided in Figure 2.14.
Algorithm Generated
Correct Word
Sample Error Words
দ দগধ, দগদ
চ া ছ া, চ া
Figure 2.14: Some sample error words generated by error generation algorithm
Chapter 3
Literature Review
Our research focuses on Bangla language modeling and spelling correction. This chapter includes
researches related to next word prediction, sentence completion, effective word representation,
word and sentence level spelling correction and text classification in Bangla, English and Chinese.
State-of-the-art natural language processing related researches are being conducted in English
and Chinese, hence the inclusion of these two languages.
Sequence order information based statistical RNN models such as LSTM and GRU are popular
for language modeling tasks [31]. The research proposes a modified version of the original RNN
model. Their proposed approach led to significant speed up during training and testing. The paper
also discussed several aspects of backpropagation through time prevalent in RNN based models.
One study showed the effectiveness of LSTM for English and French language modeling [32].
The research analyzed the deficiency of RNN models in terms of training inefficiency and
in terms of previous context learning. They also showed the perplexity performance metric
improvement over RNN while employing long short term memory (LSTM). The regularizing
effect on LSTM was investigated in [33]. They proposed a variant of dropout, that is weight-
21
3.1. LANGUAGE MODELING RELATED RESEARCHES 22
dropped LSTM which utilized DropConnect on hidden layer weight matrices as a variant of
regularization in recurrent type neural networks. In addition, they introduced NT-ASGD as a
regularization oriented optimizer. This proposed optimizer is a variant of the averaged stochastic
gradient method, where the average calculation is triggered based on a specific non-monotonic
condition.
State-of-the-art LSTM models learn sub-word information in each time stamp. A technique
for learning sub-word level information from data was proposed in [34]. Popular word vector
representations use an entire word as the smallest unit of representation learning. Such approach
shows limitations for languages with large and inflected vocabulary and a lot of rare words. Rare
words have little presence in the corpus which makes their vector representation inappropriate.
Moreover, word level language models could not deal with new unseen words before the advent
of this research. These words were called out-of-vocabulary (OOV) words by the authors.
They analyzed various techniques on character level modeling tasks such as smoothed N-gram
models, different types of neural networks including RNN based architectures and a maximum
entropy model. They found character level modeling to be superior to smoothing techniques.
A morphological information oriented character N-gram based word vector representation was
proposed in [35] later on. This research proposed a Skip-gram based learning scheme where word
representation consists of a group of character N-grams. They learned the vector representation
of these character N-grams instead of the whole words in the corpus. The vector representation
of a word was obtained using sum of the character N-grams found in that word. The proposed
method was run time efficient even in large corpus. The proposed algorithm can provide decent
vector representation for unseen words in training samples as well, which was not possible
before this research. This word representation further was improved and was given the name
Probabilistic FastText [36]. This scheme could express multiple meanings of a word, sub-word
structure and uncertainty related information during misspelling. Each word was represented
using a Gaussian mixture density. Each mixture component of a word was responsible for
a particular meaning of that word along with some uncertainty. This was the first scheme to
achieve multi-sense representation of words and showed better performance than vanilla FastText.
A detailed evaluation on sub-word aware model was performed in [15]. Their focus was 50
morphologically rich languages which display high type-to-token ratios. The authors proposed a
new method for injecting subword level information into word semantics representation vector.
They integrated this representation into neural language modeling to facilitate word level next
word prediction task. The experiments showed significant perplexity gains in all 50 diverse and
morphologically rich languages.
Character aware learning based word representation scheme was integrated in LSTM models for
the very first time in [14]. They provided a simple neural language model which required only
3.1. LANGUAGE MODELING RELATED RESEARCHES 23
1D version of CNNs have become popular recently for text classification oriented tasks [11,38,39].
A densely connected CNN armed with multi-scale feature attention was proposed in [11] for text
classification. Contemporary CNN models use convolution filters of fixed kernel size. Hence
they have limitation in learning variable length N-gram features and lack flexibility. The authors
of [11] presented a dense CNN assisted by multi-scale feature attention for text classification.
Their proposed CNN showed performance comparable to contemporary state-of-the-art baselines
on six benchmark datasets. Detailed evaluation and investigation were performed on deep CNN
models in [38]. Although CNN based text classification gained high performance, the power and
efficacy of deep CNNs were not investigated into. Often large number of labeled samples related
to real world text classification problems are not available for training deep CNNs. Though
transfer learning is popular for NLP related tasks such as next word prediction, document
classification in LSTM and Transformer based models, the power of transfer learning was not
well studied on 1D CNNs before the advent of the currently discussed research. The importance
of depth in CNN for text classification was further studied in [39] both for character level and
for word level input. The authors used 5 standard text classification and sentiment analysis task
datasets for their experiments. They showed that deep models showed better performances than
shallow models when text input is given as a sequence of characters. On the contrary, when input
is fed as sequence of words, simple shallow but wide network such as DenseNet outperformed
3.1. LANGUAGE MODELING RELATED RESEARCHES 24
deep architectures.
Although the application of CNN in computer vision tasks has been popular for a long time,
CNN application in language modeling has been studied recently on showing the ability of
CNNs to extract language model features at a high level of abstraction [4]. The authors analyzed
the application of CNNs in language modeling, which is a sequence oriented prediction task
requiring extraction of both short and long range information. The authors showed that CNN
based architectures outperformed classic recurrent neural networks (RNN), but their performance
was below that of state of the art LSTMs. They found out that CNNs were capable of looking at
up to 16 previous words for predicting the next target word. Furthermore, dilated convolution
was employed in Bangla language modeling with a view to solving long range dependency
problem [40]. In this work, the authors first analyzed the long range dependecy and vanishing
gradient problem of RNNs. Then they proposed a dilated CNN which enforces separation of
time stamp during convolution, the main way of feature extraction for 1D CNNs. In this way,
they improved the performance of Bangla language modeling.
Self attention based Transformers have become the state-of-the-art mechanism for sequence to
sequence modeling in recent years. A simple architecture entirely based on self attention and
devoid of all recurrence and convolution type layers was proposed in [17], which came to be
known as Transformer for the very first time. This architecture was based on multi-head self
attention based encoders and decoders. The authors implemented their proposed Transformer
architecture on machine translation task where they achieved state-of-the-art performance.
A recent research investigated into deep auto regressive Transformers in speech recognition
task [18]. The research focused on two topics. First of all, they analyzed the configuration of
Transformer architecture. They optimized the architecture for both word-level speech recognition
and for byte-pair encoding (BPE) level end-to-end speech recognition. Second, they showed that
positional encoding was not a necessity in Transformers. A deep Transformer architecture was
proposed in [16] for character level language modeling. The authors identified a few challenges
in character level language modeling such as requirement of learning a large vocabulary of words
from scratch, longer time steps bringing in longer dependency learning requirement and costlier
computation because of longer sequences. This research employs a deep Transformer model
consisting of 64 Transformer modules for character level language modeling. In order to speed
up convergence, the authors imposed auxiliary loss in three specific positions - (i) at intermediate
sequence positions, (ii) from intermediate hidden representations, and (iii) at target positions.
Their approach achieved state-of-the performance in two character level language modeling
datasets. A recent study has focused on effective parameter reduction in Transformers [19].
3.2. SPELL CHECKER RELATED RESEARCHES 25
They pointed out that the multi-head self attention mechanism of Transformer encoders and
decoders is the limiting factor when it comes to resource constrained deployment of the model.
The authors based their ideas on tensor decomposition and parameter sharing, and proposed a
novel self-attention model namely Multi-linear attention.
A new language representation model was proposed in [41] based on Transformer encoders.
The authors named it as Bidirectional Encoder Representations from Transformers or BERT in
brief. BERT was designed for pretraining Transformer encoders in an efficient manner where
both prior and post context could be present. The pretrained BERT model can be fine tuned
with the addition of only one output layer specific to the intended downstream task. Another
research focused on the appropriate pretraining of BERT [42]. As pretraining is computationally
expensive and offers significant performance improvement, the authors performed detail analysis
on appropriate hyperparameter choices for BERT in this regard. They also investigated into
training specific choices such as batch size and proper masking. Their pretraining scheme
on BERT achieved state-of-the-art results on GLUE, RACE and SQuAD. As an alternative
to masked language modeling based BERT pretraining, a more sample-efficient pre-training
scheme was proposed in [43] known as replaced token detection. Instead of masking input words
randomly, this pretraining method corrupted the input by replacing some words with alternative
words generated from a small generator network. A discriminative model had to predict whether
each word in the sentence is original or corrupted by generator model. The discriminator was
trained using the predictions made. In conditions where only small amount of data are available
for pretraining, this generator discriminator based pretraining scheme outperformed BERT by a
wide margin.
Several studies on Bangla spell checker algorithm development have been conducted in spite of
Bangla being a low resource language. A phonetic encoding was proposed for Bangla in [20]
that had the potential to be used by spell checker algorithms. Word encoding was performed
3.2. SPELL CHECKER RELATED RESEARCHES 26
using Soundex algorithm. This algorithm was modified according to Bangla phonetics. The
same was done for Bangla compound characters as well. Finally, the authors demonstrated a
prototype of Bangla spell checker which utilized the proposed phonetic encoding to provide
correct version of misspelled Bangla words. Another potential spell checker was proposed
in [21] where the main goal was to search and match named entities in Bangla. The authors
identified two major challenges in this research that include the large gap between script and
pronunciation in Bangla letters and different origins of Bangla names rooted in religious and
indigenous background. They presented a Double Metaphone encoding for Bangla names that
was a modification of the method proposed in [20] where major context-sensitive rules and
consonant clusters for Bangla language were taken into account. This research proposed an
algorithm that used name encoding and edit distance to provide promising suggestions of similar
names. Another recent research proposed a clustering based approach for Bangla spelling error
detection and correction. The proposed algorithm reduced both search space and search time.
The spell checker demonstrated was able to handle both typographical errors and phonetic errors.
The authors compared their proposed spell checker with Avro and Puspa, two renowned Bangla
writing tools where the proposed algorithm showed relatively better performance. An N-gram
based method was proposed in [23] for sentence level Bangla word correctness detection. They
experimented with unigram, bigram and trigram. Such probabilistic approach required a large
corpus for defining the probabilistic inference scheme. The advantage was that the proposed
approach was language independent and did not require any explicit knowledge of Bangla
grammar for detecting semantic or syntactic correctness.
A recent study on low resource language has included Hindi and Telugu spell checker
development, where mistakes were assumed to be made at character level [24]. We include
this research in this section as the two mentioned Indic languages are Sanskrit originated just
like Bangla and has similar sentence and word structure. The authors used attention based
encoder decoder modeling as their approach. The input is a sequence of characters that make up
a sentence containing misspelled words, whereas the output is another sequence of characters
which is supposed to hold all the correct versions of the misspelled words of the input. The
correctly spelled words are kept in tact. During decoding phase, each word production is
supposed to focus on a single word or a couple of words in the worst case (during accidental
word splitting type spelling error). Hence attention mechanism comes in handy during character
level spelling correction. The authors also proposed a synthetic error generation algorithm for
Hindi and Telugu language as part of this research.
3.2. SPELL CHECKER RELATED RESEARCHES 27
A year later, another research on Chinese spell checking conducted experiments on these same
three human annotated spelling error datasets and achieved better results [26]. Contemporary
methods were focusing on incorporating Chinese character pair similarity knowledge through
external input or through heuristic based algorithms. On the contrary, this research incorporated
phonological and visual similarity knowledge into their learning based language model using
a specialized graph convolutional network namely SpellGCN. SpellGCN built a graph using
the Chinese character knowledge domain. This graph learned to map a character to its correct
version acting as an inter-dependent character classifier. Such classifier was applied to character
representation extracted from a BERT based network where both the networks were trained
end to end. The latest Chinese spell checker was BERT based and did not incorporate any
external knowledge or confusion set related to Chinese characters [27]. The authors proposed a
soft masking scheme based on Gated Recurrent Unit (GRU) integrated with BERT. BERT was
pretrained using masking based scheme at character level. While performing spelling correction
on error dataset, the GRU sub-model first provided a confidence score for each character on
3.2. SPELL CHECKER RELATED RESEARCHES 28
how likely it was to be a noisy character. Based on this likelihood some masking was added to
this character. BERT based prediction was then made on this modified character. Both GRU
and BERT were trained end to end. Detail evaluation on two benchmark datasets showed the
superiority of this soft masked BERT as Chinese spell checker.
Although not directly connected to this research, a few other researches conducted on English
language can be mentioned. A preprocessing step was proposed in [46] that aimed to change
English text word, sentence and paragraph in such a way that they would have greater resemblance
to the distribution of standard English while preserving most of the original semantic meaning.
Such change would help in different NLP downstream tasks performed on large data obtained
from the wild. So, the input was a corrupt text, while the output was a projection to the target
language domain. The authors used a sequence to sequence attention based language model [47]
in this research and showed that their transformation was indeed effective. Another research
developed a spelling correction system for medical text [48]. The authors pointed out the
importance of such specialized spell checker to ensure the correct interpretation of medical
documents as health records are getting digitized day by day. Their algorithm was based on
Shannon’s noisy channel model which utilized an enriched dictionary compiled from variety of
resources. This research also used named entity recognition to avoid the correction of patient
names. The spell checker was applied to three different types of free texts such as clinical notes,
allergy entries, and medication orders. The proposed algorithm achieved high performance in
all three of these domains. A deep learning based approach for Bangla sentence correction was
proposed in [49]. Here sentence correction was performed as a form of word rearrangement to
better express the semantic meaning. They used encoder decoder based sequence to sequence
modeling for correcting Bangla sentences. Their model was LSTM based. Such models have
gained popularity for tasks such as machine translation and grammar correction. As part of
this research, the authors also constructed a benchmark dataset of misarranged words which is
expected to faciliate future research in similar domain. A research on Chinese language focused
on errors made by people learning Chinese as a second language [50]. The authors claimed that
such errors inherit from language transfer and pointed out four major types of errors in the paper
- lexical choice error, redundancy, omission, and word order. Their main focus was word order
type error as it was identified as the most prominent source of error. The research proposed a
relative position language model and a parse template language model for error correction.
Chapter 4
This chapter describes the different components of our proposed Coordinated CNN (CoCNN)
architecture aimed at Bangla language modeling tasks such as word prediction, sentence
completion and text classification. The chapter also explains the theory and intuition behind the
inclusion of these novel components.
29
4.2. SEMANTICNET SUB-MODEL 30
Output
Dense and
Softmax Layer
Flattened
Coordination
vector
1D conv of
Sentence Level kernel size 3
Terminal Sequential 1D CNN
Coordinator Sub-Model
M
lenF
Concatenation
of CNNvecs
F1 F2 F3 F4 Fn
Fi
Flattened
Sentence Level
vector
SemanticVec
Wi
Concatenation lenC
of Character
Embeddings
per word
C1 C2 C3 C4 Cm
Word1 Word2 Wordi Wordn Word Level
operation starting from vector C~j of Wi using a convolution filter of kernel size K can be
expressed as 1D Conv([C~j , C~j+1 , . . . Cj+K−1
~ ], f ilterK ) function. The output of this function is
a scalar value. While using a stride of s, the next convolution operation will start from vector
C~j+s of Wi and will provide us with another scalar value. Such convolution operations are
performed until we reach the end of word. Thus we get a vector as a result of applying one
convolution filter on matrix Wi . It is to note that we have performed same type convolution
in all our convolution type layers which employs extra padding in order to provide output of
same shape as that of the input matrix passed through the convolution layer. A convolution
layer has multiple such convolution filters. After passing Wi matrix through the first convolution
layer we obtain feature matrix Wi1 . Passing Wi1 through the second convolution layer provides
us with feature matrix Wi2 . So, the Lth convolution layer provides us with feature matrix WiL .
SemanticNet sub-model consists of such 1D convolution layers standing sequentially one after
the other. Such design helps in learning both short and long range intra-word level features.
Convolution layers near matrix Wi are responsible for identifying key sub-word patterns of
W ordi , while convolution layers further away focus on different combinations of these key sub-
word patterns. Such word level local pattern recognition plays key role in identifying semantic
meaning of a word irrespective of inflection or presence of spelling error. Each subsequent
convolution layer has non-decreasing filter size and increasing number of filters. As we move
to deeper convolution layers, our extracted features become more and more high level and
abstract. That means combining and finding important patterns from them becomes more and
more difficult with the increase of depth. This is the reason behind the increasing number of
filters with increasing depth. Each intermediate layer output is batch normalized keeping the
values passing from one layer to another within bound. The final convolution layer output matrix
WiL is flattened and formed into a vector Fi of size lenF . Fi is the SemanticVec representation of
W ordi . Now we could apply one dimensional global max pooling operation instead of flattening
matrix WiL . But global max pooling only takes the maximum value from each column of matrix
WiL . Such feature reduction may ultimately result in effective information loss. We obtain
SemanticVec representation from each of our input word in a similar fashion applying the same
SemanticNet sub-model. That means, there are many words in a sentence, but there is only one
SemanticNet sub-model used on all the input words. It is to note that SemanticVec representation
obtained from each input word does not require any extra representative vector storage unlike
word2vec and FastText. Simply storing the trained SemanticNet sub-model is enough irrespective
of the total number of words in the vocabulary.
is to find out important patterns formed by word clusters. So, this sub-model acts at
inter word level unlike SemanticNet sub-model that worked on intra word level. For n
words W ord1 , W ord2 , . . . W ordn ; we obtain n such SemanticVecs F~1 , F~2 , . . . F~n , respectively
using SemanticNet sub-model. Each SemanticVec is of size lenF . Concatenating these
SemanticVecs provide us with matrix M (details shown in the middle right portion of Figure
4.1). Applying 1D convolution on matrix M facilitates the derivation of key local patterns
found in input sentence/ paragraph which is crucial for output prediction. 1D convolution
operation starting from F~i using a convolution filter of kernel size K can be expressed as
1D Conv([F~i , F~i+1 , . . . Fi+K−1
~ ], f ilterK ) function. The output of this function is a scalar value.
Such convolution operations are performed until we reach F~n , the last SemanticVec of matrix
M. Thus we get a vector as a result of applying one convolution filter on matrix M . It is to note
that we have performed same type convolution in all our convolution type layers here as well.
Each of these convolution layers has multiple such convolution filters. After passing M matrix
through the first convolution layer we obtain feature matrix M 1 . Passing M 1 through the second
convolution layer provides us with feature matrix M 2 . So, the Lth convolution layer provides
us with feature matrix M L . So, Terminal Coordinator is a sequential 1D CNN sub-model with
design similar to SemanticNet sub-model having different set of weights. The difference is that
this sub-model is employed on only one matrix M unlike SemanticNet which has to be employed
on each individual word matrix. convolution layers near M are responsible for identifying key
word clusters, while convolution layers further away focus on different combinations of these
key word clusters important for sentence or paragraph level local pattern recognition. The final
output feature matrix obtained from the 1D CNN sub-model of Terminal Coordinator is flattened
to obtain the Coordination vector, a summary of important information obtained from the input
word sequence in order to predict the correct output. This approach of information summary
production has been proved to very effective through our experiments. But there is one problem
with CoCNN, which is the fixed maximum input length problem. Both self attention oriented
Transformer based architectures and our proposed CoCNN require a maximum input length
to be specified for many to one tasks such as next word prediction and text classification. In
fact, each input sequence no matter how small, has to be padded to the length of maximum
sequence as CoCNN requires fixed length input. Now let us understand the reason behind this
fixed length requirement. For many to one tasks, the size of the Coordination vector obtained
from the Terminal Coordinator sub-model depends on the number of SemanticVecs, which in
turn depends on number of input words. Now, at least one output dense layer has to be applied
on the Coordination vector to get the final output. The problem is that the number of rows of
dense layer weight matrix has to be equal to Coordination vector length. As the dense layer
weight matrix size has to be fixed, Coordination vector length (row number of dense layer weight
matrix) also has to be fixed, which in turn requires number of SemanticVecs to be fixed, which in
turn requires the number of sample input words to be fixed.
4.4. ATTENTION TO PATTERNS OF SIGNIFICANCE 33
O1 O2 O3 O4 O5 O1 O2 O3 O4 O5
1D conv 1D conv
of kernel of kernel
size 3 size 3
S1 S2 S3 S4 S5 S6 S7 S5 S6 S7 S1 S2 S3 S4
There is no explicit attention mechanism in CoCNN unlike self attention based Transformers
[17, 18] or weighted attention based LSTMs [47, 51]. In self attention, each input word
representative vector is modified using a weighted sum of all input word representative vectors in
accordance to its relationship with the other words. On the contrary, in weighted attention based
LSTMs the outputs obtained in each time step are weighted and added, which is in turn used
during output prediction as appropriate context information. In brief, attention mechanism is
important for obtaining the importance of each input word in terms of output prediction. Figure
4.2 demonstrates an over simplified notion of how 1D convolution implicitly imposes attention
on a sequence containing 7 entities - S1 , S2 , . . . S7 . Each of these entities is a character for
SemanticNet sub-model, while each entity is a word for Terminal Coordinator sub-model. After
employing a 1D convolution filter of kernel size 3 on this sequence representation, we obtain
a vector containing values O1 , O2 , . . . O5 , where we assume O5 and O1 to be the maximum
for the left and the right figure, respectively. O5 (left figure) is obtained by convolving input
entity S5 , S6 and S7 situated at the end of input sequence. We can say that these three input
entities have been paid more attention to by our convolution filter than the other 4 entities. In the
right figure, S5 , S6 and S7 are situated at the beginning of the input sequence. Our convolution
filter gives similar importance to this pattern irrespective of its position change. Such positional
independence of important patterns helps in Bengali and Hindi language modeling where input
words are loosely bound and words themselves are highly inflected. In CoCNN, such attention
is imposed on characters of each word during SemanticVec representation generation of that
word using SemanticNet, while similar type of attention is imposed on words of our input
sentence/ paragraph while obtaining Coordination vector using Terminal Coordinator. In brief,
our proposed CoCNN imposes position independent attention mechanism on both intra and inter
word level.
4.5. POSITIVE EFFECT OF BATCH NORMALIZATION 34
0
out − meanB
yi = √i (4.1)
varB +
outi = w × yi + b (4.2)
Here, varB and meanB denote variance and mean, respectively of this output value for the
current batch of samples. varB and meanB may vary drastically among different batches which
can lead to unstable weight update of our model. As a result, we introduce w and b which are
learnable parameters kept for reducing internal covariate shift in terms of this output value over
all batches.
If we think about next word prediction, our model training is an unsupervised task where we can
use bulk of raw corpus data. As a result, we have a wide variety of words and various types of
semantically meaningful sentences at our disposal as training samples. Batch normalized output
values provides us with three types of benefits over raw output values - (1) faster training, (2)
less bias towards initial weights and (3) automatic regularization effect [52, 53].
A2
B2
A1
C2 A3
A1
D2 B3 C1A2B3
B1 C1A2
Start
A2 C3
C1
B2 D3
D1
C1
C2 A3
D2 B3
C1D2
C3 C1D2C3
D3
Figure 4.3: Beam search simulation with 3 time stamps and a beam width of 2
product value among these eight values are of C1 A2 and C1 D2 . In a similar fashion, we feed
< Start >, C1 , A2 and < Start >, C1 , D2 as input to CoCNN and get eight probability product
values as output. Sequence C1 A2 B3 and C1 D2 C3 receive the two highest probability products. If
we decide to stop here, then we need to detect the highest probability valued sequence. Suppose,
it is C1 A2 B3 . So, the full sentence will be < Start > C1 A2 B3 in this case. This process can be
generalized to any beam width, sentence end marker words and top word list.
Choosing a larger beam width allows us to explore a larger search space for our completion task.
When our search space is larger, there is a higher probability of getting a more accurate and
sensible sentence completion. But increasing beam width also increases sentence completion
run time exponentially. Considering these two opposing factors, we choose 5 as our beam width.
There is one more aspect to consider. Let us think of two sentence completions for the same
input partial sentence. The first completion has predicted 5 words whereas the second one
has completed the sentence by predicting only 2 words. Beam search tends to prefer smaller
completions as the choice of completion is made using probability product of the predicted
consecutive words. We know that probability values are in the range of [0, 1]. Multiplying
probability values that are less than 1 will make the product term smaller and smaller as we
keep adding more probability values. The completion task predicting 5 words will have the
product of 5 probability values, whereas the completion task predicting 2 words will only have 2
such values. A simple approach can take care of this bias towards smaller sentence completions,
because in real life writing tools, people desire to get a bunch of words instead of such one word
for sentence completion oriented tasks. One can impose restriction on the minimum number of
words to be predicted in order to be considered as an output of the beam search. Also penalty
product terms can be imposed on shorter completions. On the contrary, we can impose reward
4.6. BEAM SEARCH BASED SENTENCE COMPLETION 36
on larger completions.
Chapter 5
We start this chapter by specifying the details of the datasets used for langauge modeling
experiments. Then we move on to the detail specifications of our proposed CoCNN model so
that readers can reproduce the reported results by developing the model appropriately. We also
specify our hardware details on which the experiments have been conducted so that the reported
runtimes make sense. Then we proceed to compare CoCNN architecture with state-of-the-art
architectures of CNN, LSTM and Transformer paradigm in classic language modeling task for
Bangla and Hindi - next word prediction. We identify BERT Pre model of Transformer paradigm
as the closet competitor of CoCNN. We compare these two architectures further on multiple
datasets. We also compare these architectures in downstream tasks such as sentence completion
and text classification. Finally, we show the potential of SemanticNet sub-model of CoCNN
architecture as a potential Bangla spell checker.
We have collected Bangla articles from online public news portals such as Prothom-Alo [54],
BDNews24 [55] and Nayadiganta [56] through web scraping. The articles encompass domains
such as politics, entertainment, lifestyle, sports, technology and literature. The datasets have
37
5.2. MODEL OPTIMIZATION 38
been cleaned by removing punctuation marks, English letters, numeric characters and incomplete
sentences. The Hindi dataset consists of Hindinews [57], Livehindustan [58] and Patrika [59]
newspaper articles available open source in Kaggle encompassing similar domains. Nayadiganta
(Bangla) and Patrika (Hindi) dataset have been used only as test set. Apart from next word
prediction, we have produced a sentence completion test set for both Bangla and Hindi from
their corresponding test sets each of which contain 10 k samples.
We summarize different attributes of our data sets in Table 5.1. Top words have been selected
such that they cover at least 90% tokens of the corresponding dataset. Due to popular usage of
Qwerty keyboard for Bangla typing, phonetic based spelling errors are common in this language.
A recent paper has introduced a probability based algorithm for realistic spelling error generation
in Bangla, which covers aspects such as phonetical similarity, compound characters and keyboard
mispress [30]. We use this algorithm to generate similar errors in our Bangla sample input words.
We use perplexity (PPL) to assess the performance of our model for next word prediction and
sentence completion task. Suppose, we have sample inputs I1 , I2 , . . . , In and our model provides
probability values P1 , P2 , . . . , Pn , respectively for their ground truth output tokens. Then the
PPL score of our model for these samples can be computed as:
n
1X
P P L = exp(− ln(Pi ))
n i=1
When a model assigns large probability to the ground truth next word outputs for different
samples, PPL score decreases. That means, when our model shows confidence in ground truth
word predictions, it gets a small PPL score. Smaller PPL score means more realistic model
prediction of next word. So, we would want to make our validation PPL score as small as
possible through various optimizations.
We use accuracy and F1 score as our performance metric for text classification related tasks. To
calculate F1 score, we need to determine precision and recall. All these performance metrics are
based on detecting true positive (TP), true negative (TN), false positive (FP) and false negative
(FN) values.
TP + TN
Accuracy =
TP + TN + FP + FN
5.2. MODEL OPTIMIZATION 39
TP
P recision =
TP + FP
TP
Recall =
TP + FN
2 × P recision × Recall
F1 =
P recision + Recall
For model optimization, we use Stochastic Gradient Descent (SGD) optimizer [60] with a
learning rate of 0.001 while constraining the norm of the gradients to below 5 for exploding
gradient problem elimination. We use Categorical Cross-Entropy loss for model weight update
and dropout [61] with probability 0.3 in between the dense layers of our models for regularization.
SemanticNet Details
The Terminal Coordinator sub-model used in CoCNN architecture uses six (6) convolution
layers having kernel size of 2, 3, 3, 3, 4 and 4, and filter number of 32, 64, 64, 96, 128 and 196,
respectively. Its design is similar to that of SemanticNet sub-model. We use Softmax activation
function at the final output layer of this sub-model. We use a batch size of 64 for our models,
while applying batch normalization on CNN intermediate outputs of both SemanticNet and
Terminal Coordinator.
5.3. HARDWARE SPECIFICATIONS 40
• 24 GB Nvidia Tesla K80 GPU, Intel(R) Xeon(R) CPU (2.30GHz) processor model, 12
GB RAM with 2 cores
• 8 GB Nvidia RTX 2070 GPU, Intel(R) Core(TM) CPU (2.20GHz) processor model, 32
GB RAM with 7 cores
6.25
6.00
5.75
2 4 6 8 10 12 14
Epoch
Figure 5.1: Comparing CoCNN with state-of-the-art architectures of CNN paradigm on Prothom-
Alo validation set. The score shown beside each model name in bracket denotes that model’s PPL
score on Prothom-Alo validation set after 15 epochs of training. Note that this dataset contains
synthetically generated spelling errors.
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 41
We compare CoCNN with three other CNN-based baselines (see Figure 5.1). CNN Van is a simple
sequential 1D CNN model of moderate depth [4]. It considers the full input sentence/ paragraph
as a matrix. The matrix consists of character representation vectors. CNN Dl uses dilated
convolution in its CNN layers which allows the model to have a larger field of view [40]. Such
change in convolution strategy shows slight performance improvement (PPL score reduced to 952
from 985). CNN Bn has the same setting as of CNN Van, but uses batch normalization (details
in Section 4.5) on intermediate convolution layer outputs. Such measure shows significant
performance improvement in terms of loss and PPL score (PPL reduced to 544). Proposed
CoCNN surpasses the performance of CNN Bn by a wide margin (PPL reduced to only 203). We
believe that the ability of CoCNN to consider each word of a sentence as a separate meaningful
entity is the reason behind this drastic improvement. The other CNN models consider the input
sentence as a sequence of characters which is not similar to what we humans do when we infer
something from a sentence. Rather we understand the meaning of a sentence using a bunch of
word coordination. From Table 5.2, we can see that the training time and parameter number
of CoCNN architecture are comparable to the other CNN models, although the performance of
CoCNN is better than the other CNN models by magnitudes.
We compare CoCNN with four LSTM-based models (see Figure 5.2). Two LSTM layers
are stacked on top of each other in all four of these models. We do not compare with
LSTM models that use Word2vec [62] representation as this representation requires fixed size
vocabulary. In spelling error prone setting, vocabulary size is theoretically infinite. We start with
LSTM FT, an architecture using sub-word based FastText representation [2, 12]. Character aware
learnable layers per LSTM time stamp form the new generation of SOTA LSTMs [13–15, 37].
LSTM CA acts as their representative by introducing variable size parallel convolution filter
output concatenation as word representation. The improvement over LSTM FT in terms of
PPL score is almost double (PPL score of 1198 compared to 2005). Instead of unidirectional
many to one LSTM, we introduce bidirectional LSTM in LSTM CA to form BiLSTM CA which
shows slight performance improvement (a slight reduction to PPL score of 1113 from 1198). We
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 42
7.25
7.00
6.75
Loss
6.50
6.25
LSTM_FT (2005)
6.00 LSTM_CA (1198)
BiLSTM_CA (1113)
5.75 BiLSTM_CA_Attn (771)
CoCNN (203)
2 4 6 8 10 12 14
Epoch
Figure 5.2: Comparing CoCNN with state-of-the-art LSTM architectures in terms of loss plots
and PPL score on Prothom-Alo validation set
introduce Bahdanu attention [51] on BiLSTM CA to form BiLSTM CA Attn architecture. Such
measure shows further performance boost (PPL reduced to 771). Proposed CoCNN shows almost
four times improvement in PPL score compared to BiLSTM CA Attn (PPL score of 203). If we
compare Figure 5.2 and Figure 5.1, we can see that CNNs perform relatively better than LSTMs
in general for Bangla LM. For example, the worst performing CNN model CNN Van performs
better than LSTM FT, LSTM CA and BiLSTM CA. LSTMs have a tendency of learning sequence
order information which imposes positional dependency. Such characteristic is unsuitable for
languages such as Bangla and Hindi with flexible word order.
We compare CoCNN with four Transformer-based models (see Figure 5.3). We use popular
FastText word representation with all compared Transformers. Our comparison starts with
Vanilla Tr, a single Transformer encoder (similar to the Transformer designed in [17]) that
utilizes global maxpooling and dense layer at the end for next word prediction. In BERT, we
stack 12 Transformers on top of each other where each Transformer encoder has more parameters
than the Transformer of Vanilla Tr [18, 41]. BERT with its large depth and enhanced encoders
almost double the performance shown by Vanilla Tr (PPL score reduced to 469 from 798).
We do not pretrain this BERT architecture. We follow the Transformer architecture designed
in [16] and introduce auxiliary loss after the Transformer encoders situated near the bottom of
the Transformer stack of BERT to form BERT Aux. Introduction of such auxiliary losses show
moderate improvement of performance (slight PPL decrease to 402 from 469). BERT Pre is the
pretrained version of BERT. We follow the word masking based pretraining scheme provided
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 43
7.2
Vanilla_Tr (798)
7.0 BERT (469)
BERT_Aux (402)
6.8 BERT_Pre (210)
CoCNN (203)
Loss 6.6
6.4
6.2
6.0
5.8
5.6
2 4 6 8 10 12 14
Epoch
Figure 5.3: Comparing CoCNN with state-of-the-art Transformer architectures in terms of loss
plots and PPL score on Prothom-Alo validation set
in [42]. The Bangla pretraining corpus consists of Prothom Alo [54] news articles dated from
2014-2017 and BDNews24 [55] news articles dated from 2015-2017. The performance of
BERT jumps up more than double when such pretraining is applied (PPL halved to 210 from
469). CoCNN without utilizing any pretraining achieves marginally better performance than
BERT Pre (PPL score of 203 compared to 210). Unlike Transformer encoders, convolution
imposes attention with a view to extracting important patterns from the input to provide the
correct output (see Section 4.4). Furthermore, SemanticNet of CoCNN is suitable for deriving
semantic meaning of each input word in highly inflected and error prone settings.
BERT Pre is the only model showing performance close to CoCNN in terms of validation loss
and PPL score (see Figure 5.3). In order to compare these two models further, we train both
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 44
5.2
BERT_Pre (152) BERT_Pre (65)
6.6 CoCNN (147) CoCNN (52)
5.0
6.4
6.2 4.8
Loss
Loss
6.0 4.6
5.8 4.4
5.6
4.2
5.4
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Epoch Epoch
(a) Plot on Bangla dataset (b) Plot on Hindi dataset
Figure 5.4: Comparing CoCNN with BERT Pre on Bangla (Prothom-Alo) and Hindi (Hindinews
and Livehindustan merged) validation set. The score shown beside each model name denotes
that model’s PPL score after 30 epochs of training on corresponding training set.
models for 30 epochs on several Bangla and Hindi datasets and obtain their PPL scores on
corresponding validation sets. Bangla datasets include Prothom-Alo, BDNews24; while Hindi
dataset includes Hindinews, Livehindustan. We use Nayadiganta and Patrika dataset for Bangla
and Hindi independent test set, respectively. The Hindi pretraining corpus consists of Hindi
Oscar Corpus [63], preprocessed Wikipedia articles [64], HindiEnCorp05 dataset [65] and WMT
Hindi News Crawl data [66]. From the graphs of Figure 5.4 and PPL score comparison Table
5.3, it is evident that CoCNN marginally outperforms its nemesis BERT Pre in all cases. There
are 8 pairs of PPL scores in Table 5.3. We use these scores to perform one tailed paired t-test in
order to determine whether the reduction of PPL score seen in CoCNN compared to BERT Pre
is statistically significant when P-value threshold is set to 0.05. The statistical test shows that the
improvement is significant and the null hypothesis can be rejected.
Training time
Model Parameter
/ epoch
BERT Pre 74 M 402 min
CoCNN 4.5 M 40 min
Table 5.4: Comparing BERT Pre and CoCNN in terms of efficiency
More importantly, the memory requirement of CoCNN is more than 16X lower and its training
speed is over 10X faster than that of the BERT Pre (see Table 5.4). Thus, in terms of efficiency,
CoCNN beats BERT Pre significantly. The hardware specification for both model training is as
follows: 12 GB Nvidia Titan Xp GPU, Intel(R) Core(TM) i7-7700 CPU (3.60GHz) processor
model, 32 GB RAM with 8 cores
5.5. COMPARATIVE ANALYSIS ON DOWNSTREAM TASKS 45
We form our sentence completion dataset for Hindi and Bangla language from their corresponding
test datasets for evaluating our proposed CoCNN model against its close competitor BERT Pre.
For each sentence Si of length ni , we take the first n2i words as input and the remaining words
as ground truth outputs. We use typical beam search [67] with beam width of 10. We keep 10
most probable partial sentences for each time stamp output of our model according to probability
product. Our final output contains the sentence with maximum probability product. Sentence
completion is strictly more difficult than word prediction, since wrong prediction of one output
word can adversely affect the prediction of all subsequent words of a particular sentence. In spite
of these challenges, CoCNN model achieves reasonable PPL score in this task as shown in Table
5.5. The scores are lower than BERT Pre in all three cases.
Training Validation
Dataset Class No Class Names
Sample No Sample No
Entity, Numeric,
Bangla Question
6 Human, Location, 2700 650
Classification
Description, Abbreviation
Hindi Product Positive Review,
2 1700 170
Review Classification Negative Review
Table 5.6: Bangla and Hindi Text classification dataset details
We have compared BERT Pre and CoCNN on Bangla question classification dataset (entity,
numeric, human, location, description and abbreviation type question) used in [68] and a Hindi
5.5. COMPARATIVE ANALYSIS ON DOWNSTREAM TASKS 46
product review classification dataset (positive and negative review) [69]. Details of both the
datasets have been provided in Table 5.6. We use five fold cross validation while performing
comparison on these datasets (see mean results in Table 5.7) in terms of accuracy and F1 score.
One tailed independent t-test with a P-value threshold of 0.05 have been performed on the 5 pairs
of validation F1 scores obtained from five fold cross validation of the two models. Our statistical
test results validate the significance of the improvement shown by CoCNN.
CoCNN model uses SemanticNet for producing SemanticVec representation from each input
word. We extract the CNN sub-model of SemanticNet from our trained (on Prothom-Alo dataset)
CoCNN model. We produce SemanticVec for all 13 K top words of Prothom-Alo dataset. The
SC suggests the correct version of a misspelled word. For any error word, We , we can generate
its SemanticVec Ve using SemanticNet. We can calculate cosine similarity, Cosi between Ve
and SemanticVec Vi of each top word Wi . Higher cosine similarity means greater probability of
being the correct version of We . We have discovered such approach to be effective for correct
word generation. Recently, a phonetic rule based approach has been proposed in [70], where a
hybrid of Soundex [71] and Metaphone [72] algorithm has been used for Bangla word level SC.
Another SC proposed in recent time has taken a clustering based approach [73]. We show the
comparison of our proposed SemanticNet based SC with these two existing SCs in Table 5.8.
Both the real and synthetic error dataset consist of 20k error words formed from the top 13k
words of Prothom-Alo dataset. The real error dataset has been collected from a wide range of
Bangla native speakers using an easy to use web app. The web app randomly offers three top
words to the user for typing fast and stores the user typed version of the corresponding correct
words in the database (shown in Figure 5.5). Results show the superiority of our proposed spell
checker over existing approaches.
5.5. COMPARATIVE ANALYSIS ON DOWNSTREAM TASKS 47
This chapter describes BERT derived BSpell architecture proposed for sentence level Bangla
spell checking. The architecture has a main branch and an auxiliary branch both of which have
been described in detail as part of this section. Furthermore, this chapter provides the details of
our proposed hybrid pretraining scheme required for pretraining BSpell model.
48
6.3. BERT BASE AS MAIN BRANCH 49
Final Output
Softmax Layer
BERT_Base
Softmax Layer
Di
F1 F2 Fi Fn
Dense Layer
Globalmaxpooled vector
Fi
SemanticVec
1D conv of
Sequential 1D kernel size 3
CNN Sub-Model
SemanticNet
Concatenation
of Character
Wi lenC
Embeddings
per word
our model to have similar semantic level understanding of the words. We employ SemanticNet, a
sequential 1D CNN sub-model at each individual word level with a view to learning intra word
syllable pattern. The details of this sub-model has been provided in Section 4.2. We believe that
inclusion of such sub-model in any learning based language model can play a vital role in all
types of Bangla NLP tasks.
encoder. The intuition behind such dense layer based modification is to prepare the SemanticVecs
for Transformer based context information extraction. 12 such Transformer encoders are
stacked on top of each other. Each Transformer employs multi head attention mechanism
to all its input vectors. The details of this process has been described in Section 2.4, while the
detail demonstration has been provided in Figure 2.8. The vectors modified through attention
mechanism is passed on for layer normalization. A neural network dense layer provides a
matrix as output during mini batch training. This matrix is normalization based on the mean and
variance which is known as layer normalization. The layer normalized vectors are added with a
dense layer modified version of these vectors through a skip connection for incorporating both
low level and high level features. The n vectors from n words of the input sentence keep passing
on from one Transformer to another with attention oriented modification based on the interaction
with one another. Such interaction is vital for understanding the context of each input word.
Often context of a word can be just as important as its error pattern for the spelling correction of
that word (see Figure 1.5). We pass the final Transformer output vectors to a Softmax layer that
employs Softmax activation function on each vector at an individual manner. So, now we have n
probability vectors. Each probability vector contains lenP values, where lenP is one more than
the total number of top words considered. This one additional word represents rare words, also
known as UNK. The top word corresponding to the index of the maximum probability value
of ith probability vector represents the correct word for W ordi of the input sentence. We use
such word level BERT based approach so that we can tackle the issue of heterogeneous character
number between error input word and correct output word, commonly found in Bengali (see
Figure 1.4).
the following formula to describe the impact of our auxiliary loss on weight update:
LT otal is the weighted total loss of BSpell model. We want our final loss LF inal to have greater
impact on model weight update than our auxiliary loss LAuxiliary , as LF inal is the loss associated
with the final prediction made by BSpell. Hence, we impose the constraint 0 < λ < 1. As λ gets
closer to 0, the influence of LAuxiliary diminishes more and more on model weight update.
Gradient vanishing is a serious problem in deep BERT based architectures, especially for layers
that are at the bottom near the input (SemanticNet sub-model in our case). Let us take the
scenario of Figure 6.2 for instance. This is an over simplified architecture, where the primary
branch contains 49 learnable layers, while the secondary branch only contains one. Each layer is
assumed to have only one weight for demonstration purpose. Here, W1 is the weight nearest to
the input. W1 gets updated using the following formula:
∂L1 ∂L2
W1 = W1 − ( + )
∂W1 ∂W1
We can calculate the two partial derivatives of the above equation as follows:
∂L1
There are two gradient calculations involved. ∂W 1
calculation takes the product of 50 derivative
terms. If each of these terms is slightly less than 1.0, then we get a product value which is almost
equal to 0. With the absence of the secondary branch, this is all we have to update W1 . On
∂L2
the other hand, ∂W 1
contains only two product terms. This term does not have any problem
6.5. BERT HYBRID PRETRAINING SCHEME 52
even if each product term is significantly below 1.0. The product term is still big enough. Thus
introducing auxiliary loss based shallow secondary branch is an effective measure to take when
it comes to updating weights of shallow layers in deep neural networks.
The secondary branch does not have any Transformer encoders or any such means through which
the input words can interact to produce context information. The prediction made from this
branch is dependent solely on misspelled word pattern information extracted by the SemanticNet
sub-model. Such phenomena enables SemanticNet sub-model to learn more meaningful word
representation. Such ability is crucial as error word pattern is more important than its context in
general for correcting that word although there are a few exceptional cases.
BSpell
In contemporary BERT pretraining methods, each input word W ordi maybe kept intact or maybe
replaced by a default mask word in a probabilistic manner [1, 28]. Then the BERT architecture
has to predict the masked words in addition to keeping the unmasked words in tact at the output
side. Mistakes from the BERT side will contribute to loss value accelerating backpropagation
based BERT weight update. Basically, in this process, BERT is learning to fill in the gaps. This
task is equivalent to predicting the appropriate word based on prior and post context of the
sentence or the document sample. So, contemporary word masking based pretraining focuses
on teaching the model context of the language. This is good enough for tasks such as next
word prediction, sentence auto completion, sentiment analysis and text classification. But in
spelling correction, recognizing noisy word pattern is equally important. Unfortunately, there is
no provision for that in contemporary word masking based pretraining [1, 28]. Either we have a
correct word or we have a fully masked word.
To mitigate the above limitations, we propose hybrid masking. Among our n input words in a
sentence, we randomly replace nW words with a mask word M askW ; where n > nW . So, now
we have n − nW words remaining in tact. Among these words, we choose nC words for character
6.5. BERT HYBRID PRETRAINING SCHEME 53
masking; where (n − nW ) > nC . Let us now see how we implement character masking on these
nC number of chosen words by focusing on a single word among these nC words. Suppose, our
word of interest consists of m characters. We choose mC characters at random from this word to
be replaced by a mask character M askC , where m > mC . Such masked characters introduce
noise in words and helps BERT to understand the probable semantic meaning of noisy words.
Misspelled words are actually similar to such noisy words and as a result, hybrid pretraining
boosts spelling correction specific fine tuning performance. Figure 6.3 shows a toy example of
hybrid masking on a sentence containing 5 words. Here, W ord2 (character no. 2) and W ord5
(character no. 1 and 4) are masked at character level, while W ord4 is chosen for word level
masking. W ord1 and W ord3 remain in tact.
Chapter 7
We start this chapter by specifying the details of our spell checking experimental dataset. Then
we move on to specifying the hyperparameters of our proposed BSpell model along with our
spell checker performance metric. The hardware specifications for spell checking experiments
are the same as Bangla language modeling experiments. So, we do not repeat them here. We
proceed to show the effectiveness of our proposed BSpell model in spelling correction over
existing BERT variants. Then we show the effectiveness of proposed hybrid pretraining scheme
on BSpell architecture over other possible pretraining methods. Finally, we compare hybrid
pretrained BSpell model with possible LSTM based spell checker variants and some recently
proposed Bangla spell checkers. We also shed light on the importance of SemanticNet sub-model
in BSpell architecture. This chapter ends with the limitations of our proposed method.
We now move on towards describing our Bangla spelling correction dataset in Table 7.1. We
54
7.2. MODEL OPTIMIZATION 55
have used one Bengali and one Hindi corpus with over 5 million (5 M) sentences for BERT
pretraining (see Table 7.1). There is no validation sample used while implementing pretraining.
The Bengali pretraining corpus consists of Prothom Alo [75] articles dated from 2014-2017 and
BDnews24 [76] articles dated from 2015-2017. The Hindi pretraining corpus consists of Hindi
Oscar Corpus [63], preprocessed Wikipedia articles [64], HindiEnCorp05 dataset [65] and WMT
Hindi News Crawl data [66]. We apply our masking based model pretraining schemes on these
datasets.
We have used Prothom-Alo 2017 online newspaper dataset for Bengali spelling correction model
training and validation purpose. Our errors in this corpus have been produced synthetically
using the probabilistic algorithm described in [9]. The errors approximate realistic spelling
errors, as the algorithm considers phonetical similarity based confusions, compound letter typing
mistakes, keyboard mispress insertions and deletions. We further validate our baselines and our
proposed method on Hindi open source spelling correction dataset, namely ToolsForIL [24].
This synthetically generated dataset was used for character level sequence to sequence Hindi
spell checker development.
We have also manually collected real error datasets of typing from nine Bengali writers. We have
collected a total of 6300 correct sentences from Nayadiganta [77] online newspaper. Then we
have distributed the entire dataset among nine participants. They have typed (in regular speed)
each correct sentence using English QWERTY keyboard producing natural spelling errors. It
has taken 40 days to finish the labeling, where each of the participant has labeled less than 20
sentences per day in order to ensure quality in labeling. As this real error dataset has mainly
been used as a test set after training and validating on synthetic error dataset Prothom-Alo, we
have considered its top words to be the same as Prothom-Alo dataset. Table 7.1 contains all
dataset details. For example, Prothom-Alo (synthetic error) dataset has 262 K unique words, 35
K top words, 1 M training samples, 200 K validation samples, 450 K unique error words and
52% of the existing words in this corpus contain one or more errors. Top words which are to
be corrected by our proposed models have been taken such that they cover at least 95% of the
corresponding corpus.
We use accuracy and F1 score as our performance metric for spelling correction. It is to note
that during spelling correction, we only correct top words. We do not correct rare UNK words.
Merely detecting rare words correctly is considered sufficient, as users do not necessarily want
to correct names and similar proper nouns.
7.3. COMPARING BSPELL WITH CONTEMPORARY BERT VARIANTS 56
Optimizer details and SemanticNet sub-model details for BSpell are the same as CoCNN. So,
naturally we do not repeat them here.
The main branch of BSpell architecture is similar to BERT Base [74] in terms of stacking 12
Transformer encoders on top of each other. We pass each SemanticVec through a 768 hidden
unit dense layer. The modified output obtained from this dense layer is then passed on to the
bottom-most Transformer of the stack. Each Transformer consists of 12 attention heads and 768
hidden unit feed forward neural networks. Attention outputs from each Transformer is passed
through a dropout layer [78] with a dropout rate of 0.3 and then layer normalized [79].
Pretraining schemes of BERT based architectures differ in terms of how masking is implemented
on the words of each sentence of the pretraining corpus. We have experimented with three types
of masking schemes which are as follows:
• Word Masking: For each sentence, we randomly select 15% words of that sentence and
replace those words with a fixed mask word filled with mask characters.
• Character Masking: For each sentence, we randomly select 50% words of that sentence.
For each selected word, we randomly mask 30% of its characters by replacing each of
them with our special mask character.
• Hybrid Masking: For each sentence, we randomly select 15% words of the sentence
and replace those words with a fixed mask word. Again from the remaining words, we
randomly select 40% of the words. For these selected words, we randomly mask 25% of
their characters.
Figure 7.1: The bold words in the sentences are the error words. The red marked words are the
ones where only BSpell succeeds in correcting among the approaches provided in Table 7.2
of the ith word of input sentence has to be the ith word in the output sentence as well. Such
constraint is difficult to follow through such architecture design. Hence we get poor result using
BERT Seq2seq operating at character level (see Table 7.2). BERT Base architecture is inspired
from [26], where we use only the stack of Transformer encoder portion. Our implemented
BERT Base has two differences from the design proposed in [26] - (i) We make predictions
at word level instead of character level. For character level prediction, character number of a
misspelled sentence and its correct version have to be the same; which is an uncommon scenario
in Bangla (see Figure 1.4 for details). (ii) We do not incorporate any external knowledge about
Bangla spell checking through the use of graph convolutional network since such knowledge is
not well established in Bangla spell checking domain. In BERT Base, the correct version for
the ith word of input sentence can always be found at the ith position of the corrected output
sentence. This approach achieves good performance in all four cases. Soft Masked BERT learns
to apply specialized synthetic masking on error prone words in order to push the error correction
performance of BERT Base further. The error prone words are detected using a Gated recurrent
unit (GRU) based sub-model and the whole architecture is trained end to end. Although the
corresponding research [27] implemented this architecture to make corrections at character level,
7.4. COMPARING DIFFERENT PRETRAINING SCHEMES ON BSPELL 58
our implementation does everything in word level because of the reasons mentioned before.
We see a slight performance improvement in all cases after incorporating this new design. It is
to note that we have used popular sub-word based FastText [80] word representation for both
BERT Base and Soft Masked BERT. Our proposed BSpell model shows decent performance
improvement in all cases. SemanticNet sub-model, with the help of our designed auxiliary loss
is able to specialize on word error pattern. The reason behind our performance gain becomes
clearer from the examples of Figure 7.1. The red marked error words have a high edit distance
with their corresponding correct words, though these spelling error patterns are common while
typing Bangla using English Qwerty keyboard. A human with a decent knowledge on Bangla
writing pattern should be able to identify the correct versions of the red marked words given
slight contextual information. SemanticNet sub-model with the help of auxiliary loss based
gradient update provides BSpell with similar capability. As part of ablation study, we remove
the specialized auxiliary loss from the BSpell model (BSpell (No Aux Loss) model) and redo the
experiments. The performance decreases in all cases.
Figure 7.2: The bold words in the sentences are the error words. The red marked words are the
ones where only hybrid pretrained BSpell succeeds in correcting among the approaches provided
in Table 7.3
We have implemented three different pretraining schemes (details provided in Section 7.2.2) on
our proposed BSpell model before fine tuning on corresponding spelling correction dataset. Word
7.5. COMPARING LSTM VARIANTS WITH BSPELL 59
masking mainly teaches BSpell context of a language through large scale pretraining. It takes a
fill in the gaps sort of approach. Spelling correction is not all about filling in the gaps. It is also
about what the writer wants to say, i.e. being able to predict a word even if some of its characters
are blank (masked). A writer will seldom write the most appropriate word according to context,
rather he will write what is appropriate to express his opinion. Character masking takes a more
drastic approach by completely eliminating the fill in the gap task. This approach masks a few
of the characters residing in some of the input words of the sentence and asks BSpell to predict
these noisy words’ original correct version. The lack of context in such pretraining scheme puts
negative effect on performance over real error dataset experiments, where harsh errors exist
and context is the only feasible way of correcting such errors (shown in Table 7.3). Hybrid
masking approach focuses both on filling in word gaps and on filling in character gaps through
prediction of correct word and helps BSpell achieve state-of-the-art performance. A couple of
examples have been provided in Figure 7.2, where the errors introduced in the red marked words
are slightly less common, has large edit distance with their correct versions, and requires both
error pattern understanding and word context for correction. Only hybrid pretrained BSpell has
been successful in correcting these words.
We start our experiments with Pre LSTM, a generalized form of N-gram based word prediction
[23]. Given word1 , word2 , . . . wordi−1 of a sentence, this model predicts wordi . We assume
this predicted word to be the correct version of the given wordi without taking its own spelling
7.6. COMPARING EXISTING BANGLA SPELL CHECKERS WITH BSPELL 60
pattern into consideration. Quite understandably, the result is poor on all datasets (see Table
7.4). Pre LSTM + Word CNN architecture coordinates between two neural network branches for
correcting wordi of a sentence. Pre LSTM branch takes in word1 , word2 , . . . wordi−1 as input,
whereas 1D convolution based Word CNN branch takes in wordi representation matrix as input.
Pre LSTM provides a prior context vector as output, whereas Word CNN provides wordi spelling
pattern summary vector as output. Both these outputs are merged and passed to a coordinating
neural network module to get the correct version of wordi . Performance jumps up by a large
margin with the addition of Word CNN branch as misspelled word spelling pattern is crucial for
spell checking that word. Pre LSTM + Word CNN + Post LSTM is similar to the previous model
except for the presence of one extra LSTM based neural network branch Post LSTM which
takes in wordn , wordn−1 , wordn−2 , . . . wordi+1 as input, where n is the number of words in the
sentence. It produces a post context vector for wordi as output. The merging and prediction
procedure is similar to Pre LSTM + Word CNN architecture. The performance sees another boost
after the addition of Post LSTM branch, because the correction of a word requires the word’s
error pattern, previous context and post context (see Figure 1.2). All three of the discussed LSTM
based models correct one word at a time in a sentence. BiLSTM is a many to many bidirectional
LSTM that takes in all n words of a sentence at once and predicts their correct version as output
(wordi is provided as input in time stamp i) [2]. BiLSTM shows almost similar performance as
Pre LSTM + Word CNN + Post LSTM, because it also takes in both previous and post context
into consideration besides the writing pattern of each word. We have stacked two LSTM layers
on top of each other in the discussed four models. However, in Stacked BiLSTM, we stack
twelve many to many bidirectional LSTMs on top of one another instead of just two. Spelling
correction performance improves by only 1-1.5% in spite of such large increase in parameter
number. Attn Seq2seq is a sequence to sequence LSTM based model utilizing Bahdanu attention
mechanism at decoder side [12]. This model takes in misspelled sentence characters as input and
provides the correct sequence of characters as output [24]. Due to word level spelling correction
evaluation, this model faces the same problems as BERT Seq2seq model discussed in Section 7.3.
Such architecture shows poor performance in Bangla and Hindi spelling correction. Although
Stacked BiLSTM shows superior performance in LSTM paradigm, our proposed BSpell (Hybrid)
model outperforms this model by a large margin.
spelling correction datasets because of the presence of such wide variety of errors (see Table
7.5). Since these two spell checkers are not learning based, fine tuning is not applicable for
them. 1D CNN is a simple sequential 1D convolution based model that takes a word’s matrix
representation as input and predicts that word’s correct version as output [82]. Such learning
based scheme brings in substantial performance improvement. None of these three spell checkers
take misspelled word context into consideration while correcting that word. As a result, the
performance is poor especially in Bangla real error dataset. Our proposed hybrid pretrained
BSpell model outperforms these existing Bangla spell checkers by a wide margin.
Figure 7.3: 10 popular Bangla words and their real life error variations. Cluster visual is provided
in Figure 7.4
The main motivation behind the inclusion of SemanticNet sub-model in BSpell is to obtain
vector representations of error words as close as possible to their corresponding correct words.
Such ability imposes correct semantic meaning to misspelled input words in a sentence. Such
representation comes in handy in cases where there are many error words in a single sentence
7.8. BSPELL LIMITATIONS 62
4 W1 4.3
W6
4.4
2 W5 W7
4.5
0
PC2
PC2
4.6
2 4.7
W2
W10
4 W6
4.8 W8
W9
W8 W4
W3
4.9
7.5 5.0 2.5 0.0 2.5 5.0 7.5 2.2 2.1 2.0 1.9 1.8 1.7 1.6
PC1 PC1
(a) (b)
Figure 7.4: (a) 10 Bangla popular words and three error variations of each of those words
(provided in Figure 7.3) have been plotted based on their SemanticVec representation. Each
popular word along with its three error variations form a separate cluster marked with unique
color. (b) Two of the popular word clusters zoomed in to have a closer look.
(see Figure 1.6). We take 10 frequently occurring Bangla words and collect three real life error
variations of each of these words (see Figure 7.3). We produce SemanticVec representation of all
40 of these words (including correct word and error variations) using SemanticNet sub-model
(extracted from trained BSpell model). We use principal component analysis (PCA) [83] on
each of these SemanticVecs and plot them in two dimension. Finally, we implement K-Means
Clustering algorithm using careful initialization with K = 10 [84]. Figure 7.4a shows the 10
clusters obtained from this algorithm. Each cluster consists of a popular word and its three error
variations. In all cases, the correct word and its three error versions are so close in the graph plot
that they almost form a single point. In Figure 7.4b, we zoom in to have a closer look at cluster
W6 and W8 . Error versions of a particular word are close to that word but are far from other
correct words.
Error Correct
There are three cases where BSpell model fails (see Figure 7.5). The first and second case involve
accidental merging of two words together or accidental splitting of one word into two. Our
proposed model provides a word for word correction, i.e., number of input words and number
of output words of BSpell have to be the exact same. Unfortunately, during word merging or
word splitting, number of input and output words change. The third case involves error in rare
words, words that are mostly proper nouns and are rarely found in Bangla corpus. Our model
simply labels such words as UNK and does not try to correct errors prevalent in such words. The
introduction of encoder decoder based spelling correction at character level can theoretically
eliminate these limitations. In almost all sentence level spelling corrections, the correct version
of input word i should be found at output word i. Sequence to sequence modeling finds such
constraint hard to follow. Loss function involved in character level modeling emphasizes on
correcting characters. If a word contains 10 characters and 9 out of those 10 characters are
rendered correctly, still that word remains incorrect. Poor spelling correction performance of
BERT Seq2seq (see Table 7.2) and Attn Seq2seq (see Table 7.4) bear testimony to our claim.
Chapter 8
Conclusion
We start this chapter by elaborating the possible future directions of this research. Finally,
we conclude this chapter with some ending remarks regarding our research summary and
contributions.
Our proposed BSpell architecture is aimed at offline spell checking where the model makes
64
8.2. CONCLUDING REMARKS 65
corrections after the full sentence has been provided to it. A possible future research may involve
online version of spell checking where the proper word will be suggested while typing a word
partially depending on the partially typed word pattern and its previous context. Three such
examples have been provided in Figure 8.1. The green marked words are the correctly predicted
versions of the corresponding red marked words. In the second and the third example, the
partially typed words are typed incorrectly. The designed system has to predict the correct and
complete version in spite of incorrectly typed partial words. Furthermore, since this spell checker
will be used online, it has to be runtime efficient as well. This is in essence a more difficult
problem than offline spell checking.
[1] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and
V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint
arXiv:1907.11692, 2019.
[2] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions
on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[3] S. Takase, J. Suzuki, and M. Nagata, “Character n-gram embeddings to improve rnn
language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
pp. 5074–5082, 2019.
[4] N.-Q. Pham, G. Kruszewski, and G. Boleda, “Convolutional neural network language
models,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, pp. 1153–1162, 2016.
[5] J. Gao, J. Goodman, M. Li, and K.-F. Lee, “Toward a unified approach to statistical language
modeling for chinese,” ACM Transactions on Asian Language Information Processing
(TALIP), vol. 1, no. 1, pp. 3–33, 2002.
[6] D. Cai and H. Zhao, “Neural word segmentation learning for chinese,” arXiv preprint
arXiv:1606.04300, 2016.
[7] T.-H. Yang, T.-H. Tseng, and C.-P. Chen, “Recurrent neural network-based language models
with variation in net topology, language, and granularity,” in 2016 International Conference
on Asian Language Processing (IALP), pp. 71–74, IEEE, 2016.
[8] J. F. Staal, “Sanskrit and sanskritization,” The Journal of Asian Studies, vol. 22, no. 3,
pp. 261–275, 1963.
[10] J. Noyes, “The qwerty keyboard: A review,” International Journal of Man-Machine Studies,
vol. 18, no. 3, pp. 265–281, 1983.
66
REFERENCES 67
[11] S. Wang, M. Huang, and Z. Deng, “Densely connected cnn with multi-scale feature attention
for text classification.,” in IJCAI, pp. 4468–4474, 2018.
[12] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[13] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural
machine translation,” arXiv preprint arXiv:1508.04025, 2015.
[14] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,”
arXiv preprint arXiv:1508.06615, 2015.
[18] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language modeling with deep transformers,”
arXiv preprint arXiv:1905.04226, 2019.
[19] X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, M. Zhou, and D. Song, “A tensorized
transformer for language modeling,” in Advances in Neural Information Processing Systems,
pp. 2232–2242, 2019.
[20] N. UzZaman and M. Khan, “A bangla phonetic encoding for better spelling suggesions,”
tech. rep., BRAC University, 2004.
[21] N. UzZaman and M. Khan, “A double metaphone encoding for approximate name searching
and matching in bangla,” 2005.
[22] P. Mandal and B. M. Hossain, “Clustering-based bangla spell checker,” in 2017 IEEE
International Conference on Imaging, Vision & Pattern Recognition (icIVPR), pp. 1–6,
IEEE, 2017.
[23] N. H. Khan, G. C. Saha, B. Sarker, and M. H. Rahman, “Checking the correctness of bangla
words using n-gram,” International Journal of Computer Application, vol. 89, no. 11, 2014.
REFERENCES 68
[24] P. Etoori, M. Chinnakotla, and R. Mamidi, “Automatic spelling correction for resource-
scarce languages using deep learning,” in Proceedings of ACL 2018, Student Research
Workshop, (Melbourne, Australia), pp. 146–152, Association for Computational Linguistics,
July 2018.
[25] D. Wang, Y. Tay, and L. Zhong, “Confusionset-guided pointer networks for Chinese spelling
check,” in Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, (Florence, Italy), pp. 5780–5785, Association for Computational Linguistics,
July 2019.
[26] X. Cheng, W. Xu, K. Chen, S. Jiang, F. Wang, T. Wang, W. Chu, and Y. Qi, “Spellgcn:
Incorporating phonological and visual similarities into language models for chinese spelling
check,” arXiv preprint arXiv:2004.14166, 2020.
[27] S. Zhang, H. Huang, J. Liu, and H. Li, “Spelling error correction with soft-masked bert,”
arXiv preprint arXiv:2005.07421, 2020.
[28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[30] M. Sifat, H. Rahman, C. R. Rahman, M. Rafsan, M. Rahman, et al., “Synthetic error dataset
generation mimicking bengali writing pattern,” arXiv preprint arXiv:2003.03484, 2020.
[32] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language modeling,”
in Thirteenth annual conference of the international speech communication association,
2012.
[33] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,”
arXiv preprint arXiv:1708.02182, 2017.
[35] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword
information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–
146, 2017.
[38] S. Moriya and C. Shibata, “Transfer learning method for very deep cnn for text classification
and methods for its evaluation,” in 2018 IEEE 42nd Annual Computer Software and
Applications Conference (COMPSAC), vol. 2, pp. 153–158, IEEE, 2018.
[39] H. T. Le, C. Cerisara, and A. Denis, “Do convolutional networks need to be deep for
text classification?,” in Workshops at the Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.
[40] S. Roy, “Improved bangla language modeling with convolution,” in 2019 1st International
Conference on Advances in Science, Engineering and Robotics Technology (ICASERT),
pp. 1–4, IEEE, 2019.
[41] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[42] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and
V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint
arXiv:1907.11692, 2019.
[43] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders
as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.
[44] J. Xiong, Q. Zhang, S. Zhang, J. Hou, and X. Cheng, “Hanspeller: a unified framework
for chinese spelling correction,” in International Journal of Computational Linguistics &
Chinese Language Processing, Volume 20, Number 1, June 2015-Special Issue on Chinese
as a Foreign Language, 2015.
[45] Y. Hong, X. Yu, N. He, N. Liu, and J. Liu, “Faspell: A fast, adaptable, simple, powerful
chinese spell checker based on dae-decoder paradigm,” in Proceedings of the 5th Workshop
on Noisy User-generated Text (W-NUT 2019), pp. 160–169, 2019.
[46] G. Lewis, “Sentence correction using recurrent neural networks,” Department of Computer
Science, Stanford University, p. 7, 2016.
REFERENCES 70
[47] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural
machine translation,” arXiv preprint arXiv:1508.04025, 2015.
[48] K. H. Lai, M. Topaz, F. R. Goss, and L. Zhou, “Automated misspelling detection and
correction in clinical free-text records,” Journal of biomedical informatics, vol. 55, pp. 188–
195, 2015.
[50] C.-H. Wu, C.-H. Liu, M. Harris, and L.-C. Yu, “Sentence correction incorporating relative
position and parse template language models,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 18, no. 6, pp. 1170–1181, 2009.
[51] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in International conference on machine learning, pp. 448–
456, PMLR, 2015.
[53] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch normalization help
optimization?,” arXiv preprint arXiv:1805.11604, 2018.
[62] X. Rong, “word2vec parameter learning explained,” arXiv preprint arXiv:1411.2738, 2014.
[65] O. Bojar, V. Diatka, P. Straňák, A. Tamchyna, and D. Zeman, “HindEnCorp 0.5,” 2014.
LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics
(ÚFAL), Faculty of Mathematics and Physics, Charles University.
[67] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,”
in Advances in neural information processing systems, pp. 3104–3112, 2014.
[70] S. Saha, F. Tabassum, K. Saha, and M. Akter, BANGLA SPELL CHECKER AND
SUGGESTION GENERATOR. PhD thesis, United International University, 2019.
[71] N. UzZaman and M. Khan, “A bangla phonetic encoding for better spelling suggesions,”
tech. rep., BRAC University, 2004.
[72] N. UzZaman and M. Khan, “A double metaphone encoding for bangla and its application
in spelling checker,” in 2005 International Conference on Natural Language Processing
and Knowledge Engineering, pp. 705–710, IEEE, 2005.
[73] P. Mandal and B. M. Hossain, “Clustering-based bangla spell checker,” in 2017 IEEE
International Conference on Imaging, Vision & Pattern Recognition (icIVPR), pp. 1–6,
IEEE, 2017.
[74] L. Gong, D. He, Z. Li, T. Qin, L. Wang, and T. Liu, “Efficient training of bert by
progressively stacking,” in International Conference on Machine Learning, pp. 2337–2346,
PMLR, 2019.
REFERENCES 72
[81] S. Saha, F. Tabassum, K. Saha, and M. Akter, BANGLA SPELL CHECKER AND
SUGGESTION GENERATOR. PhD thesis, United International University, 2019.
[82] S. Roy, “Improved bangla language modeling with convolution,” in 2019 1st International
Conference on Advances in Science, Engineering and Robotics Technology (ICASERT),
pp. 1–4, IEEE, 2019.
[84] Z. Chen and S. Xia, “K-means clustering algorithm with improved initial center,” in 2009
Second International Workshop on Knowledge Discovery and Data Mining, pp. 790–792,
IEEE, 2009.
Generated using Postgraduate Thesis LATEX Template, Version 1.03. Department of
Computer Science and Engineering, Bangladesh University of Engineering and
Technology, Dhaka, Bangladesh.
73