0% found this document useful (0 votes)
49 views84 pages

Full Thesis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views84 pages

Full Thesis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

M.Sc. Engg.

(CSE) Thesis

Neural Language Modeling for Context Based Word


Suggestion, Sentence Completion and Spelling Correction
in Bangla

Submitted by
Chowdhury Rafeed Rahman
1018052013

Supervised by
Dr. Mohammed Eunus Ali

Submitted to
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology
Dhaka, Bangladesh

in partial fulfillment of the requirements for the degree of


Master of Science in Computer Science and Engineering

August 2021
Acknowledgement
I am thankful to all the people who have made this research possible through their continuous
assistance and motivation. I would first like to thank my respected supervisor, Professor
Mohammed Eunus Ali who supported the research from the start to the end through his wise
suggestions, corrective measures and intriguing insight. His strong and constructive criticism
helped this research reach its high standards.
I would also like to acknowledge Swakkhar Shatabda, Associate Professor of United International
University (UIU) and the IT team of UIU who facilitated a dedicated GPU server that was
essential for conducting the large number of experiments of this research.
Finally, I would like to thank my parents and my wife for providing me with mental support and
wise counseling during the tough times of the research period.

Dhaka Chowdhury Rafeed Rahman


August 5, 2021 1018052013

iii
Contents

Candidate’s Declaration i

Board of Examiners ii

Acknowledgement iii

List of Figures vii

List of Tables ix

Abstract x

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Bangla Language Modeling Problem . . . . . . . . . . . . . . . . . . . 3
1.2.2 Spelling Correction Problem . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of Related Works and Limitations . . . . . . . . . . . . . . . . . . . 4
1.4 Proposed Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 10
2.1 Unidirection and Bidirection in Many to One LSTM . . . . . . . . . . . . . . . 10
2.2 One Dimensional CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Attention Mechanism in LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Self Attention Based Transformer Encoders . . . . . . . . . . . . . . . . . . . 14
2.5 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Word Representation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Overview of Synthetic Spelling Error Generation . . . . . . . . . . . . . . . . 19

3 Literature Review 21
3.1 Language Modeling Related Researches . . . . . . . . . . . . . . . . . . . . . 21

iv
3.1.1 LSTM Based Language Modeling . . . . . . . . . . . . . . . . . . . . 21
3.1.2 CNN Based Language Modeling . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Self Attention Based Language Modeling . . . . . . . . . . . . . . . . 24
3.2 Spell Checker Related Researches . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Existing Bangla Spell Checkers . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 State-of-the-art Neural Network Based Chinese Spell Checkers . . . . . 27
3.2.3 Researches Indirectly Connected to Spell Checking . . . . . . . . . . . 28

4 Bangla Language Modeling Approach 29


4.1 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 SemanticNet Sub-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Terminal Coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Attention to Patterns of Significance . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Positive Effect of Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Beam Search Based Sentence Completion . . . . . . . . . . . . . . . . . . . . 34

5 Language Modeling Performance Analysis 37


5.1 Dataset Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Hyperparameter Specifications . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Hardware Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Comparing CoCNN with State-of-the-art Architectures . . . . . . . . . . . . . 40
5.4.1 Comparing CoCNN with Other CNNs . . . . . . . . . . . . . . . . . . 40
5.4.2 Comparing CoCNN with LSTMs . . . . . . . . . . . . . . . . . . . . . 41
5.4.3 Comparing CoCNN with Transformers . . . . . . . . . . . . . . . . . . 42
5.4.4 Comparing CoCNN with BERT Pre . . . . . . . . . . . . . . . . . . . 43
5.5 Comparative Analysis on Downstream Tasks . . . . . . . . . . . . . . . . . . . 45
5.5.1 Sentence Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5.3 SemanticNet as Potential Spell Checker . . . . . . . . . . . . . . . . . 46

6 Bangla Spell Checking Approach 48


6.1 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 SemanticNet Sub-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 BERT Base as Main Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.4 Auxiliary Loss Based Secondary Branch . . . . . . . . . . . . . . . . . . . . . 50
6.5 BERT Hybrid Pretraining Scheme . . . . . . . . . . . . . . . . . . . . . . . . 52

7 Spell Checking Performance Analysis 54

v
7.1 Dataset Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.1 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.2 Hyperparameter Specifications . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Comparing BSpell with Contemporary BERT Variants . . . . . . . . . . . . . . 56
7.4 Comparing Different Pretraining Schemes on BSpell . . . . . . . . . . . . . . . 58
7.5 Comparing LSTM Variants with BSpell . . . . . . . . . . . . . . . . . . . . . 59
7.6 Comparing Existing Bangla Spell Checkers with BSpell . . . . . . . . . . . . . 60
7.7 Effectiveness of SemanticNet Sub-Model . . . . . . . . . . . . . . . . . . . . . 61
7.8 BSpell Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 Conclusion 64
8.1 Future Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

References 66

vi
List of Figures

1.1 Bangla language unique characteristics . . . . . . . . . . . . . . . . . . . . . . 1


1.2 Examples of context and word pattern dependency for spelling correction. Bold
red marked word is the misspelled word, the correction of which depends on
the italicized blue words. In the provided English translation, (error) has been
placed as a representative of meaningless misspelled Bangla words. . . . . . . . 2
1.3 Spelling correction problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Heterogeneous character number between error spelling and correct spelling . . 5
1.5 Some sample words that are correctly spelled accidentally, but are context-wise
incorrect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Necessity of understanding semantic meaning of existing erroneous words for
spelling correction of an individual misspelled word. . . . . . . . . . . . . . . 6

2.1 Forward and backward LSTM comprising bidirectional LSTM . . . . . . . . . 10


2.2 1D convolution operation of filter size 2 on a 8 length sequence matrix . . . . . 11
2.3 1D Pooling operation of size 2 and stride 2 on a 8 length sequence matrix . . . 12
2.4 GlobalMaxPooling operation one dimensional version applied on a sequence
feature matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Attention mechanism in encoder decoder based LSTM. Here, we assume number
of time stamps in both encoder and decoder to be 3. . . . . . . . . . . . . . . . 13
2.6 High level overview of a Transformer based many to one model . . . . . . . . . 14
2.7 Peeking inside Transformer encoder block . . . . . . . . . . . . . . . . . . . . 14
2.8 Peeking inside Transformer multi-head self attention layer . . . . . . . . . . . 15
2.9 Looking at self attention mechanism prevalent in a single head of Transformer
multi-head attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 BERT architecture and pretraining scheme overview . . . . . . . . . . . . . . . 17
2.11 Skip-gram Word2Vec model under window size 3 . . . . . . . . . . . . . . . . 18
2.12 FastText representation details of a sample word . . . . . . . . . . . . . . . . . 19
2.13 Bangla synthetic error generation algorithm high level overview . . . . . . . . 19
2.14 Some sample error words generated by error generation algorithm . . . . . . . 20

4.1 1D CNN based CoCNN architecture . . . . . . . . . . . . . . . . . . . . . . . 30


4.2 1D convolution as a form of attention . . . . . . . . . . . . . . . . . . . . . . . 33

vii
4.3 Beam search simulation with 3 time stamps and a beam width of 2 . . . . . . . 35

5.1 Comparing CoCNN with state-of-the-art architectures of CNN paradigm on


Prothom-Alo validation set. The score shown beside each model name in bracket
denotes that model’s PPL score on Prothom-Alo validation set after 15 epochs of
training. Note that this dataset contains synthetically generated spelling errors. . 40
5.2 Comparing CoCNN with state-of-the-art LSTM architectures in terms of loss
plots and PPL score on Prothom-Alo validation set . . . . . . . . . . . . . . . 42
5.3 Comparing CoCNN with state-of-the-art Transformer architectures in terms of
loss plots and PPL score on Prothom-Alo validation set . . . . . . . . . . . . . 43
5.4 Comparing CoCNN with BERT Pre on Bangla (Prothom-Alo) and Hindi
(Hindinews and Livehindustan merged) validation set. The score shown beside
each model name denotes that model’s PPL score after 30 epochs of training on
corresponding training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Spelling error collection app overview . . . . . . . . . . . . . . . . . . . . . . 47

6.1 BSpell architecture details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


6.2 Gradient vanishing problem elimination using auxiliary loss . . . . . . . . . . 51
6.3 Proposed hybrid pretraining scheme . . . . . . . . . . . . . . . . . . . . . . . 52

7.1 The bold words in the sentences are the error words. The red marked words
are the ones where only BSpell succeeds in correcting among the approaches
provided in Table 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 The bold words in the sentences are the error words. The red marked words are
the ones where only hybrid pretrained BSpell succeeds in correcting among the
approaches provided in Table 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 10 popular Bangla words and their real life error variations. Cluster visual is
provided in Figure 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.4 (a) 10 Bangla popular words and three error variations of each of those
words (provided in Figure 7.3) have been plotted based on their SemanticVec
representation. Each popular word along with its three error variations form a
separate cluster marked with unique color. (b) Two of the popular word clusters
zoomed in to have a closer look. . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.5 Three example scenarios where BSpell fails . . . . . . . . . . . . . . . . . . . 62

8.1 Examples of online spell checking . . . . . . . . . . . . . . . . . . . . . . . . 64

viii
List of Tables

5.1 Dataset specific details for language modeling experiments . . . . . . . . . . . 37


5.2 Paremeter number and training time per epoch on Prothom-Alo training set of
analyzed CNN architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 PPL comparison between BERT Pre and CoCNN . . . . . . . . . . . . . . . . 43
5.4 Comparing BERT Pre and CoCNN in terms of efficiency . . . . . . . . . . . . 44
5.5 PPL score obtained by CoCNN and BERT Pre. Each model was trained for
70 epochs on Prothom-Alo dataset for Bangla sentence completion task, and
70 epochs on Hindinews and Livehindustan merged dataset for Hindi sentence
completion task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6 Bangla and Hindi Text classification dataset details . . . . . . . . . . . . . . . 45
5.7 Text classification performance comparison between BERT Pre and CoCNN . . 46
5.8 Bangla spelling correction comparison in terms of accuracy . . . . . . . . . . . 46

7.1 Dataset specification details for spell checker model development . . . . . . . . 54


7.2 Comparing BERT based variants. Typical word masking based pretraining has
been used on all these variants [1]. Real-Error (Fine Tuned) denotes fine tuning
of the Bangla synthetic error dataset trained model on real error dataset, while
Real-Error (No Fine Tune) means directly validating synthetic error dataset
trained model on real error dataset without any further fine tuning on real error
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3 Comparing BSpell exposed to various pretraining schemes . . . . . . . . . . . 58
7.4 Comparing LSTM based variants with hybrid pretrained BSpell. FastText word
representation has been used with LSTM portion of each architecture. . . . . . 59
7.5 Comparing existing Bangla spell checkers with hybrid pretrained BSpell model 61

ix
Abstract
Though there has been a large body of recent works in language modeling for
high resource languages such as English and Chinese, the area is still unexplored for
low resource languages like Bangla and Hindi. We propose an end to end trainable
memory efficient convolutional neural network (CNN) architecture CoCNN to handle
specific characteristics such as high inflection, morphological richness, flexible word
order and phonetical spelling errors of Bangla and Hindi. In particular, we introduce
two learnable convolutional sub-models at word and sentence level. We show that
state-of-the-art Transformer models do not necessarily yield the best performance for
Bangla and Hindi. CoCNN outperforms pretrained BERT with 16X less parameters
and 10X less training time, while it achieves much better performance than state-
of-the-art long short term memory (LSTM) models on multiple real-world datasets.
The word level CNN sub-model SemanticNet of CoCNN architecture has shown its
potential as an effective Bangla spell checker. We explore this potential and develop
a state-of-the-art Bangla spell checker. Bangla typing is mostly performed using
English keyboard and can be highly erroneous due to the presence of compound
and similarly pronounced letters. Spelling correction of a misspelled word requires
understanding of word typing pattern as well as the context of the word usage.
We propose a specialized BERT model, BSpell targeted towards word for word
correction in sentence level. BSpell contains CNN sub-model SemanticNet being
motivated from CoCNN along with specialized auxiliary loss. This allows BSpell to
specialize in highly inflected Bangla vocabulary in the presence of spelling errors.
We further propose hybrid pretraining scheme for BSpell combining word level and
character level masking. Utilizing this pretraining scheme, BSpell achieves 91.5%
accuracy on real life Bangla spelling correction validation set.

x
Chapter 1

Introduction

Natural language processing (NLP) has flourished in recent years in various domains starting
from writing automation tools to robot communication modules. We can see the applications of
NLP in mobile phone writing pads during automatic next word suggestion, in emailing software
during sentence auto completion, in writing and editing tools during automatic spelling correction
and even in spam message detection. NLP research has flourished in high resource languages
such as English and Chinese in recent times more than ever before [2–7]. Although Bangla is the
international mother language, it is still considered as a low resource language, especially when
we consider Bangla research from NLP perspective. Languages such as Bangla and Hindi are
Sanskrit originated and have very different characteristics from English and Chinese. Hence we
focus on Bangla language modeling where we propose suitable model for Bangla NLP tasks.

1.1 Motivation

Compound Component
Root Word Inflected Variations
Character Characters
শাধ পির শাধ, িত শাধ, শািধত
ক+্+ত
চল চলিত, চালক, চলমান
ক+্+ষ
কিব কিবতা, কিব , মহাকিব ব+্+র
High Inflection Morphological Richness

Valid Sentence Samples


আজ িবকােল রা পিত আসেবন
রা পিত আজ িবকােল আসেবন
রা পিত আসেবন আজ িবকােল
খুিন সে েহ পাঁচজনেক আটক কেরেছ পুিলশ
পুিলশ পাঁচজনেক আটক কেরেছ খুিন সে েহ
পুিলশ পাঁচজনেক খুিন সে েহ আটক কেরেছ
Flexible Word-Order

Figure 1.1: Bangla language unique characteristics

Bangla is the native language of 228 million speakers making it the fourth most spoken language

1
1.1. MOTIVATION 2

in the world. This language originated from Sanskrit [8] and displays some unique characteristics.
First of all, it is a highly inflected language. Each root word of Bangla may have many
variations due to addition of different suffixes and prefixes. Second, Bangla as a language is
morphologically enriched. There are 11 vowels, 39 consonants, 11 modified vowels and more
than 170 compound characters found in Bangla scripts [9]. Third, there is vast difference between
Bangla grapheme representation and phonetic utterances for many commonly used words.
As a result, fast typing of Bangla yields frequent spelling mistakes. Finally, the word-order
in Bangla sentences is flexible. The importance of word order and their positions in Bangla
sentences are loosely bounded. Some relevant examples regarding these unique aspects have
been shown in Figure 1.1. Many other languages such as Hindi, Nepali, Gujarati, Marathi,
Kannada, Punjabi and Telugu also share these characteristics because of the same Sanskrit origin.

Main Correction
Incorrect Correct
Dependency

কাল আমােদর ু েল বািষক পিরখা। কাল আমােদর ু েল বািষক পরী া।


Prior Context
(We have final trench at school tomorrow.) (We have final exam at school tomorrow.)

কাল আমােদর সকুেল বািষক পরী া। কাল আমােদর ু েল বািষক পরী া।


Post Context
(We have final exam at (error) tomorrow.) (We have final exam at school tomorrow.)

কাল আমােদর ু েল বািষক অনু ান। কাল আমােদর ু েল বািষক অনু ান।
Error Word Pattern
(We have final (error) at school tomorrow.) (We have final ceremony at school tomorrow.)

Figure 1.2: Examples of context and word pattern dependency for spelling correction. Bold red
marked word is the misspelled word, the correction of which depends on the italicized blue words.
In the provided English translation, (error) has been placed as a representative of meaningless
misspelled Bangla words.

Because of the above mentioned unique characteristics, Bangla language modeling needs special
attention which is different from high resource languages such as English and Chinese. High
inflection, morphological richness and deviation between grapheme representation and phonetic
utterance in Bangla language make it spelling error prone. So, spell checking as a major
downstream task related to Bangla language modeling can be quite interesting. Moreover, most
Bangla native speakers type using English QWERTY layout keyboard [10] which makes it
difficult to type Bangla compound characters, phonetically similar single characters and similar
pronounced modified vowels correctly. Thus Bangla typing speed, if correct typing is desired,
is really slow. Moreover, a major portion of Bangla native speakers are semi-literate who have
no clear idea on Bangla correct spelling. So, necessary focus should be given at solving this
problem for Bangla language.
Neural language models have shown great promise recently in solving several key NLP tasks
such as word prediction and sentence completion in major languages such as English and
Chinese [2–7]. To the best of our knowledge, none of the existing study investigates the efficacy
of recent language models in the context of languages like Bangla and Hindi. In this research,
we conduct an in-depth analysis of major deep learning architectures for language modeling
1.2. PROBLEM STATEMENT 3

and propose an end to end trainable memory efficient CNN architecture to address the unique
characteristics of Bangla and Hindi. A sub-component of this architecture has shown great
promise as a Bangla spell checker at word level. We argue that in order to correct a misspelled
word, we need to consider the word pattern, its prior context and its post context. Figure
1.2 shows three examples where the dominant aspect for the word correction process among
these three have been mentioned. So, our proposed method for Bangla spell checking takes
necessary intuition from our CNN based proposed architecture for Bangla language modeling
and incorporates necessary changes for learning context information as well.

1.2 Problem Statement


This section formally proposes the problem of Bangla language modeling and spelling correction
that we are dealing as part of this research. While language modeling deals with a more general
class of NLP problems, spelling correction is more specific as an NLP problem.

1.2.1 Bangla Language Modeling Problem

Next word prediction task is the primary language modeling task for any language. Given
W ord1 , W ord2 , . . . , W ordi , we predict the next word W ordi+1 in sentence level. We also
include other many to one language modeling tasks such as text classification and sentiment
analysis as part of auxiliary tasks of this research.

1.2.2 Spelling Correction Problem

Figure 1.3: Spelling correction problem

We focus on Bangla sentence level spelling correction. Suppose, an input sentence consists of
n words – W ord1 , W ord2 , . . . , W ordn . For each W ordi , we have to predict the right spelling,
if W ordi exists in the top word list of our corpus. By top word we mean frequently occurring
words in our Bangla corpus. If W ordi is a rare word (Proper Noun in most cases), we predict
U N K token denoting that we do not make any correction of such words. Think of a user who is
writing the name of his friend in a sentence. We obviously have no way of knowing what the
1.3. OVERVIEW OF RELATED WORKS AND LIMITATIONS 4

correct spelling of this friend’s name is. For correcting a particular W ordi in a paragraph, we
only consider other words of the same sentence for context information. Figure 1.3 illustrates
the problem of spelling correction. The input is a sentence containing misspelled words, while to
output is the correctly spelled version of the sentence.

1.3 Overview of Related Works and Limitations


State-of-the-art techniques for language modeling can be categorized into three sub-domains
of deep learning: (i) convolutional neural network (CNN) [4, 11] (ii) recurrent neural network
[12–15], and (iii) Transformer attention network [16–19]. Long Short Term Memory (LSTM)
based models are suitable for learning sequence and word order information. But they are
not effective for modeling languages such as Bangla and Hindi due to the flexible word order
characteristic of these languages. On the other hand, Transformers use dense layer based multi-
head attention mechanism. They lack the ability to learn local patterns in sentence level, which in
turn puts negative effect on modeling languages with loosely bound word order. Neither LSTMs
nor Transformers use any suitable measure to learn intra-word level local pattern necessary for
modeling highly inflected and morphologically rich languages.
In contrast, contemporary CNN based architectures are good at local pattern learning. In fact,
convolution operation is similar to attention mechanism as it puts more importance on certain
patterns than others for providing us with the correct output. CNN implemented at word level
can extract meaningful semantic information and morphological patterns from highly inflected
error prone Bangla words. Furthermore, CNN implemented at word coordination level can pay
attention to important word clusters necessary for providing suitable output. This output can be
the possible next word, document classification, sentiment analysis result and many more.
A few researches were performed for Bangla spell checker development as well. Existing
Bangla spell checkers include phonetic rule based [20, 21] and clustering based methods [22].
These methods are rule based and do not take misspelled word context into consideration.
Another N-gram based Bangla spell checker [23] takes only short range previous context into
consideration. With a view to correcting spelling mistakes, sequence to sequence (seq2seq)
attention modeling was performed for Hindi language (a Sanskrit originated language similar
to Bangla) at character level [24]. Such character level decoding scheme is also error prone,
because a mistake in one character decoding of a word containing multiple characters can render
that word misspelled. Recent state-of-the-art spell checkers have been developed for Chinese
language. A character level confusion set (similar characters) guided seq2seq model has been
proposed in [25]. Another research used similarity mapping graph convolutional network in
order to guide BERT (Bidirectional Encoder Representations from Transformers) based character
by character parallel correction [26]. Both these methods [25,26] require external knowledge and
assumption about confusing character pairs existing in the language. The most recent Chinese
1.3. OVERVIEW OF RELATED WORKS AND LIMITATIONS 5

spell checker offers an assumption free BERT architecture where error detection network based
soft-masking is included [27]. This model takes all N characters of a sentence as input and
produces the correct version of these N characters as output in a parallel manner. In this model,
the input and output character number have to be the same.

Wrong Correct

পিরকা পরী া
(প+র+ি +ক+ া) (প+র+ ◌ী+ক+ ◌্ +ষ+ ◌া)

িবশশ িব
(ব+ি +শ+শ) ্
(ব+ি +শ+ +ব)

ভাদর ভা
(ভ+ া+দ+র) (ভ+ া+দ+ +র)

Figure 1.4: Heterogeneous character number between error spelling and correct spelling

We now point out the limitations of the researches performed for spell checking in terms of
Bangla language. One of the limitations in developing Bangla spell checker using state-of-the-art
BERT based implementation [27] is that number of input and output characters in BERT have to
be exactly the same. This scenario is problematic for Bangla due to the following reasons. First
of all, such scheme is only capable of correcting substitution type errors. Deletion or insertion
will certainly alter word length. As compound characters (made up of three or more simple
characters) are commonly used in Bangla words, an error made due to the substitution of such
characters also changes word length. We show this scenario in Figure 1.4. Three real life cases
are shown. Here, character number has decreased in the misspelled version. The correct version
of the first word of this figure contains 7 characters, whereas the misspelled version contains
only 5 characters due to typing mistake in place of the only compound character of this word.
Word length has changed without any deletion or insertion. As a result, the character based
BERT correction is not feasible for Bangla. Hence, we introduce a word level prediction in our
proposed BERT based model.

Correct Incorrect
সিনক ঘাড়া চেড় যুে গল। সিনক ঘারা চেড় যুে গল।
(Soldier went to war riding a horse.) (Soldier went to war riding a visit.)

আসামী দাষ ীকার করল। আসামী দাষ িশকার করল।


(The criminal confessed crime.) (The criminal hunted crime.)

কাল আমােদর বািষক পরী া। কাল আমােদর বািষক পিরখা।


(tomorrow is our final exam.) (tomorrow is our final trench.)

Figure 1.5: Some sample words that are correctly spelled accidentally, but are context-wise
incorrect.

Although most existing Bangla spell checkers focus only on the word to be corrected, the
importance of context in this regard is often ignored. Figure 1.5 illustrates the importance of
1.4. PROPOSED METHOD OVERVIEW 6

context in Bangla spell checking. Although the red marked words of this figure are the misspelled
versions of the corresponding green marked correct words, these red words are valid Bangla
words. But if we check these red words based on sentence semantic context, we can realize that
these words have been produced accidentally because of spelling error. In the first sentence, the
red marked word means to visit. But the sentence context hints about a soldier riding towards
battleground. A slight modification of this red word makes the word correct (written in green),
which now has the meaning of horse. These are all real life Bangla spelling error examples which
prove the paramount importance of context in spell checking.

গরাম িশর অরর িনরভরিশল (Misspelled)


াম কৃিষর ওপর িনভরশীল (Correct)
Meaning: Village is dependent on agriculture

Misspelled Word Correction Dependency Correct Word

গরাম কৃিষর (agriculture) াম (village)

িশর াম (village), িনভরশীল (dependent) কৃিষর (agriculture)

অরর িনভরশীল (dependent) ওপর (on)

িনরভরিশল ওপর (on) িনভরশীল (dependent)

Figure 1.6: Necessity of understanding semantic meaning of existing erroneous words for
spelling correction of an individual misspelled word.

Furthermore, real life Bangla spelling errors while typing fast often span up to multiple words in
a sentence. Such a real life example has been provided in Figure 1.6, where all words have been
misspelled during speed typing. The table in the figure shows the words to look at in order to
correct each misspelled word which include correct version of a few misspelled words existing
in the sentence. Our proposed solution should be such that it can also capture the approximate
semantic meaning of a word even if it is misspelled.

1.4 Proposed Method Overview


We observe that learning inter (flexible word order) and intra (high inflection and morphological
richness) word local patterns is of paramount importance for modeling languages such as Bangla
and Hindi. Moreover, these languages require model design which does not impose any strict
sequence order on the words of a sentence. Our language model design includes individual
word level and word coordination oriented sentence level neural network based sub-model for
extracting useful information at intra and inter word level. Moreover, our design choice focuses
on local pattern learning specific neural language modeling which does not favour strict sequence
order information extraction. As context is an important issue for spell checking in Bangla, we
replace our word coordination oriented neural network sub-model with context sensitive self
1.4. PROPOSED METHOD OVERVIEW 7

attention based module capable of carrying out many to many task. We modify our loss function
to focus more on word level spelling for better spell checking performance.
We design a novel CNN architecture, namely Coordinated CNN (CoCNN) that achieves state-
of-the-art performance in Bangla language modeling with low training time. In particular,
our proposed architecture consists of two learnable convolutional sub-models: word level
(SemanticNet) and sentence level (Terminal Coordinator). SemanticNet is well suited for syllable
pattern learning, whereas Terminal Coordinator serves the purpose of word coordination learning
while maintaining positional independence, which suits the flexible word order of Bangla and
Hindi languages. It is important to note that CoCNN does not explicitly incorporate any self
attention mechanism like Transformers; rather it relies on its Terminal Coordinator sub-model for
emphasizing on important word patterns. We validate the effectiveness of CoCNN on a number
of tasks that include next word prediction in erroneous setting, text classification, sentiment
analysis and sentence completion. CoCNN achieves slightly better performance than pretrained
BERT model with 16X less parameter and 10X less training time. Also, CoCNN shows far
superior performance than the contemporary LSTM models.
CoCNN is an architecture suitable for many to one type tasks. As sentence level spelling
correction is a many to many type NLP task. We propose different architecture for this task in line
with the intuition of CNN sub-model usage. We propose word level BERT [28] based architecture
namely BSpell. This architecture is capable of learning prior and post context dependency through
the use of multi-head attention mechanism of stacked Transformer encoders [29].The proposed
architecture has the following three major features.
SemanticNet: In this architecture, each input word in a sentence has to pass through learnable
CNN sub-layers so that the model can capture semantic meaning even if the words are misspelled
to a certain extent. These learnable sub-layers form SemanticNet and the word representation
obtained is named as SemanticVec.
Auxiliary loss: We also introduce a specialized auxiliary loss to facilitate SemanticNet sub-
model learning in order to emphasize on the importance of user writing pattern. This loss has
been placed at an intermediate level of the model instead of the final level thus eliminating the
gradient vanishing problem prevalent in very deep architectures.
Hybrid pretraining: An important aspect of existing BERT models is pretraining scheme.
Existing pretraining schemes [1] are designed mainly to understand language context. In order
to capture both context and word error pattern, we introduce a new pretraining scheme namely
hybrid pretraining. Besides word masking based fill in the gaps task of contemporary pretraining
methods, hybrid pretraining incorporates word prediction task after blurring a few characters of
some words. Such scheme provides the BERT model with ideas about retrieving a word from its
noisy version.
1.5. RESEARCH CONTRIBUTIONS 8

1.5 Research Contributions


This research has the following contributions to Bangla NLP community:

• We propose an end to end trainable CNN architecture CoCNN based on the coordination
of two CNN sub-models.

• We perform an in-depth analysis and comparison on different state-of-the-art language


models in three paradigms: CNN, LSTM, and Transformer. We show the effectiveness
of CoCNN with an extensive set of experiments on next word prediction, sentence
completion, text classification and sentiment analysis. we show that CoCNN shows
superior performance than LSTM and Transformer based models in spite of being memory
efficient.

• We propose BSpell architecture where we introduce learnable CNN sub-model named


SemanticNet at individual word level in order to obtain semantically meaningful
SemanticVec representation from each word. We introduce specialized auxiliary loss
function for effective training of SemanticNet sub-model.

• We introduce hybrid pretraining scheme for BSpell architecture which facilitates sentence
level spelling correction by combining word level and character level masking. Detailed
evaluation on three error datasets that includes a real life Bangla error dataset shows the
effectiveness of our approach. Our evaluation also includes detailed analysis on possible
LSTM based spelling correction architectures.

1.6 Thesis Organization


This thesis book is organized as follows. Chapter 2 provides detail description of topics related
to LSTM, CNN, self attention, Transformer and word representation which are necessary to
understand for appreciating the proposed methods and experimental evaluation of this research.
Chapter 3 discusses state-of-the-art researches, their motivations and their contributions related
to language modeling and spell checking in both Bangla and other high resource languages.
Then comes Chapter ?? where we formally define the problem of Bangla language modeling
and spell checking. Our core algorithms start from Chapter 4. This chapter describes the details
of our proposed Coordinated CNN (CoCNN) architecture along with the intuition behind its
components’ design. Next in Chapter 5, We compare CoCNN with state-of-the-art architectures
from LSTM, CNN and Transformer paradigms in terms of next word prediction, sentence
completion and text classification tasks. Chapter 6 provides BSpell architecture details and our
proposed hybrid pretraining scheme targeted towards Bangla spell checking. Chapter 7 on the
other hand compares BSpell model with state-of-the-art spell checkers used in high resource
1.6. THESIS ORGANIZATION 9

languages, existing Bangla spell checkers and LSTM based possible Bangla spell checkers.
Chapter 8, the final chapter of this thesis book, provides our concluding remarks regarding this
research and some possible future directions towards research in Bangla NLP.
Chapter 2

Background

This chapter focuses on those components that are needed to be understood prior to going through
our proposed method and our conducted experiments. We start this chapter with some learning
based model components such as bidirection in LSTM, one dimensional convolution, attention
mechanism in LSTMs and self attention in Transformers. Then we move on to describing
commonly used pretraining scheme. Finally, we provide an overview of a synthetic spelling
error generation algorithm used to prepare some of our datasets.

2.1 Unidirection and Bidirection in Many to One LSTM

Y0 Y1 Y2 YT

BST BLSTM BLSTM BLSTM BLSTM BS0

FS0 FLSTM FLSTM FLSTM FLSTM FST

X0 X1 X2 XT

Figure 2.1: Forward and backward LSTM comprising bidirectional LSTM

Unidirectional LSTM can be described as a sub-module of bidirectional LSTM. Figure 2.1


shows the general architecture of a bidirectional LSTM. It is a combination of two unidirectional
LSTMs learning sequence order information in the opposing direction. Suppose, we have T time
steps. We have input entity vector Xi in time step no. i. Each input goes into both the forward
LSTM (FLSTM) and backward LSTM (BLSTM). F S0 and BS0 are initial hidden states for
FLSTM and BLSTM, respectively. FLSTM extracts information from X0 to XT , while BLSTM

10
2.2. ONE DIMENSIONAL CNN 11

does the same thing from XT to X0 . Outputs of both unidirectional LSTMs are combined to get
the final output. In order to get output Yi , we maximize the probability Pi .

Pi = PF i × PBi

PF i = P (Yi |X0 , X1 , . . . , Xi−1 )

PBi = P (Yi |XT , XT −1 , . . . , Xi+1 )

2.2 One Dimensional CNN


One dimensional (1D) CNN is a combination of two types of neural networks with a bridge in
between. The initial part consists of convolution portion (convolution and pooling layers). The
later part consists of dense layers. Now the convolution portion expects two dimensional matrix
type input per sample, while dense layers expect vector type input per sample. In 1D CNN, 1D
global max pooling layer acts as the converter (from matrix to vector).

X1
X2
START
X3
Conv1D
X4
X5
X6
END X7
X8

Figure 2.2: 1D convolution operation of filter size 2 on a 8 length sequence matrix

We demonstrate 1D convolution operation in Figure 2.2. There are 8 entities in the input sequence
- X1 , X2 , . . . , X8 - each represented with a vector of size 6. We have a single 1D convolution
filter of size 2 performing convolution operation on this sequence containing 8 entities. This
input sequence can be represented as a 8 × 6 size two dimensional matrix. Since our filter size is
2, the convolution operation is performed on two entities at a time. With a stride of 1, the first
convolution operation will be performed on X1 , X2 ; the second convolution operation will be
performed on X2 , X3 ; the final operation will be on X7 , X8 . Each convolution operation will
provide us with a scalar value as output. Thus we have a vector of scalar values of size 7 as a
result of 1D convolution on the sequence matrix using a 1D filter of size 2.
We use the same input sequence matrix in order to explain 1D max pooling operation in Figure
2.3. We demonstrate a pooling operation of size 2 with a stride of 2. Unlike convolution, 1D
pooling is performed column-wise. It starts with the leftmost column C1 . 1D pooling of size
2.3. ATTENTION MECHANISM IN LSTM 12

Start End

Start X1
X2
X3 Pool1D
X4
X5
X6
O1 O2 O3 O4 O5 O6
X7
End X8
C1 C2 C3 C4 C5 C6

Figure 2.3: 1D Pooling operation of size 2 and stride 2 on a 8 length sequence matrix

2 is performed on C1 X1 and C1 X2 providing us with a scalar value (the maximum between


C1 X1 and C1 X2 for max pooling) as output. With a stride of 2, the next pooling operation is
performed on C1 X3 and C1 X4 of the same column providing us with another scalar value. The
final pooling operation in column C1 is performed on C1 X7 and C1 X8 . Thus after performing
pooling operation on column C1 , we get output column O1 . Similar operations are performed on
column C2 , C3 , . . . , C6 providing us vector O2 , O3 , . . . , O6 as output. Thus our final output after
performing 1D pooling on the input sequence is a 2D matrix of size 4 × 6.

X1
X2
X3
GlobalMaxPool1D
X4
X5
X6
X7
X8
C1 C2 C3 C4 C5 C6

Figure 2.4: GlobalMaxPooling operation one dimensional version applied on a sequence feature
matrix

Finally, we demonstrate 1D global max pooling in Figure 2.4 using the same input sequence
matrix. This operation is performed column-wise as well. We take the maximum value from
each column C1 , C2 , . . . C6 . Maximum value of each column has been marked with deep blue .
These column-wise maximum values form the output vector of 1D global max pooling.

2.3 Attention Mechanism in LSTM


When we have a sequence of entities as input, attention mechanism works in paying more
attention to those entities that are important for providing the desired output. Attention
2.4. SELF ATTENTION BASED TRANSFORMER ENCODERS 13

Y0 Y1 Y2
Last Encoder
Hidden State

S0 Decoder S1 Decoder S2 Decoder


LSTM LSTM LSTM

Context X1 Context
X0 X2
Vector1 Vector2

Encoder Output Vectors Concatenated into


Encoder Output Matrix

Encoder Encoder Encoder S0


LSTM LSTM LSTM

Figure 2.5: Attention mechanism in encoder decoder based LSTM. Here, we assume number of
time stamps in both encoder and decoder to be 3.

mechanism is popularly incorporated with encoder decoder based LSTM architectures. Such
architectures are used in sequence to sequence type tasks such as machine translation and
sentence correction. In such tasks, the number and positions of sequence entities vary greatly in
input and output sequence. Figure 2.5 demonstrates attention mechanism in encoder decoder
LSTM. The encoder and decoder are two different LSTMs having different weights. The encoder
LSTM encodes the whole input sequence in a hidden state S0 . The output vectors obtained from
each time step of encoding are concatenated into an encoder output matrix. The initial hidden
state of the decoder LSTM is S0 , the summarized state of input sequence, while the input is
X0 , typically a default start symbol. Suppose, the output hidden state and output probability
vector from time step 0 are S1 and Y0 , respectively. In typical encoder decoder LSTMs, we
would feed the vector X1 corresponding to maximum probability index of output vector Y0 to the
decoder in time step 1. But in case of attention mechanism, we first merge S1 and encoder output
matrix through the use of dot product like operations into a context vector. We concatenate this
context vector with X1 and then feed the vector to decoder LSTM in time step 1. Such context
vector helps the decoder to get an idea of the portion of input sequence to focus on most to get
output vector Y1 . In time step 2, we follow similar procedure of context vector formation using
output hidden state S2 and encoder output matrix, then concatenate with vector X2 , a vector
corresponding to the maximum probability index of output Y1 . This process continues until we
receive a stop symbol as output.
2.4. SELF ATTENTION BASED TRANSFORMER ENCODERS 14

Dense

Vector Conversion

TransformerN

Transformer2

Transformer1

X1 X2

Figure 2.6: High level overview of a Transformer based many to one model

2.4 Self Attention Based Transformer Encoders


Transformers take in all sequence entities (vector representation) at the same time as input.
The output can be same number of outputs as input entities or it can be modified to obtain a
single output in many to one type problems. Figure 2.6 demonstrates a high level overview of a
Transformer that takes two input sequence entities X1 and X2 . There are N Transformer encoder
modules stacked on top of each other each modifying vector X1 and X2 while passing them
on from one Transformer encoder to the next one. The N th Transformer encoder provides two
modified vectors for X1 and X2 which are then converted into a single vector in many to one
type problems through the use of schemes such as flattening or global pooling. Finally, this
vector is passed on to dense layers and an output layer to obtain the final output.

Dense Dense

Multi-Head Self Attention

X1 X2

Figure 2.7: Peeking inside Transformer encoder block


2.4. SELF ATTENTION BASED TRANSFORMER ENCODERS 15

We now take a peek inside Transformer encoder block in Figure 2.7. The unique feature of this
encoder is self attention mechanism based on multi-head self attention block shown in figure.
Such attention provides a representative vector for each input entity based on its relationship
with all other input entities and itself. These vectors are then layer normalized and are passed to
dense layer. It is to note that the dense layer applied on each vector is the same layer. The final
self attention based vector representation of input entities come out of these dense layers.

QW1
KW1 Q1
Z1
VW1 K1
V1

X QW2

X1 KW2 Q2
VW2
Z2 2X9
X2 K2
V2 Row-wise
concatenation

Dense
QW3 (9X4)
KW3 Q3
Z3
VW3 K3
V3 X1
X2

Figure 2.8: Peeking inside Transformer multi-head self attention layer

Now let us peek inside the multi-head self attention block. We demonstrate this block in Figure
2.8 with two input vectors X1 and X2 . Multi-head self attention block as its name suggests
consists of multiple branches for applying attention mechanism on the inputs. In the provided
figure, there are 3 attention heads; there obviously can be more. Let us describe the first attention
head in detail, as all of them follow the same principal. We have three types of weight matrices
in an attention head - query (QW1 ), key (KW1 ) and value (V W1 ) weight matrix. These are all
learnable parameters updated during training. In figure, we have three 4 × 3 size weight matrices
for demonstration purpose. We take the dot product of our input matrix X of size 2 × 4 with
QW1 , KW1 and V W1 separately to obtain three matrices - Q1 , K1 and V1 , respectively of size
2 × 3. These three obtained matrices are merged into a single matrix Z1 of size 2 × 3 (the details
of this process is described in the next paragraph). We obtain Z2 and Z3 matrix of same size in
the same manner from the other two attention heads. These three matrices are now concatenated
row-wise to obtain a single matrix of size 2 × 9. This concatenated matrix is then passed through
a dense layer, where the node number of the dense layer is the same as the vector size of X1 and
X2 , that is 4 in our case. Finally, after the application of this dense layer, we obtain two modified
vectors for X1 and X2 , which are of the same as their original vector representations.
2.5. BERT ARCHITECTURE 16

X1 X2

Q11 Q12

K11 K12

V11 V12

Q11.K11 = 20 Q11.K12 = 5

0.8 0.2

T11 T12

Sum

Z11

Figure 2.9: Looking at self attention mechanism prevalent in a single head of Transformer
multi-head attention

Now let us try to understand the intuition behind the dot products that we mention using query,
key and value matrix and the merging procedure in a single Transformer attention head. Detailed
demonstration has been provided in Figure 2.9. We already know from previous discussion that
we obtain three vectors Q11 (row 1 of Q1 matrix of Figure 2.8), K11 and V11 from input vector
X1 using the three weight matrices of attention head one. Similarly, from X2 we obtain Q12 ,
K12 and V12 . We want to obtain the Z11 , the modified vector of X1 produced through applying
attention head one on X1 . We apply dot product query vector Q11 of X1 with key vector of
both X1 and X2 . We get 20 and 5, respectively. Now we turn these values into a probability
distribution using Softmax activation on them. We get 0.8 and 0.2, respectively after doing so.
Vector T11 is obtained by multiplying V11 with 0.8, while vector T12 is obtained by multiplying
V12 with 0.2. These two modified vectors are now added to obtain Z11 (first row of matrix Z1
in Figure 2.8). Similarly, if we want to obtain Z12 , the modified vector of X2 using attention
head one, we must follow the same procedure except for the dot products which now have to
performed using Q12 instead of Q11 . So, this is how self attention works - modifying each input
vector with respect to other input vectors using query, key and value weight matrices.

2.5 BERT Architecture


Bidirectional Encoder Representations from Transformers (BERT) is a stack of Transformer
encoders that can be used for natural language specific tasks such as text classification, sentiment
analysis, parts of speech tagging, spell checking and many more. The speciality of BERT is the
pretraining scheme introduced in it. Before the invent of BERT, pretraining schemes in language
2.6. WORD REPRESENTATION SCHEMES 17

Word1 Word2 Wordi WordN

Transformer Encoder 12

Transformer Encoder 2

Transformer Encoder 1

Word1 Mask Wordi WordN

Figure 2.10: BERT architecture and pretraining scheme overview

models were not sample efficient. For each word in a sentence, it was required to generate a
new sample for training which ultimately resulted in an outburst of pretraining sample number.
BERT pretraining scheme for the very first time provided masking based pretraining scheme that
does not increase the pretraining sample number from original corpus sentence number and at
the same time provides bidirectional pretraining to the model. A sample mask based pretraining
has been shown in Figure 2.10. There are N words in the input sentence. Among them W ord2
has been randomly replaced by a default mask word. At the output side, BERT needs to predict
that this masked word is indeed W ord2 in addition to keeping the other words in tact. It is a
sort of filling in the gaps task through which our language model is learning the context and
grammatical structure of the language. Generally, 15% of the words of a sentence are masked
randomly. BERT Base model consists of 12 Transformer encoders, while BERT Large model
consists of 24 encoders. Each Transformer encoder has 12 attention heads and 768 hidden units
in dense layers.

2.6 Word Representation Schemes


For any learning based NLP task modeling, we need to represent each word of our input sentence
or document using an appropriate vector that represents the semantic meaning of each of those
words. Words that are close in meaning and in usage should have similar representative vectors.
In many of our experiments, we have used FastText as our word representation. In order to
understand FastText, we first need to understand Skip-gram Word2vec representation of words.
2.6. WORD REPRESENTATION SCHEMES 18

Window Size = 3
D-dim
Hidden Layer
V-dim
Input Layer

V-dim
Output Layers

Figure 2.11: Skip-gram Word2Vec model under window size 3

The learning scheme of FastText is similar to that of Skip-gram Word2vec. Word2vec formation
requires specifying a window size. Word representation should be such that words of similar
meaning should have close (small Euclidean distance) vector representation. Words of similar
meaning are generally used to express similar contexts; that means their neighbouring words
are more or less similar in large corpus. Window size denotes the neighbours to consider in
order to form Word2vec of a word. Figure 2.11 shows the neural network structure of Word2vec
formation for a corpus containing V no. words. We consider a window size of 3 (the word of
interest, its right neighbour word and its left neighbour word). We want our Word2vec vector
size to be D. In such case, our input is a V size one hot vector of the word we are interested in.
We use a hidden layer containing D nodes, that means our hidden layer weight matrix is of size
V × D. The output layer consists of two parallel branches - one for the left neighbour and the
other for the right neighbour word. We provide the one hot vectors of the neighbour words in a
sentence as the ground truth output during training. After the end of training, the ith row of our
hidden layer weight matrix will be Word2vec representation of our ith word in the corpus.
The idea of FastText is based on character N-gram representation learning instead of full word
representation learning. There are millions of words in a corpus, where many of them are
rarely used. Most of those rarely or uncommonly used words are actually a combination of
some common sub-words which we call character N-grams (N denotes number of characters
2.7. OVERVIEW OF SYNTHETIC SPELLING ERROR GENERATION 19

<artificial>
Start End
N=3

<ar art rti tif ifi fic ici cia ial al>

Window Size = 5

Figure 2.12: FastText representation details of a sample word

in the sub-word). The concept of training and model architecture is the same as Word2vec,
only this time we are considering common sub-words. Figure 2.12 demonstrates character
3-gram based FastText formation where we consider a window size of 3. We first show the
break down of word artificial into size 3 sub-words. With a window size of 5, we consider
the immediate couple of sub-words to the left and to the right of our sub-word of interest. In
FastText, we consider different values of N and take the common character N-grams for FastText
representation formation in a way similar to Word2vec formation. Ultimately, the FastText
representation of a full word is formed from the FastText representation of its consisting common
sub-words through addition, averaging or some other operation.

2.7 Overview of Synthetic Spelling Error Generation

Input Word

Detected Detected Base


Compound Letters Letters

Compound Phonetically Accidental


Letter Error Similar Error Mispress Type
Rules Rules Errors

Error Word

Figure 2.13: Bangla synthetic error generation algorithm high level overview

Resource for real life labeled Bangla spelling error is limited. Native Bangla speakers most
commonly use English Qwerty keyboard to write Bangla utilizing writing assistant software.
Because of phonetic similarity among different Bangla letters, presence of compond letters in
Bangla words and robust use of modified vowels; Bangla typing is inherently error prone. For
2.7. OVERVIEW OF SYNTHETIC SPELLING ERROR GENERATION 20

producing synthetic but realistic Bangla spelling errors in words, a probabilistic error generation
algorithm was proposed in [30] in light of popular Bangla misspellings. Figure 2.13 provides a
high level overview of this algorithm. Given a word, the algorithm first detects all compound
and simple/ base letters, as it handles these two types of letters differently. For compound letters,
there are some position dependent probabilistic error generation rules that take a generalized
approach for error generation in many Bangla compound letters. For the detected base letters, the
algorithm introduces two types of errors. The first type of error generation utilizes phonetically
similar clusters of Bangla base letters formed through detail investigation. The main idea is that
a letter can be replaced with one of the other letters of the same cluster with some probability.
The second type of error includes errors produced from accidental mispress of keyboard. This
type of error contains accidentally missed modified vowels as a form of deletion. As a form
of insertion, this second type of error produces those letters that are obtained due to pressing
of neighbouring keys of the intended letter key press. A pair of realistic spelling errors of 10
Bangla words have been provided in Figure 2.14.

Algorithm Generated
Correct Word
Sample Error Words

কািলয়াৈকর কািলয়াকইর, কািলয়াকইড়

টা াইল তা াইল, তা াইক

ােক ােক, তরােক

প লেবামা প মা, প লেবামা

িতনজন টনজন, িতনেজান

দ দগধ, দগদ

ব বসায়ী ববসািয়, ববশািয়

ভিত ভরিত, বরিত

ার গের ার, গের ার

চ া ছ া, চ া

Figure 2.14: Some sample error words generated by error generation algorithm
Chapter 3

Literature Review

Our research focuses on Bangla language modeling and spelling correction. This chapter includes
researches related to next word prediction, sentence completion, effective word representation,
word and sentence level spelling correction and text classification in Bangla, English and Chinese.
State-of-the-art natural language processing related researches are being conducted in English
and Chinese, hence the inclusion of these two languages.

3.1 Language Modeling Related Researches


In this research, we explore the efficacy of recent LSTM, CNN and self attention based
architectures for Bangla language modeling. Although a significant number of works for
language modeling of high resource languages like English and Chinese are available, very
few researches of significance for language modeling in low resource languages like Bangla
and Hindi exist. We mainly summarize prominent language modeling works for high resource
languages in this section.

3.1.1 LSTM Based Language Modeling

Sequence order information based statistical RNN models such as LSTM and GRU are popular
for language modeling tasks [31]. The research proposes a modified version of the original RNN
model. Their proposed approach led to significant speed up during training and testing. The paper
also discussed several aspects of backpropagation through time prevalent in RNN based models.
One study showed the effectiveness of LSTM for English and French language modeling [32].
The research analyzed the deficiency of RNN models in terms of training inefficiency and
in terms of previous context learning. They also showed the perplexity performance metric
improvement over RNN while employing long short term memory (LSTM). The regularizing
effect on LSTM was investigated in [33]. They proposed a variant of dropout, that is weight-

21
3.1. LANGUAGE MODELING RELATED RESEARCHES 22

dropped LSTM which utilized DropConnect on hidden layer weight matrices as a variant of
regularization in recurrent type neural networks. In addition, they introduced NT-ASGD as a
regularization oriented optimizer. This proposed optimizer is a variant of the averaged stochastic
gradient method, where the average calculation is triggered based on a specific non-monotonic
condition.

State-of-the-art LSTM models learn sub-word information in each time stamp. A technique
for learning sub-word level information from data was proposed in [34]. Popular word vector
representations use an entire word as the smallest unit of representation learning. Such approach
shows limitations for languages with large and inflected vocabulary and a lot of rare words. Rare
words have little presence in the corpus which makes their vector representation inappropriate.
Moreover, word level language models could not deal with new unseen words before the advent
of this research. These words were called out-of-vocabulary (OOV) words by the authors.
They analyzed various techniques on character level modeling tasks such as smoothed N-gram
models, different types of neural networks including RNN based architectures and a maximum
entropy model. They found character level modeling to be superior to smoothing techniques.
A morphological information oriented character N-gram based word vector representation was
proposed in [35] later on. This research proposed a Skip-gram based learning scheme where word
representation consists of a group of character N-grams. They learned the vector representation
of these character N-grams instead of the whole words in the corpus. The vector representation
of a word was obtained using sum of the character N-grams found in that word. The proposed
method was run time efficient even in large corpus. The proposed algorithm can provide decent
vector representation for unseen words in training samples as well, which was not possible
before this research. This word representation further was improved and was given the name
Probabilistic FastText [36]. This scheme could express multiple meanings of a word, sub-word
structure and uncertainty related information during misspelling. Each word was represented
using a Gaussian mixture density. Each mixture component of a word was responsible for
a particular meaning of that word along with some uncertainty. This was the first scheme to
achieve multi-sense representation of words and showed better performance than vanilla FastText.
A detailed evaluation on sub-word aware model was performed in [15]. Their focus was 50
morphologically rich languages which display high type-to-token ratios. The authors proposed a
new method for injecting subword level information into word semantics representation vector.
They integrated this representation into neural language modeling to facilitate word level next
word prediction task. The experiments showed significant perplexity gains in all 50 diverse and
morphologically rich languages.

Character aware learning based word representation scheme was integrated in LSTM models for
the very first time in [14]. They provided a simple neural language model which required only
3.1. LANGUAGE MODELING RELATED RESEARCHES 23

character-level inputs, no word or subword representative vector whatsoever. Many to many or


many to one type predictions were still made at word level. The proposed scheme employed a
CNN and a highway network (a special type of choosy neural network) over characters of input
words. The output of the highway layer was then provided to the LSTM that was responsible
for the outputs. The model outperformed word level and morpheme level LSTM based models
with 60% fewer parameters on next word prediction task provided in the famous Penn Treebank
dataset. A detailed investigation has shown that character-aware neural LMs outperform syllable-
aware ones [37]. Syllable level break down of words is the natural way of word representation if
we think in terms of linguistics. This research focused on determining whether syllable aware
neural language models have any advantage over character aware ones. They could not find
any evidence of syllable aware models outperforming character aware ones after trying with
different types of models. Word representation was further improved by combining ordinary
word level and character-aware embedding [3]. The research focused on a novel RNN based
language model that took advantage of character information from character N-gram embeddings
and semantic information from ordinary word embeddings. They showed that their proposed
method achieved state-of-the-art perplexity on various benchmark language modeling datasets
such as - Penn Treebank, WikiText-2, and WikiText-103.

3.1.2 CNN Based Language Modeling

1D version of CNNs have become popular recently for text classification oriented tasks [11,38,39].
A densely connected CNN armed with multi-scale feature attention was proposed in [11] for text
classification. Contemporary CNN models use convolution filters of fixed kernel size. Hence
they have limitation in learning variable length N-gram features and lack flexibility. The authors
of [11] presented a dense CNN assisted by multi-scale feature attention for text classification.
Their proposed CNN showed performance comparable to contemporary state-of-the-art baselines
on six benchmark datasets. Detailed evaluation and investigation were performed on deep CNN
models in [38]. Although CNN based text classification gained high performance, the power and
efficacy of deep CNNs were not investigated into. Often large number of labeled samples related
to real world text classification problems are not available for training deep CNNs. Though
transfer learning is popular for NLP related tasks such as next word prediction, document
classification in LSTM and Transformer based models, the power of transfer learning was not
well studied on 1D CNNs before the advent of the currently discussed research. The importance
of depth in CNN for text classification was further studied in [39] both for character level and
for word level input. The authors used 5 standard text classification and sentiment analysis task
datasets for their experiments. They showed that deep models showed better performances than
shallow models when text input is given as a sequence of characters. On the contrary, when input
is fed as sequence of words, simple shallow but wide network such as DenseNet outperformed
3.1. LANGUAGE MODELING RELATED RESEARCHES 24

deep architectures.

Although the application of CNN in computer vision tasks has been popular for a long time,
CNN application in language modeling has been studied recently on showing the ability of
CNNs to extract language model features at a high level of abstraction [4]. The authors analyzed
the application of CNNs in language modeling, which is a sequence oriented prediction task
requiring extraction of both short and long range information. The authors showed that CNN
based architectures outperformed classic recurrent neural networks (RNN), but their performance
was below that of state of the art LSTMs. They found out that CNNs were capable of looking at
up to 16 previous words for predicting the next target word. Furthermore, dilated convolution
was employed in Bangla language modeling with a view to solving long range dependency
problem [40]. In this work, the authors first analyzed the long range dependecy and vanishing
gradient problem of RNNs. Then they proposed a dilated CNN which enforces separation of
time stamp during convolution, the main way of feature extraction for 1D CNNs. In this way,
they improved the performance of Bangla language modeling.

3.1.3 Self Attention Based Language Modeling

Self attention based Transformers have become the state-of-the-art mechanism for sequence to
sequence modeling in recent years. A simple architecture entirely based on self attention and
devoid of all recurrence and convolution type layers was proposed in [17], which came to be
known as Transformer for the very first time. This architecture was based on multi-head self
attention based encoders and decoders. The authors implemented their proposed Transformer
architecture on machine translation task where they achieved state-of-the-art performance.
A recent research investigated into deep auto regressive Transformers in speech recognition
task [18]. The research focused on two topics. First of all, they analyzed the configuration of
Transformer architecture. They optimized the architecture for both word-level speech recognition
and for byte-pair encoding (BPE) level end-to-end speech recognition. Second, they showed that
positional encoding was not a necessity in Transformers. A deep Transformer architecture was
proposed in [16] for character level language modeling. The authors identified a few challenges
in character level language modeling such as requirement of learning a large vocabulary of words
from scratch, longer time steps bringing in longer dependency learning requirement and costlier
computation because of longer sequences. This research employs a deep Transformer model
consisting of 64 Transformer modules for character level language modeling. In order to speed
up convergence, the authors imposed auxiliary loss in three specific positions - (i) at intermediate
sequence positions, (ii) from intermediate hidden representations, and (iii) at target positions.
Their approach achieved state-of-the performance in two character level language modeling
datasets. A recent study has focused on effective parameter reduction in Transformers [19].
3.2. SPELL CHECKER RELATED RESEARCHES 25

They pointed out that the multi-head self attention mechanism of Transformer encoders and
decoders is the limiting factor when it comes to resource constrained deployment of the model.
The authors based their ideas on tensor decomposition and parameter sharing, and proposed a
novel self-attention model namely Multi-linear attention.

A new language representation model was proposed in [41] based on Transformer encoders.
The authors named it as Bidirectional Encoder Representations from Transformers or BERT in
brief. BERT was designed for pretraining Transformer encoders in an efficient manner where
both prior and post context could be present. The pretrained BERT model can be fine tuned
with the addition of only one output layer specific to the intended downstream task. Another
research focused on the appropriate pretraining of BERT [42]. As pretraining is computationally
expensive and offers significant performance improvement, the authors performed detail analysis
on appropriate hyperparameter choices for BERT in this regard. They also investigated into
training specific choices such as batch size and proper masking. Their pretraining scheme
on BERT achieved state-of-the-art results on GLUE, RACE and SQuAD. As an alternative
to masked language modeling based BERT pretraining, a more sample-efficient pre-training
scheme was proposed in [43] known as replaced token detection. Instead of masking input words
randomly, this pretraining method corrupted the input by replacing some words with alternative
words generated from a small generator network. A discriminative model had to predict whether
each word in the sentence is original or corrupted by generator model. The discriminator was
trained using the predictions made. In conditions where only small amount of data are available
for pretraining, this generator discriminator based pretraining scheme outperformed BERT by a
wide margin.

3.2 Spell Checker Related Researches


Most researches conducted on Bangla spell checker development are rule based with a few
exceptions of learning based sequence to sequence modeling. Spell checkers in Chinese language
are a lot more developed and relevant researches show in depth analysis in this field. In spite of
English being an international language, spell checking has never been the focus of research in
this language due to its simplicity of spelling and pronunciation. We describe the details of these
studies in this section.

3.2.1 Existing Bangla Spell Checkers

Several studies on Bangla spell checker algorithm development have been conducted in spite of
Bangla being a low resource language. A phonetic encoding was proposed for Bangla in [20]
that had the potential to be used by spell checker algorithms. Word encoding was performed
3.2. SPELL CHECKER RELATED RESEARCHES 26

using Soundex algorithm. This algorithm was modified according to Bangla phonetics. The
same was done for Bangla compound characters as well. Finally, the authors demonstrated a
prototype of Bangla spell checker which utilized the proposed phonetic encoding to provide
correct version of misspelled Bangla words. Another potential spell checker was proposed
in [21] where the main goal was to search and match named entities in Bangla. The authors
identified two major challenges in this research that include the large gap between script and
pronunciation in Bangla letters and different origins of Bangla names rooted in religious and
indigenous background. They presented a Double Metaphone encoding for Bangla names that
was a modification of the method proposed in [20] where major context-sensitive rules and
consonant clusters for Bangla language were taken into account. This research proposed an
algorithm that used name encoding and edit distance to provide promising suggestions of similar
names. Another recent research proposed a clustering based approach for Bangla spelling error
detection and correction. The proposed algorithm reduced both search space and search time.
The spell checker demonstrated was able to handle both typographical errors and phonetic errors.
The authors compared their proposed spell checker with Avro and Puspa, two renowned Bangla
writing tools where the proposed algorithm showed relatively better performance. An N-gram
based method was proposed in [23] for sentence level Bangla word correctness detection. They
experimented with unigram, bigram and trigram. Such probabilistic approach required a large
corpus for defining the probabilistic inference scheme. The advantage was that the proposed
approach was language independent and did not require any explicit knowledge of Bangla
grammar for detecting semantic or syntactic correctness.

A recent study on low resource language has included Hindi and Telugu spell checker
development, where mistakes were assumed to be made at character level [24]. We include
this research in this section as the two mentioned Indic languages are Sanskrit originated just
like Bangla and has similar sentence and word structure. The authors used attention based
encoder decoder modeling as their approach. The input is a sequence of characters that make up
a sentence containing misspelled words, whereas the output is another sequence of characters
which is supposed to hold all the correct versions of the misspelled words of the input. The
correctly spelled words are kept in tact. During decoding phase, each word production is
supposed to focus on a single word or a couple of words in the worst case (during accidental
word splitting type spelling error). Hence attention mechanism comes in handy during character
level spelling correction. The authors also proposed a synthetic error generation algorithm for
Hindi and Telugu language as part of this research.
3.2. SPELL CHECKER RELATED RESEARCHES 27

3.2.2 State-of-the-art Neural Network Based Chinese Spell Checkers

State-of-the-art research in this domain involves Chinese spell checkers as it is an error


prone language due to its confusing word segmentation, phonetically and visually similar
but semantically different characters. One of the most significant initial work on Chinese spell
checking was conducted in [44]. They focused on the mistakes made by people learning Chinese
as a foreign language and proposed a unified framework for spell checking named HanSpeller++.
Keeping rule and statistics based methods aside, this spell checker is based on extended hidden
Markov model and ranker based models. This approach provides its final correction decision
assisted by a rule based model. SIGHAN 2014 is a popular test set for foreigner’s Chinese essays
where this approach showed decent performance. A confusion set guided Pointer network was
proposed for Chinese spell checking in [25]. The confusion set was used as a guide for correct
character generation. A sequence to sequence model learnt either to copy correct character from
input to output side using the Pointer network or to generate the correct version of an incorrect
character using the confusion set rather than using the entire language vocabulary. The authors
conducted experiments on three human annotated datasets where they achieved state-of-the-art
Chinese spell checking performance. Another contemporary research proposed a Chinese spell
checker named FASPell utilizing a new idea based on a denoising autoencoder (DAE) and a
decoder [45]. This new idea allowed FASPell to be computationally faster, to be adaptable to
all types of Chinese text based errors and to be powerful with simple architecture design. DAE
was able to learn from a very limited amout of supervised learning data through the use of
unsupervised pretraining scheme seen in models such as BERT and XLNet. The decoder part
of the proposed model eliminated the need of using a rigid character confusion set and instead
introduced a learning based approach.

A year later, another research on Chinese spell checking conducted experiments on these same
three human annotated spelling error datasets and achieved better results [26]. Contemporary
methods were focusing on incorporating Chinese character pair similarity knowledge through
external input or through heuristic based algorithms. On the contrary, this research incorporated
phonological and visual similarity knowledge into their learning based language model using
a specialized graph convolutional network namely SpellGCN. SpellGCN built a graph using
the Chinese character knowledge domain. This graph learned to map a character to its correct
version acting as an inter-dependent character classifier. Such classifier was applied to character
representation extracted from a BERT based network where both the networks were trained
end to end. The latest Chinese spell checker was BERT based and did not incorporate any
external knowledge or confusion set related to Chinese characters [27]. The authors proposed a
soft masking scheme based on Gated Recurrent Unit (GRU) integrated with BERT. BERT was
pretrained using masking based scheme at character level. While performing spelling correction
on error dataset, the GRU sub-model first provided a confidence score for each character on
3.2. SPELL CHECKER RELATED RESEARCHES 28

how likely it was to be a noisy character. Based on this likelihood some masking was added to
this character. BERT based prediction was then made on this modified character. Both GRU
and BERT were trained end to end. Detail evaluation on two benchmark datasets showed the
superiority of this soft masked BERT as Chinese spell checker.

3.2.3 Researches Indirectly Connected to Spell Checking

Although not directly connected to this research, a few other researches conducted on English
language can be mentioned. A preprocessing step was proposed in [46] that aimed to change
English text word, sentence and paragraph in such a way that they would have greater resemblance
to the distribution of standard English while preserving most of the original semantic meaning.
Such change would help in different NLP downstream tasks performed on large data obtained
from the wild. So, the input was a corrupt text, while the output was a projection to the target
language domain. The authors used a sequence to sequence attention based language model [47]
in this research and showed that their transformation was indeed effective. Another research
developed a spelling correction system for medical text [48]. The authors pointed out the
importance of such specialized spell checker to ensure the correct interpretation of medical
documents as health records are getting digitized day by day. Their algorithm was based on
Shannon’s noisy channel model which utilized an enriched dictionary compiled from variety of
resources. This research also used named entity recognition to avoid the correction of patient
names. The spell checker was applied to three different types of free texts such as clinical notes,
allergy entries, and medication orders. The proposed algorithm achieved high performance in
all three of these domains. A deep learning based approach for Bangla sentence correction was
proposed in [49]. Here sentence correction was performed as a form of word rearrangement to
better express the semantic meaning. They used encoder decoder based sequence to sequence
modeling for correcting Bangla sentences. Their model was LSTM based. Such models have
gained popularity for tasks such as machine translation and grammar correction. As part of
this research, the authors also constructed a benchmark dataset of misarranged words which is
expected to faciliate future research in similar domain. A research on Chinese language focused
on errors made by people learning Chinese as a second language [50]. The authors claimed that
such errors inherit from language transfer and pointed out four major types of errors in the paper
- lexical choice error, redundancy, omission, and word order. Their main focus was word order
type error as it was identified as the most prominent source of error. The research proposed a
relative position language model and a parse template language model for error correction.
Chapter 4

Bangla Language Modeling Approach

This chapter describes the different components of our proposed Coordinated CNN (CoCNN)
architecture aimed at Bangla language modeling tasks such as word prediction, sentence
completion and text classification. The chapter also explains the theory and intuition behind the
inclusion of these novel components.

4.1 Solution Overview


Traditional CNN based approaches [4] represent the entire input sentence/ paragraph using a
matrix of size SN × SV , where SN and SV represent number of characters in the sentence/
paragraph and the character representation vector size, respectively. In such character based
approach, the model does not have the ability to consider each word in the sentence as a separate
entity. A character itself does not make any sense. When we work with sequence of characters,
finding out character clusters that form the sentence constituent words is itself a difficult task.
But the words are already given and it is important to understand the contextual meaning of
each word and to be able to find out relationship among those words for sentence semantics
understanding. Relationship among words include important meaningful patterns, phrases,
grammatical constraints etc. We propose Coordinated CNN (CoCNN) architecture which is
aimed to achieve both of these feats. Figure 4.1 illustrates CoCNN that has two major components.
SemanticNet component works at word level, while Terminal Coordinator component works at
sentence/ paragraph level. Both of these components are 1D CNN based sub-model at their core
and are trained end-to-end. While LSTM based models put too much emphasis on word sequence
order, CoCNN puts emphasis on local patterns formed of word syllables and of words. While
Transformers employ explicit self attention mechanism among words of a sentence, CoCNN
employs position independent specialized and localized attention at word level and at sentence
level. The SemanticNet sub-model of CoCNN is effective to handle the high inflection and
morphological richness inherent in Bangla language. The Terminal Coordinator on the other

29
4.2. SEMANTICNET SUB-MODEL 30

Output

Dense and
Softmax Layer

Flattened
Coordination
vector

1D conv of
Sentence Level kernel size 3
Terminal Sequential 1D CNN
Coordinator Sub-Model

M
lenF
Concatenation
of CNNvecs

F1 F2 F3 F4 Fn
Fi
Flattened
Sentence Level
vector
SemanticVec

Word Level 1D conv of


Sequential 1D CNN kernel size 3
Sub-Model
SemanticNet

Wi
Concatenation lenC
of Character
Embeddings
per word
C1 C2 C3 C4 Cm
Word1 Word2 Wordi Wordn Word Level

Figure 4.1: 1D CNN based CoCNN architecture

hand deals with the flexible word order in Bangla.

4.2 SemanticNet Sub-Model


SemanticNet is used to transform each input word into a vector representation called SemanticVec.
We represent each input word W ordi by a matrix Wi . Wi consists of m vectors each of size lenC .
These vectors C~1 , C~2 , . . . C~m represent one hot vector of character C1 , C2 , . . . Cm , respectively of
W ordi . Now each word in the figure is thought to have equal length, because we have included
padding up to maximum word length. Representation detail has been depicted in the bottom
right corner of Figure 4.1. Now we could have used Char2vec type semantically meaningful
representation instead of using simple one hot vector. But truly speaking, such representation
is not reliable, as characters are related more to word spelling patterns of a language rather
than being related to semantic meaning of any sort. Applying 1D convolution layers on matrix
Wi helps in deriving key local patterns and sub-word information of W ordi . 1D convolution
4.3. TERMINAL COORDINATOR 31

operation starting from vector C~j of Wi using a convolution filter of kernel size K can be
expressed as 1D Conv([C~j , C~j+1 , . . . Cj+K−1
~ ], f ilterK ) function. The output of this function is
a scalar value. While using a stride of s, the next convolution operation will start from vector
C~j+s of Wi and will provide us with another scalar value. Such convolution operations are
performed until we reach the end of word. Thus we get a vector as a result of applying one
convolution filter on matrix Wi . It is to note that we have performed same type convolution
in all our convolution type layers which employs extra padding in order to provide output of
same shape as that of the input matrix passed through the convolution layer. A convolution
layer has multiple such convolution filters. After passing Wi matrix through the first convolution
layer we obtain feature matrix Wi1 . Passing Wi1 through the second convolution layer provides
us with feature matrix Wi2 . So, the Lth convolution layer provides us with feature matrix WiL .
SemanticNet sub-model consists of such 1D convolution layers standing sequentially one after
the other. Such design helps in learning both short and long range intra-word level features.
Convolution layers near matrix Wi are responsible for identifying key sub-word patterns of
W ordi , while convolution layers further away focus on different combinations of these key sub-
word patterns. Such word level local pattern recognition plays key role in identifying semantic
meaning of a word irrespective of inflection or presence of spelling error. Each subsequent
convolution layer has non-decreasing filter size and increasing number of filters. As we move
to deeper convolution layers, our extracted features become more and more high level and
abstract. That means combining and finding important patterns from them becomes more and
more difficult with the increase of depth. This is the reason behind the increasing number of
filters with increasing depth. Each intermediate layer output is batch normalized keeping the
values passing from one layer to another within bound. The final convolution layer output matrix
WiL is flattened and formed into a vector Fi of size lenF . Fi is the SemanticVec representation of
W ordi . Now we could apply one dimensional global max pooling operation instead of flattening
matrix WiL . But global max pooling only takes the maximum value from each column of matrix
WiL . Such feature reduction may ultimately result in effective information loss. We obtain
SemanticVec representation from each of our input word in a similar fashion applying the same
SemanticNet sub-model. That means, there are many words in a sentence, but there is only one
SemanticNet sub-model used on all the input words. It is to note that SemanticVec representation
obtained from each input word does not require any extra representative vector storage unlike
word2vec and FastText. Simply storing the trained SemanticNet sub-model is enough irrespective
of the total number of words in the vocabulary.

4.3 Terminal Coordinator


Terminal Coordinator takes the SemanticVecs obtained from SemanticNet as input and returns
a single Coordination vector as output which is used for final prediction. Our goal here
4.3. TERMINAL COORDINATOR 32

is to find out important patterns formed by word clusters. So, this sub-model acts at
inter word level unlike SemanticNet sub-model that worked on intra word level. For n
words W ord1 , W ord2 , . . . W ordn ; we obtain n such SemanticVecs F~1 , F~2 , . . . F~n , respectively
using SemanticNet sub-model. Each SemanticVec is of size lenF . Concatenating these
SemanticVecs provide us with matrix M (details shown in the middle right portion of Figure
4.1). Applying 1D convolution on matrix M facilitates the derivation of key local patterns
found in input sentence/ paragraph which is crucial for output prediction. 1D convolution
operation starting from F~i using a convolution filter of kernel size K can be expressed as
1D Conv([F~i , F~i+1 , . . . Fi+K−1
~ ], f ilterK ) function. The output of this function is a scalar value.
Such convolution operations are performed until we reach F~n , the last SemanticVec of matrix
M. Thus we get a vector as a result of applying one convolution filter on matrix M . It is to note
that we have performed same type convolution in all our convolution type layers here as well.
Each of these convolution layers has multiple such convolution filters. After passing M matrix
through the first convolution layer we obtain feature matrix M 1 . Passing M 1 through the second
convolution layer provides us with feature matrix M 2 . So, the Lth convolution layer provides
us with feature matrix M L . So, Terminal Coordinator is a sequential 1D CNN sub-model with
design similar to SemanticNet sub-model having different set of weights. The difference is that
this sub-model is employed on only one matrix M unlike SemanticNet which has to be employed
on each individual word matrix. convolution layers near M are responsible for identifying key
word clusters, while convolution layers further away focus on different combinations of these
key word clusters important for sentence or paragraph level local pattern recognition. The final
output feature matrix obtained from the 1D CNN sub-model of Terminal Coordinator is flattened
to obtain the Coordination vector, a summary of important information obtained from the input
word sequence in order to predict the correct output. This approach of information summary
production has been proved to very effective through our experiments. But there is one problem
with CoCNN, which is the fixed maximum input length problem. Both self attention oriented
Transformer based architectures and our proposed CoCNN require a maximum input length
to be specified for many to one tasks such as next word prediction and text classification. In
fact, each input sequence no matter how small, has to be padded to the length of maximum
sequence as CoCNN requires fixed length input. Now let us understand the reason behind this
fixed length requirement. For many to one tasks, the size of the Coordination vector obtained
from the Terminal Coordinator sub-model depends on the number of SemanticVecs, which in
turn depends on number of input words. Now, at least one output dense layer has to be applied
on the Coordination vector to get the final output. The problem is that the number of rows of
dense layer weight matrix has to be equal to Coordination vector length. As the dense layer
weight matrix size has to be fixed, Coordination vector length (row number of dense layer weight
matrix) also has to be fixed, which in turn requires number of SemanticVecs to be fixed, which in
turn requires the number of sample input words to be fixed.
4.4. ATTENTION TO PATTERNS OF SIGNIFICANCE 33

4.4 Attention to Patterns of Significance

O1 O2 O3 O4 O5 O1 O2 O3 O4 O5

1D conv 1D conv
of kernel of kernel
size 3 size 3

S1 S2 S3 S4 S5 S6 S7 S5 S6 S7 S1 S2 S3 S4

Figure 4.2: 1D convolution as a form of attention

There is no explicit attention mechanism in CoCNN unlike self attention based Transformers
[17, 18] or weighted attention based LSTMs [47, 51]. In self attention, each input word
representative vector is modified using a weighted sum of all input word representative vectors in
accordance to its relationship with the other words. On the contrary, in weighted attention based
LSTMs the outputs obtained in each time step are weighted and added, which is in turn used
during output prediction as appropriate context information. In brief, attention mechanism is
important for obtaining the importance of each input word in terms of output prediction. Figure
4.2 demonstrates an over simplified notion of how 1D convolution implicitly imposes attention
on a sequence containing 7 entities - S1 , S2 , . . . S7 . Each of these entities is a character for
SemanticNet sub-model, while each entity is a word for Terminal Coordinator sub-model. After
employing a 1D convolution filter of kernel size 3 on this sequence representation, we obtain
a vector containing values O1 , O2 , . . . O5 , where we assume O5 and O1 to be the maximum
for the left and the right figure, respectively. O5 (left figure) is obtained by convolving input
entity S5 , S6 and S7 situated at the end of input sequence. We can say that these three input
entities have been paid more attention to by our convolution filter than the other 4 entities. In the
right figure, S5 , S6 and S7 are situated at the beginning of the input sequence. Our convolution
filter gives similar importance to this pattern irrespective of its position change. Such positional
independence of important patterns helps in Bengali and Hindi language modeling where input
words are loosely bound and words themselves are highly inflected. In CoCNN, such attention
is imposed on characters of each word during SemanticVec representation generation of that
word using SemanticNet, while similar type of attention is imposed on words of our input
sentence/ paragraph while obtaining Coordination vector using Terminal Coordinator. In brief,
our proposed CoCNN imposes position independent attention mechanism on both intra and inter
word level.
4.5. POSITIVE EFFECT OF BATCH NORMALIZATION 34

4.5 Positive Effect of Batch Normalization


We apply batch normalization in between each pair of convolution layers of both SemanticNet
and Terminal Coordinator. Our weight optimization is based on mini batch stochastic graidient
descent algorithm. Batch normalizing intermediate output feature matrices has shown positive
effect in our model performance. Let us see how we apply batch normalization. Let us take a
single convolution output value for example. Suppose, after passing sample no. i of a mini batch,
0 0
we receive outi as the preliminary version of this output value. We modify outi to get the final
version outi as follows:

0
out − meanB
yi = √i (4.1)
varB + 

outi = w × yi + b (4.2)

Here, varB and meanB denote variance and mean, respectively of this output value for the
current batch of samples. varB and meanB may vary drastically among different batches which
can lead to unstable weight update of our model. As a result, we introduce w and b which are
learnable parameters kept for reducing internal covariate shift in terms of this output value over
all batches.
If we think about next word prediction, our model training is an unsupervised task where we can
use bulk of raw corpus data. As a result, we have a wide variety of words and various types of
semantically meaningful sentences at our disposal as training samples. Batch normalized output
values provides us with three types of benefits over raw output values - (1) faster training, (2)
less bias towards initial weights and (3) automatic regularization effect [52, 53].

4.6 Beam Search Based Sentence Completion


Sentence completion is a task which is achieved through consecutive prediction of words
until end marker punctuation appears. We use beam search for sentence completion using our
proposed CoCNN model. Figure 4.3 shows an example of beam search where beam width is 2, a
hyperparameter. The top word list consists of only four words - A, B, C and D. Suppose, we have
W ord1 , W ord2 , . . . , W ordi at the start of our sentence to be completed. Using these starting
words as input (denoted as < Start >), we get the probability of A, B, C and D as output for
being the next word. The subscript denotes time stamp number. In the first time stamp, A1 and
C1 are the two highest probability words. Providing < Start >, A1 as input to CoCNN provides
us with probability values for A2 , B2 , C2 and D2 . Similarly, providing < Start >, C1 as input
to CoCNN provides us with probability values for the same words. The two highest probability
4.6. BEAM SEARCH BASED SENTENCE COMPLETION 35

A2

B2
A1
C2 A3
A1

D2 B3 C1A2B3
B1 C1A2
Start
A2 C3
C1

B2 D3
D1
C1
C2 A3

D2 B3
C1D2

C3 C1D2C3

D3

Figure 4.3: Beam search simulation with 3 time stamps and a beam width of 2

product value among these eight values are of C1 A2 and C1 D2 . In a similar fashion, we feed
< Start >, C1 , A2 and < Start >, C1 , D2 as input to CoCNN and get eight probability product
values as output. Sequence C1 A2 B3 and C1 D2 C3 receive the two highest probability products. If
we decide to stop here, then we need to detect the highest probability valued sequence. Suppose,
it is C1 A2 B3 . So, the full sentence will be < Start > C1 A2 B3 in this case. This process can be
generalized to any beam width, sentence end marker words and top word list.

Choosing a larger beam width allows us to explore a larger search space for our completion task.
When our search space is larger, there is a higher probability of getting a more accurate and
sensible sentence completion. But increasing beam width also increases sentence completion
run time exponentially. Considering these two opposing factors, we choose 5 as our beam width.
There is one more aspect to consider. Let us think of two sentence completions for the same
input partial sentence. The first completion has predicted 5 words whereas the second one
has completed the sentence by predicting only 2 words. Beam search tends to prefer smaller
completions as the choice of completion is made using probability product of the predicted
consecutive words. We know that probability values are in the range of [0, 1]. Multiplying
probability values that are less than 1 will make the product term smaller and smaller as we
keep adding more probability values. The completion task predicting 5 words will have the
product of 5 probability values, whereas the completion task predicting 2 words will only have 2
such values. A simple approach can take care of this bias towards smaller sentence completions,
because in real life writing tools, people desire to get a bunch of words instead of such one word
for sentence completion oriented tasks. One can impose restriction on the minimum number of
words to be predicted in order to be considered as an output of the beam search. Also penalty
product terms can be imposed on shorter completions. On the contrary, we can impose reward
4.6. BEAM SEARCH BASED SENTENCE COMPLETION 36

on larger completions.
Chapter 5

Language Modeling Performance Analysis

We start this chapter by specifying the details of the datasets used for langauge modeling
experiments. Then we move on to the detail specifications of our proposed CoCNN model so
that readers can reproduce the reported results by developing the model appropriately. We also
specify our hardware details on which the experiments have been conducted so that the reported
runtimes make sense. Then we proceed to compare CoCNN architecture with state-of-the-art
architectures of CNN, LSTM and Transformer paradigm in classic language modeling task for
Bangla and Hindi - next word prediction. We identify BERT Pre model of Transformer paradigm
as the closet competitor of CoCNN. We compare these two architectures further on multiple
datasets. We also compare these architectures in downstream tasks such as sentence completion
and text classification. Finally, we show the potential of SemanticNet sub-model of CoCNN
architecture as a potential Bangla spell checker.

5.1 Dataset Specifications

Unique Unique Top Training Validation


Datasets
Word No. Character No. Word No. Sample No. Sample No.
Prothom-Alo 260 K 75 13 K 5.9 M 740 K
BDNews24 170 K 72 14 K 2.9 M 330 K
Nayadiganta 44 K 73 280 K
Hindinews 37 K 74 5.5 K 87 K 10 K
Livehindustan 60 K 73 4.5 K 210 K 20 K
Patrika 28 K 73 307 K
Table 5.1: Dataset specific details for language modeling experiments

We have collected Bangla articles from online public news portals such as Prothom-Alo [54],
BDNews24 [55] and Nayadiganta [56] through web scraping. The articles encompass domains
such as politics, entertainment, lifestyle, sports, technology and literature. The datasets have

37
5.2. MODEL OPTIMIZATION 38

been cleaned by removing punctuation marks, English letters, numeric characters and incomplete
sentences. The Hindi dataset consists of Hindinews [57], Livehindustan [58] and Patrika [59]
newspaper articles available open source in Kaggle encompassing similar domains. Nayadiganta
(Bangla) and Patrika (Hindi) dataset have been used only as test set. Apart from next word
prediction, we have produced a sentence completion test set for both Bangla and Hindi from
their corresponding test sets each of which contain 10 k samples.

We summarize different attributes of our data sets in Table 5.1. Top words have been selected
such that they cover at least 90% tokens of the corresponding dataset. Due to popular usage of
Qwerty keyboard for Bangla typing, phonetic based spelling errors are common in this language.
A recent paper has introduced a probability based algorithm for realistic spelling error generation
in Bangla, which covers aspects such as phonetical similarity, compound characters and keyboard
mispress [30]. We use this algorithm to generate similar errors in our Bangla sample input words.

5.2 Model Optimization

5.2.1 Performance Metric

We use perplexity (PPL) to assess the performance of our model for next word prediction and
sentence completion task. Suppose, we have sample inputs I1 , I2 , . . . , In and our model provides
probability values P1 , P2 , . . . , Pn , respectively for their ground truth output tokens. Then the
PPL score of our model for these samples can be computed as:

n
1X
P P L = exp(− ln(Pi ))
n i=1

When a model assigns large probability to the ground truth next word outputs for different
samples, PPL score decreases. That means, when our model shows confidence in ground truth
word predictions, it gets a small PPL score. Smaller PPL score means more realistic model
prediction of next word. So, we would want to make our validation PPL score as small as
possible through various optimizations.
We use accuracy and F1 score as our performance metric for text classification related tasks. To
calculate F1 score, we need to determine precision and recall. All these performance metrics are
based on detecting true positive (TP), true negative (TN), false positive (FP) and false negative
(FN) values.

TP + TN
Accuracy =
TP + TN + FP + FN
5.2. MODEL OPTIMIZATION 39

TP
P recision =
TP + FP

TP
Recall =
TP + FN

2 × P recision × Recall
F1 =
P recision + Recall

5.2.2 Hyperparameter Specifications


Optimizer Details

For model optimization, we use Stochastic Gradient Descent (SGD) optimizer [60] with a
learning rate of 0.001 while constraining the norm of the gradients to below 5 for exploding
gradient problem elimination. We use Categorical Cross-Entropy loss for model weight update
and dropout [61] with probability 0.3 in between the dense layers of our models for regularization.

SemanticNet Details

SemanticNet sub-model consists of a character level embedding layer producing a 40 size


vector from each character, then 5 consecutive layers each consisting of 1D convolution (batch
normalization and Relu activation in between each pair of convolution layers) and finally, a
1D global maxpooling in order to obtain SemanticVec representation from each input word.
The five 1D convolution layers consist of (64, 2), (64, 3), (128, 3), (128, 3), (256, 4) convolution,
respectively. Here the first and second element of each tuple denote number of convolution
filters and kernel size, respectively. As we can see, the filter size and number of filters of the
convolution layers are monotonically increasing as architecture depth increases. It is because
deep convolution layers need to learn the combination of various low level features which is a
more difficult task compared to the task of shallow layers that include extraction of low level
features. We provide a weight of 0.3 (λ value of loss function) to the auxiliary loss in order to
limit the impact of this loss in gradient update compared to the final loss.

Terminal Coordinator Details

The Terminal Coordinator sub-model used in CoCNN architecture uses six (6) convolution
layers having kernel size of 2, 3, 3, 3, 4 and 4, and filter number of 32, 64, 64, 96, 128 and 196,
respectively. Its design is similar to that of SemanticNet sub-model. We use Softmax activation
function at the final output layer of this sub-model. We use a batch size of 64 for our models,
while applying batch normalization on CNN intermediate outputs of both SemanticNet and
Terminal Coordinator.
5.3. HARDWARE SPECIFICATIONS 40

5.3 Hardware Specifications


We use Python 3.7 and Tensorflow 2.3.0 package for our implementation. We use three GPU
servers for training our models. Their specifications are as follows:

• 12 GB Nvidia Titan Xp GPU, Intel(R) Core(TM) i7-7700 CPU (3.60GHz) processor


model, 32 GB RAM with 8 cores

• 24 GB Nvidia Tesla K80 GPU, Intel(R) Xeon(R) CPU (2.30GHz) processor model, 12
GB RAM with 2 cores

• 8 GB Nvidia RTX 2070 GPU, Intel(R) Core(TM) CPU (2.20GHz) processor model, 32
GB RAM with 7 cores

5.4 Comparing CoCNN with State-of-the-art Architectures


In this section, we compare CoCNN with state-of-the-art architectures from three different
paradigms - CNN, LSTM and Transformer. We provide intuitive explanation behind the superior
performance of proposed CoCNN.

5.4.1 Comparing CoCNN with Other CNNs

7.25 CNN_Van (985)


CNN_Dl (952)
7.00 CNN_Bn (544)
CoCNN (203)
6.75
6.50
Loss

6.25
6.00
5.75

2 4 6 8 10 12 14
Epoch

Figure 5.1: Comparing CoCNN with state-of-the-art architectures of CNN paradigm on Prothom-
Alo validation set. The score shown beside each model name in bracket denotes that model’s PPL
score on Prothom-Alo validation set after 15 epochs of training. Note that this dataset contains
synthetically generated spelling errors.
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 41

We compare CoCNN with three other CNN-based baselines (see Figure 5.1). CNN Van is a simple
sequential 1D CNN model of moderate depth [4]. It considers the full input sentence/ paragraph
as a matrix. The matrix consists of character representation vectors. CNN Dl uses dilated
convolution in its CNN layers which allows the model to have a larger field of view [40]. Such
change in convolution strategy shows slight performance improvement (PPL score reduced to 952
from 985). CNN Bn has the same setting as of CNN Van, but uses batch normalization (details
in Section 4.5) on intermediate convolution layer outputs. Such measure shows significant
performance improvement in terms of loss and PPL score (PPL reduced to 544). Proposed
CoCNN surpasses the performance of CNN Bn by a wide margin (PPL reduced to only 203). We
believe that the ability of CoCNN to consider each word of a sentence as a separate meaningful
entity is the reason behind this drastic improvement. The other CNN models consider the input
sentence as a sequence of characters which is not similar to what we humans do when we infer
something from a sentence. Rather we understand the meaning of a sentence using a bunch of
word coordination. From Table 5.2, we can see that the training time and parameter number
of CoCNN architecture are comparable to the other CNN models, although the performance of
CoCNN is better than the other CNN models by magnitudes.

Parameter No. Training Time


CNN Architectures
(million) (minute/ epoch)
CNN Bn 4 30
CNN Van 4 33
CNN Dl 2 25
CoCNN 4.5 40
Table 5.2: Paremeter number and training time per epoch on Prothom-Alo training set of analyzed
CNN architectures

5.4.2 Comparing CoCNN with LSTMs

We compare CoCNN with four LSTM-based models (see Figure 5.2). Two LSTM layers
are stacked on top of each other in all four of these models. We do not compare with
LSTM models that use Word2vec [62] representation as this representation requires fixed size
vocabulary. In spelling error prone setting, vocabulary size is theoretically infinite. We start with
LSTM FT, an architecture using sub-word based FastText representation [2, 12]. Character aware
learnable layers per LSTM time stamp form the new generation of SOTA LSTMs [13–15, 37].
LSTM CA acts as their representative by introducing variable size parallel convolution filter
output concatenation as word representation. The improvement over LSTM FT in terms of
PPL score is almost double (PPL score of 1198 compared to 2005). Instead of unidirectional
many to one LSTM, we introduce bidirectional LSTM in LSTM CA to form BiLSTM CA which
shows slight performance improvement (a slight reduction to PPL score of 1113 from 1198). We
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 42

7.25
7.00
6.75
Loss
6.50
6.25
LSTM_FT (2005)
6.00 LSTM_CA (1198)
BiLSTM_CA (1113)
5.75 BiLSTM_CA_Attn (771)
CoCNN (203)
2 4 6 8 10 12 14
Epoch

Figure 5.2: Comparing CoCNN with state-of-the-art LSTM architectures in terms of loss plots
and PPL score on Prothom-Alo validation set

introduce Bahdanu attention [51] on BiLSTM CA to form BiLSTM CA Attn architecture. Such
measure shows further performance boost (PPL reduced to 771). Proposed CoCNN shows almost
four times improvement in PPL score compared to BiLSTM CA Attn (PPL score of 203). If we
compare Figure 5.2 and Figure 5.1, we can see that CNNs perform relatively better than LSTMs
in general for Bangla LM. For example, the worst performing CNN model CNN Van performs
better than LSTM FT, LSTM CA and BiLSTM CA. LSTMs have a tendency of learning sequence
order information which imposes positional dependency. Such characteristic is unsuitable for
languages such as Bangla and Hindi with flexible word order.

5.4.3 Comparing CoCNN with Transformers

We compare CoCNN with four Transformer-based models (see Figure 5.3). We use popular
FastText word representation with all compared Transformers. Our comparison starts with
Vanilla Tr, a single Transformer encoder (similar to the Transformer designed in [17]) that
utilizes global maxpooling and dense layer at the end for next word prediction. In BERT, we
stack 12 Transformers on top of each other where each Transformer encoder has more parameters
than the Transformer of Vanilla Tr [18, 41]. BERT with its large depth and enhanced encoders
almost double the performance shown by Vanilla Tr (PPL score reduced to 469 from 798).
We do not pretrain this BERT architecture. We follow the Transformer architecture designed
in [16] and introduce auxiliary loss after the Transformer encoders situated near the bottom of
the Transformer stack of BERT to form BERT Aux. Introduction of such auxiliary losses show
moderate improvement of performance (slight PPL decrease to 402 from 469). BERT Pre is the
pretrained version of BERT. We follow the word masking based pretraining scheme provided
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 43

7.2
Vanilla_Tr (798)
7.0 BERT (469)
BERT_Aux (402)
6.8 BERT_Pre (210)
CoCNN (203)
Loss 6.6
6.4
6.2
6.0
5.8
5.6
2 4 6 8 10 12 14
Epoch

Figure 5.3: Comparing CoCNN with state-of-the-art Transformer architectures in terms of loss
plots and PPL score on Prothom-Alo validation set

in [42]. The Bangla pretraining corpus consists of Prothom Alo [54] news articles dated from
2014-2017 and BDNews24 [55] news articles dated from 2015-2017. The performance of
BERT jumps up more than double when such pretraining is applied (PPL halved to 210 from
469). CoCNN without utilizing any pretraining achieves marginally better performance than
BERT Pre (PPL score of 203 compared to 210). Unlike Transformer encoders, convolution
imposes attention with a view to extracting important patterns from the input to provide the
correct output (see Section 4.4). Furthermore, SemanticNet of CoCNN is suitable for deriving
semantic meaning of each input word in highly inflected and error prone settings.

5.4.4 Comparing CoCNN with BERT Pre

Datasets Error? BERT Pre CoCNN


Yes 152 147
Prothom-Alo
No 117 114
Yes 201 193
BDNews24
No 147 141
Hindinews +
No 65 52
Livehindustan
Yes 169 162
Nayadiganta
No 136 133
Patrika No 67 57
Table 5.3: PPL comparison between BERT Pre and CoCNN

BERT Pre is the only model showing performance close to CoCNN in terms of validation loss
and PPL score (see Figure 5.3). In order to compare these two models further, we train both
5.4. COMPARING COCNN WITH STATE-OF-THE-ART ARCHITECTURES 44

5.2
BERT_Pre (152) BERT_Pre (65)
6.6 CoCNN (147) CoCNN (52)
5.0
6.4
6.2 4.8
Loss

Loss
6.0 4.6
5.8 4.4
5.6
4.2
5.4
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Epoch Epoch
(a) Plot on Bangla dataset (b) Plot on Hindi dataset

Figure 5.4: Comparing CoCNN with BERT Pre on Bangla (Prothom-Alo) and Hindi (Hindinews
and Livehindustan merged) validation set. The score shown beside each model name denotes
that model’s PPL score after 30 epochs of training on corresponding training set.

models for 30 epochs on several Bangla and Hindi datasets and obtain their PPL scores on
corresponding validation sets. Bangla datasets include Prothom-Alo, BDNews24; while Hindi
dataset includes Hindinews, Livehindustan. We use Nayadiganta and Patrika dataset for Bangla
and Hindi independent test set, respectively. The Hindi pretraining corpus consists of Hindi
Oscar Corpus [63], preprocessed Wikipedia articles [64], HindiEnCorp05 dataset [65] and WMT
Hindi News Crawl data [66]. From the graphs of Figure 5.4 and PPL score comparison Table
5.3, it is evident that CoCNN marginally outperforms its nemesis BERT Pre in all cases. There
are 8 pairs of PPL scores in Table 5.3. We use these scores to perform one tailed paired t-test in
order to determine whether the reduction of PPL score seen in CoCNN compared to BERT Pre
is statistically significant when P-value threshold is set to 0.05. The statistical test shows that the
improvement is significant and the null hypothesis can be rejected.

Training time
Model Parameter
/ epoch
BERT Pre 74 M 402 min
CoCNN 4.5 M 40 min
Table 5.4: Comparing BERT Pre and CoCNN in terms of efficiency

More importantly, the memory requirement of CoCNN is more than 16X lower and its training
speed is over 10X faster than that of the BERT Pre (see Table 5.4). Thus, in terms of efficiency,
CoCNN beats BERT Pre significantly. The hardware specification for both model training is as
follows: 12 GB Nvidia Titan Xp GPU, Intel(R) Core(TM) i7-7700 CPU (3.60GHz) processor
model, 32 GB RAM with 8 cores
5.5. COMPARATIVE ANALYSIS ON DOWNSTREAM TASKS 45

5.5 Comparative Analysis on Downstream Tasks


This section analyzes the performance of CoCNN and BERT Pre in downstream tasks such as
Bangla and Hindi text classification, sentence completion and spell checking. The section also
provides statistical significance analysis while comparing CoCNN and BERT Pre.

5.5.1 Sentence Completion

Dataset Is Error? CoCNN BERT Pre


Nayadiganta Yes 319 327
(Bangla) No 259 270
Patrika
No 212 221
(Hindi)
Table 5.5: PPL score obtained by CoCNN and BERT Pre. Each model was trained for 70 epochs
on Prothom-Alo dataset for Bangla sentence completion task, and 70 epochs on Hindinews and
Livehindustan merged dataset for Hindi sentence completion task.

We form our sentence completion dataset for Hindi and Bangla language from their corresponding
test datasets for evaluating our proposed CoCNN model against its close competitor BERT Pre.
For each sentence Si of length ni , we take the first n2i words as input and the remaining words
 

as ground truth outputs. We use typical beam search [67] with beam width of 10. We keep 10
most probable partial sentences for each time stamp output of our model according to probability
product. Our final output contains the sentence with maximum probability product. Sentence
completion is strictly more difficult than word prediction, since wrong prediction of one output
word can adversely affect the prediction of all subsequent words of a particular sentence. In spite
of these challenges, CoCNN model achieves reasonable PPL score in this task as shown in Table
5.5. The scores are lower than BERT Pre in all three cases.

5.5.2 Text Classification

Training Validation
Dataset Class No Class Names
Sample No Sample No
Entity, Numeric,
Bangla Question
6 Human, Location, 2700 650
Classification
Description, Abbreviation
Hindi Product Positive Review,
2 1700 170
Review Classification Negative Review
Table 5.6: Bangla and Hindi Text classification dataset details

We have compared BERT Pre and CoCNN on Bangla question classification dataset (entity,
numeric, human, location, description and abbreviation type question) used in [68] and a Hindi
5.5. COMPARATIVE ANALYSIS ON DOWNSTREAM TASKS 46

product review classification dataset (positive and negative review) [69]. Details of both the
datasets have been provided in Table 5.6. We use five fold cross validation while performing
comparison on these datasets (see mean results in Table 5.7) in terms of accuracy and F1 score.
One tailed independent t-test with a P-value threshold of 0.05 have been performed on the 5 pairs
of validation F1 scores obtained from five fold cross validation of the two models. Our statistical
test results validate the significance of the improvement shown by CoCNN.

Dataset Metric BERT Pre CoCNN


Question Acc 91.3% 92.7%
Classify F1 0.915 0.926
Product Acc 84.4% 85.2%
Review F1 0.841 0.85
Table 5.7: Text classification performance comparison between BERT Pre and CoCNN

5.5.3 SemanticNet as Potential Spell Checker

Spell Cheker Synthetic Real


Algorithm Error Error
SemanticNet 70.1% 60.1%
Phonetic Rule 61.5% 32.5%
Clustering Rule 51.8% 43.8%
Table 5.8: Bangla spelling correction comparison in terms of accuracy

CoCNN model uses SemanticNet for producing SemanticVec representation from each input
word. We extract the CNN sub-model of SemanticNet from our trained (on Prothom-Alo dataset)
CoCNN model. We produce SemanticVec for all 13 K top words of Prothom-Alo dataset. The
SC suggests the correct version of a misspelled word. For any error word, We , we can generate
its SemanticVec Ve using SemanticNet. We can calculate cosine similarity, Cosi between Ve
and SemanticVec Vi of each top word Wi . Higher cosine similarity means greater probability of
being the correct version of We . We have discovered such approach to be effective for correct
word generation. Recently, a phonetic rule based approach has been proposed in [70], where a
hybrid of Soundex [71] and Metaphone [72] algorithm has been used for Bangla word level SC.
Another SC proposed in recent time has taken a clustering based approach [73]. We show the
comparison of our proposed SemanticNet based SC with these two existing SCs in Table 5.8.
Both the real and synthetic error dataset consist of 20k error words formed from the top 13k
words of Prothom-Alo dataset. The real error dataset has been collected from a wide range of
Bangla native speakers using an easy to use web app. The web app randomly offers three top
words to the user for typing fast and stores the user typed version of the corresponding correct
words in the database (shown in Figure 5.5). Results show the superiority of our proposed spell
checker over existing approaches.
5.5. COMPARATIVE ANALYSIS ON DOWNSTREAM TASKS 47

Figure 5.5: Spelling error collection app overview


Chapter 6

Bangla Spell Checking Approach

This chapter describes BERT derived BSpell architecture proposed for sentence level Bangla
spell checking. The architecture has a main branch and an auxiliary branch both of which have
been described in detail as part of this section. Furthermore, this chapter provides the details of
our proposed hybrid pretraining scheme required for pretraining BSpell model.

6.1 Solution Overview


Figure 6.1 shows the details of our proposed BSpell architecture for Bangla spelling correction.
Each input word in represented by a matrix, which in turn is passed through our proposed
SemanticNet sub-model. This process returns us with a SemanticVec vector representation for
each input word. These vectors are then passed onto two branches simultaneously. The main
branch is similar to BERT Base architecture [74] which is basically a stack of Transformer
encoders. The main goal of this branch is to focus on the input sentence context for spell
checking. The secondary branch consists of an output dense layer used for the sole purpose
of imposing auxiliary loss to facilitate SemanticNet sub-model learning. The main goal of
the secondary branch is to facilitate word level local and global pattern learning. The whole
architecture provided in Figure 6.1 is trained end to end. The secondary branch is only kept
during training. During validation and testing, this branch is removed as it adds a significant
amount of extra parameters to the model and increases application runtime.

6.2 SemanticNet Sub-Model


Correcting a particular word requires the understanding of other relevant words in the same
sentence. Unfortunately, those relevant words may also be misspelled (see Figure 1.6). As
humans, we can understand the meaning of a word even if it is misspelled because of our deep
understanding at word syllable level and our knowledge of usual spelling error pattern. We want

48
6.3. BERT BASE AS MAIN BRANCH 49

Final Output

Word1 Word2 Wordi Wordn

Softmax Layer

Transformer Encoder (12)

BERT_Base

Transformer Encoder (2) Auxiliary Output

Word1 Word2 Wordi Wordn


Transformer Encoder (1)

Softmax Layer
Di

F1 F2 Fi Fn
Dense Layer

Globalmaxpooled vector
Fi
SemanticVec
1D conv of
Sequential 1D kernel size 3
CNN Sub-Model
SemanticNet

Concatenation
of Character
Wi lenC
Embeddings
per word

Word1 Word2 Wordi Wordn C1 C2 C3 C4 Cm

Figure 6.1: BSpell architecture details

our model to have similar semantic level understanding of the words. We employ SemanticNet, a
sequential 1D CNN sub-model at each individual word level with a view to learning intra word
syllable pattern. The details of this sub-model has been provided in Section 4.2. We believe that
inclusion of such sub-model in any learning based language model can play a vital role in all
types of Bangla NLP tasks.

6.3 BERT Base as Main Branch


Each of the n SemanticVecs obtained from n input words are passed individually through a
dense layer. These modified vectors are then passed parallelly on to our first Transformer
6.4. AUXILIARY LOSS BASED SECONDARY BRANCH 50

encoder. The intuition behind such dense layer based modification is to prepare the SemanticVecs
for Transformer based context information extraction. 12 such Transformer encoders are
stacked on top of each other. Each Transformer employs multi head attention mechanism
to all its input vectors. The details of this process has been described in Section 2.4, while the
detail demonstration has been provided in Figure 2.8. The vectors modified through attention
mechanism is passed on for layer normalization. A neural network dense layer provides a
matrix as output during mini batch training. This matrix is normalization based on the mean and
variance which is known as layer normalization. The layer normalized vectors are added with a
dense layer modified version of these vectors through a skip connection for incorporating both
low level and high level features. The n vectors from n words of the input sentence keep passing
on from one Transformer to another with attention oriented modification based on the interaction
with one another. Such interaction is vital for understanding the context of each input word.
Often context of a word can be just as important as its error pattern for the spelling correction of
that word (see Figure 1.5). We pass the final Transformer output vectors to a Softmax layer that
employs Softmax activation function on each vector at an individual manner. So, now we have n
probability vectors. Each probability vector contains lenP values, where lenP is one more than
the total number of top words considered. This one additional word represents rare words, also
known as UNK. The top word corresponding to the index of the maximum probability value
of ith probability vector represents the correct word for W ordi of the input sentence. We use
such word level BERT based approach so that we can tackle the issue of heterogeneous character
number between error input word and correct output word, commonly found in Bengali (see
Figure 1.4).

6.4 Auxiliary Loss Based Secondary Branch


Gradient vanishing problem is a common phenomena in deep neural networks, where weights
of the shallow layers are not updated sufficiently during backpropagation. This results in sub-
optimal performance of the entire model. With the presence of 12 Transformer encoders on top
of the SemanticNet sub-model, the layers of this sub-model certainly lie in a shallow position
of a very deep neural network. Although SemanticNet sub-model constitutes a small initial
portion of BSpell network, this portion of the network is responsible for user written word pattern
learning which is the most important task during spelling correction. In order to eliminate
possible gradient vanishing problem of SemanticNet sub-model and to turn this sub-model into
an effective pattern based spell checker, we introduce an auxiliary loss based secondary branch
in BSpell.
Each of the n SemanticVecs obtained from n input words are passed parallelly on to a Softmax
layer without any further modification. The outputs obtained from this branch are called auxiliary
outputs. The outputs are probability vectors exactly similar to the main branch output. We use
6.4. AUXILIARY LOSS BASED SECONDARY BRANCH 51

the following formula to describe the impact of our auxiliary loss on weight update:

LT otal = LF inal + λ × LAuxiliary

LT otal is the weighted total loss of BSpell model. We want our final loss LF inal to have greater
impact on model weight update than our auxiliary loss LAuxiliary , as LF inal is the loss associated
with the final prediction made by BSpell. Hence, we impose the constraint 0 < λ < 1. As λ gets
closer to 0, the influence of LAuxiliary diminishes more and more on model weight update.

O1 O2 O48 O49 L1 Final


Input W1 W2 W49 WL1
Output
O1
L2 Auxiliary
WL2
Output

Figure 6.2: Gradient vanishing problem elimination using auxiliary loss

Gradient vanishing is a serious problem in deep BERT based architectures, especially for layers
that are at the bottom near the input (SemanticNet sub-model in our case). Let us take the
scenario of Figure 6.2 for instance. This is an over simplified architecture, where the primary
branch contains 49 learnable layers, while the secondary branch only contains one. Each layer is
assumed to have only one weight for demonstration purpose. Here, W1 is the weight nearest to
the input. W1 gets updated using the following formula:

∂L1 ∂L2
W1 = W1 − ( + )
∂W1 ∂W1
We can calculate the two partial derivatives of the above equation as follows:

∂L1 ∂L1 ∂O49 ∂O2 ∂O1


= × × ... × ×
∂W1 ∂O49 ∂O48 ∂O1 ∂W1

∂L2 ∂L2 ∂O1


= ×
∂W1 ∂O1 ∂W1
Here, O1 , O2 , . . . O49 are intermediate outputs; W1 , W2 , . . . W49 are intermediate neural network
(NN) weights; WL1 and WL2 are output layer weights; L1 and L2 are final and auxiliary output,
respectively.

∂L1
There are two gradient calculations involved. ∂W 1
calculation takes the product of 50 derivative
terms. If each of these terms is slightly less than 1.0, then we get a product value which is almost
equal to 0. With the absence of the secondary branch, this is all we have to update W1 . On
∂L2
the other hand, ∂W 1
contains only two product terms. This term does not have any problem
6.5. BERT HYBRID PRETRAINING SCHEME 52

even if each product term is significantly below 1.0. The product term is still big enough. Thus
introducing auxiliary loss based shallow secondary branch is an effective measure to take when
it comes to updating weights of shallow layers in deep neural networks.
The secondary branch does not have any Transformer encoders or any such means through which
the input words can interact to produce context information. The prediction made from this
branch is dependent solely on misspelled word pattern information extracted by the SemanticNet
sub-model. Such phenomena enables SemanticNet sub-model to learn more meaningful word
representation. Such ability is crucial as error word pattern is more important than its context in
general for correcting that word although there are a few exceptional cases.

6.5 BERT Hybrid Pretraining Scheme

Word1 Word2 Word3 Word4 Word5

BSpell

Word1 Word2 Word3 Word4 Word5


Char Masking Word Masking Char Masking

Figure 6.3: Proposed hybrid pretraining scheme

In contemporary BERT pretraining methods, each input word W ordi maybe kept intact or maybe
replaced by a default mask word in a probabilistic manner [1, 28]. Then the BERT architecture
has to predict the masked words in addition to keeping the unmasked words in tact at the output
side. Mistakes from the BERT side will contribute to loss value accelerating backpropagation
based BERT weight update. Basically, in this process, BERT is learning to fill in the gaps. This
task is equivalent to predicting the appropriate word based on prior and post context of the
sentence or the document sample. So, contemporary word masking based pretraining focuses
on teaching the model context of the language. This is good enough for tasks such as next
word prediction, sentence auto completion, sentiment analysis and text classification. But in
spelling correction, recognizing noisy word pattern is equally important. Unfortunately, there is
no provision for that in contemporary word masking based pretraining [1, 28]. Either we have a
correct word or we have a fully masked word.

To mitigate the above limitations, we propose hybrid masking. Among our n input words in a
sentence, we randomly replace nW words with a mask word M askW ; where n > nW . So, now
we have n − nW words remaining in tact. Among these words, we choose nC words for character
6.5. BERT HYBRID PRETRAINING SCHEME 53

masking; where (n − nW ) > nC . Let us now see how we implement character masking on these
nC number of chosen words by focusing on a single word among these nC words. Suppose, our
word of interest consists of m characters. We choose mC characters at random from this word to
be replaced by a mask character M askC , where m > mC . Such masked characters introduce
noise in words and helps BERT to understand the probable semantic meaning of noisy words.
Misspelled words are actually similar to such noisy words and as a result, hybrid pretraining
boosts spelling correction specific fine tuning performance. Figure 6.3 shows a toy example of
hybrid masking on a sentence containing 5 words. Here, W ord2 (character no. 2) and W ord5
(character no. 1 and 4) are masked at character level, while W ord4 is chosen for word level
masking. W ord1 and W ord3 remain in tact.
Chapter 7

Spell Checking Performance Analysis

We start this chapter by specifying the details of our spell checking experimental dataset. Then
we move on to specifying the hyperparameters of our proposed BSpell model along with our
spell checker performance metric. The hardware specifications for spell checking experiments
are the same as Bangla language modeling experiments. So, we do not repeat them here. We
proceed to show the effectiveness of our proposed BSpell model in spelling correction over
existing BERT variants. Then we show the effectiveness of proposed hybrid pretraining scheme
on BSpell architecture over other possible pretraining methods. Finally, we compare hybrid
pretrained BSpell model with possible LSTM based spell checker variants and some recently
proposed Bangla spell checkers. We also shed light on the importance of SemanticNet sub-model
in BSpell architecture. This chapter ends with the limitations of our proposed method.

7.1 Dataset Specifications

Unique Unique Top Training Validation Unique Error


Datasets
Word Char Word Sample Sample Error Word Word %
Prothom-Alo
262 K 73 35 K 1M 200 K 450 K 52%
(Synthetic Error)
Real Error
14.5 K 73 4.3 K 2K 10 K 36%
(Bengali)
Pretrain Corpus
513 K 73 40 K 5.5 M
(Bengali)
ToolsForIL (Hindi
20.5 K 77 15 K 75 K 16 K 5K 10%
Synthetic Error)
Pretrain Corpus
370 K 77 40 K 5.5 M
(Hindi)
Table 7.1: Dataset specification details for spell checker model development

We now move on towards describing our Bangla spelling correction dataset in Table 7.1. We

54
7.2. MODEL OPTIMIZATION 55

have used one Bengali and one Hindi corpus with over 5 million (5 M) sentences for BERT
pretraining (see Table 7.1). There is no validation sample used while implementing pretraining.
The Bengali pretraining corpus consists of Prothom Alo [75] articles dated from 2014-2017 and
BDnews24 [76] articles dated from 2015-2017. The Hindi pretraining corpus consists of Hindi
Oscar Corpus [63], preprocessed Wikipedia articles [64], HindiEnCorp05 dataset [65] and WMT
Hindi News Crawl data [66]. We apply our masking based model pretraining schemes on these
datasets.
We have used Prothom-Alo 2017 online newspaper dataset for Bengali spelling correction model
training and validation purpose. Our errors in this corpus have been produced synthetically
using the probabilistic algorithm described in [9]. The errors approximate realistic spelling
errors, as the algorithm considers phonetical similarity based confusions, compound letter typing
mistakes, keyboard mispress insertions and deletions. We further validate our baselines and our
proposed method on Hindi open source spelling correction dataset, namely ToolsForIL [24].
This synthetically generated dataset was used for character level sequence to sequence Hindi
spell checker development.
We have also manually collected real error datasets of typing from nine Bengali writers. We have
collected a total of 6300 correct sentences from Nayadiganta [77] online newspaper. Then we
have distributed the entire dataset among nine participants. They have typed (in regular speed)
each correct sentence using English QWERTY keyboard producing natural spelling errors. It
has taken 40 days to finish the labeling, where each of the participant has labeled less than 20
sentences per day in order to ensure quality in labeling. As this real error dataset has mainly
been used as a test set after training and validating on synthetic error dataset Prothom-Alo, we
have considered its top words to be the same as Prothom-Alo dataset. Table 7.1 contains all
dataset details. For example, Prothom-Alo (synthetic error) dataset has 262 K unique words, 35
K top words, 1 M training samples, 200 K validation samples, 450 K unique error words and
52% of the existing words in this corpus contain one or more errors. Top words which are to
be corrected by our proposed models have been taken such that they cover at least 95% of the
corresponding corpus.

7.2 Model Optimization

7.2.1 Performance Metric

We use accuracy and F1 score as our performance metric for spelling correction. It is to note
that during spelling correction, we only correct top words. We do not correct rare UNK words.
Merely detecting rare words correctly is considered sufficient, as users do not necessarily want
to correct names and similar proper nouns.
7.3. COMPARING BSPELL WITH CONTEMPORARY BERT VARIANTS 56

7.2.2 Hyperparameter Specifications

Optimizer details and SemanticNet sub-model details for BSpell are the same as CoCNN. So,
naturally we do not repeat them here.

BSpell Main Branch Details

The main branch of BSpell architecture is similar to BERT Base [74] in terms of stacking 12
Transformer encoders on top of each other. We pass each SemanticVec through a 768 hidden
unit dense layer. The modified output obtained from this dense layer is then passed on to the
bottom-most Transformer of the stack. Each Transformer consists of 12 attention heads and 768
hidden unit feed forward neural networks. Attention outputs from each Transformer is passed
through a dropout layer [78] with a dropout rate of 0.3 and then layer normalized [79].

Implemented Pretraining Scheme Details

Pretraining schemes of BERT based architectures differ in terms of how masking is implemented
on the words of each sentence of the pretraining corpus. We have experimented with three types
of masking schemes which are as follows:

• Word Masking: For each sentence, we randomly select 15% words of that sentence and
replace those words with a fixed mask word filled with mask characters.

• Character Masking: For each sentence, we randomly select 50% words of that sentence.
For each selected word, we randomly mask 30% of its characters by replacing each of
them with our special mask character.

• Hybrid Masking: For each sentence, we randomly select 15% words of the sentence
and replace those words with a fixed mask word. Again from the remaining words, we
randomly select 40% of the words. For these selected words, we randomly mask 25% of
their characters.

7.3 Comparing BSpell with Contemporary BERT Variants


We start our experiments with sequence to sequence BERT model where the encoder and decoder
portion consist of 12 Transformers [29] stacked on top of one another [28]. Predictions are
made at character level. Similar architecture has been used by FASpell [45] for Chinese spelling
correction. A word containing m characters is considered wrong if even one of its characters
is predicted incorrectly. Hence it is difficult to achieve good result in character level seq2seq
modeling. Moreover, in most cases during sentence level spell checking, The correct spelling
7.3. COMPARING BSPELL WITH CONTEMPORARY BERT VARIANTS 57

Synthetic Error Real-Error Real-Error Synthetic Error


Spell Checker
(Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Architecture
ACC F1 ACC F1 ACC F1 ACC F1
BERT
31.6% 30.5% 24.5% 22.4% 29.3% 27.8% 22.8% 20.9%
Seq2seq
BERT Base 91.1% 90.2% 83% 82.3% 87.6% 85.5% 93.8% 92.3%
Soft Masked
92% 91.9% 84.2% 83.2% 88.1% 86.2% 94% 93.3%
BERT
BSpell 94.7% 93.4% 86.1% 85.9% 90.1% 89.8% 96.2% 96%
BSpell
93.4% 92.3% 85.2% 85.1% 89.1% 88.3% 94.4% 93.9%
(No Aux Loss)
Table 7.2: Comparing BERT based variants. Typical word masking based pretraining has been
used on all these variants [1]. Real-Error (Fine Tuned) denotes fine tuning of the Bangla synthetic
error dataset trained model on real error dataset, while Real-Error (No Fine Tune) means directly
validating synthetic error dataset trained model on real error dataset without any further fine
tuning on real error data.

Misspelled Correctly Spelled

আজ তামরা জানেব আসমত স েক।


আজ ত া জানেব আ তস রেক।
(Today you will know about Aslam.)

বাটলারেক যাগ সহায়তা কেরিছেলন িতিন।


বাতলারেক জ সহায়তা কেরিছেলন িতিন।
(He properly helped Butler.)

Figure 7.1: The bold words in the sentences are the error words. The red marked words are the
ones where only BSpell succeeds in correcting among the approaches provided in Table 7.2

of the ith word of input sentence has to be the ith word in the output sentence as well. Such
constraint is difficult to follow through such architecture design. Hence we get poor result using
BERT Seq2seq operating at character level (see Table 7.2). BERT Base architecture is inspired
from [26], where we use only the stack of Transformer encoder portion. Our implemented
BERT Base has two differences from the design proposed in [26] - (i) We make predictions
at word level instead of character level. For character level prediction, character number of a
misspelled sentence and its correct version have to be the same; which is an uncommon scenario
in Bangla (see Figure 1.4 for details). (ii) We do not incorporate any external knowledge about
Bangla spell checking through the use of graph convolutional network since such knowledge is
not well established in Bangla spell checking domain. In BERT Base, the correct version for
the ith word of input sentence can always be found at the ith position of the corrected output
sentence. This approach achieves good performance in all four cases. Soft Masked BERT learns
to apply specialized synthetic masking on error prone words in order to push the error correction
performance of BERT Base further. The error prone words are detected using a Gated recurrent
unit (GRU) based sub-model and the whole architecture is trained end to end. Although the
corresponding research [27] implemented this architecture to make corrections at character level,
7.4. COMPARING DIFFERENT PRETRAINING SCHEMES ON BSPELL 58

our implementation does everything in word level because of the reasons mentioned before.
We see a slight performance improvement in all cases after incorporating this new design. It is
to note that we have used popular sub-word based FastText [80] word representation for both
BERT Base and Soft Masked BERT. Our proposed BSpell model shows decent performance
improvement in all cases. SemanticNet sub-model, with the help of our designed auxiliary loss
is able to specialize on word error pattern. The reason behind our performance gain becomes
clearer from the examples of Figure 7.1. The red marked error words have a high edit distance
with their corresponding correct words, though these spelling error patterns are common while
typing Bangla using English Qwerty keyboard. A human with a decent knowledge on Bangla
writing pattern should be able to identify the correct versions of the red marked words given
slight contextual information. SemanticNet sub-model with the help of auxiliary loss based
gradient update provides BSpell with similar capability. As part of ablation study, we remove
the specialized auxiliary loss from the BSpell model (BSpell (No Aux Loss) model) and redo the
experiments. The performance decreases in all cases.

7.4 Comparing Different Pretraining Schemes on BSpell

Synthetic Error Real-Error Real-Error Synthetic Error


Pretraining
(Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Scheme
ACC F1 ACC F1 ACC F1 ACC F1
Word
94.7% 93.4% 86.1% 85.9% 90.1% 89.8% 96.2% 96%
Masking
Character
95.6% 95.2% 85.3% 85.1% 89.2% 88.9% 96.4% 96.3%
Masking
Hybrid
97.6% 97.1% 87.8% 87.3% 91.5% 91.1% 97.2% 97%
Masking
Table 7.3: Comparing BSpell exposed to various pretraining schemes

Misspelled Correctly Spelled


এরা এক ট ু জািত।
এরা এ খু জািত।
(They are a small nation.)
ানাজেনর ে িলিমেটশন নই।
গ ানারজেনর ে িলিমেতসন নই।
(There is no limitation in gaining knowledge.)

Figure 7.2: The bold words in the sentences are the error words. The red marked words are the
ones where only hybrid pretrained BSpell succeeds in correcting among the approaches provided
in Table 7.3

We have implemented three different pretraining schemes (details provided in Section 7.2.2) on
our proposed BSpell model before fine tuning on corresponding spelling correction dataset. Word
7.5. COMPARING LSTM VARIANTS WITH BSPELL 59

masking mainly teaches BSpell context of a language through large scale pretraining. It takes a
fill in the gaps sort of approach. Spelling correction is not all about filling in the gaps. It is also
about what the writer wants to say, i.e. being able to predict a word even if some of its characters
are blank (masked). A writer will seldom write the most appropriate word according to context,
rather he will write what is appropriate to express his opinion. Character masking takes a more
drastic approach by completely eliminating the fill in the gap task. This approach masks a few
of the characters residing in some of the input words of the sentence and asks BSpell to predict
these noisy words’ original correct version. The lack of context in such pretraining scheme puts
negative effect on performance over real error dataset experiments, where harsh errors exist
and context is the only feasible way of correcting such errors (shown in Table 7.3). Hybrid
masking approach focuses both on filling in word gaps and on filling in character gaps through
prediction of correct word and helps BSpell achieve state-of-the-art performance. A couple of
examples have been provided in Figure 7.2, where the errors introduced in the red marked words
are slightly less common, has large edit distance with their correct versions, and requires both
error pattern understanding and word context for correction. Only hybrid pretrained BSpell has
been successful in correcting these words.

7.5 Comparing LSTM Variants with BSpell

Synthetic Error Real-Error Real-Error Synthetic Error


Spell Checker
(Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Architecture
ACC F1 ACC F1 ACC F1 ACC F1
Pre LSTM 15.5% 13.3% 12.3% 10.9% 14.2% 13.1% 17.4% 16.5%
Pre LSTM
79.3% 78.2% 73.5% 71.4% 78.7% 77.1% 78.2% 77.1%
+ Word CNN
Pre LSTM +
Word CNN + 82.2% 82.1% 78.8% 78.6% 81.2% 81% 81.4% 81.1%
Post LSTM
BiLSTM 81.9% 81.8% 78.3% 78.1% 81.1% 80.9% 81.2% 80.9%
Stacked BiLSTM 83.5% 83.2% 80.1% 80% 82.4% 82.2% 82.7% 82.4%
Attn Seq2seq
20.5% 17.8% 15.4% 12.9% 17.3% 15.2% 22.7% 21.6%
(Char level)
BSpell
97.6% 97.1% 87.8% 87.3% 91.5% 91.1% 97.2% 97%
(Hybrid)
Table 7.4: Comparing LSTM based variants with hybrid pretrained BSpell. FastText word
representation has been used with LSTM portion of each architecture.

We start our experiments with Pre LSTM, a generalized form of N-gram based word prediction
[23]. Given word1 , word2 , . . . wordi−1 of a sentence, this model predicts wordi . We assume
this predicted word to be the correct version of the given wordi without taking its own spelling
7.6. COMPARING EXISTING BANGLA SPELL CHECKERS WITH BSPELL 60

pattern into consideration. Quite understandably, the result is poor on all datasets (see Table
7.4). Pre LSTM + Word CNN architecture coordinates between two neural network branches for
correcting wordi of a sentence. Pre LSTM branch takes in word1 , word2 , . . . wordi−1 as input,
whereas 1D convolution based Word CNN branch takes in wordi representation matrix as input.
Pre LSTM provides a prior context vector as output, whereas Word CNN provides wordi spelling
pattern summary vector as output. Both these outputs are merged and passed to a coordinating
neural network module to get the correct version of wordi . Performance jumps up by a large
margin with the addition of Word CNN branch as misspelled word spelling pattern is crucial for
spell checking that word. Pre LSTM + Word CNN + Post LSTM is similar to the previous model
except for the presence of one extra LSTM based neural network branch Post LSTM which
takes in wordn , wordn−1 , wordn−2 , . . . wordi+1 as input, where n is the number of words in the
sentence. It produces a post context vector for wordi as output. The merging and prediction
procedure is similar to Pre LSTM + Word CNN architecture. The performance sees another boost
after the addition of Post LSTM branch, because the correction of a word requires the word’s
error pattern, previous context and post context (see Figure 1.2). All three of the discussed LSTM
based models correct one word at a time in a sentence. BiLSTM is a many to many bidirectional
LSTM that takes in all n words of a sentence at once and predicts their correct version as output
(wordi is provided as input in time stamp i) [2]. BiLSTM shows almost similar performance as
Pre LSTM + Word CNN + Post LSTM, because it also takes in both previous and post context
into consideration besides the writing pattern of each word. We have stacked two LSTM layers
on top of each other in the discussed four models. However, in Stacked BiLSTM, we stack
twelve many to many bidirectional LSTMs on top of one another instead of just two. Spelling
correction performance improves by only 1-1.5% in spite of such large increase in parameter
number. Attn Seq2seq is a sequence to sequence LSTM based model utilizing Bahdanu attention
mechanism at decoder side [12]. This model takes in misspelled sentence characters as input and
provides the correct sequence of characters as output [24]. Due to word level spelling correction
evaluation, this model faces the same problems as BERT Seq2seq model discussed in Section 7.3.
Such architecture shows poor performance in Bangla and Hindi spelling correction. Although
Stacked BiLSTM shows superior performance in LSTM paradigm, our proposed BSpell (Hybrid)
model outperforms this model by a large margin.

7.6 Comparing Existing Bangla Spell Checkers with BSpell


Phonetic Rule spell checker takes a Bangla phonetic rule based hard coded approach [81], where
a hybrid of Soundex [20] and Metaphone [21] algorithm has been used for spell checking.
Clustering Method on the other hand is a clustering rule based spell checking scheme based on
some predefined rules on word cluster formation, cluster based distance measure and correct
word suggestion [22]. These rule based spell checkers do not show decent performance in Bangla
7.7. EFFECTIVENESS OF SEMANTICNET SUB-MODEL 61

Synthetic Error Real-Error Real-Error


Spell Checker
(Prothom-Alo) (No Fine Tune) (Fine Tuned)
Algorithm
ACC F1 ACC F1 ACC F1
Phonetic Rule 61.2% 58.2% 43.5% 40.1%
Clustering Method 52.3% 50.1% 44.2% 41.2%
1D CNN 77.5% 74.5% 68.5% 67.8% 72.3% 0.703
BSpell (Hybrid) 97.6% 97.1% 87.8% 87.3% 91.5% 91.1%
Table 7.5: Comparing existing Bangla spell checkers with hybrid pretrained BSpell model

spelling correction datasets because of the presence of such wide variety of errors (see Table
7.5). Since these two spell checkers are not learning based, fine tuning is not applicable for
them. 1D CNN is a simple sequential 1D convolution based model that takes a word’s matrix
representation as input and predicts that word’s correct version as output [82]. Such learning
based scheme brings in substantial performance improvement. None of these three spell checkers
take misspelled word context into consideration while correcting that word. As a result, the
performance is poor especially in Bangla real error dataset. Our proposed hybrid pretrained
BSpell model outperforms these existing Bangla spell checkers by a wide margin.

7.7 Effectiveness of SemanticNet Sub-Model

Cluster Correct Error Variations


W1 িতিন (he) িতণী, টিন, তিন
W2 তােদর (their) টােদর, তাদর, তেদর
W3 অিভেযাগ (complain) ওিভেযাগ, অিভজগ, অবেযাগ
W4 কেরেছন (done) করেছন, করেচন, কেড়েচন
W5 রেয়েছ (possess) রয়েছ, রেয়েচ, ড়েয়েছ
W6 ছাড়া (without) ছড়া, চাড়া, চড়া
W7 হাজার (thousand) হাযার, হাজর, হাজড়
W8 শষ (end) শষ, সশ, শশ
W9 পিরক না (plan) পিরকলপনা, পিড়কলপনা, পিরকপনা
W10 পিরি িত (situation) পির িত, পিরসিতিত, পিরসিত ট

Figure 7.3: 10 popular Bangla words and their real life error variations. Cluster visual is provided
in Figure 7.4

The main motivation behind the inclusion of SemanticNet sub-model in BSpell is to obtain
vector representations of error words as close as possible to their corresponding correct words.
Such ability imposes correct semantic meaning to misspelled input words in a sentence. Such
representation comes in handy in cases where there are many error words in a single sentence
7.8. BSPELL LIMITATIONS 62

4 W1 4.3
W6
4.4
2 W5 W7
4.5
0
PC2

PC2
4.6
2 4.7
W2
W10
4 W6
4.8 W8
W9
W8 W4
W3
4.9
7.5 5.0 2.5 0.0 2.5 5.0 7.5 2.2 2.1 2.0 1.9 1.8 1.7 1.6
PC1 PC1
(a) (b)

Figure 7.4: (a) 10 Bangla popular words and three error variations of each of those words
(provided in Figure 7.3) have been plotted based on their SemanticVec representation. Each
popular word along with its three error variations form a separate cluster marked with unique
color. (b) Two of the popular word clusters zoomed in to have a closer look.

(see Figure 1.6). We take 10 frequently occurring Bangla words and collect three real life error
variations of each of these words (see Figure 7.3). We produce SemanticVec representation of all
40 of these words (including correct word and error variations) using SemanticNet sub-model
(extracted from trained BSpell model). We use principal component analysis (PCA) [83] on
each of these SemanticVecs and plot them in two dimension. Finally, we implement K-Means
Clustering algorithm using careful initialization with K = 10 [84]. Figure 7.4a shows the 10
clusters obtained from this algorithm. Each cluster consists of a popular word and its three error
variations. In all cases, the correct word and its three error versions are so close in the graph plot
that they almost form a single point. In Figure 7.4b, we zoom in to have a closer look at cluster
W6 and W8 . Error versions of a particular word are close to that word but are far from other
correct words.

7.8 BSpell Limitations

Error Correct

বুেনা হািত ােম ঢেক পেড়িছল।


বুেনাহািত ােম ঢেক পেড়িছল।
(Wild elephant entered the village.)

আজকাল আর তমন িকছ শানা যায় না।


আজ কাল আর তমন িকছ শানা যায় না।
(Such news is rarely heard these days.)

বাটলারেক সহায়তা কেরিছেলন িতিন।


বাতলারেক সহায়তা কেরিছেলন িতিন।
(He helped Butler.)

Figure 7.5: Three example scenarios where BSpell fails


7.8. BSPELL LIMITATIONS 63

There are three cases where BSpell model fails (see Figure 7.5). The first and second case involve
accidental merging of two words together or accidental splitting of one word into two. Our
proposed model provides a word for word correction, i.e., number of input words and number
of output words of BSpell have to be the exact same. Unfortunately, during word merging or
word splitting, number of input and output words change. The third case involves error in rare
words, words that are mostly proper nouns and are rarely found in Bangla corpus. Our model
simply labels such words as UNK and does not try to correct errors prevalent in such words. The
introduction of encoder decoder based spelling correction at character level can theoretically
eliminate these limitations. In almost all sentence level spelling corrections, the correct version
of input word i should be found at output word i. Sequence to sequence modeling finds such
constraint hard to follow. Loss function involved in character level modeling emphasizes on
correcting characters. If a word contains 10 characters and 9 out of those 10 characters are
rendered correctly, still that word remains incorrect. Poor spelling correction performance of
BERT Seq2seq (see Table 7.2) and Attn Seq2seq (see Table 7.4) bear testimony to our claim.
Chapter 8

Conclusion

We start this chapter by elaborating the possible future directions of this research. Finally,
we conclude this chapter with some ending remarks regarding our research summary and
contributions.

8.1 Future Research Direction


Our current study focuses on architectures based on one dimensional convolution. For image
classification and object detection and segmentation, different state-of-the-art two dimensional
CNN architecture designs have been proposed such as VGGNet, ResNet, Inception, DenseNet
and XceptionNet. Such architectures are deep and have specific design choices such as
convolution and pooling combination, skip connection, parallel branching, reusing early stage
features in later neural network stages and depthwise separable convolution. Each has their own
intuitions behind performance improvement. Experimenting on these design choices for Bangla
language modeling can be a very good route to take for future researchers.
Our proposed Bangla spell checker is based on learning patterns and context from large spelling
error corpus where user specific behaviour has not been modeled in any way. Future study may
investigate on personalized Bangla spell checker which has the ability to take user personal
preference and writing behaviour into account. Reinforcement learning based approach taking in
user specific limited data availability into consideration can be a good research direction as well.

আকােশর আড়ােল লুক (লুিকেয়)


কাল খবেরর কাগেজ এক ট মমাি ক কব (খবর)
বািড় পৗ েছ দিখ মলার আেয়াজন সু ( )

Figure 8.1: Examples of online spell checking

Our proposed BSpell architecture is aimed at offline spell checking where the model makes

64
8.2. CONCLUDING REMARKS 65

corrections after the full sentence has been provided to it. A possible future research may involve
online version of spell checking where the proper word will be suggested while typing a word
partially depending on the partially typed word pattern and its previous context. Three such
examples have been provided in Figure 8.1. The green marked words are the correctly predicted
versions of the corresponding red marked words. In the second and the third example, the
partially typed words are typed incorrectly. The designed system has to predict the correct and
complete version in spite of incorrectly typed partial words. Furthermore, since this spell checker
will be used online, it has to be runtime efficient as well. This is in essence a more difficult
problem than offline spell checking.

8.2 Concluding Remarks


We have proposed a CNN based architecture Coordinated CNN (CoCNN) that introduces two
1D CNN based key concepts: word level SemanticNet and sentence level Terminal Coordinator
for Bangla language modeling. Detailed investigation in three deep learning paradigms (CNN,
LSTM and Transformer) shows the effectiveness of CoCNN in Bangla and Hindi LM. We
further show the effectiveness of our proposed model on two text classification tasks. Further
analysis on SemanticNet indicates its ability to work as an effective Bangla spell checker. Future
research may incorporate interesting ideas from state-of-the-art two dimensional CNNs such
as efficient increase of architecture depth, parallel branching and skip connection in CoCNN
which are expected to increase its language modeling performance, although such inclusion has
the possibility of increasing model memory, training time and validation time. We have also
performed investigation using Bangla and Hindi spelling correction datasets on the limitations of
existing Bangla spell checkers as well as other state-of-the-art spell checkers proposed for high
resource languages. CoCNN is a many to one architecture. Keeping in line with the necessity
of a many to many type architecture for sentence level Bangla spelling correction, we have
proposed BSpell architecture that exploits the concept of hybrid masking based pretraining and
uses SemanticVec representation of input misspelled words mimicking CoCNN. Such word
representation enhances the spelling correction performance in Bangla through the introduction
of word level learnable sub-model. We also introduce a specialized auxiliary loss to push the
learning ability of SemanticNet further. Furthermore, we perform detailed analysis on possible
LSTM based spell checker models on Bangla and Hindi dataset. Detailed experimental evaluation
shows the effectiveness of our proposed method over existing ones.
References

[1] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and
V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint
arXiv:1907.11692, 2019.

[2] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions
on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[3] S. Takase, J. Suzuki, and M. Nagata, “Character n-gram embeddings to improve rnn
language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
pp. 5074–5082, 2019.

[4] N.-Q. Pham, G. Kruszewski, and G. Boleda, “Convolutional neural network language
models,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, pp. 1153–1162, 2016.

[5] J. Gao, J. Goodman, M. Li, and K.-F. Lee, “Toward a unified approach to statistical language
modeling for chinese,” ACM Transactions on Asian Language Information Processing
(TALIP), vol. 1, no. 1, pp. 3–33, 2002.

[6] D. Cai and H. Zhao, “Neural word segmentation learning for chinese,” arXiv preprint
arXiv:1606.04300, 2016.

[7] T.-H. Yang, T.-H. Tseng, and C.-P. Chen, “Recurrent neural network-based language models
with variation in net topology, language, and granularity,” in 2016 International Conference
on Asian Language Processing (IALP), pp. 71–74, IEEE, 2016.

[8] J. F. Staal, “Sanskrit and sanskritization,” The Journal of Asian Studies, vol. 22, no. 3,
pp. 261–275, 1963.

[9] M. H. R. Sifat, C. R. Rahman, M. Rafsan, and H. Rahman, “Synthetic error dataset


generation mimicking bengali writing pattern,” in 2020 IEEE Region 10 Symposium
(TENSYMP), pp. 1363–1366, IEEE, 2020.

[10] J. Noyes, “The qwerty keyboard: A review,” International Journal of Man-Machine Studies,
vol. 18, no. 3, pp. 265–281, 1983.

66
REFERENCES 67

[11] S. Wang, M. Huang, and Z. Deng, “Densely connected cnn with multi-scale feature attention
for text classification.,” in IJCAI, pp. 4468–4474, 2018.

[12] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[13] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural
machine translation,” arXiv preprint arXiv:1508.04025, 2015.

[14] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,”
arXiv preprint arXiv:1508.06615, 2015.

[15] D. Gerz, I. Vulić, E. Ponti, J. Naradowsky, R. Reichart, and A. Korhonen, “Language


modeling for morphologically rich languages: Character-aware modeling for word-level
prediction,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 451–
465, 2018.

[16] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones, “Character-level language


modeling with deeper self-attention,” in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 33, pp. 3159–3166, 2019.

[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and


I. Polosukhin, “Attention is all you need,” in Advances in neural information processing
systems, pp. 5998–6008, 2017.

[18] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language modeling with deep transformers,”
arXiv preprint arXiv:1905.04226, 2019.

[19] X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, M. Zhou, and D. Song, “A tensorized
transformer for language modeling,” in Advances in Neural Information Processing Systems,
pp. 2232–2242, 2019.

[20] N. UzZaman and M. Khan, “A bangla phonetic encoding for better spelling suggesions,”
tech. rep., BRAC University, 2004.

[21] N. UzZaman and M. Khan, “A double metaphone encoding for approximate name searching
and matching in bangla,” 2005.

[22] P. Mandal and B. M. Hossain, “Clustering-based bangla spell checker,” in 2017 IEEE
International Conference on Imaging, Vision & Pattern Recognition (icIVPR), pp. 1–6,
IEEE, 2017.

[23] N. H. Khan, G. C. Saha, B. Sarker, and M. H. Rahman, “Checking the correctness of bangla
words using n-gram,” International Journal of Computer Application, vol. 89, no. 11, 2014.
REFERENCES 68

[24] P. Etoori, M. Chinnakotla, and R. Mamidi, “Automatic spelling correction for resource-
scarce languages using deep learning,” in Proceedings of ACL 2018, Student Research
Workshop, (Melbourne, Australia), pp. 146–152, Association for Computational Linguistics,
July 2018.

[25] D. Wang, Y. Tay, and L. Zhong, “Confusionset-guided pointer networks for Chinese spelling
check,” in Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, (Florence, Italy), pp. 5780–5785, Association for Computational Linguistics,
July 2019.

[26] X. Cheng, W. Xu, K. Chen, S. Jiang, F. Wang, T. Wang, W. Chu, and Y. Qi, “Spellgcn:
Incorporating phonological and visual similarities into language models for chinese spelling
check,” arXiv preprint arXiv:2004.14166, 2020.

[27] S. Zhang, H. Huang, J. Liu, and H. Li, “Spelling error correction with soft-masked bert,”
arXiv preprint arXiv:2005.07421, 2020.

[28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and


I. Polosukhin, “Attention is all you need,” in Advances in neural information processing
systems, pp. 5998–6008, 2017.

[30] M. Sifat, H. Rahman, C. R. Rahman, M. Rafsan, M. Rahman, et al., “Synthetic error dataset
generation mimicking bengali writing pattern,” arXiv preprint arXiv:2003.03484, 2020.

[31] T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur, “Extensions of


recurrent neural network language model,” in 2011 IEEE international conference on
acoustics, speech and signal processing (ICASSP), pp. 5528–5531, IEEE, 2011.

[32] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language modeling,”
in Thirteenth annual conference of the international speech communication association,
2012.

[33] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,”
arXiv preprint arXiv:1708.02182, 2017.

[34] T. Mikolov, I. Sutskever, A. Deoras, H.-S. Le, S. Kombrink, and J. Cernocky,


“Subword language modeling with neural networks,” preprint (http://www. fit. vutbr.
cz/imikolov/rnnlm/char. pdf), vol. 8, p. 67, 2012.
REFERENCES 69

[35] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword
information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–
146, 2017.

[36] B. Athiwaratkun, A. G. Wilson, and A. Anandkumar, “Probabilistic fasttext for multi-sense


word embeddings,” arXiv preprint arXiv:1806.02901, 2018.

[37] Z. Assylbekov, R. Takhanov, B. Myrzakhmetov, and J. N. Washington, “Syllable-


aware neural language models: A failure to beat character-aware ones,” arXiv preprint
arXiv:1707.06480, 2017.

[38] S. Moriya and C. Shibata, “Transfer learning method for very deep cnn for text classification
and methods for its evaluation,” in 2018 IEEE 42nd Annual Computer Software and
Applications Conference (COMPSAC), vol. 2, pp. 153–158, IEEE, 2018.

[39] H. T. Le, C. Cerisara, and A. Denis, “Do convolutional networks need to be deep for
text classification?,” in Workshops at the Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.

[40] S. Roy, “Improved bangla language modeling with convolution,” in 2019 1st International
Conference on Advances in Science, Engineering and Robotics Technology (ICASERT),
pp. 1–4, IEEE, 2019.

[41] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[42] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and
V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint
arXiv:1907.11692, 2019.

[43] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders
as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.

[44] J. Xiong, Q. Zhang, S. Zhang, J. Hou, and X. Cheng, “Hanspeller: a unified framework
for chinese spelling correction,” in International Journal of Computational Linguistics &
Chinese Language Processing, Volume 20, Number 1, June 2015-Special Issue on Chinese
as a Foreign Language, 2015.

[45] Y. Hong, X. Yu, N. He, N. Liu, and J. Liu, “Faspell: A fast, adaptable, simple, powerful
chinese spell checker based on dae-decoder paradigm,” in Proceedings of the 5th Workshop
on Noisy User-generated Text (W-NUT 2019), pp. 160–169, 2019.

[46] G. Lewis, “Sentence correction using recurrent neural networks,” Department of Computer
Science, Stanford University, p. 7, 2016.
REFERENCES 70

[47] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural
machine translation,” arXiv preprint arXiv:1508.04025, 2015.

[48] K. H. Lai, M. Topaz, F. R. Goss, and L. Zhou, “Automated misspelling detection and
correction in clinical free-text records,” Journal of biomedical informatics, vol. 55, pp. 188–
195, 2015.

[49] S. Islam, M. F. Sarkar, T. Hussain, M. M. Hasan, D. M. Farid, and S. Shatabda, “Bangla


sentence correction using deep neural network based sequence to sequence learning,” in
2018 21st International Conference of Computer and Information Technology (ICCIT),
pp. 1–6, IEEE, 2018.

[50] C.-H. Wu, C.-H. Liu, M. Harris, and L.-C. Yu, “Sentence correction incorporating relative
position and parse template language models,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 18, no. 6, pp. 1170–1181, 2009.

[51] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in International conference on machine learning, pp. 448–
456, PMLR, 2015.

[53] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch normalization help
optimization?,” arXiv preprint arXiv:1805.11604, 2018.

[54] M. Rahman, “Prothom-alo.” https://www.prothomalo.com/, 2017.

[55] T. I. Khalidi, “Bdnews24.” https://bdnews24.com/, 2015.

[56] A. Mohiuddin, “Nayadiganta.” https://www.dailynayadiganta.com/, 2019.

[57] S. Pandey, “Hindinews.” https://www.dailyhindinews.com/, 2018.

[58] S. Shekhar, “Livehindustan.” https://www.livehindustan.com/, 2018.

[59] B. Jain, “Patrika.” https://epaper.patrika.com/, 2018.

[60] N. S. Keskar and R. Socher, “Improving generalization performance by switching from


adam to sgd,” arXiv preprint arXiv:1712.07628, 2017.

[61] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov,


“Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint
arXiv:1207.0580, 2012.
REFERENCES 71

[62] X. Rong, “word2vec parameter learning explained,” arXiv preprint arXiv:1411.2738, 2014.

[63] A. Thakur, “Hindi oscar corpus.” https://www.kaggle.com/abhishek/


hindi-oscar-corpus.

[64] Gaurav, “Wikipedia.” https://www.kaggle.com/disisbig/


hindi-wikipedia-articles-172k.

[65] O. Bojar, V. Diatka, P. Straňák, A. Tamchyna, and D. Zeman, “HindEnCorp 0.5,” 2014.
LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics
(ÚFAL), Faculty of Mathematics and Physics, Charles University.

[66] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow,


M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri,
“Findings of the 2019 conference on machine translation (WMT19),” in Proceedings of
the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1),
(Florence, Italy), pp. 1–61, Association for Computational Linguistics, Aug. 2019.

[67] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,”
in Advances in neural information processing systems, pp. 3104–3112, 2014.

[68] M. A. Islam, M. F. Kabir, K. Abdullah-Al-Mamun, and M. N. Huda, “Word/phrase based


answer type classification for bengali question answering system,” in 2016 5th International
Conference on Informatics, Electronics and Vision (ICIEV), pp. 445–448, IEEE, 2016.

[69] D. Kakwani, “Ai4bharat.” https://github.com/ai4bharat, 2020.

[70] S. Saha, F. Tabassum, K. Saha, and M. Akter, BANGLA SPELL CHECKER AND
SUGGESTION GENERATOR. PhD thesis, United International University, 2019.

[71] N. UzZaman and M. Khan, “A bangla phonetic encoding for better spelling suggesions,”
tech. rep., BRAC University, 2004.

[72] N. UzZaman and M. Khan, “A double metaphone encoding for bangla and its application
in spelling checker,” in 2005 International Conference on Natural Language Processing
and Knowledge Engineering, pp. 705–710, IEEE, 2005.

[73] P. Mandal and B. M. Hossain, “Clustering-based bangla spell checker,” in 2017 IEEE
International Conference on Imaging, Vision & Pattern Recognition (icIVPR), pp. 1–6,
IEEE, 2017.

[74] L. Gong, D. He, Z. Li, T. Qin, L. Wang, and T. Liu, “Efficient training of bert by
progressively stacking,” in International Conference on Machine Learning, pp. 2337–2346,
PMLR, 2019.
REFERENCES 72

[75] M. Rahman, “Prothom-alo.” https://www.prothomalo.com/.

[76] T. I. Khalidi, “Bdnews24.” https://bangla.bdnews24.com/.

[77] A. Mohiuddin, “Nayadiganta.” https://www.dailynayadiganta.com/.

[78] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a


simple way to prevent neural networks from overfitting,” The journal of machine learning
research, vol. 15, no. 1, pp. 1929–1958, 2014.

[79] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint


arXiv:1607.06450, 2016.

[80] B. Athiwaratkun, A. G. Wilson, and A. Anandkumar, “Probabilistic fasttext for multi-sense


word embeddings,” arXiv preprint arXiv:1806.02901, 2018.

[81] S. Saha, F. Tabassum, K. Saha, and M. Akter, BANGLA SPELL CHECKER AND
SUGGESTION GENERATOR. PhD thesis, United International University, 2019.

[82] S. Roy, “Improved bangla language modeling with convolution,” in 2019 1st International
Conference on Advances in Science, Engineering and Robotics Technology (ICASERT),
pp. 1–4, IEEE, 2019.

[83] J. Shlens, “A tutorial on principal component analysis,” arXiv preprint arXiv:1404.1100,


2014.

[84] Z. Chen and S. Xia, “K-means clustering algorithm with improved initial center,” in 2009
Second International Workshop on Knowledge Discovery and Data Mining, pp. 790–792,
IEEE, 2009.
Generated using Postgraduate Thesis LATEX Template, Version 1.03. Department of
Computer Science and Engineering, Bangladesh University of Engineering and
Technology, Dhaka, Bangladesh.

This thesis was generated on Thursday 5th August, 2021 at 5:47am.

73

You might also like