Revised Final Draft
Revised Final Draft
06/05/2022
Ambo, Ethiopia
NIVERSITY
HUNDESSA CAMPUS
SCHOOL OF POST GRADUATE STUDIES
P.O.BOX: 19, AMBO,
OROMIA, ETHIOPIA
Title: Deep Learning Based Sentence Boundary Detection for Afan Oromo Text
Documents
Submitted by:
Approved by:
1. _____________________________ ________________ ______________
List of Figures................................................................................................................................iii
List of Acronym..............................................................................................................................iv
Abstract............................................................................................................................................v
Introduction......................................................................................................................................1
2 Literature Review.........................................................................................................................4
3 Research methodology..................................................................................................................6
3.1 Introduction............................................................................................................................6
Reference.......................................................................................................................................10
i
List of Tables
Table 1 Tools and Materials………………………………………………………………………8
ii
List of Figures
Figure 1 Research design diagram.............................................................................................6
iv
List of Acronym
API Application platform interface
v
Abstract
The ultimate goal of this study is to develop a deep learning-based method for correctly determining
the sentence border of Afan Oromo text. The approach of determining where one sentence stops
and another begins is known as sentence boundary detection. For many natural language processing
applications like parsing, automatic summarization, information retrieval, and machine translation,
sentence boundary detection is an essential preprocessing task. A vast number of text documents
are generated from many sources, including both online and offline sources. Due to the
unstructured format of content, detecting sentence boundaries is a difficult process. Most
researchers have tried to develop sentence boundary detection for foreign languages to solve these
difficulties. To the best of our knowledge, this is the first of its kind for Afan Oromo. Afan Oromo
sentences may cause ambiguity. One example is the use of full stops. A full stop is sometimes used
in association with an abbreviation. Similarly, it can be used to represent a decimal point in a
number. It is a component of an abbreviation or a number in all of these examples. We are unable to
delimit a sentence since the period has a distinct meaning in this context. Therefore, it causes
confusion in sentence breaking. To tackle this problem, we propose a deep learning model for Afan
Oromo sentence boundary detection due to the effectiveness of deep learning algorithms. However,
there are no openly accessible Afan Oromo text datasets for this study. The dataset will be collected
and annotated with comments and posts from the Facebook social media pages of BBC Afan
Oromo, OBN Afan Oromo, and FBC Afan Oromo by using a Facepager data scrapper tool. In this
work, we use different natural language processing techniques. This includes text preprocessing,
which includes text cleaning, normalization, tokenization, removal of stop words, parts of speech
tagging, stemming, and lemmatization. To obtain a benchmark of the Afan Oromo Sentence
boundary detection dataset as feature extraction, we use Fast Text. The dataset will be divided into
training and testing zones. Different deep learning algorithms will be conducted in order to discover
the best performance. For model comparison, bidirectional long-term memory and convolutional
neural network algorithms are considered for model comparison. The performance of the proposed
model will be evaluated using different evaluation metrics.
Key Words: sentence boundary detection, Afan Oromo, social media, CNN, Bi-LSTM, deep
learning
vi
Introduction
1 Background of the study
Language is a system of conventional spoken words and the expression of ideas through speech
sounds that are joined to form words. Words are then combined to form sentences that convey a
meaningful idea. A sentence is commonly used as an input unit in many natural language pro-
cessing (NLP) systems. The segmentation of text into sentences is one of the techniques that oc-
curs frequently. because the sentence is the basic textual unit immediately above the word and
phrase. Different natural language applications, such as POS tagging, syntactic, semantic, dis-
course parsing, information extraction, and machine translation, use sentences as an input. Sen-
tence Boundary Detection (SDB)is also needed for chatbots, machine translation, named entity
recognition, and coreference resolution (Sadvilkar & Neumann, 2020).
All of these language processing systems require accurate sentence boundaries in their input text.
Sentence End Detection, also known as Sentence Boundary Disambiguation (SBD) or boundary
detection, is the Natural Language Processing (NLP) task of recognizing where a sentence begins
and ends (Tuggener & Aghaebrahimian, 2021). Linguistic computation studies are expanding in
tandem with the growing and widely available amount of text data (Raharjo et al., 2018).
Sentence boundary detection can aid in the preprocessing step and increase performance results.
In a language, the use of sentence boundaries is determined by the nature of the language in
general and the grammatical context in particular. Sentence Boundary Detection has been studied
in other languages like English, Portuguese, French, Vietnamese, Chinese, Japanese, and
Marathi. In Afan Oromo, sentence boundary detection has not been studied yet. Sentence
Boundary Detection is crucial as a preprocessing phase of many natural language processing
tasks.
Many algorithms have been used to recognize sentence boundaries in various languages. Recent
SBD research focuses on the use of machine learning techniques such as decision trees, neural
networks, hidden Markov models, maximum entropy, and conditional random fields (Wong et
al., 2014).Afan Oromo is one of the major indigenous African languages that is widely spoken
and used in most parts of Ethiopia and some parts of neighbouring countries like Kenya and
Somalia (Beekan, n.d.).
In this work we focus on resolving the ambiguity of sentence end markers for Afan Oromo text
and have carried out several experiments to detect sentence boundaries, using deep learning-
based strategies. The punctuation markers are also often ambiguous in these types of texts in
particular between actually being used as punctuation and being used for emphasis creating great
challenges for sentence boundary detection.
1.2 Motivation of The study
The main motivation for studying this research work is to develop models that detect the
sentence boundaries of Afan Oromo text to improve the performance of different NLP
applications that use sentences as a basic initial step, such as POS tagging, syntactic, semantic,
1
and discourse parsing, information extraction, and machine translation. Sometimes we may
consider a full stop as an indicator of the end of a sentence, but it is also used to denote an
abbreviation, acronym, a decimal point, or ellipses. So it makes the disambiguation problem
more complicated. No sentence boundary detection research to date has addressed the problem.
Despite its importance in natural language processing, sentence boundary detection for local
languages has gotten little attention. So, sentence boundary detection can help different scholars,
especially those who work on Afan Oromo NLP-related tasks, which take sentences as an input
in order to improve the performance of applications.
1.3 Statement of the Problem
Sentence boundary detection is used to split every sentence in a document(Santoso et al., 2021).
Many natural language processing systems generally take a sentence as an input unit. Sentence
Boundary Detection can be a preprocessing task for a POS tagger (Manning, 2011). Machine
Translation (Liu et al., 2004). Sentence-ending punctuations like a full stop, exclamation mark,
and question mark often appear inside the sentence(Rehman & Anwar, 2012). It is difficult to
determine whether a punctuation mark is a sentence marker or not.
There have been various studies on sentence boundary detection, but they have all been done in
different languages, like English, French, Italian, Portuguese, and Swedish. These facts
demonstrate that the SBD is a necessary and fundamental component for the development of
most NLP applications. The unavailability of an Afan Oromo SBD is the key issue that needs to
be addressed. As a result, developing such a system for Afan Oromo is critical for enabling the
development of Afan Oromo-related NLP applications.
Therefore, as there is no effort in dealing with Afan Oromo sentence boundary detection, to the
best of our knowledge, we initiated studying and addressing sentence boundary detection for
Afan Oromo texts.
1.4 Research Question
This research work intends to answer the following research questions
RQ1: How to prepare the dataset for better word representation for Afan Oromo SBD?
RQ2: How to develop a model for Afan Oromo SBD Using deep learning?
RQ3: Which Deep Learning algorithm will perform best for Afan Oromo SBD?
RQ4: How can the overall performance be enhanced?
1.4 General and specific objectives
1.4.1 General Objectives
The General Objective of this study is to develop Afan Oromo Sentence boundary detection
model using a deep learning approach.
2
To compare the performance of deep learning algorithms
To evaluate the performance using a test set.
3
2 Literature Review
2.1 Related Works
(Santoso et al., 2021) proposes Indonesian Sentence Boundary Detection using Deep Learning
Approaches. This research uses a deep learning approach to split each sentence from an
Indonesian news document. They Used the Bi-LSTM approach. The corpus was built and
crawled from news sites which are Detik and Kompas. The work is divided according to each
architecture layer: input layer, Bidirectional LSTM cells, and output layer. The input layer is
token embedding to convert tokens into vectors. They used Skip-Gram Word2Vec as an
embedding model. This study achieved an F1-Score value of 98.49 %.
(Sirirattanajakarin et al., 2020) studied Bidirectional LSTM-CNN Model for Thai Sentence
Segmenter. They proposed BoydCut models, which consist of a deep learning-based model and a
Word2Vec.The proposed model was for building word embedding and an NLP framework for
identifying sentence boundaries based on the Bidirectional LSTM-CNN Model which can learn
sequent of word-level in sentences and learn extracted features from character level. They used
two datasets. the first one is the Orchid dataset that is used in various research in the Thai NLP
field The second is scb-mt-en-th-2020 a combination of Thai-English translation pair sentences.
This study achieved an F1-Score value of 79.2 %.
(Mekki et al., 2021) studied “Sentence boundary detection of various forms of Tunisian Arabic”.
The study addressed the problem of SBD of dialectal Arabic, especially for the Tunisian dialect.
They prepared corpus from varied sources. Deep Neuronal Networks, Support Vector Machines
and Conditional Random Fields to detect the boundaries of sentences written in different types of
dialect. This study achieved an F1-Score value 82.78%, 80.21%and 84.37% respectively.
(Rehman & Anwar, 2012)Proposes hybrid approach for Urdu sentence boundary disambiguation.
Comprising of unigram statistical model and rule based algorithm. Tagged data has been used to
train the unigram model. They obtained the accuracy 99.36% precision, 96.45% recall and
97.89% F1-Measure They use Europarl corpus as a data source which consists of transcripts of
the sessions of the European parliament and features transcripts in multiple languages.
hm
4
Learning -achieved an F1-
Approach Score value of
98.49 %.
(Sirirattan Bidirectional Orchid achieved an F1- character feature from
ajakarin et LSTM-CNN CNN dataset Score value of CNN performs lower
al., 2020) Model for Thai 79.2 %. than character
Sentence dictionary feature
Segmenter. from BiLSTM
(Mekki et Sentence DNN ,S varied -achieved an F1-
al., 2021) boundary VM sources Score value for
detection of CRF DNN ,SVM
various forms CRF 82.78%,
of Tunisian 80.21%and
Arabic 84.37%
respectively
(Rehman A hybrid unigram Europar
& Anwar, approach for statistica l -achieved an F1- -Error rates produced
Score 97.89% -Data size
2012) Urdu sentence l model
boundary and
disambiguation rule-
based
algorith
m
5
3 Research methodology
3.1 Introduction
In this section, dataset preparation Feature representation, and material to be employed will be
discussed.
6
video, and audio because we only need plain text. The dataset will be segmented for training and
testing used for training and testing the learning algorithm.
3.2.2 Pre-processing
Preprocessing is the first step for Sentence boundary detection. The preprocessing stage includes
cleaning text, which is used to remove unnecessary content and make the data more
representable, tokenization, stop word removal, Stemming, and lemmatization.
3.3.3 Data Annotation
A human data annotator adds categories, labels, and other contextual elements to a raw data set
in order for machines to read and act upon the information. Based on their background in the
Afan Oromo language and willingness to perform the task and understand the language, we will
assign different annotators.
Feature extraction is a step in the dimensionality reduction process, which divides and reduces a
large set of raw data into manageable groupings to minimize redundant detection problems. The
technique of lowering the number of random features under consideration by obtaining a collec-
tion of main or relevant features is known as dimensionality reduction. Since computers and ma-
chine learning models tend to understand only numeric data, we need to convert text data into
numeric data. For better analysis of the text we want to process, we must come up with a numer-
ical representation of each word.
3.4 Feature representation
3.4.1 Word Embedding
Document classification involves the transformation of documents into feature vectors. Word
Embedding is a distributional representation of words with similar meanings to be understood by
machine learning models. The most common techniques for building word embedding are
word2vec, Glove, and Fast Text.
3.4.1.1 Fast text
Fast Text considers not only the word itself but also groups of characters from that word and the
sub-word information such as character unigram, bigram, and during learning word
representations (Diriba Gichile, 2021). For this study, Fast text is considered as a word
representation.
3.5 Feature modeling
After the feature set is complete, we must create a model that can be trained to learn from the
data. This model is deep learning-based.
3.6 Performance Evaluation Metrics
There are numerous metrics recommended to evaluate the detection performance of an SBD
system for the performance metric. For the performance measurement purpose, we use the
following three measurements. Which interpret the measure based on the counting
of true and false positives and true and false negatives.
7
Recall:- is the proportion of all candidates belonging to the true boundaries that have been
determined by the evaluated system. l is computed as the number of True Positive divided by the
number of True Positives plus False Negatives.
TP
Recall =
TP+ FN
Precision: - It is computed as the number of True Positives divided by the number of True
Positive plus False Positive.
TP
p𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
TP+ FP
F1 score:- is a measure of a model's accuracy on a dataset. It is used to evaluate binary
classification systems, which classify examples as positive or negative.
Precision × Recall
F1 − 𝑠𝑐𝑜𝑟e=2x
Precision+ Recall
3.7 Programing language and development tools
Under this topic, we discuss how to implement the models and the tools needed for
experimentation.
Facepager : is an open-source tool for collecting public data from social media platforms such
as Facebook, Twitter, or YouTube(Ermias, 2021).
LibreOffice : is a powerful office suite,clean interface and feature rich tool used in designing the
research.
Anaconda Navigator: it helps with creating different environments, training with different
python and package versions.
3.1.8 Tool Selection
No. Tools Purpose
1 Laptop computer Proposal writing and implementation
2 Anaconda Python Modeling
3 Pen and paper Draft idea writing
4 Microsoft Office Document Editing
8
Item Unit Quantity Unit cost Total cost Remarks
Binding and Printing 4 - 356 4*356 = 1424
Per dime for the researcher 7 days - 468 7*468 = 3276
Contingency - - - 300
Laptop computer 1 2500 2500
Total 29700
Table 2 Budget Plan
9
Reference
Beekan, E. (n.d.). Beekan Erena. Retrieved January 19, 2022,
from https://scholar.harvard.edu/erena/oromo-language-afaan-oromoo.
Diriba Gichile. (2021). Afaan Oromo MultiLabel News Text Classification Using Convolutional
Neural Network(CNN). 4(1), 18.
Ermias, N. (2021). FAKE NEWS DETECTION FOR AMHARIC LANGUAGE USING DEEP
LEARNING Fake News Detection for Amharic Language Using Deep Learning. January.
Liu, Z., Wang, H., & Wu, H. U. A. (2004). Example-Based Machine Translation Based on
TreeString Correspondence and Statistical Generation. 1, 1–15.
Mandefiro, L. (2010). Named Entity Recognition for Afan Oromo. October, Angeles, L.,
Advocacy, S., Location, O. (2002).
Manning, C. D. (2011). Part-of-speech tagging from 97% to 100%: Is it time for some
linguistics? Lecture Notes in Computer Science (Including Subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics), 6608 LNCS, 171–189.
https://doi.org/10.1007/978-3-642-19400-9_14
Mekki, A., Zribi, I., Ellouze, M., & Belguith, L. H. (2021). Sentence boundary detection of
various forms of Tunisian Arabic. Language Resources and Evaluation.
https://doi.org/10.1007/s10579-021-09538-4
Raharjo, S., Wardoyo, R., & Putra, A. E. (2018). Rule based sentence segmentation of
Indonesian language. Journal of Engineering and Applied Sciences, 13(21), 8986–8992.
https://doi.org/10.3923/jeasci.2018.8986.8992
Rehman, Z., & Anwar, W. (2012). A hybrid approach for urdu sentence boundary
disambiguation. International Arab Journal of Information Technology, 9(3), 250–255.
Sadvilkar, N., & Neumann, M. (2020). PySBD: Pragmatic Sentence Boundary Disambiguation.
110–114. https://doi.org/10.18653/v1/2020.nlposs-1.15
Santoso, J., Setiawan, E. I., Purwanto, C. N., & Kurniawan, F. (2021). Indonesian Sentence
10
Boundary Detection using Deep Learning Approaches. Knowledge Engineering and Data
Science, 4(1), 38. https://doi.org/10.17977/um018v4i12021p38-48
Tuggener, D., & Aghaebrahimian, A. (2021). The sentence end and punctuation prediction in
NLG text (SEPP-NLG) shared task 2021. CEUR Workshop Proceedings, 2957(2016).
Wong, D. F., Chao, L. S., & Zeng, X. (2014). I sentenizer-: Multilingual sentence boundary
detection model. The Scientific World Journal, 2014. https://doi.org/10.1155/2014/196574
11