0% found this document useful (0 votes)
9 views

DAQAS - Deep Arabic Question Answering System Based On Duplicate Question Detection and Machine Reading Comprehension

Uploaded by

Balqis Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DAQAS - Deep Arabic Question Answering System Based On Duplicate Question Detection and Machine Reading Comprehension

Uploaded by

Balqis Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Contents lists available at ScienceDirect

Journal of King Saud University –


Computer and Information Sciences
journal homepage: www.sciencedirect.com

DAQAS: Deep Arabic Question Answering System based on duplicate


question detection and machine reading comprehension
Hamza Alami a,⇑, Abdelkader El Mahdaouy a, Abdessamad Benlahbib b, Noureddine En-Nahnahi b,
Ismail Berrada a, Said El Alaoui Ouatik c
a
School of Computer Science, Mohammed VI Polytechnic University, Ben Guerir 43150, Morocco
b
Laboratory of Informatics, Signals, Automatics, and Cognitivism (LISAC), Faculty of Sciences Dhar El Mahraz, Sidi Mohammed Ben Abdellah University, Fez 30003, Morocco
c
Laboratory of Engeneering Sciences, National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco

a r t i c l e i n f o a b s t r a c t

Article history: As of late, various deep learning techniques and methods have shown their superiority to feature-based
Received 14 February 2023 and shallow learning techniques in the field of open-domain question–answering systems (OpenQAS).
Revised 23 July 2023 However, only a few works adopted these techniques to build Arabic OpenQAS that can extract exact
Accepted 6 August 2023
answers from large information sources (e.g., Wikipedia). In addition, no available Arabic OpenQAS inte-
Available online 12 August 2023
grated a module to identify duplicate questions to accelerate response time and reduce computation cost.
In this paper, we propose an Arabic OpenQAS (named DAQAS) based on deep learning methods. It consists
Keywords:
of three components: (1) Dense Duplicate Question Detection which returns answers to questions that
Question answering systems
Neural networks
already have been answered; (2) Retriever based on BM25 and Query Expansion by neural text genera-
Transformers tion; and (3) Reader able to extract exact answers given a question and the retrieved passages that prob-
Natural language processing ably contains the answer. All components of our system integrate deep learning models, specially
Duplicate answer detection transformers-based techniques, which have scored state-of-the-art in different NLP fields. We performed
several experiments with publicly available question answering datasets to show the effectiveness of our
system. DAQAS obtained promising results and scored 21.77% Exact Match and 54.71% F1 score when
using only top 5 retrieved passages.
Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction data such as knowledge bases. While DBQA systems aim at extract-
ing answers from unstructured data such as raw textual docu-
Question Answering (QA) systems aim to provide precise ments. Typically, a DBQA can be designed to answer restricted-
answers for users’ questions. Building a QA is challenging since it or open-domain questions. Restricted domain QA systems concen-
needs to find answers from a huge amount of both structured trate on a specialized field and employ particular linguistic
and unstructured data. Approaches to deal with QA can be consid- resources to enhance the system’s performance. Open domain QA
ered in two broad classes according to the answers’ source; to be systems rely on huge amount of text from the web or collections
explicit, Knowledge-Based Question Answering (KBQA) and of documents like Wikipedia to answer questions from varied
Document-Based Question Answering (DBQA). The KBQA systems areas. Traditional open domain DBQA systems were commonly
apply specific techniques able to extract answers from structured built as pipeline, comprising various modules like question pro-
cessing, passage retrieval, and answer processing (Abouenour
⇑ Corresponding author. et al., 2012; Kurdi et al., 2014; Bekhti and Al-Harbi, 2013; Hamza
E-mail addresses: [email protected] (H. Alami), abdelkader.mah- et al., 2021). With the recent advances in neural networks tech-
[email protected] (A. El Mahdaouy), [email protected] (A. niques, modern open-domain DBQA systems (OpenQAS) follow a
Benlahbib), [email protected] (N. En-Nahnahi), ismail.berra- new structure. This latter combine traditional Information Retrie-
[email protected] (I. Berrada), [email protected] (S.E.A. Ouatik). val (IR) methods with machine reading comprehension (MRC)
Peer review under responsibility of King Saud University. models. The objective of MRC is to construct models that are able
to read a passage of text and answer comprehension questions
(Zeng et al., 2020; Mozannar et al., 2019). Fig. 1 illustrates the
new structure of OpenAQS. The advancements in OpenQAS in Eng-
Production and hosting by Elsevier lish and some Latin-based languages are highly promising. This is

https://doi.org/10.1016/j.jksuci.2023.101709
1319-1578/Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Our contributions can be summarized as follows:

 We propose DAQAS an open domain question answering system


for Arabic language based on duplicate question detection and
machine reading comprehension techniques.
 We integrated a duplicate question detection component. This
component leverage question embeddings and FAISS to provide
exact answers to previously known questions. To our knowl-
edge, no available Arabic OpenQAS had integrated a Duplicate
Question Detection component in its pipeline.
 We applied text generation techniques to improve the perfor-
mance of our retriever module, thus improving the quality of
our system. We build queries using the generated context and
questions. We retrieve relevant passages based on the formu-
lated queries and BM25.
 We build a dataset for Arabic OpenQAS by fusing Arabic ques-
tion and passage pairs from various datasets including ARCD,
Arabic SQuAD, XQuAD, and TyDi QA.
 We performed extensive experiments to show the effectiveness
of each components of DAQAS.

Fig. 1. New structure of open-domain DBQA. The rest of the paper is organized as follows: Section 2 discusses
the related work in the field of Arabic question answering system.
Section 3 describes DAQAS our proposed Arabic open domain ques-
tion answering system. Section 4 presents the obtained results and
due to the introduction and the usage of deep learning techniques, performance evaluations of the proposed system. Section 6 con-
namely transformers (Vaswani et al., 2017), BERT (Devlin et al., cludes and provides future works.
2018), GPT-X (Radford et al., 2018, 2020), T5 (Raffel et al., 2020),
in the field of natural language processing. Despite that, few works
related to Arabic OpenQAS integrated neural network techniques
2. Related work
in their pipeline to extract exact answers to natural Arabic ques-
tions (Mozannar et al., 2019; Malhas and Elsayed, 2022). In this
Various Arabic QAS have been proposed in the literature. In the
paper, we propose an Arabic OpenQAS based on neural network
next paragraphs, we describe various deep learning text represen-
approaches. It comprises three main modules, including 1) Dense
tations that reshaped the NLP field along with existing end-to-end
Duplicate Question Detection, 2) Retriever, and 3) Reader. First,
Arabic QAS.
we build a Dense Duplicate Question Detection module to answer
previously answered questions. This module gives the system the
ability to answer as quickly as possible previously answered ques-
tions (Hamza et al., 2020). Also, this component is designed to 2.1. Deep learning text representation
index a large amount of previously answered questions, i.e., able
to store a huge number of exact answers. As far as we know, we Recently, impressive progress has been made in various natural
are the first to propose a dense duplicate question detection mod- language processing tasks such as machine translation, text classi-
ule in Arabic OpenQAS. Second, following the architecture of neural fication, and OpenQAS. Numerous factors contributed to these suc-
OpenQAS (Retriever + Reader), we build a Retriever that aims at cesses, including 1) the advances in computing resources and
retrieving relevant passages to a given question. We use Arabic materials; 2) the development of neural network-based methods
Wikipedia as an information source. At its core this component is that significantly surpass the performance of previous techniques;
an IR system, queries are constructed from questions, then we 3) the availability of large volume data designed to train these sys-
expand each query with a context generated by neural generation tems; 4) the advancement in neural word representations which
techniques (AraGPT2-base, AraGPT2-large (Antoun et al., 2021), allow mapping words from their textual representation into a con-
and mT5-small (Xue et al., 2020)). The new query is then passed tinuous and distributed vector representation. Considering that the
to a BM25 retrieval model to retrieve top k relevant passages. Third word representation is one of the main factor in these advance-
and last, an extractive model is trained to predict the answer span ments, we discuss the most known word representations including
(i.e., the start and end of an answer) from retrieved passages. We Word2Vec(Mikolov et al., 2013; Mikolov et al., 2013), ELMo (Peters
fine-tuned different BERT-based encoders (Devlin et al., 2018; et al., 2018), BERT (Devlin et al., 2018), GPT-X models (Radford
Antoun et al., 2020; Lan et al., 2020; Abdul-Mageed et al., 2021), et al., 2019; Brown et al., 2020), T5 (Raffel et al., 2020) in the
which are proposed for the Arabic language, with the objectif of following:
answer span prediction. The final answer is then chosen according
to the retriever and reader scores. To evaluate our proposed sys-
tem, we build a large dataset for Arabic OpenQA by combining 2.1.1. Word2Vec
existing datasets namely ARCD (Mozannar et al., 2019), Arabic These representations capture a large number of syntactic and
SQuAD (Mozannar et al., 2019), MLQA (Lewis et al., 2020), XQuAD semantic relationships between words, and the size of the gener-
(Artetxe et al., 2020), and TyDi QA (Clark et al., 2020). We per- ated vectors is generally between 100–300. The idea behind this
formed various evaluations on each module to show the effective- model is that each word representation is extracted from its con-
ness of our system. The proposed system (DAQAS) obtained text (past words, future words). The authors (Mikolov et al.,
promising results and scored 21.77% Exact Match and 54.71% F1 2013; Mikolov et al., 2013) proposed two architectures (1) CBOW;
score when using only top 5 retrieved passages. (2) Continuous Skip-gram.
2
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

1. Continuous Bag Of Words The idea behind this model is to pre-


dict a word w given his context window, that is, how many
words before and after w. The model is a neural network with
input, projection, and output layers. The word representation
is learned by maximizing the log probability to predict the cur-
rent word given its context. The vectors of context words are
averaged to get projected into the same position. The order of
past words and future words does not affect the quality of the
projection. For a given word wt in the corpus and its context
fwtc ; . . . ; wt1 ; wtþ1 ; . . . ; wtþc g, the model maximizes the follow-
ing expression:

1X V
log ½Pðwt jwtc ; . . . ; wt1 ; wtþ1 ; . . . ; wtþc Þ ð1Þ
V t¼1 Fig. 2. The overall architecture of Embeddings from Language Models.

where V and c are respectively the number of words in the cor-


pus and the context size of wt .
2. Continuous Skip-gram Model The aim is to predict the context sen tokens, 80% are supplanted with the exceptional token [MASK],
words of a word w given the word itself. The model is slightly 10% are replaced with a random Wordpiece token, and 10% are left
different from CBOW model, in which instead of predicting unaltered. The MLM objective function is a cross-entropy loss on
the current word given its context, it is trained to maximize predicting the masked token; 2) NSP The model learns in a binary
the log probability to predict the context for a given word. Note classification way if two segments follow each other in the original
that increasing the size of the context should enhance the qual- corpora.
ity of obtained word vectors, but it might result in rising the
computational complexity. More formally, given a sequence of
training words w1 ; w2 ; . . . ::; wV , the Skip-gram model maximizes 2.1.4. Generative pre-training model 1,2,3,4
the average log probability: In general, Generative Pre-Training (GPT-X) models are based
1X V X on the decoder part of transformers technique (Vaswani et al.,
logPðwtþj jwt Þ ð2Þ 2017) due to its autoregressive nature. These models are trained
V t¼1 c6j6c;j–0
on the Causal Language Modeling objective. Fig. 4 depicts the
where V and c are respectively the number of words in the cor- architecture of GPT-based models. There are three versions of this
pus and the context size of wt . model: 1) GPT-1 (Radford et al., 2019) is the first model that
showed that pre-training on causal language modeling (CLM)
2.1.2. Embeddings from language models objective enhances a model’s generalization abilities. 2) GPT-2
ELMo representation was introduced in (Peters et al., 2018). All revealed that the model’s size and the dataset size could increase
the words included in a sentence are passed to a neural network the performance. For instance, a larger model trained with a larger
that is composed of two main layers: 1) one dimensional convolu- dataset surpasses state-of-the-art of various tasks in a zero-shot
tional neural network with different filter sizes that computes setting (i.e., a model solves a task without receiving any training
word embeddings based on its character level embeddings; 2) a on that task). 3) GPT-3 and GPT-4 (Brown et al., 2020; OpenAI,
stack of two BiLSTM layers. The network is trained with large tex- 2023) embraced the idea of few-shot learners. The model is trained
tual corpora to optimize language model objective. Thus, the ELMo on CLM objective and then is fine-tuned on a specific task. This
representation of a word k, given by ELMok , is calculated by the model contains 175 billion parameters and trained with terabytes
next equation: of raw text. We list some usage of this model website code gener-
ation, SQL query generation, code completion, etc. However, scal-
1X 2
ðLMÞ ing model sizes and training datasets comes with the cost of a
ELMok ¼ h ð3Þ
3 j¼0 k;j high computational budget. Training GPT-2 and GPT-3 costs about
$43 K and $4.6 M, respectively.
where hk;j is the output of the hidden layers j of the neural network.
Fig. 2 depicts the overall design of the ELMo technique.
A sentence S, that is a succession of l words, is modelled by the 2.1.5. Text-to-text transfer transformer
later matrix: Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020) is a
unified approach to deal with every text processing problem as
S ¼ ½ELMoword1 ; . . . ; ELMowordl  ð4Þ
text-to-text problem. T5 employs a modified version the
encoder-decoder transformer architecture (Vaswani et al., 2017).
2.1.3. Bidirectional encoder representations from transformers The main idea of T5 is to treat every text processing problem as
BERT and its kinships have achieved state-of-the-art perfor- a ‘‘text-to-text” problem, i.e., taking the text as input and produc-
mance on several downstream NLP tasks. Fig. 3 illustrates the ing new text as output. The model is trained on a denoising objec-
architecture of BERT. The model is based on the well-known trans- tive, i.e., the model is trained to predict missing or otherwise
former’s encoder model (Vaswani et al., 2017), which we will not corrupted tokens in the input. In the fine-tuning phase, a prefix
survey in detail. The input of BERT is a sequence of the tokens gen- is added to the input to select the downstream task. For instance,
erated by the wordpiece model. The special character [CLS] is the input ‘‘What is the type of the question: When was Fez built?”
added at the beginning of every sequence. The hidden layers are the T5 prediction should be the class ‘‘Time”; while given the input
a set of bidirectional transformer layers (Vaswani et al., 2017), ‘‘Answer the question: When was Fez built” should generate the
and the pre-training uses two objectives: 1) MLM Enlivened by answer ‘‘9th century”. Fig. 5 shows a diagram of T5 framework.
the cloze procedure (Taylor, 1953), an arbitrary sample of 15% from The model generates text for various tasks, including question
inputs tokens are selected for conceivable substitution. Of the cho- answering, text classification, machine translation, etc.
3
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Fig. 3. The architecture of BERT model.

the question type, named entities, and the relevant passages to


extract candidate answers. This module handles only Arabic factoid
questions; 4) Answers Validation Component: ranks all the candi-
date answers according to their probability of containing the correct
answer. To test ArabiQA, a set of 200 questions were used, and 11,000
Arabic Wikipedia’s articles were used as the information source. Ara-
biQA system achieved about 83.3% precision.
QASAL (Brini et al., 2009) is designed to answer factoid and def-
inition questions. The system is based on NooJ platform
(Silberztein et al., 2012), and it is composed of three main compo-
nents, including Question Analysis module, Passage Retrieval Mod-
ule, and Answer Extraction module. Contrary to previous works,
this system includes in all stages NooJ’s linguistic engine. QASAL
was tested with 43 definition questions and Google search engine
as an information source. It achieved 100% recall and 94%
Fig. 4. The architecture of GPT model. precision.
AQUASYS (Bekhti and Al-Harbi, 2013) is designed to answer Ara-
2.2. Arabic QAS bic questions related to named entities of types person, location,
organization, time, quantity, etc. The authors assumed that the
QARAB (Hammo et al., 2002) is the first Arabic QAS which apply answer is a short passage. The system consists of four components:
IR and NLP methods to extract answers to Arabic questions. The 1) Question analysis; 2) Sentence filtering; 3) Candidate answers
following steps are executed to answer a question: 1) Processing finding; 4) Candidate answers scoring and ranking modules. In
the question; 2) Retrieving the candidate documents, using an IR order to evaluate the system, a corpus containing 150,000 tagged
system, that probably contain the answer; 3) Extracting sentences tokens (ANERcorp) and a few gazetters (ANERgazet)2 was used,
that may contain the answer from retrieved documents. The IR sys- and 80 questions with different types were asked to measure the
tem is based on Salton’s vector space model (Salton, 1971). QARAB performance. AQUASYS obtained an overall recall score of 97,5%
does not handle two types of questions ‫ ﻣﺎﺫﺍ‬،‫( ﻛﻴﻒ‬How and Why). A and a precision score of 66,25%.
set of unstructured documents extracted from the newspaper Al- IDRAAQ (Abouenour et al., 2012) is an OpenQAS designed to
RAYA (a newspaper published in Qatar) is used as an information enhance the the quality of retrieved passages w.r.t. a given ques-
source. The system was evaluated on 113 questions and a set of tion. The system comprises three main stages: question analysis
documents from Al-Raya newspaper. The authors did not report and classification, passage retrieval, and question validation. The
any information regarding accuracy, precision, or recall. authors highly focused on the passage retrieval module. This mod-
ArabiQA (Benajiba et al., 2007b) is an Arabic OpenQAS that relies ule is constituted of two main levels: keyword-based level and
on Arabic NER proposed in (Benajiba et al., 2007a). The system con- structure-based level. The former generate multiple queries using
sists of four main components: 1) Question Analysis Component: QE based on Arabic WordNet. The latter objective is to extract pas-
based on a given question, it extracts the question type, build a sages relevant to the question. It is built with JIRS which is based
query, and detect named entities appearing in the question; 2) Pas- on the distance n-gram density (Soriano et al., 2005). In order to
sage Retrieval Component: use JIRS1 to retrieve a list of passages evaluate the performance of the system, the authors participated
ranked by their relevance to the query generated by the question in the task QA4MRE@CLEF 2012. The test set consists of 16 test
analysis component; 3) Answer Extraction Component: it considers documents as an information source, 160 questions, and 800

1 2
https://sourceforge.net/projects/jirs/. http://users.dsic.upv.es/ybenajiba/downloads.html.

4
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Fig. 5. A diagram of T5 framework. It generate for every task (text classification, question answering, and machine translation) we consider text. Thus, treating every text
processing problem as text-to-text problem.

choices. IDRAAQ achieved a precision of 0,13 and c@1 equals 0,21. grates partial matching to evaluate performance over both multi-
The c@1 is a simple measure to assess non-response introduced in answer and single-answer questions. To evaluate the proposed sys-
(Peñas and Rodrigo, 2011). tem, they used two experimental setup with the QRCD dataset: 1)
JAWEB (Kurdi et al., 2014) is based on AQUASYS (Bekhti and Al- splitting the dataset in 75% training set and 25% testing set; 2) per-
Harbi, 2013). It consists of four modules: 1) User Interface; 2) form 5-fold Cross Validation (CV). The proposed system outper-
Question Analyzer; 3) Passage Retrieval; and 4) Answer Extractor. formed the baseline fine-tuned AraBERT reader system by 6.12
The system handles questions that expect named entities answers and 3.75 points according to partial average precision scores, in
(person, location, time, etc.). As an extension to AQUASYS, JAWEB the train-test split and CV setups, respectively.
provides a web interface to the system, which is an additional sup- We summarize the characteristic of the aforementioned sys-
port for Arabic language presentation in web browsers. The system tems in Table 1. It is worth noting that only one Arabic OpenQAS
was tested with an expansion of the corpus presented in (Bekhti adapted a neural approach.
and Al-Harbi, 2013). JAWEB scored 100% recall and 80% precision.
LEMAZA (Azmi and Alshenaifi, 2017) is built to answer Arabic
why-questions. It is composed of four components, including 3. Proposed system
transforming the input question into a query; preprocessing the
documents collection with the same method used for why- In this section, we present the main modules of the DAQAS sys-
questions; retrieve candidate passages related to the input ques- tem, including Dense Duplicate Question Detection module, Retrie-
tion, and extracting the answer. The system applied the Rhetorical ver module, and Reader module. The dense duplicate question
Structure Theory to extract answers. The experiments were con- detection module aims at searching if a question has a duplicate.
ducted on 110 Why-questions using 700 documents compiled If a duplicate exists, the system returns the answer to the duplicate
from open-source Arabic corpora. LEMAZA achieved about 72.7% question; otherwise, the system should pass the question to the
Recall, 79.2% Precision, and 78.7% c@1. SOQAL (Mozannar et al., retriever module that retrieves relevant passages to the question
2019) is the first Arabic OpenQAS that adopted a neural approach. using Arabic Wikipedia as information source. Finally, the reader
It is composed of two main components: retriever and reader. The module extracts the answer from relevant passages. The flowchart
retriever aims at retrieving spans of text that are most related to of our Arabic OpenQAS is depicted in Fig. 6.
the user’s questions. It uses hierarchical TF-IDF to retrieve first a
set of documents related to the question and then extract passages
that most probably contain the answer. The reader is a neural read-
ing comprehension model based on BERT. It is trained to extract Table 1
A summary of existing Arabic OpenQAS. The table presents for each system its name,
the answer given the question and the passage that likely contains
adopted approach, number of test questions, and the information source used.
the answer. To evaluate the system, Arabic Wikipedia was used as
the information source. In addition, the authors proposed two new System Approach # Test Information Source
Questions
datasets, including 1) ARCD which contain 1395 open-domain
questions, it constructed based on Arabic Wikipedia articles; 2) QARAB (Hammo Traditional 113 Al-Raya newspaper
Arabic SQuAD, which is based on the translation of the SQuAD et al., 2002)
ArabiQA (Benajiba Traditional 200 11,000 Arabic Wikipedia’s
dataset proposed in (Rajpurkar et al., 2016). SOQAL achieved when et al., 2007b) articles
considering top 5 answers 20,7%, 42,5%, and 51,7% in exact match QASAL (Brini et al., Traditional 43 Google search engine
score, f1 score, and sentence match score, respectively. 2009)
The authors in (Malhas and Elsayed, 2022) built the first AQUASYS (Bekhti Traditional 80 150,000 tagged tokens
and Al-Harbi, (ANERcorp + ANERgazet)
machine reading comprehension on the Holy Qur’an. It a restricted
2013)
domain system which aims to extract an answer given a Qur’anic IDRAAQ (Abouenour Traditional 160 QA4MRE@CLEF 2012 test
passage a question in modern standard Arabic. They introduced et al., 2012) documents
CL-AraBERT which is an AraBERT model pre-trained with large JAWEB (Kurdi et al., Traditional - Extension of
Classical Arabic dataset. They leveraged cross-lingual transfer 2014) (ANERcorp + ANERgazet)
LEMAZA (Azmi and Traditional 110 700 documents from open
learning by fine-tuning CL-AraBERT with Arabic SQuAD and ARCD Alshenaifi, 2017) source Arabic corpora
(Mozannar et al., 2019) prior ro fine-tuning the model with QRCD. SOQAL (Mozannar Neural 702 Arabic Wikipedia
The latter is a new dataset proposed by the authors for extractive et al., 2019)
machine reading comprehension on the Holy Qur’an. Furthermore, (Malhas and Neural 348 or 5- Qur’an
Elsayed, 2022) fold CV
they introduced a new metric partial average precision that inte-
5
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Fig. 6. The flowchart of our Arabic OpenQAS architecture.

3.1. Dense duplicate question detection module can be scaled to contain billions of questions. Given an input ques-
tion qInput , we compute its embedding using EInput and use a threshold
We propose a new Dense Duplicate Question Detection Module to retrieve the top k duplicate questions candidates that probably
inspired by dense passage retrieval method (Karpukhin et al., have the same answer as the input question. Finally, these retrieved
2020). The aim of this module is twofold. First, it reduces the questions and the posed question are passed to a BERT-based dupli-
response time for questions that already have answers stored in cate question classifier to determine a list of scores. The duplicate
an offline data source. Second, it provides trusted exact answers question candidate with the highest score is considered a duplicate
to previously known questions. To train the model, two dense input question. Thus the known answer is returned as the final
encoder EInput and EKnown are used to map input questions and answer. If no duplicate questions are detected, the system should
known questions into d dimensional vectors. These vectors consti- pass the question to the retriever module. Fig. 7 illustrates the pro-
tute the representation of the special token [CLS] computed by a cess of selecting the final duplicate question.
BERT-based model, including mBERT (Devlin et al., 2018), AraBERT
(Antoun et al., 2020), GigaBERT (Lan et al., 2020), MARBERT and
ARBERT (Abdul-Mageed et al., 2021). The similarity vector between 3.2. Retriever module
the input and known questions, q1 and q2 , is defined by the follow-
ing equation: This module aims at extracting from source documents the
   most relevant passages given an input question. It consists of four
simðq1 ; q2 Þ ¼ exp EInput ðq1 Þ  EKnown ðq2 Þ ð5Þ stages: source documents preparation, query formulation, query
expansion, and relevant passages retrieval.
This similarity vector is then fed to a softmax layer to predict
whether the questions pair is duplicate or not. The model is then
trained to optimize the cross-entropy loss. 3.2.1. Source documents preparation
After the training phase, we compute the embeddings of previ- We used the Arabic Wikipedia dump4 from Sept. 20, 2020, as the
ously answered questions using the EKnown encoder. These embed- source documents where the system can find answers to questions.
dings are indexed offline with FAISS (Johnson et al., 2019). FAISS3 We divided each document into several disjoint passages, which
is an extremely efficient, open-source library for similarity search serve as our elementary retrieval units. Each passage contains a text
and clustering of dense vectors, which can easily be applied to bil- block of 100 words, following (Wang et al., 2019; Karpukhin et al.,
lions of vectors. Thus, the data source that contains known questions 2020). The total number of obtained passages is 2,189,238. We pre-

3 4
https://github.com/facebookresearch/faiss. https://dumps.wikimedia.org/arwiki/20201120/.

6
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Fig. 7. Illustration of the process of selecting the final duplicate question.

fixed every passage with the title of the Wikipedia document that 3.2.3. Query expansion
contains the passage and the particular token [SEP]. In order to improve the performance of the retriever, a pre-
trained language model was fine-tuned to generate new contexts
3.2.2. Query formulation and passage preprocessing relevant to a question (Mao, 2020). The contexts were expanded
We present the main preprocessing steps we applied for both to the initial query to add semantic information and thus facilitate
query formulation and passage preprocessing. The preprocessing retrieving relevant passages according to the query. Three contexts
pipeline includes the following steps. 1) Diacritics removal: index- were taken as the generation targets: 1) The Wikipedia page title
ing text with diacritical marks is computationally expensive since a containing the answer. Queries augmented by a valid generated
large number of words must be considered. Therefore, diacritics title have more chance to retrieve relevant passage; 2) The answer
removal is computationally highly effective. Also, noting that to the question is naturally helpful for retrieving passages that con-
retrieval is usually tolerant of ambiguity, we removed all diacritics tain the answer itself. Still, generating the answer to a question
from text (Sanderson, 1994; Darwish and Magdy, 2014); 2) Kashi- directly is challenging; thus, the performance of the retriever
das removal: kashidas are simple word elongation characters; thus may diminish according to the performance of the answer genera-
they are typically removed; 3) Letter normalization: we apply let- tor; 3) The windowed passages are the context of an answer. They
ter normalization since it decreases the vocabulary size; thus it is are extracted by taking a window of 10 words before and after the
computationally efficient; 4) Segmentation and Tokenization: Ara- answer from the full context.
bic word segmentation consists of breaking words into its prefix Three pre-trained language models were fine-tuned to generate
(es), stem, and suffix(es). Tokenization is splitting sentences into the three context discussed above. The mT5-small (Xue et al.,
tokens based on spaces and punctuations. We used Farasa seg- 2020), AraGPT2-base, and AraGPT2-large (Antoun et al., 2021)
menter (Abdelali et al., 2016) to segment and tokenize questions models. The AraGPT2 architecture follows GPT2 (Radford et al.,
and passages; 5) Stopwords removal: Stopwords carry out various 2019) architecture but pre-trained with large Arabic corpora. The
tasks in sentences. However, they are ineffective for retrieval. Ara- model optimizes the CLM objective, i.e., maximazes the probability
bic stopwords can be attached to prefixes and suffixes; therefore, of a word given the previous words in a sentence. The Eq. 6 pre-
we apply segmentation before stopwords removal. sents the CLM objective:
7
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Y
n Table 2
pðsÞ ¼ pðwn jw1 ; . . . ; wn1 Þ ð6Þ An example of title, answer, and windowed passage generation.
i¼1

where s is a sentence, n the length of s, and wi is a word within s.


The mT5 model is a multilingual variant of Text-to-Text Trans-
fer Transformer (T5) (Raffel et al., 2020) where a unified approach
is proposed to deal with every text processing problem as text-to-
text problem. mT5 employs a simple encoder-decoder transformer
architecture initially proposed by Vaswani et al. (2017). Addition-
ally, mT5 is pre-trained to optimize a masked language modeling
objective. mT5 comes in different sizes. We utilize the small size
due to resource limitations. Table 2 shows an example of title,
answer, and windowed passage generation for a given Arabic
question. A typical OpenQAS dataset comprises samples of question and
context pairs where the context contains the answer. We provide
a brief description of datasets that we used to train and evaluate
3.3. Relevant passages retrieval
new Arabic OpenQAS in the following:
ARCD and Arabic Squad datasets were proposed by Mozannar
A generation-augmented query is built by extending the origi-
et al. (2019). They contain a set of questions with their correspond-
nal query by the contexts generated in the previous phase. We
ing paragraphs and answers. ARCD dataset was crowdsourced. Each
retrieve relevant passages using queries with different generated
crowd-worker provides question–answer pairs for paragraphs
contexts separately (e.g., query + generated title). The performance
extracted from Arabic Wikipedia. The articles presented in the task
of one-time retrieval with all the contexts appended is significantly
were 155 articles randomly sampled from the 1000 most viewed
worse. Finally, we apply the BM25 model to retrieve top k relevant
articles on Wikipedia in 2018. In total, ARCD dataset contains
passages according to the generation-augmented query.
1,395 question–answer pairs based on 465 paragraphs from 155
Arabic Wikipedia articles. Arabic Squad dataset is the Arabic trans-
3.4. Reader module lation of 48,344 question–answer pairs based on 10,364 para-
graphs translated from English to Arabic.
The main objective of this module is to extract answers from MLQA (Lewis et al., 2020) is a multi-way parallel extractive QA
retrieved passages. Let P = [p1 ; p2 , . . ., pk ] denote the list of top k evaluation benchmark in seven languages: English, Arabic, Ger-
retrieved passages with relevance scores R=[r1 ; r2 , . . ., r k ]. The man, Vietnamese, Spanish, Simplified Chinese and Hindi. In order
reader uses BERT-based models to estimate the start S and end E to build MLQA, the following steps are performed: 1) Automatically
of the answer within a retrieved passage. The input of the model identify sentences from Wikipedia articles that have the same or
is a question and paragraph pair which is represented as a single similar meaning in multiple languages; 2) Extract the paragraphs
sentence separated by the special token [SEP]. For each token tj that contain such sentences, then crowd-sourcing questions on
in the retrieved passage pk , we compute the probability that tj is the English paragraphs, making sure the answer is in the aligned
the start P jstart ðpk Þ or end Pjend ðpk Þ of the answer. The final answer sentence. This makes it possible to answer the question in all lan-
is selected according to three criteria: the retrieved passage with guages in the vast majority of cases; 3) The generated questions are
score ri , the words from the retrieved passage pi with the highest then translated to all target languages by professional translators,
probability P jstart and Plend that are the start and end of the answer. and answer spans are annotated in the aligned contexts for the tar-
get languages. Since we are working on Arabic OpenQAS we are
For every retrieved passage pi , we compute the Pjstart and P lend prob-
interested only in the Arabic part of MLQA. In overall, MLQA dataset
abilities. The candidate answer from pi is then the text span delim-
contains 5,852 Arabic question–answer pairs based on 5085 con-
ited by the tokens that have the probabilities P jstart and Plend . We texts from 2627 Arabic Wikipedia articles.
select the final answer from the candidate answers so it maximizes Xquad (Artetxe et al., 2020) is a cross-lingual question–answer-
the following score: ing dataset. It consists of 1190 question–answer pairs based on 240
paragraphs from the development set of SQuAD (Rajpurkar et al.,
r i  argmax P jstart ðpi Þ  Plend ðpi Þ
j;l 2016). These pairs are provided with their translation in ten lan-
guages, including Arabic, Spanish, German, Greek, Russian, Turkish,
such as i 2 ½1; k and j; l 2 ½1; number of tokens in passagepi 
Vietnamese, Thai, Chinese, and Hindi.
ð7Þ TydiQA (Clark et al., 2020) is the first large-scale multilingual
corpus of information-seeking question–answer pairs. The authors
applied a simple-yet-novel data collection procedure that is
4. Experimental results model-free and translation-free. The main objective of building
this dataset is twofold: 1) enabling researchers to design and build
This section describes and presents all the necessary informa- high-quality OpenQAS in the world’s top 100 languages; 2) propos-
tion about our experiments, including the datasets used, experi- ing models that handle linguistic phenomena and data scenarios of
mental settings, and evaluations. the world’s language. TydiQA contains 25,893 Arabic question–an-
swer pairs.
4.1. Datasets We build an Arabic OpenQAS dataset by fusing samples from
ARCD train set (Mozannar et al., 2019), Arabic SQuAD (Mozannar
To evaluate the performance of our system, we used two main et al., 2019), MLQA (Lewis et al., 2020), XQuAD (Artetxe et al.,
datasets: duplicate question dataset and Arabic OpenQAS datasets. 2020), TyDi QA (Clark et al., 2020). We used ARCD train set (693
The duplicate question dataset is proposed by Mawdoo3 (Seelawi samples), Arabic SQuAD (48344 samples), MLQA (5852), XQuAD
et al., 2019). It contains 11,997 labeled question pairs, 45% of these (1190 samples) and TyDi QA train set (15726 samples) for training.
pairs are duplicates while 55% are not. We leverage the ARCD test set (702) provided by Mozannar et al.
8
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

(2019) for testing. Fig. 8 shows the size of Arabic samples from dif-
ferent datasets.

4.2. Experimental settings

We used google colaboratory environment 5 to execute and run


all our experiments. This environment provides free graphical pro-
cessing units for researchers. We make use of the well known Python
machine learning libraries including Tensorflow 2.0 (Abadi et al.,
2016), Keras, Pytorch (Paszke et al., 2019), HuggingFace (Wolf
et al., 2020), and Scikit-learn libraries to construct, train, and test
our models.
We provide the evaluation metrics we used to measure the per-
formance of various stages of DAQAS:
Fig. 8. Number of samples used from various datasets (ARCD, Arabic SQuAD, MLQA,
No: of Correct Predictions XQuAD, TyDi QA).
Accuracy ¼ ð8Þ
Total No: of Predictions
Table 3
No: of Correct Positive Predictions
Precision ¼ ð9Þ Obtained results of different duplicate question classifiers with different encoders.
Total No: of Predicted Positives
Encoder Private Score Public Score
No: of Correct Positive Predictions mBERT (Devlin et al., 2018) 82.12% 81.87%
Recall ¼ ð10Þ AraBERT (Antoun et al., 2020) 87.62% 87.79%
Total No: of Actual Positives
GigaBERT (Lan et al., 2020) 90.81% 90.66%
MARBERT (Abdul-Mageed et al., 2021) 85.97% 87.25%
2  Precision  Recall
F1 measure ¼ ð11Þ ARBERT (Abdul-Mageed et al., 2021) 88.70% 90.13%
Precision þ Recall
In the passage retrieval stage, these metrics are given by:
the input questions are computed with the encoder of input ques-
No: of Relevant Retrieved Passages
Precision  IR ¼ ð12Þ tions, and then the FAISS retrieve the top k similar known ques-
Total No: of Retrieved Passages tions. A threshold is fixed to determine whether a question pair
is duplicate or not.
No: of Relevant Retrieved Passages
Recall  IR ¼ ð13Þ In order to choose the best threshold, first, we retrieved the top
Total No: of Relevant Passages 5 similar questions and found that 71.94% of retrieved questions
contain expected duplicates. Fig. 9 shows the number of retrieved
2  Precision  Recall duplicate questions according to their rank. We notice that 81.64%
F1 measure  IR ¼ ð14Þ
Precision þ Recall of duplicate questions rank is in the first and second positions.
ROUGE measures are used to quantify the performance of text gen- Second, we analyzed the similarity scores between the input
eration. Where ROUGE-N measures the overlap between the gener- questions and expected duplicate questions. The similarity is com-
ated text and the reference text at the N level. For, instance ROUGE- puted with the FAISS IndexFlatL27 which is based on the euclidean
2 measures the overlap at the bigram level. ROUGE-L calculates the distance. Fig. 10 illustrates the box-plots of scores according to their
longest common subsequence (LCS) between the system generated ranks. The scores are close even if the positions are different. How-
text and the reference test. ever, some outliers are detected; for instance, some scores of dupli-
cate pairs reach over 100 similarity score.
4.3. Dense duplicate question detection evaluation Third, we filter retrieved questions given a threshold, retrieved
questions with a score less than or equal the threshold are kept,
The evaluation of Duplicate Question Detection consists of two and retrieved questions with a score larger than the threshold
main points. First, measure the performances of different classifi- are removed. If all the retrieved questions have a score greater than
cation methods based on various question encoders. Then, mea- the threshold, then the input question is passed to the Retrieval
sure the performance of retrieving duplicate questions given an module. In order to choose the threshold, we compute precision,
input question. recall, and filtered ratio (i.e., the number of filtered questions
First, we tested the duplicate questions classification model. divided by the number of total items). We chose the threshold that
The main objective is to build encoders for input questions and minimizes ranks and maximizes the precision score since it is the
previously known questions. The parameters of these encoders number of retrieved duplicate questions divided by the number of
are learned by training a model that integrates these encoders to retrieved questions. Figs. 11–13 present obtained scores using
classify duplicate questions. The test sets presented in the median, third quartile (q3), and the upper limit of the box-plots
(Seelawi et al., 2019). Since it is a competition previously hosted (Fig. 10) according various ranks. The chosen threshold is a 33.45
in Kaggle 6, two test sets are available, a private and public. The eval- similarity score which is the upper limit of expected duplicate
uation metric for this competition is F1-Score. Table 3 shows the questions ranked third in retrieved questions since it achieved
obtained results. 41.43% precision. An arising problem needs to be addressed. If
To measure the performances of retrieving duplicate question the system retrieved multiple duplicate questions, which one is
given an input question, we index all the training set offline using the most probably the duplicate?.
FAISS and the encoder of known questions. All questions that have Finally, a multilingual BERT-based model is used to compute
a duplicate were considered input questions. The embeddings of the probabilities of the input question being duplicate with each
retrieved question. The retrieved question with the highest proba-
5
https://colab.research.google.com/.
6 7
https://www.kaggle.com/. https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.

9
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Fig. 12. Precision, Recall and filtered ratio obtained using the third quartile of
Fig. 9. Illustration of the positions of duplicate questions in the retrieved top 5 various retrieved duplicate questions ranks as threshold.
similar questions.

Fig. 13. Precision, Recall and filtered ratio obtained using the upper limit of various
Fig. 10. Box-plots of similarity scores between input questions and their duplicate retrieved duplicate questions ranks as threshold.
questions according to their positions in the top 5 retrieved questions.

4.4.1. Query expansion by generation


The quality of the generated query contexts is measured by the
well-known ROUGE metric (Lin, 2004). We trained different mod-
els, namely mT5-small, AraGPT2-base, and AraGPT2-medium, to
generate the title, answer, and windowed passage of a given query.
Figs. 14–16 reports the ROUGE-1 F1 scores, ROUGE-2 F1 scores,
and ROUGE-L F1 scores of the trained models, respectively. We
can notice that the title generation scores the best scores, and
the AraGPT2 model surpasses the other encoders. However, the
obtained results still relatively low compared to English and other
Latin-based languages and need to be improved further.

4.4.2. Retrieval of relevant passages


We now present the results of passage retrieval using query
expansion by generation. The passages built by splitting Arabic
Fig. 11. Precision, Recall and filtered ratio obtained using the medians of various Wikipedia into chunks containing 100 words are considered infor-
retrieved duplicate questions ranks as threshold. mation source. It is worth mentioning that we matched the con-
texts of ARCD dataset samples with 100 words passages to test
the retrieval module. Table 4 shows the precision and recall
obtained by the retrieval module. Augmenting the query with
bility is then the duplicate question that we will return it answer. the title improves the performance of the retrieval system. How-
The model achieved a 58.87% of accuracy score (i.e., the questions ever, the precision is low due to the fact that we have one passage
with the highest probabilities are duplicates). that is relevant (i.e., in the case of retrieving the top 5 passages, the
maximum precision we can get is 1=5 ¼ 25%). The recall score
4.4. Retriever module evaluation shows that the retriever can retrieve up to 25% of passages that
contain the answer in the top 5 retrieved passages. We also evalu-
We examine the performance of our retriever modules with the ated augmenting the query with the generated title, answer, and
full ARCD dataset. The evaluation is performed in two stages query windowed passage; however, the results obtained are very med-
expansion by generation and the retrieval of relevant passages. iocre and not promising.
10
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Table 5
Ratio of exact answers within retrieved passages.

% of exact answers
Top 5 Top 10 Top 15
Without Aug 25.27 31.52 35.88
Title Augmentation 27.74 36.02 41.25
Answer Augmentation 22.88 29.77 33.48
Windowed Passage Aug 23.02 29.05 33.04

Table 6
Obtained results with the full ARCD dataset as a test set according to Exact match, F1,
and sentence match scores.

Model Exact F1 Score Sentence


Match Match
Fig. 14. ROUGE-1 score of AraGPT2-base, and AraGPT2-medium, mT5-small models mBERT 22,80 57,64 92,62
fine-tuned to generate questions contexts (title, answer, and windowed passage). AraBERT without 21,65 56,55 91,76
preprocessing
AraBERT 21,00 63,64 89,39
GigaBERT 23,80 58,83 92,69
ARBERT 21,29 57,25 91,11
MARBERT 15,34 45,17 84,44

Table 7
Obtained results with the ARCD test set according to to Exact match, F1, and sentence
match scores.

Model Exact F1 Score Sentence


Match Match
mBERT 26,35 59,00 92,59
AraBERT without 23,50 57,04 91,59
preprocessing
AraBERT 19,82 49,04 88,22
GigaBERT 26,35 60,42 93,01
ARBERT 24,07 58,43 91,45
MARBERT 17,95 47,64 86,18
Fig. 15. ROUGE-2 score of AraGPT2-base, and AraGPT2-medium, mT5-small models
fine-tuned to generate questions contexts (title, answer, and windowed passage).

Table 8
DAQAS performances evaluation with ARCD test set.

Model Exact Match F1 Score Sentence Match


top 5 retrieved passages 19,44 49,71 52,88
top 10 retrieved passages 20,51 50,39 55,49
top 15 retrieved passages 22,36 55,13 59,21

Table 9
Impact of Duplicate Question Detection module on DAQAS performances.

Model Exact Match F1 Score Sentence Match


top 5 retrieved passages 21,77 54,71 58,88
top 10 retrieved passages 23,22 59,45 61,86
top 15 retrieved passages 25,02 62,39 64,07

Fig. 16. ROUGE-L score of AraGPT2-base, and AraGPT2-medium, mT5-small models


fine-tuned to generate questions contexts (title, answer, and windowed passage).
obtain answers from other passages that are retrieved by the sys-
tem. The queries augmented by the generated titles achieved the
In addition, Table 5 shows the ratio of passages that contain the best results. Hence, we will use the passages retrieved with title
exact answers to the given queries. The results show that we can augmentation to extract exact answers with neural readers.

Table 4
Precision and Recall score of the passage retrieval module with different query augmentation context.

Precision (%) Recall (%)


Top 5 Top 15 Top 50 Top 100 Top 5 Top 15 Top 50 Top 100
Without Augmentation 4.52 2.14 0.85 0.49 22.61 32.16 42.25 48.56
Title Augmentation 5.03 2.46 0.99 0.58 25.14 36.85 49.46 58.02
Answer Augmentation 4.00 1.99 0.83 0.48 20.00 29.91 41.53 47.75
Windowed Passage Aug 4.34 2.13 0.83 0.47 21.71 31.89 41.62 47.39

11
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Table 10
Step by step example of different modules of DAQAS.

4.5. Reader module evaluation 4.6. Evaluation of the end-to-end system

We evaluated the reader module with three main measures, We evaluate the performance of the overall DAQAS system with
namely exact match, F1 score, and sentence match. We trained var- ARCD test set, the query is expanded with the generated title only
ious BERT-based models to predict answer span with Arabic SQuAD using the AraGPT2-medium model, and the reader is based on the
(Mozannar et al., 2019), MLQA (Lewis et al., 2020), XQuAD (Artetxe GigaBERT model. The performances are evaluated with the top
et al., 2020), and TyDi QA (Clark et al., 2020) datasets. Then we 5,10, and 15 retrieved passages with the retrieved module. The
tested the models on the full ARCD (Mozannar et al., 2019) dataset results are presented in Table 8.
and the ARCD test set. Tables 6 and 7 present the obtained results. To evaluate the impact of the duplicate questions detection
These latter show that GigaBERT model achieves the best perfor- module, we randomly select 10% of selected questions from the
mances on the test set. Hence, this is the model that we will be ARCD test set. Then, we encode these questions using the known
using in our system. question encoder and index the resulting encoding with FAISS.

Fig. 17. Illustration of the wikipedia article from where the answer was extracted.

12
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Table 9 shows the results obtained. It is obvious that incorporating Benajiba, Y., Rosso, P., Lyhyaoui, A., 2007b. Implementation of the arabiqa question
answering system’s components. In: Proc. Workshop on Arabic Natural
a duplicate question detection module improves the performance
Language Processing, 2nd Information Communication Technologies Int.
of the overall system. Symposium, ICTIS-2007, Fez, Morroco, April, pp. 3–5.
A step by step example of different modules outputs is pre- Brini, W., Ellouze, M., Trigui, O., Mesfar, S., Belguith, L.H., Rosso, P., 2009. Factoid and
sented in Table 10. In addition, in Fig. 17 we provide an illustration definitional arabic question answering system. Post-Proc. NOOJ-2009, Tozeur,
Tunisia, June, 8–10.
of the wikipedia article from where the answer was extracted. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan,
A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C.,
5. Conclusion and perspectives Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
S., Radford, A., Sutskever, I., Amodei, D., 2020. Language models are few-shot
learners. arXiv:2005.14165.
In this paper, we proposed a new Arabic OpenQAS DAQAS. The Clark, J.H., Palomaki, J., Nikolaev, V., Choi, E., Garrette, D., Collins, M., Kwiatkowski,
system is composed of three main components: 1) Duplicate ques- T., 2020. Tydi QA: A benchmark for information-seeking question answering in
tion detection component, which aims to search and return typologically diverse languages. Trans. Assoc. Comput. Linguistics 8, 454–470.
URL https://transacl.org/ojs/index.php/tacl/article/view/1929.
answers to duplicate input questions; 2) Retriever component, that Darwish, K., Magdy, W., 2014. Arabic information retrieval. Found. Trends Inf. Retr.
performs information retrieval to retrieve relevant passages to 7, 239–342. https://doi.org/10.1561/1500000031.
input questions; and 3) Reader component, which extracts answer Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint
span from relevant passages. As far as we know, we are the first arXiv:1810.04805.
that integrated a duplicate question detection system in the pipe- Hammo, B., Abu-Salem, H., Lytinen, S.L., Evens, M., 2002. QARAB: A: question
line of an Arabic OpenQAS. All the components applied deep learn- answering system to support the arabic language. In: Proceedings of the
Workshop on Computational Approaches to Semitic Languages, SEMITIC@ACL
ing techniques such as BERT, GPT, and T5 to improve the system’s
2002, Philadelphia, PA, USA, July 11, 2002, Association for Computational
performance. Our system combines various techniques, including Linguistics. URL: https://www.aclweb.org/anthology/W02-0507/, https://doi.
text classification, IR, Information extraction, and more. We per- org/10.3115/1118637.1118644.
Hamza, A., Alaoui Ouatik, S.E., Zidani, K.A., En-Nahnahi, N., 2020. Arabic duplicate
formed various experiments to show the effectiveness of our sys-
questions detection based on contextual representation, class label matching,
tem. Our system scored about 54.71% F1 score when retrieving and structured self attention. J. King Saud Univ.- Comput. Informat. Sci. https://
the top 5 relevant passages. Future work will aim to improve the doi.org/10.1016/j.jksuci.2020.11.032. URL: https://
retrieval and answer extraction components, hence improving www.sciencedirect.com/science/article/pii/S1319157820305735.
Hamza, A., En-Nahnahi, N., Zidani, K.A., El Alaoui Ouatik, S., 2021. An arabic question
the performance of the overall DAQAS system. classification method based on new taxonomy and continuous distributed
representation of words. J. King Saud Univ.- Comput. Infr. Sci. 33, 218–224. URL:
https://www.sciencedirect.com/science/article/pii/S1319157818308401,
Declaration of Competing Interest https://doi.org/10.1016/j.jksuci.2019.01.001.
Johnson, J., Douze, M., Jégou, H., 2019. Billion-scale similarity search with gpus. IEEE
Trans. Big Data 1–1. https://doi.org/10.1109/TBDATA.2019.2921572.
The authors declare that they have no known competing finan-
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t., 2020.
cial interests or personal relationships that could have appeared Dense passage retrieval for open-domain question answering. In: Proceedings
to influence the work reported in this paper. of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), Association for Computational Linguistics, Online. pp. 6769–6781.
URL: https://www.aclweb.org/anthology/2020.emnlp-main.550, https://doi.
References org/10.18653/v1/2020.emnlp-main.550.
Kurdi, H., Alkhaider, S., Alfaifi, N., 2014. Development and evaluation of a web based
question answering system for arabic language. Comput. Sci. Infr. Technol. (CS &
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
IT) 4, 187–202.
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
Lan, W., Chen, Y., Xu, W., Ritter, A., 2020. Gigabert: Zero-shot transfer learning from
Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R.,
english to arabic, in. In: Proceedings of The 2020 Conference on Empirical
Moore, S., Murray, D., Olah, C., 2016. Tensorflow: Large-scale machine learning
Methods on Natural Language Processing (EMNLP).
on heterogeneous distributed systems URL: https://arxiv.org/pdf/1603.04467.
Lewis, P.S.H., Oguz, B., Rinott, R., Riedel, S., Schwenk, H., 2020. MLQA: evaluating
pdf, arXiv:1603.04467.
cross-lingual extractive question answering. In: Jurafsky, D., Chai, J., Schluter,
Abdelali, A., Darwish, K., Durrani, N., Mubarak, H., 2016. Farasa: A fast and furious
N., Tetreault, J.R. (Eds.), Proceedings of the 58th Annual Meeting of the
segmenter for arabic. In: Proceedings of the 2016 Conference of the North
Association for Computational Linguistics. Association for Computational
American Chapter of the Association for Computational Linguistics:
Linguistics, pp. 7315–7330. https://doi.org/10.18653/v1/2020.acl-main.653.
Demonstrations, pp. 11–16.
Lin, C.Y., 2004. ROUGE: A package for automatic evaluation of summaries. In: Text
Abdul-Mageed, M., Elmadany, A.A., Nagoudi, E.M.B., 2021. ARBERT & MARBERT:
Summarization Branches Out, Association for Computational Linguistics,
deep bidirectional transformers for arabic. CoRR abs/2101.01785. URL: https://
Barcelona, Spain, pp. 74–81. URL: https://www.aclweb.org/anthology/W04-
arxiv.org/abs/2101.01785, arXiv:2101.01785.
1013.
Abouenour, L., Bouzoubaa, K., Rosso, P., 2012. Idraaq: New arabic question
Malhas, R., Elsayed, T., 2022. Arabic machine reading comprehension on the holy
answering system based on query expansion and passage retrieval. In: CLEF
qur’an using cl-arabert. Infr. Process. Manage. 59, 103068. URL: https://
(Online Working Notes/Labs/Workshop).
www.sciencedirect.com/science/article/pii/S0306457322001704, https://doi.
Antoun, W., Baly, F., Hajj, H., 2020. Arabert: Transformer-based model for arabic
org/10.1016/j.ipm.2022.103068.
language understanding. In: LREC 2020 Workshop Language Resources and
Mao, Y., 2020. Generation-augmented retrieval for open-domain question
Evaluation Conference 11–16 May, p. 9.
answering. CoRR abs/2009.08553. URL: https://arxiv.org/abs/2009.08553.
Antoun, W., Baly, F., Hajj, H., 2021. AraGPT2: Pre-trained transformer for Arabic
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient estimation of word
language generation. In: Proceedings of the Sixth Arabic Natural Language
representations in vector space. In: Bengio, Y., LeCun, Y. (Eds.), 1st International
Processing Workshop, Association for Computational Linguistics, Kyiv, Ukraine
Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA,
(Virtual). pp. 196–207. URL: https://www.aclweb.org/anthology/2021.wanlp-1.
May 2-4, 2013, Workshop Track Proceedings. URL: http://arxiv.org/abs/1301.
21.
3781.
Artetxe, M., Ruder, S., Yogatama, D., 2020. On the cross-lingual transferability of
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed
monolingual representations. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R.
representations of words and phrases and their compositionality. In:
(Eds.), Proceedings of the 58th Annual Meeting of the Association for
Advances in Neural Information Processing Systems (NIPS), pp. 3111–3119.
Computational Linguistics. Association for Computational Linguistics, pp.
Mozannar, H., Maamary, E., Hajal, K.E., Hajj, H.M., 2019. Neural arabic question
4623–4637. https://doi.org/10.18653/v1/2020.acl-main.421.
answering. In: El-Hajj, W., Belguith, L.H., Bougares, F., Magdy, W., Zitouni, I.
Azmi, A.M., Alshenaifi, N.A., 2017. Lemaza: An arabic why-question answering
(Eds.), Proceedings of the Fourth Arabic Natural Language Processing Workshop,
system. Natural Lang. Eng. 23, 877–903. https://doi.org/10.1017/
WANLP@ACL 2019, Florence, Italy, August 1, 2019, Association for
S1351324917000304.
Computational Linguistics. pp. 108–118. https://doi.org/10.18653/v1/w19-
Bekhti, S., Al-Harbi, M., 2013. Aquasys: A question-answering system for arabic, in:
4612.
WSEAS International Conference. In: Proceedings. Recent Advances in
OpenAI, 2023. Gpt-4 technical report. arXiv:2303.08774.
Computer Engineering Series, WSEAS, pp. 19–27.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Benajiba, Y., Rosso, P., Benedíruiz, J.M., 2007a. Anersys: An arabic named entity
Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M.,
recognition system based on maximum entropy. In: International Conference
Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019.
on Intelligent Text Processing and Computational Linguistics. Springer, pp. 143–
Pytorch: An imperative style, high-performance deep learning library. In:
153.

13
H. Alami, A. El Mahdaouy, A. Benlahbib et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101709

Wallach, Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. Silberztein, M., Váradi, T., Tadić, M., 2012. Open source multi-platform NooJ for NLP.
(Eds.), 2020, Advances in Neural Information Processing Systems 32: Annual In: Proceedings of COLING 2012: Demonstration Papers, The COLING 2012
Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8- Organizing Committee, Mumbai, India. pp. 401–408. URL: https://www.aclweb.
14 December 2019, Vancouver, BC, Canada, pp. 8024–8035. URL: http://papers. org/anthology/C12-3050.
nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep- Soriano, J.M.G., Montes-y-Gómez, M., Arnal, E.S., Pineda, L.V., Rosso, P., 2005.
learning-library. Language independent passage retrieval for question answering. In: Gelbukh, A.
Peñas, A., Rodrigo, Á., 2011. A simple measure to assess non-response. In: Lin, D., F., de Albornoz, A., Terashima-Marín, H. (Eds.), MICAI 2005: Advances in
Matsumoto, Y., Mihalcea, R. (Eds.), The 49th Annual Meeting of the Association Artificial Intelligence, 4th Mexican International Conference on Artificial
for Computational Linguistics: Human Language Technologies, Proceedings of Intelligence. Springer, pp. 816–823. https://doi.org/10.1007/11579427_83.
the Conference, 19–24 June, 2011, Portland, Oregon, USA, The Association for Taylor, W.L., 1953. ‘‘cloze procedure”: A new tool for measuring readability.
Computer Linguistics. pp. 1415–1424. URL: https://www.aclweb.org/ Journalism Quart. 30, 415–433. https://doi.org/10.1177/107769905303000401.
anthology/P11-1142/. arXiv:https://doi.org/10.1177/107769905303000401.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
2018. Deep contextualized word representations. In: North American Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural
Association for Computational Linguistics (NAACL), pp. 2227–2237. Information Processing Systems (NIPS), pp. 5998–6008.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving language Wang, Z., Ng, P., Ma, X., Nallapati, R., Xiang, B., 2019. Multi-passage BERT: A globally
understanding by generative pre-training. Technical Report. OpenAI. normalized BERT model for open-domain question answering. In: Proceedings
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language of the 2019 Conference on Empirical Methods in Natural Language Processing
models are unsupervised multitask learners. OpenAI blog 1, 9. and the 9th International Joint Conference on Natural Language Processing
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China,
P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text pp. 5878–5882. https://doi.org/10.18653/v1/D19-1599. URL: https://www.
transformer. J. Mach. Learn. Res. 21, 140:1–140:67. URL: http://jmlr.org/papers/ aclweb.org/anthology/D19-1599.
v21/20-074.html. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. SQuAD: 100,000+ questions for Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y.,
machine comprehension of text. In: Empirical Methods in Natural Language Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A., 2020.
Processing (EMNLP), pp. 2383–2392. Transformers: State-of-the-art natural language processing. In: Proceedings of
Salton, G., 1971. The Smart Retrieval System-experiments in Automatic Document the 2020 Conference on Empirical Methods in Natural Language Processing:
Processing. Englewood Cliffs. System Demonstrations, Association for Computational Linguistics, Online. pp.
Sanderson, M., 1994. Word sense disambiguation and information retrieval. In: 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6,
Croft, W.B., van Rijsbergen, C.J. (Eds.), Proceedings of the 17th Annual https://doi.org/10.18653/v1/2020.emnlp-demos.6.
International ACM-SIGIR Conference on Research and Development in Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., Raffel, C.,
Information Retrieval. Dublin, Ireland, 3–6 July 1994 (Special Issue of the 2020. mt5: A massively multilingual pre-trained text-to-text transformer. CoRR
SIGIR Forum), ACM/Springer. pp. 142–151. https://doi.org/10.1007/978-1- abs/2010.11934. URL: https://arxiv.org/abs/2010.11934, arXiv:2010.11934.
4471-2099-5_15. Zeng, C., Li, S., Li, Q., Hu, J., Hu, J., 2020. A survey on machine reading
Seelawi, H., Mustafa, A., Al-Bataineh, H., Farhan, W., Al-Natsheh, H.T., 2019. NSURL- comprehension–tasks, evaluation metrics and benchmark datasets. Appl. Sci.
2019 shared task 8: Semantic question similarity in arabic. CoRR abs/ 10. URL: https://www.mdpi.com/2076-3417/10/21/7640, https://doi.org/10.
1909.09691. URL: http://arxiv.org/abs/1909.09691, arXiv:1909.09691. 3390/app10217640.

14

You might also like