Recent Trends in Deep Learning Based Open-Domain Textual Question Answering Systems
Recent Trends in Deep Learning Based Open-Domain Textual Question Answering Systems
June 2, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2988903
ABSTRACT Open-domain textual question answering (QA), which aims to answer questions from large
data sources like Wikipedia or the web, has gained wide attention in recent years. Recent advancements
in open-domain textual QA are mainly due to the significant developments of deep learning techniques,
especially machine reading comprehension and neural-network-based information retrieval, which allows
the models to continuously refresh state-of-the-art performances. However, a comprehensive review of
existing approaches and recent trends is lacked in this field. To address this issue, we present a thorough
survey to explicitly give the task scope of open-domain textual QA, overview recent key advancements on
deep learning based open-domain textual QA, illustrate the models and acceleration methods in detail, and
introduce open-domain textual QA datasets and evaluation metrics. Finally, we summary the models, discuss
the limitations of existing works and potential future research directions.
INDEX TERMS Open-domain textual question answering, deep learning, machine reading comprehension,
information retrieval.
I. INTRODUCTION bases (KBs), such as Freebase [8] and DBpedia [9], espe-
A. BACKGROUND cially with the emergence of open-domain datasets on
Question answering (QA) systems have long been concerned WebQuestions [10] and SimpleQuestions [11], KBQA tech-
by both academia and industry [1]–[3], where the concept nologies evolved quickly. In 2011, IBM Watson [12] won the
of QA system can be traced back to the emergence of arti- Jeopardy! game show, which received a great deal of atten-
ficial intelligence, namely the famous Turing test [4]. Tech- tion. Recently, due to the release of several large-scale bench-
nologies with respect to QA have been constantly evolving mark datasets [13]–[15] and the fast development in deep
over almost the last 60 years in the field of Natural Lan- learning techniques, large advancements have been made
guage Processing (NLP) [5]. Early works on QA mainly in the QA field. Especially, recent years have witnessed a
relied on manually-designed syntactic rules to answer simple research renaissance on deep learning based open-domain
answers due to constrained computing resources [6], such textual QA, an important QA branch that focuses on answer-
as Baseball in 1961, Lunar in 1977, Janus in 1989 and ing questions from large knowledge sources like Wikipedia
so on [5]. Around 2000, several conferences such as and the web.
TREC QA [1] and QA@CLEF [7], have greatly promoted the
development of QA. A large number of systems that utilize B. MOTIVATION
information retrieval (IR) techniques were proposed at that Despite the flourishing research of open-domain textual QA,
time. Then around 2007, with the development of knowledge there remains a lack of comprehensive survey that summa-
rizes existing approaches&datasets as well as systemically
The associate editor coordinating the review of this manuscript and analysis of the trends behind these successes. Although sev-
approving it for publication was Jan Chorowski . eral surveys [16]–[19] were proposed to discuss the broad
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 94341
Z. Huang et al.: Recent Trends in Deep Learning Based Open-Domain Textual QA Systems
picture of QA, none of them have focused on the specific TABLE 1. Question-answer pairs with sample excerpts from TriviaQA [14],
which requires reasoning from multiple paragraphs.
deep learning based open-domain textual QA branch. More-
over, there are several surveys [20]–[23] that illustrate recent
advancements in machine reading comprehension (MRC)
by introducing several classic neural MRC models. How-
ever, they only reported the approaches in close-domain
single-paragraph settings, and failed to present the latest
achievements in open-domain scenarios. So we write this
paper to summarize recent literature of deep learning based
open-domain textual QA for the researchers, practitioners,
and educators who are interested in this area.
C. TASK SCOPE
In this paper, we conduct a thorough literature review on
recent progress in open-domain textual QA. To achieve this
goal, we first category previous works based on five charac-
teristics described as below, then give an exact definition of
open-domain textual QA that explicitly constrains its scope.
1) Source: Towards different data sources, QA systems
can be classified into structured, semi-structured and rely on general text and knowledge base. Moreover,
unstructured categories. One the one hand, structured systems are usually required to find answers from large
data are mainly organized in the form of knowledge open-domain knowledge sources (e.g., Wikipedia,
graph (KG) [9], [24], [25], while semi-structured data web), instead of a given document [37], [38].
are usually viewed as lists or tables [26]–[28]. On the 5) Methodology: As for involved methodologies, QA
other hand, unstructured data are typically plain text systems can be categorized into IR based [39]–[41],
composed of natural language. NLP based [31] and KB based [42] approaches [5].
2) Question: The question type is defined as a IR based models mainly return the final answer as
certain semantic category characterized by some com- a text snippet that is most relevant to the question.
mon properties. The major types include factoid, list, NLP based models aim to extract candidate answer
definition, hypothetical, causal, relationship, proce- strings from the context document and re-rank them
dural, and confirmation questions [17]. Typically, by semantic matching. KBQA systems build a seman-
factoid question is the question that starts with a tic representation of the query and transform it into
Wh-interrogated word (What, When, Where, etc.) and a full predicate calculus statement for the knowledge
requires an answer as fact expressed in the text [17]. graph.
The form of question can be full question [14], key Following the above categories, open-domain textual
word/ phrase [15] or (item, property, answer) triple QA can be defined as: (1) unstructured data sources on
[29]. text, (2) factoid questions or keyword/phrase as inputs,
3) Answer: Based on how the answer is produced, (3) extractive-based answer, (4) open-domain, and
QA systems can be roughly classified into extractive- (5) NLP based technologies with auxiliary IR technologies.
based QA and generative-based QA. Extractive-based Table. 1 shows an example of deep learning based
QA selects a span of text [13], [15], [30], a word open-domain textual QA.
[31], [32] or an entity [10], [11] as the answer.
Generative-based QA may rewrite the answer if it does D. CONTRIBUTIONS
not (i) include proper grammar to make it a full sen- The purpose of this survey is to review the recent research
tence, (ii) make sense without the context of either the progress of open-domain textual QA based on deep learn-
query or the passage, (iii) have a high overlap with ing. It provides the reader with a panoramic view that
exact portions in context [33], [34]. allows the reader to establish a general understanding of
4) Domain: Closed-domain QA system deals with ques- open-domain textual QA and know how to build a QA model
tions under a specific field [35], [36] (e.g., law, with deep learning technique. In conclusion, the main con-
education, and medicine), and can exploit domain- tributions of this survey are as follows: (1) we conducted
specific knowledge frequently formalized in a systematic review for open-domain textual QA system
ontologies. Besides, closed-domain QA usually refers based on deep learning technique; (2) we introduced the
to a situation where only a limited type of question recent models, discussed the pros and cons of each method,
is asked, and a small amount of context is provided. summarized method used in each components of model,
Open-domain QA system, on the other hand, deals with and compared the models performance on each dataset;
questions from a broad range of domains, and only (3) we discussed the current challenges and problems to be
solved, and explored new trends and future directions in the B. WHY DEEP LEARNING FOR OPEN-DOMAIN
research on open domain textual QA system based on deep TEXTUAL QA
learning. It is beneficial to understand the motivation behind these
approaches for open-domain textual QA. Specifically, why do
E. ORGANIZATION we need to use deep learning techniques to build open-domain
After making the definition clear, we further give an overview textual QA systems? What are the advantages of neural-
of open-domain textual QA systems, including presenting a network-based architectures? In this section, we would like
brief history, explaining the motivation of using deep learn- to answer the above questions to show the strengths of deep
ing techniques, and introducing a general open-domain tex- learning-based QA models, which are listed as below:
tual QA architecture (Section II). Next, we illustrate several 1) Automatically learn complex representation: Using
key components of open-domain textual QA including neural networks to learn representations has two advan-
ranking module, answer extraction, and answer selection, tages: (1) it reduces the efforts in hand-craft feature
summarize recent trends on acceleration techniques as designs. Feature engineering is a labor-intensive work,
well as public datasets and metrics (Section III). Last, deep learning enables automatically feature learning
we conclude the work with discussions on the limita- from raw data in unsupervised or supervised ways [50].
tions of existing works and some future research directions (2) contrary to linear models, neural networks are capa-
(Section IV). ble of modeling the non-linearity in data with activation
functions such as Relu, Sigmoid, Tanh, etc. This prop-
II. OVERVIEW OF OPEN-DOMAIN TEXTUAL QA SYSTEMS erty makes it possible to capture complex and intricate
Before we dive into the details of this survey, we start user item interaction patterns [50].
with an introduction to the history, the reason why deep 2) End-to-end processing: Many early years’ QA sys-
learning based method emerges and architecture regard- tems heavily relied on the question and answer
ing to open-domain textual QA systems based on deep templates, which were mostly manually constructed
learning. and time-consuming. Later most of the QA research
adopted a pipeline of conventional linguistically-based
A. HISTORY OF OPEN-DOMAIN TEXTUAL QA NLP techniques, such as semantic parsing, part-of-
In 1993, START became the first knowledge-based question- speech tagging, and coreference resolution. This could
answering system on the Web [43], since then answered cause the error propagation during the entire progress.
millions of questions from Web users all over the world. On the other hand, neural networks have the advan-
In 1999, the 8th TREC competitions [44] began to run the tage that multiple building blocks can be composed
QA track. In the following year, at the 38th ACL conference, into a single (gigantic) differentiable function and
a special discussion topic ‘‘Open-domain Question Answer- trained end-to-end. Besides, models of different stages
ing’’ was opened up. Since then, open-domain QA system can share learned representations and benefit from
has become a hot topic in the research community. With multi-task learning [51].
the development of structured KBs like Freebase [8], many 3) Data-driven paradigm: Deep learning is essentially
works have proposed to construct QA systems with KBs, a science based on statistics, one intrinsic property of
such as WebQuestions [10] and SimpleQuestions [11]. These deep learning is that it follows a data-driven paradigm.
approaches usually achieve high precision and nearly solve That is, neural networks can learn statistical distri-
the task on simple questions [45], but their scope is limited butions of features from massive data, and the per-
to the ontology of the KBs. There are also some pipelined formance of the model could be constantly improved
QA approaches that use a large number of data resources, as more data are used [52]. This is important for
including unstructured text collections and structured KBs. open-domain textual QA as it usually involves wide
The landmark approaches are ASKMSR [3], DEEPQA [12], range of domains and large text corpus.
and YODAQA [2]. A landmark event in this filed is the
success of IBM Watson [12], who won the Jeopardy! game C. DEEP LEARNING BASED TECHNICAL ARCHITECTURE
show in 2011. This complicated system adopted a hybrid OF OPEN-DOMAIN TEXTUAL QA SYSTEMS
scheme including technologies brought from IR, NLP, and As shown in TABLE. 1, given a question, the QA system
KB. In recent years, With the development of deep learning, needs to retrieve several relevant documents, read and gather
NLP based QA systems emerge, which can directly carry information across multiple text snippets, then extract the
out end-to-end processing of unstructured text sequences answer from raw text. Notably, not all given paragraphs
at the semantic level through neural network model [46]. contain the correct answer, and the exact location of the
Specifically, DrQA [37] was the first neural-network-based ground-truth answer is unknown. Such setting is usually
model for the task of open-domain textual QA. Based on this referred to as distant supervision, which brings difficul-
framework, some end-to-end textual QA models have been ties in designing supervised training signals. In summary,
proposed, such as R3 [47], DS-QA [48], DocumentQA [49], open-domain textual QA poses great challenges as it requires
and RE3 QA [38]. to: 1) filter out irrelevant noise context, 2) reason across
FIGURE 1. The technical architecture of deep learning based open-domain textual QA systems. The paragraph index&ranking module first
retrieves several related documents and then selects a few top-ranked paragraphs relative to the question, from which the extractive reading
comprehension module extracts multiple candidate answers. Finally, the system picks the most promising prediction as the answer. Besides,
to boost the processing speed while ensuring accuracy, several acceleration techniques are adopted.
multiple evidence snippets, and 3) train with distantly- acceleration techniques, such as jump reading [54] and skim
supervised objectives. reading [55], can be applied in the system.
In recent years, with the rapid development of deep learn-
ing technologies, significant technical advancements have III. MODELS AND HOT TOPICS
been made in the field of open-domain textual QA. Specif- In this section, we illustrate the individual component of
ically, Chen et al. proposed the DrQA system [37], which the generalized open-domain textual QA system described
splits the task into two subtasks: paragraph retrieval and in Fig. 1. Specifically, we introduce: (i) the paragraph
answer extraction. The paragraph retrieval module selects and index&ranking module in subsection III-A, (ii) the candidate
ranks the candidate paragraphs according to the relevance answer extraction module in subsection III-B, (iii) the final
between paragraph and question, while the answer extrac- answer selection module in subsection III-C, and (iv) the
tion module predicts the start and end positions of candi- acceleration techniques in subsection III-D. Finally, we give a
date answers in the context. Later, Clark and Gardner [49] brief introduction of recent open-domain textual QA datasets
proposed a shared-normalization mechanism to deal with in subsection III-E, as well as experimental evaluation and
the distant-supervision problem in open-domain textual QA. model performance in subsection III-F.
Wang et al. [47] adopted reinforcement learning to joint train
the ranker and the answer-extraction reader. Based on this A. PARAGRAPH INDEX AND RANKING
work, Wang et al. [53] further proposed evidence aggre- The first step of open-domain textual QA is to retrieve sev-
gation for answer re-ranking. Recently, Hu et al. [38] pre- eral top-ranked paragraphs that are relevant to the question.
sented an end-to-end open-domain textual QA architecture There are two sub-stages here: retrieving documents through
to jointly perform context retrieval, reading comprehension, indexing, and ranking the context fragments (paragraphs)
and answer re-ranking. in these documents. The paragraph-index module builds the
To summarize these works, we propose a general technical light-weight index for the original documents. During pro-
architecture of open-domain textual QA system in Fig. 1. The cessing, the index dictionary is loaded into memory, while the
architecture mainly consists of three modules including para- original documents are stored in file-systems. This method
graph index&ranking, candidate answer extraction, and final can effectively reduce memory overhead, as well as accel-
answer selection. Specifically, the paragraph index&ranking erates the retrieval process. The paragraph-ranking module
module first retrieves top-k paragraphs related to questions. analyzes the relevance between query and paragraphs and
Then these paragraphs are sent into the answer extrac- selects top-ranked paragraphs to feed into the reading com-
tion module to locates multiple candidate answers. Finally, prehension module. In recent years, along with the devel-
the answer selection module predicts the final answer. More- opment of information retrieval and NLP, a large number
over, in order to improve the efficiency of QA systems, some of new technologies regarding to index and ranking have
been proposed. Here we mainly focus on the deep between queries and paragraphs also can be helpful for pre-
learning-based approaches. dictions on the final answer, as we discuss in subsection III-C.
Consequently, Listwise ranking methods are also important to
1) PARAGRAPH INDEX the open-domain textual QA task.
Paragraph index can be classified into query-dependent Moreover, the paragraph ranking model trained with
index and query-independent index. The query-dependent deep neural networks mainly includes four categories [56]:
index mainly includes dependence model and pseudo rele- (i) learning the ranking model through manual features, and
vance feedback(PRF) [56], [57], which considers approxi- only using the neural network to match the query and docu-
mation between query and document terms. However, due ment; (ii) estimating relevance based on the query-document
to the index dependence on queries, the corresponding exactly matching pattern; (iii) learning the embedded repre-
ranking models are difficult to scale and generalize. The sentations of queries and documents, and evaluating them by
query-independent index mainly includes TF-IDF, BM25, a simple function, such as cosine similarity or dot-product;
and language modeling [56], [57], which contains a rel- (iv) conducting query expansion with neural network embed-
atively simple index feature and with low computational dings, and calculating the query expectation.
complexity on matching. IBM Watson adopted a search Similar to (ii), Wang et al. [47] proposed Reinforced
method to combine the query-dependent similarity score with Ranker-Reader (R3 ) model, which is also a kind of Pointwise
the query-independent score to determine the overall search method. It consisted of: (1) a Ranker to select a paragraph
score for each passage [58]. Although those index features most relevant to the query, and (2) a Reader to extract the
are relatively efficient and scalable on processing, they are answer from the paragraph selected by Ranker. The deep
mainly based on the terms without the contextual semantic learning-based Ranker model was trained using reinforce-
information. ment learning, where the accuracy of the answer extracted by
Recently, several deep learning-based methods have been Reader determined the reward. Both the Ranker and Reader
proposed. These approaches usually embed the terms or leveraged Match-LSTM [71] model to match the query and
phrases into dense vectors and use them as indices. passages. Similar to (iii), Tan et al. [72] studied several rep-
Kato et al. [59] constructed a demo to compare the effi- resentation learning models and found that attentive LSTM
ciency and effectiveness of LSTM and BM25. Seo et al. can be very effective on the Pairwise mode training. And
proposed Phrase-indexed Question Answering (PIQA) [60], PIQA [60] employed similarity clustering to retrieve the near-
which employed bi-directional LSTMs and self-attention est indexed phrase vector to the query vector by asymmetric
mechanism to obtain the representation vectors for both query locality-sensitive hashing (aLSH) [73] or Fassi [74].
and paragraph. Lee et al. leveraged BERT encoder [61] to There are also combinations of the above categories,
pre-train the retrieval module [62], unlike previous works that Htut et al. [75] combined (i) and (iii), which took the embed-
retrieve candidate paragraphs, the evidence passage retrieved ded representations to train the ranking model, and proposed
from Wikipedia was seen as a latent variable. two kinds of ranking models: InferSent ranker and Relation-
Networks ranker. The rankers leveraged the Listwise ranking
2) PARAGRAPH RANKING method, which were trained by minimizing the margin rank-
The traditional ranking technologies are based on manually- ing loss, so as to obtain the optimal score.
designed feature [63], but in recent years, learning to rank k
X
(L2R) approaches have become a hot-spot. L2R refers to max(0, 1 − f (q, ppos ) + f (q, pineg )) (1)
ranking methods based on supervised learning, it can be clas- i=1
sified into Pointwise, Pairwise, and Listwise [64]. Pointwise
Here f is the scoring function, ppos is a paragraph that
(e.g., McRank [65], Prank [66]) converts the document into
contains the ground-truth answer, pneg is a negative para-
feature vectors, then gives out the relevance scores according
graph that does not contain the ground-truth answer, and k is
to the classification or regression function learned from the
the number of negative samples. InferSent ranker leveraged
training data, from which to indicate the ranking results.
sentence-embedded representations [76] and evaluated the
Pointwise focuses on the relevance between the query and
semantic similarity in ranking for QA, which employed a
documents, ignoring the information interaction inside the
feed-forward neural network as the scoring function:
documents. Hence Pairwise (e.g., RankNet [67], FRank [68])
estimates whether the order of document pairs is reasonable. q
However, the number of relevant documents varies greatly p
xclassifier =
q-p
(2)
from different queries. Thus the generalization ability of Pair- J
wise is difficult to estimate. Unlike the above two methods, q p
Listwise (e.g., LambdaRank [69], SoftRank [70]) trains the z = W (1) xclassifier + b(1) (3)
optimization scoring function with a list of all search results (2) (2)
score = W ReLU (z) + b (4)
for each query as a training sample. Since the aim of para-
graph ranking is to filter out irrelevant paragraphs, Pointwise Relation-Networks ranker focused on measuring the rel-
seems to be adequate in most cases. However, the scores evance between words in the question and words in
FIGURE 2. Differences between BERT, GPT, and ELMo. BERT uses a bi-directional Transformer. OpenAI GPT uses a left-to-right Transformer.
ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTM to generate features for downstream
tasks.(Figure source: Devlin et al. [61])
the paragraph, where the word pairs were the inputs of cannot distinguish the polysemous words. ELMo [81] lever-
Relation-Networks which is formulated as follows. aged a deep bi-directional language model to yield word
X embeddings that can vary from different context sentences,
RN (q, p) = fφ gθ ([E(qi ); E(pj )]) (5)
which was concatenated by two unidirectional language mod-
i,j
els. OpenAI GPT [82] used the left-to-right transformer
Here E(·) is a 300 dimensional GloVe embedding [77], decoder [83], whereas BERT [61] used the bi-directional
fφ and gθ are 3 layer feed-forward neural networks with transformer encoder [83] to pre-train, then both of them
ReLU activation function. The experimental results showed adjust the downstream tasks through fine-tuning methods.
that the performance of QA part [75] even exceed reinforcing Fig. 2 shows the difference between ELMo, GPT, and BERT.
feedback ranking model [47]. Specifically, The pre-trained BERT model has been proven
as a powerful context-dependent representation and made
B. CANDIDATE ANSWER EXTRACTION significant improvements on the open domain textual QA
With the candidate paragraphs filtered from the index& tasks, some works based on BERT, such as RE3 QA [38],
ranking module, QA systems can locate candidate answers ORQA [62], and DFGN [84], have achieved state-of-the-art
(the start and end positions of answer spans in the document results.
or paragraph) through the reading comprehension model.
With the releasing of datasets and test standards [13]–[15], 2) INTERACTIVE ATTENTION LAYER
[30], many works have been proposed in the past three years, The interactive attention layer constructs representations on
attracting great attention from the academia and industrial. the original features of question or paragraph by using atten-
In this subsection, we illustrate the reading extraction model tion mechanisms. It can be mainly divided into two types:
from three hierarchies: (i) word embeddings and pre-training (i) Interactive alignment between the question and para-
models for feature encoding in subsection III-B1, (ii) interac- graph, namely co-attention, which allows the model to focus
tion of questions and paragraphs using attention mechanism on the most relevant question features with respect to para-
in subsection III-B2, and (iii) feature aggregation for predict- graph words, and breaks through the limited coding extrac-
ing the candidate answers in subsection III-B3. tion ability of a single model. Wang and Jiang [71] leveraged
a textual entailment model Match-LSTM [85] to construct
1) FEATURE ENCODING LAYER the attention processing. Xiong et al. [86] used a co-attention
In this layer, the original text tokens are transformed into encoder to co-dependent representations of the question and
vectors that can be computed by the deep neural networks the document, and a dynamic pointer decoder to predict
through word embeddings or manual features. Word embed- the answer span. Seo et al. proposed a six layers model
dings can be obtained through dictionary or fine-tuning on BiDAF [87] along with a memory-less attention mecha-
pre-trained language models, while manual textual features nism to yield the representations of the context paragraph
are usually implemented by part-of-speech tagging (POS) at character-level, word-level and contextual-level. Gong and
and named entity recognition (NER). Manual features can be Bowman [88] added a multi-hop attention mechanism to
constructed by tools such as CoreNLP [78], AllenNLP [79], BiDAF to solve the problem that the single-pass model cannot
and NLTK [80]. Generally, the features mentioned above will reflect on.
be fused with embedding vectors. (ii) Self alignment inside the paragraph to generate
Embedding vectors can be constructed by pre-trained lan- self-aware features, namely self-attention, which allows
guage models. Glove [77] transferred word-level informa- non-adjacent words in the same paragraph to attend to each
tion to word vectors through the co-occurrence matrix, but other, thus alleviating the long-term dependency problem.
For example, Wang et al. [89] proposed a self-attention mech- start and end tokens. In the multi-paragraph reading
anism to refine the question-aware passage representation by comprehension tasks, reading comprehension model is
matching the passage against itself. employed on both negative paragraphs and positive para-
We can find two trends in recent works: (1) the combi- graphs, thus need to add the no-answer prediction term
nation of co-attention and self-attention. e.g., DCN+ [90] in the loss function as [49], [100]:
improved DCN by extending the deep residual co-attention (1 − δ)ez + δesa gb
encoder with self-attention. Yu et al. leveraged the combi- L = −log z Pn Pn si gj (7)
nation of convolutions and self-attention in the embedding e + i=1 j=1 e
and modeling encoders, and a context-query attention layer
after the embedding encoder layer [91]. (2) fusion features Here δ is 1 if an answer exists and 0 otherwise, and
at different levels, e.g., Huang et al. adopted a three-layers z presents the weight given to a ‘‘no-answer’’ possibility.
fully-aware-attention mechanism to further enhance the fea-
ture representation ability of the models [92]. Wang et al. C. FINAL ANSWER SELECTION
combined co-attention and self-attention mechanism, as well Final answer selection mainly selects the final answer from
as applied a fusion function to incorporated different lev- multiple candidate answers using feature aggregation, aggre-
els of features [93]. Hu et al. proposed a re-attending gation methods can be divided into the following types.
mechanism inside a multi-layer attention architecture, where • Evidence Aggregation. Wang et al. proposed a method
prior co-attention and self-attention were both considered to of candidate answer re-ranking, mainly based on two
fine-tune current attention [94]. types of evidence [53]: (i) replicated evidence: the can-
didate answer which appears more times in different
3) AGGREGATION PREDICTION LAYER passages may have a higher probability to be the correct
In this layer, aggregation vectors are generated to predict answer. (ii) complementary evidence: aggregating mul-
candidate answers, we mainly focus on the following parts. tiple passages can entail multiple aspects of the question,
• Aggregation strategies. Aggregation strategies vary so as to ensure the completeness of the answer. In the
from different network frameworks. BiDAF [87] and inference part, Wang et al. leveraged a classical textual
Multi-Perspective Matching [95] leveraged Bi-LSTM entailment model Match-LSTM [71] to infer the rele-
for semantic information aggregation. FastQAExt [96] vance of the answer spans [53]. Moreover, Lin et al. and
adopted two feed-forward neural networks to generate Zhong et al. adopted the coarse-to-fine strategy to select
the probability distribution of start and end position of related paragraphs and aggregated evidence from them
the answers, then used beam-search to determine the to predict the final answer [48], [101].
range of the answers. • Multi-stages Aggregation. Wang et al. divided the
• Iteration prediction strategies. DCN [86] consisted open-domain textual QA task into two stages [47]: can-
of a co-attentive encoder and a dynamic pointing didate paragraph ranking and answer extraction, and
decoder, which adopted a multi-round iteration mech- jointly optimized the expected losses of the two-stages
anism. In each round of iteration, the decoder estimated through reinforcement learning. Wang et al. divided
the start and end of the answer span. Based on the reading comprehension into candidate extraction and
prediction of the previous iteration, LSTM and Highway answer selection, and jointly trained the two-stages pro-
Maxout Network are used to update the prediction of cess in the end-to-end model and made improvements
the answer span in the next iteration. ReasoNet [97] on the final prediction [102]. Pang et al. and Wang et al.
and Mnemonic Reader [94] used the memory network divided the open-domain textual QA task into reading
framework to do iterative prediction. DCN+ [90] and extraction and answer selection. They leveraged a beam
Reinforced Mnemonic Reader [94] iteratively predicted search strategy to find the final answer with maxi-
the start and end position by reinforcement learning. mum probability considering both stages [103], [104].
• Interference discarding strategies. Discarding inter- Hu et al. proposed an end-to-end open-domain tex-
ference items dynamically during the prediction process tual QA model, which contains retrieving, reading, and
can improve the accuracy performance and generaliza- reranking modules [38].
tion of models, such as DSDR [98] and SAN [99]. • Fusion of Knowledge Bases and Text. Recently sev-
• Loss Function. Based on the extracted answer span, eral works attempt to incorporate external knowledge
the loss function is generally defined as the sum of the to improve performance on a variety of tasks, such
probability distributions of the start and end positions of as [105] for natural language inference, [106] for
gold answers [49], which can be formulated as follows. cloze-style QA task, and [107] for Multi-Hop QA task.
esa egb Sun et al. proposed a method to fuse multi-source infor-
L = −log Pn s − log Pn gj (6) mation in early stage to improve overall QA task [108].
i=1 e j=1 e
i
Weissenborn et al. proposed an architecture to dynami-
Here sj and gj are the scores for the start and end bounds cally integrate explicit background knowledge in Natu-
produced by the model for token j, a and b are the ral Language Understanding models [109].
FIGURE 3. A synthetic example of LSTM-Jump model. In this example, the maximum size of jump is 5, the number of tokens read before
a jump is 2 and the number of jumps allowed is 10. The green softmax are for jumping predictions. (Figure source: Yu et al. [54])
D. ACCELERATION METHODS which was build upon the basics of LSTM network
Despite that current open-domain textual QA systems have and reinforcement learning, to determine the number
achieved significant advancements, these models become of tokens or sentences to jump. As shown in Fig. 3,
slow and cumbersome [110] with multi-layers [111], the softmax gave out a distribution over the jumping
multi-stages [53], [102] architectures along with various steps between 1 and the max jump size. This method
features [81], [87], [137]. Moreover, ensemble models are can greatly improve reading efficiency, but the decision
employed to further improve performance, which requires a action can only jump forward, which may be ineffective
large number of computation resources. Open-domain textual in complex reasoning tasks. Therefore Yu et al. [117]
QA systems, however, are required to be fast in paragraph proposed an approach to decide whether to skip tokens,
index&ranking as well as accurate in answer extraction. re-read the current sentence or stop reading the feedback
Therefore, we would like to discuss some hot topics regarding answer, and LSTM-shuttle [118] proposed a method to
acceleration methods in this section. either read forward or read back to increase accuracy
during speed reading.
1) MODEL ACCELERATION • Skim reading determines whether to skim one token
Due to the complex and computationally expensive deep before reading the sentence according to the current
learning models, automated machine learning (AutoML) word or not. Unlike previous methods using reinforce-
technologies have aroused widespread concern on hyper- ment learning to make action decisions, skip-rnn [119]
parameter optimization and neural architecture search adjusted the RNN module to determine whether each
methods [112]–[114]. However, there is little research about step input was skipped or directly copied the state of
AutoML acceleration for the open-domain textual QA sys- the previous hidden layer. However, previous meth-
tem. In order to reduce complexity under the guarantee of ods are mainly for sequence reading and classifica-
quality, there are many models proposed to accelerate read- tion tasks, and the experiments are mainly for the
ing processing, namely model acceleration. Hu et al. [115] cloze-style QA task [31]. Then Skim-rnn [55] conducted
proposed a knowledge distillation method, which transferred comparative experiments on the reading comprehen-
knowledge from an ensemble model to a single model with sion tasks. Specifically, skim-RNN was responsible for
little loss in performance. In addition, it is known that updating the first few dimensions of the hidden state
LSTMs, which are widely used in the open domain textual through the small RNN, and weighted between the
QA systems [110], are difficult to parallelize and scale due computation amount and the discard rate. Moreover,
to their sequential nature. Consequently some researchers Hansen et al. [120] proposed the first speed reading
replace the recurrent structures [110] or attention layer [96] model including both jump and skip actions.
with more efficient works, such as Transformer [83] and • Other speed reading applications: JUMPER [36] pro-
SRU [111], and limit the range of co-attention [116]. vided fast reading feedback for legal texts, Johansen and
Socher [121] focused on sentiment classification tasks.
Choi et al. [122] tackle long document-oriented QA
2) ACTION ACCELERATIONS
tasks for sentence selection and reading based on CNN.
There are some works boosting the sequence reading speed Hu et al. [38] proposed an early-stopping mechanism to
while maintaining the performance, namely action accel- efficiently terminate the encoding process of unrelated
eration. These approaches can dynamically employ some paragraphs.
actions to speed up during reading, such as jumping, skipping,
skimming, and early-stopping. We illustrate the details from
the following perspectives. E. DATASETS
• Jump reading determines from the current word how In this subsection, we introduce several datasets rela-
many words should be skipped before next reading. tive to open-domain textual QA. Owing to the release of
For example, Yu and Liu [54] proposed LSTM-Jump, these datasets, the development of open-domain textual
several promising prospective research directions, which we BERT [61] to select paragraphs. In the extractive reading
believe are critical to the present state of the field. stage, most works utilize Glove embeddings [77], while
recent models tend to use pre-trained language models such
A. SUMMARY OF MODELS as ELMo [81] or BERT [61] for text feature encoding.
We summarize current hot topics in Figure. 4 and catego- As for the attention mechanism, most works adopt either
rize structure of some models in Table. 4 according to the co-attention or self-attention, or combine both of them to bet-
technologies illustrated in Section III. There are some works ter exchange information between questions and documents.
that are designed in single-document QA settings, such as For the aggregation prediction, most works adopt RNN-based
BiDAF [87], QAnet [91], and SLQA [93], where the rank- approaches (LSTM or GRU), while some recent works lever-
ing stage is not needed. On the other hand, some bunch of age BERT [61]. In the final answer selection, the multi-stage
works need to search and filter the paragraphs from multiple aggregation is the main solution while few works adopt the
documents in open-domain textual QA settings. So we divide evidence aggregation strategy.
Table. 4 into two parts, the upper for MRC models and the
lower for open-domain textual QA models. B. CHALLENGES AND LIMITATIONS
As can be seen from Table. 4, most works use IR methods We first present the challenges and limitations of
such as TF-IDF and BM25 in the ranking stage. Recently, open-domain textual QA systems due to the use of deep
some works such as ORQA [62] and DFGN [84] adopt learning techniques. There are several common limitations
TABLE 4. The structure of some models. The top half of the table are the MRC models, and the bottom half are the open-domain textual QA models
which contain the paragraph ranking stage.
of deep learning techniques [126], which also affect deep BERT [61] or other self-attention pre-training model [82]
learning based open-domain textual QA systems. to extract text features can improve the scalability,
• Interpretability. It is well-known that the process of running these models over hundreds of paragraphs is
deep learning likes a black-box. Due to the activation computationally costly since these models usually have
function and backward derivation, it is hard to model large size and consist of numerous layers. Moreover,
the neural network function, which makes the final the Using indexable query-agnostic phrase representation
answer unpredictable in theoretical. can reduce the computationally cost while ensuring
• Data Hungry. As mentioned in subsection II-B, accuracy in reading comprehension, whereas the accu-
deep learning is data-driven, which also bring some racy is still low in open-domain textual QA [128].
challenges [126]. We can also find the fact in • Machine Reading Comprehension. Existing extractive
subsection III-E, where the total samples of each reading technology has made great progress. Several
dataset are larger than 10k. It is very expensive to reading comprehension models even surpass human per-
build large-scale datasets on open-domain textual QA formance. However, these MRC models are complex
even though annotation tools are provided. Specifically, and lack of interpretation, which makes it difficult to
the public dataset released by Google [127] consists evaluate the performance and analyze the generalization
of 307,373 training examples with single annotations; ability of each neuron module. With the improvement of
7,830 examples with 5-way annotations for development performance along with the increase of model size, it is
data; and a further 7,842 examples 5-way annotated also a problem that running these models consumes a lot
sequestered as test data. of energy [129]. Moreover, existing models are vulner-
• Computing Resource Reliance. In addition to able to adversarial attacks [130], making it difficult to
large-scale data, large-scale neural network models are deploy them in real-world QA applications.
generally employed for the complex processing by • Aggregation Prediction. Existing predictive reasoning
deep learning based open-domain textual QA as men- usually supposes that the answer span only appears in
tioned in III-D. However, it is very consumption to the single paragraph, or the answer text is short [37].
train such complex models, while real time feedback However, in the real world, the answer span usually
is often required by user on QA systems. In such case, appears in several paragraphs or even requires multi-hop
large-scale computing resource is the basic configura- inference. How to aggregate evidence across multiple
tion for training or inference. mentioned text snippets to find the answer remains to
be a great challenge.
We then present several problems from the following three
parts corresponding to Section III.
• Index & Ranking. Recent works usually adopt inter- C. RECENT TRENDS
active attention mechanisms to improve the accuracy of We summarize several recent trends regarding to open-
ranking. However, it is not beneficial in both efficiency domain textual QA, which are listed as follows.
and scalability since each passage needs to be encoded 1) Complex Reasoning. As the datasets get larger, rea-
along with individual questions. Although using soning becomes more complex, open-domain textual
[25] F. M. Suchanek, G. Kasneci, and G. Weikum, ‘‘Yago: A core of semantic [46] J. Cheng, L. Dong, and M. Lapata, ‘‘Long short-term memory-networks
knowledge,’’ in Proc. 16th Int. Conf. World Wide Web New York, NY, USA, for machine reading,’’ in Proc. Conf. Empirical Methods Natural Lang.
2007, pp. 697–706 Process., 2016, pp. 551–561.
[26] S. Sarawagi and S. Chakrabarti, ‘‘Open-domain quantity queries on Web [47] S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang,
tables: Annotation, response, and consensus models,’’ in Proc. 20th ACM G. Tesauro, B. Zhou, and J. Jiang, ‘‘R 3: Reinforced ranker-reader for open-
SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 711–720. domain question answering,’’ in Proc. 32nd AAAI Conf. Artif. Intell., New
[27] D. Downey, S. Dumais, D. Liebling, and E. Horvitz, ‘‘Understanding the Orleans, LA, USA, 2018, pp. 5981–5988.
relationship between searchers’ queries and information goals,’’ in Proc. [48] Y. Lin, H. Ji, Z. Liu, and M. Sun, ‘‘Denoising distantly supervised open-
17th ACM Conf. Inf. Knowl. Mining (CIKM), 2008, pp. 449–458. domain question answering,’’ in Proc. 56th Annu. Meeting Assoc. Comput.
[28] P. Pasupat and P. Liang, ‘‘Compositional semantic parsing on semi- Linguistics, vol. 1, 2018, pp. 1736–1745.
structured tables,’’ in Proc. Int. Conf. World Wide Web, Aug. 2014, [49] C. Clark and M. Gardner, ‘‘Simple and effective multi-paragraph reading
pp. 1–11. comprehension,’’ in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics,
vol. 1, 2018, pp. 845–855.
[29] J. Welbl, P. Stenetorp, and S. Riedel, ‘‘Constructing datasets for multi-
hop reading comprehension across documents,’’ Trans. Assoc. Comput. [50] S. Zhang, L. Yao, A. Sun, and Y. Tay, ‘‘Deep learning based recommender
Linguistics, vol. 6, pp. 287–302, Dec. 2018. system: A survey and new perspectives,’’ ACM Comput. Surv., vol. 52,
no. 1, pp. 1–38, Feb. 2019.
[30] B. Dhingra, K. Mazaitis, and W. W. Cohen, ‘‘Quasar: Datasets for question
[51] J. Chen, X. Qiu, P. Liu, and X. Huang, ‘‘Meta multi-task learning
answering by search and reading,’’ 2017, arXiv:1707.03904. [Online].
for sequence modeling,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
Available: http://arxiv.org/abs/1707.03904
pp. 5070–5077.
[31] K. M. Hermann, T. Koisky, E. Grefenstette, L. Espeholt, W. Kay, [52] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, ‘‘Revisiting unreasonable
M. Suleyman, and P. Blunsom, ‘‘Teaching machines to read and compre- effectiveness of data in deep learning era,’’ in Proc. IEEE Int. Conf.
hend,’’ in Proc. Adv. Neural Inf. Process. Syst., Montreal, QC, Canada, Comput. Vis. (ICCV), Oct. 2017, pp. 843–852.
Jan. 2015, pp. 1693–1701. [53] S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang,
[32] F. Hill, A. Bordes, S. Chopra, and J. Weston, ‘‘The goldilocks principle: T. Klinger, G. Tesauro, and M. Campbell, ‘‘Evidence aggregation for
Reading children’s books with explicit memory representations,’’ in Proc. answer re-ranking in open-domain question answering,’’ in Proc. Int. Conf.
Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–13. Learn. Represent. (ICLR), 2018, pp. 1–14.
[33] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and [54] A. W. Yu, H. Lee, and Q. Le, ‘‘Learning to skim text,’’ in Proc. 55th Annu.
L. Deng, ‘‘MS MARCO: A human generated machine reading compre- Meeting Assoc. Comput. Linguistics, vol. 1, 2017, pp. 1880–1890.
hension dataset,’’ in Proc. Workshop Cognit. Comput., Integrating Neural [55] M. Seo, S. Min, A. Farhadi, and H. Hajishirzi, ‘‘Neural speed reading via
Symbolic Approaches Co-Located 30th Annu. Conf. Neural Inf. Process. skim-RNN,’’ in Proc. Int. Conf. Learn. Represent. (ICLR), 2018, pp. 1–14.
Syst. (NIPS), 2016, pp. 1–10. [56] B. Mitra and N. Craswell, ‘‘An introduction to neural information
[34] T. Kocišký, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and retrieval,’’ Found. Trends Inf. Retr., vol. 13, no. 1, pp. 1–126, Dec. 2018.
E. Grefenstette, ‘‘The NarrativeQA reading comprehension challenge,’’ [57] K. D. Onal, Y. Zhang, I. S. Altingovde, M. M. Rahman, P. Karagoz,
Trans. Assoc. Comput. Linguistics, vol. 6, pp. 317–328, Dec. 2018. A. Braylan, B. Dang, H.-L. Chang, H. Kim, Q. McNamara, A. Angert,
[35] G. Balikas, A. Krithara, I. Partalas, and G. Paliouras, ‘‘BioASQ: A chal- E. Banner, V. Khetan, T. McDonnell, A. T. Nguyen, D. Xu, B. C. Wallace,
lenge on large-scale biomedical semantic indexing and question answer- M. de Rijke, and M. Lease, ‘‘Neural information retrieval: At the end of
ing,’’ in Multimodal Retrieval in the Medical Domain. Cham, Switzerland: the early years,’’ Inf. Retr. J., vol. 21, nos. 2–3, pp. 111–182, Jun. 2018.
Springer, 2015, pp. 26–39. [58] J. Chu-Carroll, J. Fan, B. K. Boguraev, D. Carmel, D. Sheinwald, and
[36] X. Liu, L. Mou, H. Cui, Z. Lu, and S. Song, ‘‘JUMPER: Learning when C. Welty, ‘‘Finding needles in the haystack: Search and candidate genera-
to make classification decisions in reading,’’ in Proc. 27th Int. Joint Conf. tion,’’ IBM J. Res. Develop., vol. 56, nos. 3–4, p. 6, May 2012.
Artif. Intell., Stockholm, Sweden, Jul. 2018, pp. 4237–4243. [59] S. Kato, R. Togashi, H. Maeda, S. Fujita, and T. Sakai, ‘‘LSTM vs.
[37] D. Chen, A. Fisch, J. Weston, and A. Bordes, ‘‘Reading wikipedia to BM25 for open-domain QA: A hands-on comparison of effectiveness
answer open-domain questions,’’ in Proc. Annu. Meeting Assoc. Comput. and efficiency,’’ in 40th Int. ACM SIGIR Conf. Res. Develop. Inf. Retr.,
Linguistics (ACL), vol. 1. Vancouver, BC, Canada, 2017, pp. 1870–1879. New York, NY, USA, 2017, pp. 1309–1312.
[38] M. Hu, Y. Peng, Z. Huang, and D. Li, ‘‘Retrieve, read, rerank: Towards [60] M. Seo, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Hajishirzi, ‘‘Phrase-
End-to-End multi-document reading comprehension,’’ in Proc. 57th indexed question answering: A new challenge for scalable document com-
Annu. Meeting Assoc. for Comput. Linguistics, Florence, Italy, 2019, prehension,’’ in Proc. Conf. Empirical Methods Natural Lang. Process.,
pp. 2285–2295. 2018, pp. 559–564.
[39] B. Hu, Z. Lu, H. Li, and Q. Chen, ‘‘Convolutional neural network architec- [61] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training of
tures for matching natural language sentences,’’ in Proc. Adv. Neural Inf. deep bidirectional transformers for language understanding,’’ in Proc. ACL
Process. Syst., Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and Conf. NAACL HLT, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186.
K. Q. Weinberger, Eds. Cambridge, MA, USA: Curran Associates, 2014, [62] K. Lee, M.-W. Chang, and K. Toutanova, ‘‘Latent retrieval for weakly
pp. 2042–2050. supervised open domain question answering,’’ in Proc. 57th Annu. Meeting
Assoc. Comput. Linguistics, Florence, Italy, 2019, pp. 6086–6096.
[40] T. Kenter, A. Borisov, C. Van Gysel, M. Dehghani, M. de Rijke, and B.
[63] E. H. Hovy, L. Gerber, U. Hermjakob, M. Junk, and C. Lin, ‘‘Question
Mitra, ‘‘Neural networks for information retrieval,’’ in Proc. 11th ACM Int.
answering in webclopedia,’’ in Proc. 9th Text Retr. Conf. (TREC), 2000,
Conf. Web Search Data Mining, New York, NY, USA, 2018, pp. 779–780.
pp. 1–10.
[41] L. Yunjuan, Z. Lijun, M. Lijuan, and M. Qinglin, ‘‘Research and appli- [64] T.-Y. Liu, ‘‘Learning to rank for information retrieval,’’ Found. Trends Inf.
cation of information retrieval techniques in intelligent question answer- Retr., vol. 3, no. 3, pp. 225–331, 2007.
ing system,’’ in Proc. 3rd Int. Conf. Comput. Res. Develop., Mar. 2011, [65] P. Li, C. J. C. Burges, and Q. Wu, ‘‘McRank: Learning to rank using
pp. 188–190. multiple classification and gradient boosting,’’ in Proc. 20th Int. Conf.
[42] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, and J. Zhao, ‘‘An Neural Inf. Process. Syst., Jul. 2007, pp. 897–904.
End-to-End model for question answering over knowledge base with [66] K. Crammer and Y. Singer, ‘‘Pranking with ranking,’’ in Advances in Neu-
cross-attention combining global knowledge,’’ in Proc. 55th Annu. Meet- ral Information Processing Systems. Cambridge, MA, USA: MIT Press,
ing Assoc. Comput. Linguistics, vol. 1. Vancouver, BC, Canada, 2017, 2001, pp. 641–647.
pp. 221–231. [67] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and
[43] B. Katz, S. Felshin, J. J. Lin, and G. Marton, ‘‘Viewing the Web as a G. Hullender, ‘‘Learning to rank using gradient descent,’’ in Proc. 22nd
virtual database for question answering,’’ in New Directions in Question Int. Conf. Mach. Learn. (ICML), 2005, pp. 89–96.
Answering. Palo Alto, CA, USA: AAAI Press, 2004, ch. 17, pp. 215–226. [68] M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, and W.-Y. Ma, ‘‘FRank: A
[44] E. M. Voorhees, ‘‘The TREC-8 question answering track report,’’ in Proc. ranking method with fidelity loss,’’ in Proc. 30th Annu. Int. ACM SIGIR
Text Retr. Conf. (TREC), 1999, pp. 77–82. Conf. Res. Develop. Inf. Retr., Jul. 2007, pp. 383–390.
[45] M. Petrochuk and L. Zettlemoyer, ‘‘SimpleQuestions nearly solved: A new [69] C. J. C. Burges, R. Ragno, and Q. V. Le, ‘‘Learning to rank with nons-
upperbound and baseline approach,’’ in Proc. Conf. Empirical Methods mooth cost functions,’’ in Proc. 19th Int. Conf. Neural Inf. Process. Syst.,
Natural Lang. Process., 2018, pp. 554–558. Jun. 2006, pp. 193–200.
[70] M. Taylor, J. Guiver, S. Robertson, and T. Minka, ‘‘SoftRank: Optimizing [92] H.-Y. Huang, C. Zhu, Y. Shen, and W. Chen, ‘‘Fusionnet: Fusing via fully-
non-smooth rank metrics,’’ in Proc. Int. Conf. Web Search Data Mining, aware attention with application to machine comprehension,’’ in Proc. Int.
Aug. 2008, pp. 77–86. Conf. Learn. Represent. (ICLR), 2018, pp. 1–20.
[71] S. Wang and J. Jiang, ‘‘Machine comprehension using match-LSTM and [93] W. Wang, M. Yan, and C. Wu, ‘‘Multi-granularity hierarchical attention
answer pointer,’’ in Proc. Int. Conf. Learn. Represent. (ICLR), 2017, fusion networks for reading comprehension and question answering,’’
pp. 1–3. in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, vol. 1, 2018,
[72] M. Tan, C. dos Santos, B. Xiang, and B. Zhou, ‘‘Improved representation pp. 1705–1714.
learning for question answer matching,’’ in Proc. 54th Annu. Meeting [94] M. Hu, Y. Peng, Z. Huang, X. Qiu, F. Wei, and M. Zhou, ‘‘Reinforced
Assoc. Comput. Linguistics, vol. 1, 2016. mnemonic reader for machine reading comprehension,’’ in Proc. 27th Int.
[73] A. Shrivastava and P. Li, ‘‘Improved asymmetric locality sensitive hashing Joint Conf. Artif. Intell., Jul. 2018, pp. 4099–4106.
(ALSH) for maximum inner product search (MIPS),’’ in Proc. 31st Conf. [95] Z. Wang, H. Mi, W. Hamza, and R. Florian, ‘‘Multi-perspective con-
Uncertainty Artif. Intell., Arlington, VA, USA, 2015, pp. 812–821. text matching for machine comprehension,’’ 2016, arXiv:1612.04211.
[74] J. Johnson, M. Douze, and H. Jégou, ‘‘Billion-scale similarity [Online]. Available: http://arxiv.org/abs/1612.04211
search with GPUs,’’ 2017, arXiv:1702.08734. [Online]. Available: [96] D. Weissenborn, G. Wiese, and L. Seiffe, ‘‘Making neural QA as simple
http://arxiv.org/abs/1702.08734 as possible but not simpler,’’ in Proc. 21st Conf. Comput. Natural Lang.
[75] P. M. Htut, S. Bowman, and K. Cho, ‘‘Training a ranking function for open- Learn. (CoNLL), 2017, pp. 271–280.
domain question answering,’’ in Proc. Conf. North Amer. Chapter Assoc.
[97] Y. Shen, P.-S. Huang, J. Gao, and W. Chen, ‘‘ReasoNet: Learning to
Comput. Linguistics: Student Res. Workshop, 2018, pp. 120–127.
stop reading in machine comprehension,’’ in Proc. 23rd ACM SIGKDD
[76] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, ‘‘Super-
Int. Conf. Knowl. Discovery Data Mining, New York, NY, USA, 2017,
vised learning of universal sentence representations from natural language
pp. 1047–1055.
inference data,’’ in Proc. Conf. Empirical Methods Natural Lang. Process.,
[98] X. Wang, Z. Huang, Y. Zhang, L. Tan, and Y. Liu, ‘‘DSDR: Dynamic
2017, pp. 670–680.
semantic discard reader for open-domain question answering,’’ in Proc.
[77] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for
Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2018, pp. 1–7.
word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
Process. (EMNLP), 2014, pp. 1532–1543. [99] X. Liu, Y. Shen, K. Duh, and J. Gao, ‘‘Stochastic answer networks for
[78] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and machine reading comprehension,’’ in Proc. 56th Annu. Meeting Assoc.
D. McClosky, ‘‘The stanford CoreNLP natural language processing Comput. Linguistics, vol. 1, 2018, pp. 1694–1704.
toolkit,’’ in Proc. 52nd Annu. Meeting Assoc. Comput. Linguistics: Syst. [100] M. Hu, F. Wei, Y. Peng, Z. Huang, N. Yang, and D. Li, ‘‘Read
Demonstrations, 2014, pp. 55–60. + verify: Machine reading comprehension with unanswerable
[79] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, questions,’’ Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 6529–6537,
M. Peters, M. Schmitz, and L. Zettlemoyer, ‘‘AllenNLP: A deep semantic Jul. 2019.
natural language processing platform,’’ in Proc. Workshop NLP Open [101] V. Zhong, C. Xiong, N. Keskar, and R. Socher, ‘‘Coarse-grain fine-grain
Source Softw. (NLP-OSS), 2018, pp. 1–6. coattention network for multi-evidence question answering,’’ in Proc. Int.
[80] S. Bird and E. Loper, ‘‘NLTK: The natural language toolkit,’’ in Proc. ACL Conf. Learn. Represent. (ICLR), 2019, pp. 1–27.
Interact. Poster Demonstration Sessions, 2004, pp. 1–5. [102] Z. Wang, J. Liu, X. Xiao, Y. Lyu, and T. Wu, ‘‘Joint training of candidate
[81] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and extraction and answer selection for reading comprehension,’’ in Proc. 56th
L. Zettlemoyer, ‘‘Deep contextualized word representations,’’ in Proc. Annu. Meeting Assoc. Comput. Linguistics, vol. 1, 2018, pp. 1715–1724.
Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Tech- [103] L. Pang, Y. Lan, J. Guo, J. Xu, L. Su, and X. Cheng, ‘‘HAS-QA: Hierar-
nol., vol. 1, 2018, pp. 2227–2237. chical answer spans model for open-domain question answering,’’ in Proc.
[82] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, ‘‘Improv- AAAI Conf. Artif. Intell., vol. 33, Jul. 2019., pp. 6875–6882
ing language understanding with unsupervised learning,’’ Openai, [104] Y. Wang, K. Liu, J. Liu, W. He, Y. Lyu, H. Wu, S. Li, and H. Wang, ‘‘Multi-
San Francisco, CA, USA, Tech. Rep., 2018. [Online]. Available: passage machine reading comprehension with cross-passage answer veri-
https://openai.com/blog/language-unsupervised/ fication,’’ in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018,
[83] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, pp. 1918–1927.
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you [105] Q. Chen, X. Zhu, Z.-H. Ling, D. Inkpen, and S. Wei, ‘‘Neural natural
need,’’ in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 30, language inference models enhanced with external knowledge,’’ in Proc.
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, 56th Annu. Meeting Assoc. Comput. Linguistics, vol. 1. Melbourne, NSW,
S. Vishwanathan, and R. Garnett, Eds., 2017, pp. 5998–6008. Australia, 2018, pp. 2406–2417.
[84] L. Qiu, Y. Xiao, Y. Qu, H. Zhou, L. Li, W. Zhang, and Y. Yu, ‘‘Dynamically [106] T. Mihaylov and A. Frank, ‘‘Knowledgeable reader: Enhancing cloze-
fused graph network for multi-hop reasoning,’’ in Proc. 57th Annu. Meeting style reading comprehension with external commonsense knowledge,’’ in
Assoc. Comput. Linguistics, 2019, pp. 6140–6150. Proc. 56th Annu. Meeting Assoc. Comput. Linguistics,vol. 1. Melbourne,
[85] S. Wang and J. Jiang, ‘‘Learning natural language inference with LSTM,’’ NSW, Australia, 2018, pp. 821–832.
in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum.
[107] L. Bauer, Y. Wang, and M. Bansal, ‘‘Commonsense for generative multi-
Lang. Technol., 2016, pp. 1442–1451.
hop question answering tasks,’’ in Proc. Conf. Empirical Methods Natural
[86] C. Xiong, V. Zhong, and R. Socher, ‘‘Dynamic coattention networks for Lang. Process., Brussels, Belgium, 2018, pp. 1–32.
question answering,’’ in Proc. Int. Conf. Learn. Represent. (ICLR), 2017,
[108] H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and
pp. 1–11.
W. Cohen, ‘‘Open domain question answering using early fusion of knowl-
[87] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, ‘‘Bidirectional
edge bases and text,’’ in Proc. Conf. Empirical Methods Natural Lang.
attention flow for machine comprehension,’’ in Proc. Int. Conf. Learn.
Process., 2018, pp. 4231–4242.
Represent. (ICLR), Toulon, France, 2017, pp. 1–14.
[88] Y. Gong and S. Bowman, ‘‘Ruminating reader: Reasoning with gated [109] D. Weissenborn, T. Kocišký, and C. Dyer, ‘‘Dynamic integration of
multi-hop attention,’’ in Proc. Workshop Mach. Reading Question Answer- background knowledge in neural NLU systems,’’ 2017, arXiv:1706.02596.
ing, 2018, pp. 1–11. [Online]. Available: http://arxiv.org/abs/1706.02596
[89] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, ‘‘Gated self- [110] D. Chen, ‘‘Neural reading comprehension and beyond,’’ Ph.D. disserta-
matching networks for reading comprehension and question answering,’’ tion, Dept. Comput. Sci., Stanford Univ., Stanford, CA, USA, 2018.
in Proc. 55th Annu. Meeting Assoc. Comput. Linguistics, vol. 1, 2017, [111] T. Lei, Y. Zhang, S. I. Wang, H. Dai, and Y. Artzi, ‘‘Simple recurrent units
pp. 189–198. for highly parallelizable recurrence,’’ in Proc. Conf. Empirical Methods
[90] C. Xiong, V. Zhong, and R. Socher, ‘‘DCN+: Mixed objective and deep Natural Lang. Process., 2018, pp. 4470–4481.
residual coattention for question answering,’’ in Proc. Int. Conf. Learn. [112] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and
Represent. (ICLR), 2018, pp. 1–10. F. Hutter, ‘‘Efficient and robust automated machine learning,’’ in Proc. Adv.
[91] A. W. Yu, D. Dohan, Q. Le, T. Luong, R. Zhao, and K. Chen, Neural Inf. Process. Syst. Annu. Conf. Neural Inf. Process. Syst., 2015,
‘‘QANet: Combining local convolution with global self-attention for pp. 2962–2970.
reading comprehension,’’ in Proc. Int. Conf. Learn. Represent. (ICLR), [113] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Efficient and Robust
2018, pp. 1–16. Automated Machine Learning. Berlin, Germany: Springer, 2018.
[114] S. Estevez-Velarde, Y. Gutiérrez, A. Montoyo, and Y. Almeida-Cruz, [136] S. Yu, S. R. Indurthi, S. Back, and H. Lee, ‘‘A multi-stage
‘‘AutoML strategy based on grammatical evolution: A case study about memory augmented neural network for machine reading comprehen-
knowledge discovery from text,’’ in Proc. 57th Annu. Meeting Assoc. sion,’’ in Proc. Workshop Mach. Reading Question Answering, 2018,
Comput. Linguistics, Florence, Italy, 2019, pp. 4356–4365. pp. 21–30.
[115] M. Hu, Y. Peng, F. Wei, Z. Huang, D. Li, N. Yang, and M. Zhou, [137] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin,
‘‘Attention-guided answer distillation for machine reading comprehen- ‘‘Advances in pre-training distributed word representations,’’ in Proc. 11th
sion,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, Int. Conf. Lang. Resour. Eval. (LREC). Miyazaki, Japan: European Lan-
pp. 2077–2086. guage Resources Association (ELRA), May 2018. [Online]. Available:
[116] S. Min, V. Zhong, R. Socher, and C. Xiong, ‘‘Efficient and robust question https://www.aclweb.org/anthology/L18-1008
answering from minimal context over documents,’’ in Proc. 56th Annu.
Meeting Assoc. Comput. Linguistics, vol. 1, 2018, pp. 1725–1735.
[117] A. G. S. J. P. Keyi Yu and Y. Liu, ‘‘Fast and accurate text classifica-
tion: Skimming, rereading and early stopping,’’ in ICLR workshop, 2018, ZHEN HUANG was born in Hunan, China,
pp. 1–12. in 1984. He received the B.S. and Ph.D. degrees
[118] T.-J. Fu and W.-Y. Ma, ‘‘Speed reading: Learning to read ForBackward from the National University of Defense Tech-
via shuttle,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., nology (NUDT), in 2006 and 2012, respectively.
2018, pp. 4439–4448. He was a Visited Student with Eurecom, in 2009.
[119] V. Campos, B. Jou, X. G. I. Nieto, J. Torres, and S.-F. Chang, ‘‘Skip RNN: From 2012 to 2016, he was an Assistant Professor
Learning to skip state updates in recurrent neural networks,’’ in Proc. Int. with the Science and Technology on Parallel and
Conf. Learn. Represent. (ICLR), 2018, pp. 1–17. Distributed Laboratory (PDL), NUDT, where he is
[120] C. Hansen, C. Hansen, S. Alstrup, J. G. Simonsen, and C. Lioma, ‘‘Neural
currently an Associate Professor. He is the author
speed reading with structural-JUMP-LSTM,’’ in Proc. Int. Conf. Learn.
Represent. (ICLR), 2019, pp. 1–10. of more than 40 articles. His research interests
[121] A. Johansen and R. Socher, ‘‘Learning when to skim and when to read,’’ include natural language processing, distributed storage, and artificial intel-
in Proc. 2nd Workshop Represent. Learn. (NLP), 2017, pp. 257–264. ligence. His Ph.D. Thesis received the Excellence Doctoral Thesis Award of
[122] E. Choi, D. Hewlett, J. Uszkoreit, I. Polosukhin, A. Lacoste, and J. Berant, Hunan Province. He also received the Best Paper of ICCCT, in 2011.
‘‘Coarse-to-Fine question answering for long documents,’’ in Proc. 55th
Annu. Meeting Assoc. Comput. Linguistics, vol. 1, 2017, pp. 209–220.
[123] R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum, ‘‘Multi-step
SHIYI XU was born in Hubei, China, in 1991.
retriever-reader interaction for scalable open-domain question answering,’’
She received the B.E. degree from Minnan Nor-
in Proc. Int. Conf. Learn. Represent. (ICLR), 2019, pp. 1–13.
[124] S. Back, S. Yu, S. R. Indurthi, J. Kim, and J. Choo, ‘‘MemoReader: Large- mal University, China, in 2013. She is currently
scale reading comprehension through neural memory controller,’’ in Proc. pursuing the master’s degree with the Science and
Conf. Empirical Methods Natural Lang. Process., 2018, pp. 2131–2140. Technology on Parallel and Distributed Labora-
[125] Y. Zhuang and H. Wang, ‘‘Token-level dynamic self-attention network tory (PDL), National University of Defense Tech-
for multi-passage reading comprehension,’’ in Proc. 57th Annu. Meeting nology (NUDT), Changsha, China. Her research
Assoc. Comput. Linguistics, Florence, Italy, 2019, pp. 2252–2262. interests include natural language processing and
[126] G. Marcus, ‘‘Deep learning: A critical appraisal,’’ 2018, artificial intelligence.
arXiv:1801.00631. [Online]. Available: http://arxiv.org/abs/1801.00631
[127] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh,
C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee,
K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le,
and S. Petrov, ‘‘Natural questions: A benchmark for question answer- MINGHAO HU received the M.S. degree from
ing research,’’ Trans. Assoc. Comput. Linguistics, vol. 7, pp. 453–466, the National University of Defense Technology
Aug. 2019. (NUDT), in 2016, where he is currently pursu-
[128] M. Seo, J. Lee, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Hajishirzi, ing the Ph.D. degree. He has published articles
‘‘Real-time open-domain question answering with dense-sparse phrase in top-tier conferences, such as ACL, EMNLP,
index,’’ in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, Florence, AAAI, and IJCAI. His research interests include
Italy, 2019, pp. 4430–4441. natural language processing and machine reading
[129] E. Strubell, A. Ganesh, and A. McCallum, ‘‘Energy and policy consid- comprehension.
erations for deep learning in NLP,’’ in Proc. 57th Annu. Meeting Assoc.
Comput. Linguistics, Florence, Italy, 2019, pp. 1–6.
[130] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, ‘‘Universal
adversarial triggers for attacking and analyzing NLP,’’ in Proc. Conf.
Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural XINYI WANG was born in China, in 1995. She
Lang. Process. (EMNLP-IJCNLP), 2019, pp. 2153–2162. received the B.E. degree from the National Uni-
[131] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and versity of Defense Technology (NUDT), where
C. D. Manning, ‘‘HotpotQA: A dataset for diverse, explainable multi-hop she is currently pursuing the master’s degree with
question answering,’’ in Proc. Conf. Empirical Methods Natural Lang. the Science and Technology on Parallel and Dis-
Process., 2018, pp. 1–12. tributed Laboratory (PDL). Her research interest
[132] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner, includes natural language processing.
‘‘DROP: A reading comprehension benchmark requiring discrete reason-
ing over paragraphs,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput.
Linguistics (NAACL), 2019, pp. 1–12.
[133] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang, ‘‘Cognitive
graph for multi-hop reading comprehension at scale,’’ in Proc. 57th
JINYAN QIU received the M.S. degree in com-
Annu. Meeting Assoc. Comput. Linguistics, Florence, Italy, 2019,
puter science and technology from the National
pp. 2694–2703.
[134] M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou, ‘‘Multi- University of Defense Technology (NUDT),
hop reading comprehension across multiple documents by reasoning over in 2008. He is currently an Assistant Engineer with
heterogeneous graphs,’’ in Proc. 57th Annu. Meeting Assoc. Comput. the H.R. Support Center. His research interests
Linguistics, Florence, Italy, 2019, pp. 2704–2713. include deep learning and big data.
[135] M. Hu, Y. Peng, Z. Huang, and D. Li, ‘‘A multi-type multi-span network
for reading comprehension that requires discrete reasoning,’’ in Proc. Conf.
Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural
Lang. Process. (EMNLP-IJCNLP), 2019, pp. 1596–1606.
YONGQUAN FU received the M.S. and Ph.D. YUXING PENG was born in 1963. He received
degrees in computer science and technology from the bachelor’s degree in computer from the Beijing
the National University of Defense Technology University of Aeronautics and Astronautics, and
(NUDT), in 2007 and 2012, respectively. He is the M.S. and Ph.D. degrees from the National Uni-
currently an Associate Professor with NUDT. His versity of Defense Technology (NUDT). He was a
research interests include network machine learn- Head Coach with the School’s ACM Programming
ing and distributed systems. Contest. He is currently a Researcher in computer
science and a Ph.D. Supervisor with the Science
and Technology on Parallel and Distributed Lab-
oratory (PDL), NUDT. He has trained more than
50 gold medal winners and more than 70 silver medal winners in interna-
tional contests. His research interests include studying distributed computing
technology, virtual computing environment, cloud computing, big data, and
intelligent computing and other relevant topics. He received the Gold Medal
of the Military Academy Talents Cultivation Award, in 2010, the Excellent
Doctoral Thesis Mentor of Hunan Province, in 2013, and the ACM ICPC
World Final Outstanding Coach Award, in 2015.
YUNCAI ZHAO was born in Hunan, China, CHANGJIAN WANG received the B.S., M.S.,
in 1975. He received the bachelor’s degree from and Ph.D. degrees in computer science and tech-
the Naval University of Engineering, in 1994. He is nology from the National University of Defense
currently a Senior Engineer with the Unit 31011, Technology (NUDT), Changsha, China. He is
PLA. His research interests include artificial intel- currently an Associate Professor with NUDT.
ligence, international relation, and international His research interests include database, dis-
strategy. tributed computing, cloud computing, big data,
and machine learning.