v1 Stamped
v1 Stamped
RESEARCH
1 Introduction
Over the past decade knowledge graphs have been increasingly adopted to structure
and describe data in various fields like education, biology [1] or social media [2].
These knowledge graphs are often composed of millions or billions of nodes and
edges, and are published in the Resource Description Framework (RDF). However,
querying such knowledge graphs requires specialized knowledge in query languages
such as SPARQL as well as deep understanding of the underlying structure of these
graphs. Hence, a wide range of end-users without deep knowledge of these technical
concepts is excluded from querying these knowledge graphs effectively.
This drawback has triggered the design of natural language interfaces to knowl-
edge graphs to enable non-tech savvy users to query ever more complex data [3–5].
With the development of the Semantic Web, a large amount of new structured
data has become available in the form of knowledge graphs on the web. Hence,
natural language interfaces and in particular Question-Answering (QA) systems
over knowledge graphs have gained importance [6].
Even though these QA systems significantly improve the usability of knowledge
graphs for non-technical users, they are far from perfect. Translating from natural
Liang et al. Page 2 of 21
language to SPARQL is a hard problem due to the ambiguity of the natural lan-
guage. For instance, the word “Zurich” could refer to the city of Zurich, the canton
of Zurich or the company “Zurich Financial Services”. To provide the correct result,
a QA system needs to understand the users’ intention. Moreover, knowledge graphs
are typically very complex and thus exhaustively enumerating all possible answer
combinations is often prohibitive.
To extract answers from a given knowledge graph, QA systems usually translate
natural language questions into a formal representation of a query by using tech-
niques from natural language processing, databases, information retrieval, machine
learning and the Semantic Web [7]. However, the accuracy of these systems still
needs to be improved and a significant amount of work is required to make these
systems practical in the real-world [8].
In order to tackle this hard challenge of translating from natural language to
SPARQL, one approach is to break down the problem into smaller, more man-
ageable sub-problems. In particular, we can conceptualize the problem as a linear,
modular pipeline with components like Named Entity Recognition (NER), Relation
Extraction (RE) and Query Generation (QG). Consider, for instance, the query
“How many people work in Zurich?”. The NER-component recognizes “Zurich”
as an entity which could have the three meanings as mentioned above. The RE-
component recognizes the relation “works”. Finally, the QG-component needs to
generate a SPARQL query by taking into account the structure of the knowledge
graphs. However, most implementations of QA systems over knowledge graphs are
not subdivided into such independent components [9]. Recently, truly modular QA
systems have been introduced so that components can be reused and exchanged.
This modular approach enables different research efforts to tackle parts of the over-
all challenge.
In this paper we provide a novel modular implementation of a QA system. We
build on the modular design of the Frankenstein framework [10] and the SPARQL
Query Generator (SQG) [9]. At the most basic level, our system is structured into
two parts: One part is knowledge graph-dependent and while the other part is knowl-
edge graph-independent (see Fig. 1). The basic idea is to break up the task of trans-
lation from natural language to SPARQL in the following components: (1) Question
analysis, i.e. syntactic parsing. (2) Question type classification, i.e. is it a yes/no
questions or a count question? (3) Phrase mapping, i.e. mapping of entities and re-
lationships in the natural language to the corresponding entities and relationships
in the knowledge graph. (4) Query generation, i.e. construct a SPARQL query based
on the entities and relationships identified in the knowledge graph. (5) Query rank-
ing, i.e. rank the most relevant query as the highest. Details are discussed in Section
3.
To sum up, we introduce a novel modular implementation of a QA system based
on the Frankenstein framework [10] and the SPARQL Query Generator (SQG) [9].
Through a careful design, including the choice of components, our system outper-
forms the state-of-the art, while requiring minimal training data. More specifically,
we make the following main contributions:
• We subdivide our QA system into knowledge graph dependent and into knowl-
edge graph independent modules. Like this, our QA system can be easily ap-
plied to newly unseen data domains. In particular, the independent modules
Liang et al. Page 3 of 21
(question type classification model and the query generation model) do not
require any domain-specific knowledge.
• Large and and representative training data sets for RDF graphs are hard to
devise [11]. In our system only the modules for question type classification
and query ranking require training. As pointed out above, both modules are
knowledge graph independent. Consequently, they can be trained on general
purpose datasets. A training set of a few hundreds queries has been shown to
be sufficient in our experiments (see Section 4).
• In contrast to previous systems we use an ensemble method for phrase map-
ping. Moreover, question type classification is performed by a Random Forest
Classifier, which outperforms previous methods.
• We extended the query generation algorithm in [9] to include more complex
queries. Our system includes query ranking with Tree-structured Long Short-
Term Memory (Tree-LSTM) [12] to sort candidate queries according to the
similarity between the syntactic and semantic structure of the input question.
• We show that our QA system outperforms the state-of-art systems by 15% on
the QALD-7 dataset and by 48% on the LC-QuAD dataset, respectively.
• We make our source code available (see Section 4).
The paper is organized as follows. Section 2 gives an overview on the related
work of QA systems on knowledge graphs. Section 3 shows the architecture of our
proposed system. Section 4 provides a detailed experimental evaluation including a
comparison against state-of-the-art systems. Finally, Section 5 concludes the paper
and gives directions for future research.
Liang et al. Page 4 of 21
2 Related Work
Since our paper proposes a solution for querying knowledge graphs, we will now
review the major work on QA systems over knowledge graphs such as [10, 13–
15]. In particular, we focus our discussions on systems that are most relevant for
understanding the contributions of our proposed QA system.
ganswer2 [13] answers natural language questions through a graph data-driven
solution composed of offline and online phases. In the offline phase, the semantic
equivalence between relation phrases and predicates is obtained through a graph
mining algorithm. Afterwards a paraphrase dictionary is built to record the obtained
semantic equivalence. The online phase contains question understanding stage and
query evaluation stage. In the question understanding stage, a semantic query graph
is built to represent the user’s intention by extracting semantic relations from the
dependency tree of the natural language question based on the previously built para-
phrase dictionary. Afterwards, a subgraph of the knowledge graph, which matches
the semantic query graph through subgraph isomorphism, is selected. The final an-
swer is returned based on the selected subgraph in the query evaluation stage. In
contrast to ganswer2, our proposed system is component based. Our framework can
be decomposed into independent components and therefore the overall accuracy can
be improved by enhancing each component individually. As a result, our proposed
system is much more flexible in terms of adapting to new techniques for question
understanding and query evaluation.
WDAqua [14] is a QA component which can answer questions over DBpedia and
Wikidata through both full natural language queries and keyword queries. In ad-
dition, WDAqua supports four different languages over Wikidata, namely English,
French, German and Italian. WDAqua uses a rule-based combinatorial approach
which constructs SPARQL queries based on the semantics encoded in the underlying
knowledge base. As a result, WDAqua does not use a machine learning algorithm to
translate natural language questions into SPARQL queries. Hence, WDAqua does
not suffer from over-fitting problems. However, due to the limitations of human-
defined transformation rules, the coverage and diversity of the generated SPARQL
queries are limited. For instance, the generated SPARQL queries contain at most
two triple patterns. Moreover, the modifiers in the generated queries are limited
to the ‘COUNT’ operator. Adding a new operator in the generated queries would
require significant work in designing the transformation rules. Instead, for machine
learning-based systems, just collecting new question-answer pairs would be enough.
WDAqua-core1 [15] constructs queries in four consecutive steps: question expan-
sion, query construction, query ranking and answer decision. In the first step, all
possible entities, properties and classes in the question are identified through lexi-
calization. Then, a set of queries is constructed based on the combinations of the
previously identified entities, properties and classes in four manually defined pat-
terns. In the third step, the candidate queries are ranked based on five features
including the number of variables and triples in the query, the number of the words
in the question which are covered by the query, the sum of the relevance of the
resources and the edit distance between the resource and the word. In the last step,
logistic regression is used to determine whether the user’s intention is reflected in
the whole candidate list and whether the answer is correct or not. There are mainly
Liang et al. Page 5 of 21
two differences between our proposed system and WDAqua-core1. Firstly, we use an
ensemble method of state-of-the-art entity detection methods instead of using lexi-
calization. Therefore, the coverage of identified intentions is improved enormously.
In addition, we use a Tree-LSTM to compute the similarity between NL questions
and SPARQL queries as the ranking score instead of the five simple features selected
by the authors of [15]. Hence, the final selected query is more likely to express the
true intention of the question and extract the right answer.
Frankenstein [10] decomposes the problem into several QA component tasks and
builds the whole QA pipeline by integrating 29 state-of-the-art QA components.
Frankenstein first extracts features such as question length, answer, type, special
words and part-of-speech (POS) tags from the input questions. Afterwards, a QA
optimization algorithm is implemented in two steps to automatically build the final
QA pipeline by selecting the best performing QA components from the 29 reusable
QA components based on the questions. In the first step, the performance of each
component is predicted based on the question features and then the best perform-
ing QA components are selected based on the predicted performance. In the second
step, the QA pipeline is dynamically generated based on the selected components
and answers are returned by executing the generated QA pipeline. Compared to
Frankenstein, our proposed system uses an ensemble method instead of only select-
ing the best performing QA component. What is more, we use an improved version
of the query construction component [9] other than selecting between the currently
published QA components.
3 Methods
Here we describe the details of our proposed QA system. In particular, our system
translates natural language questions to SPARQL queries in five steps (see Fig. 1 in
Section 1). At each step, a relevant task is solved independently by one individual
software component. First, the input question is processed by the question analysis
component, based solely on syntactic features. Afterwards, the type of the question
is identified and phrases in the question are mapped to corresponding resources and
properties in the underlying RDF knowledge graph. A number of SPARQL queries
are generated based on the mapped resources and properties. A ranking model
based on Tree-structured Long Short-Term Memory (Tree-LSTM) [12] is applied to
sort the candidate queries according to the similarity between their syntactic and
semantic structure relative to the input question. Finally, answers are returned to
the user by executing the generated query against the underlying knowledge graph.
In the proposed architecture, only the Phrase Mapping is dependent on the specific
underlying knowledge graph because it requires the concrete resources, properties
and classes. All other components are independent of the underlying knowledge
graph and therefore can be applied to another knowledge domain without being
modified.
to recognize the named entities, to identify the relations between the tokens and,
finally, to determine the dependency label of each question component [2].
Moreover, the questions are lemmatized and a dependency parse tree is generated.
The resulting lemma representation and the dependency parse tree are used later
for question classification and query ranking.
The goal of lemmatization is to reduce the inflectional forms of a word to a common
base form. For instance, a question “Who is the mayor of the capital of French
Polynesia?” can be converted to the lemma representation as “Who be the mayor
of the capital of French Polynesia?”.
Dependency parsing is the process of analyzing the syntactic structure of a sen-
tence to establish semantic relationships between its components. The dependency
parser generates a dependency parse tree [16] that contains typed labels denot-
ing the grammatical relationships for each word in the sentence (see Fig. 2 for an
example).
example question could be ‘Who is the wife of Obama?’. The expected answer to
the ‘List’ questions is a list of resources in the underlying knowledge graph.
The second type is the ‘Count’ question type, where the keyword ‘COUNT’ exists
in the corresponding SPARQL query. These kind of questions usually start with a
particular word such as “how”. One example question could be ‘How many compa-
nies were founded in the same year as Google?’. The expected answer to a ‘Count’
question is a number.
Note that sometimes the expected answer to a ‘Count’ question could be directly
extracted as the value of the property in the underlying knowledge graph instead
of being calculated by the ‘COUNT’ SPARQL set function. For example, the an-
swer of the question ‘How many people live in the capital of Australia?’ is already
stored as the value of http://dbpedia.org/ontology/populationTotal. As a result,
this question is treated as of the type ‘List’ instead of ‘Count’.
Finally, the ‘Boolean’ question type must contain the keyword “ASK” in the corre-
sponding SPARQL query. For example: ‘Is there a video game called Battle Chess?’.
The expected answer is of a Boolean value - either True or False.
We use a machine learning method instead of heuristic rules to classify question
types because it is hard to correctly capture all the various question formulations.
For example, consider the question ‘How many people live in Zurich?’, which starts
with ‘How many’ and belongs to question type ’LIST’ rather than ’COUNT’ (as in
the example above). Similar questions include ’How high is Mount Everest’ which
also belongs to question type ’LIST’. In order to capture those special questions,
many specific cases must be considered while hand-crafting heuristic rules. Instead,
using a machine learning algorithm for question type classification saves the tedious
manual work and can automatically capture such questions as long as the training
data is large and sufficiently diverse.
To automatically derive the question type, we first convert each word of the orig-
inal question into its lemma representation. Then we use term frequency-inverse
document frequency (TF-IDF) to convert the resulting questions into a numeric fea-
ture vector [17]. Afterwards, we train a Random Forest model [18] on these numeric
feature vectors to classify questions into ‘List’, ‘Count’ and ‘Boolean’ questions.
Our experimental results demonstrate that this simple model is good enough for
this classification task (see Section 2). Consequently, a SPARQL query will be con-
structed based on the derived question type. For instance, ‘ASK WHERE’ is used
in the SPARQL query of a ‘Boolean’ question - rather than ‘SELECT * WHERE’.
Figure 3: The phrase mapping result for the example question: “Who is
the mayor of the capital of French Polynesia?”. Abbreviations: dbo =
DBpedia ontology, dbr = DBpedia resource.
In more complex SPARQL queries, more than one variable may be involved.
Therefore, set S is extended by adding the relationship to a new variable [9]. For ex-
ample, the triple pattern <dbr:French Polynesia dbo:capital ?uri> in S can
be extended by adding another triple pattern <?uri dbo:mayor ?uri’> because
dbo:mayor is one mapped property in the example question and such relationship
exists in the underlying knowledge graph. The triple pattern <?uri dbo:country
dbr:France> can be extended by adding <?uri’ dbo:mayor ?uri> to S for the
same reason.
We choose to examine only the subgraph containing the mapped resources and
properties instead of traversing the whole underlying knowledge graph. As a result,
our approach dramatically decreases the computation time compared to [9]. By con-
sidering the whole knowledge graph instead, we would have precision and execution
time performance drawbacks. For example, one drawback is that the time needed
to execute all the possible entity-property combinations increases significantly with
the number of properties. As a result, the number of plausible queries to be con-
sidered will significantly increase too, and consequently, the time to compute the
similarity between questions and SPARQL queries will also increase.
A list of triples needs to be selected from set S to build the ‘WHERE’ clause
in the SPARQL query. However, the output of the mapped resources, proper-
ties and classes from the phrase mapping step may be incorrect and some of
them may be unnecessary. Therefore, instead of only choosing the combination
of triples which contains all the mapped resources and properties and has the
maximum number of triples, combinations of any size are constructed from all
triples in S as long as such relationship exists in the underlying knowledge graph.
For example, (<dbr:French Polynesia dbo:capital ?uri> , <?uri dbo:mayor
?uri’>) is one possible combination and (<?uri dbo:country dbr:France> ,
<?uri’ dbo:mayor ?uri> ) is another possible combination. Given the question
type information, each possible combination can be used to build one SPARQL
query. As a result, many possible candidate queries are generated for each input
question.
Algorithm 1 summarizes the process of constructing set S of all possible triples
and set Q of all possible queries, where E 0 is the set of all mapped resources, P 0 is
the set of all mapped properties and K is the underlying knowledge graph. The basic
idea of generating all possible triple patterns is taken from previous research [9].
However, we improve that approach to be able to generate more possible WHERE
clauses and thus, to be able to handle more complex queries (see lines 15-24 of the
algorithm below).
Algorithm 1: Construct the Set of All Possible Triples and Queries with
WHERE Clauses
Data: E 0 , P 0 , K
Result: S: Set of All Possible Triple Patterns
Q: Set of Queries with WHERE Clauses
1 S1 = {(e1 , p, e2 )|e1 ∈ E 0 , e2 ∈ E 0 , p ∈ P, (e1 , p, e2 ) ∈ K}
2 S2 = {(e, p, ?uri)|e ∈ E 0 , p ∈ P, (e, p, ?uri) ∈ K}
3 S3 = {(?uri, p, e)|e ∈ E 0 , p ∈ P, (?uri, p, e) ∈ K}
4 S = S1 ∪ S2 ∪ S3
5 S 0 = {}
6 foreach (e1 , p, e2 ) ∈ S do
7 S1 = {(e1 , p, ?uri)|e1 ∈ E 0 , p ∈ P, (e1 , p, ?uri) ∈ K}
8 S2 = {(?uri, p, e1 )|e1 ∈ E 0 , p ∈ P, (?uri, p, e1 ) ∈ K}
9 S3 = {(e2 , p, ?uri)|e2 ∈ E 0 , p ∈ P, (e2 , p, ?uri) ∈ K}
10 S4 = {(?uri, p, e2 )|e2 ∈ E 0 , p ∈ P, (?uri, p, e2 ) ∈ K}
11 S 0 = S 0 ∪ S1 ∪ S2 ∪ S3 ∪ S4
12 end
13 S = S ∪ S0
14 S 0 = {}
15 foreach ((e, p, ?uri), (?uri0 , p, e)) ∈ S × S do
16 S1 = {(?uri, p0 , ?uri0 )|p0 ∈ P, (?uri, p0 , ?uri0 ) ∈ K}
17 S2 = {(?uri0 , p0 , ?uri)|p0 ∈ P, (?uri0 , p0 , ?uri) ∈ K}
18 S 0 = S 0 ∪ S1 ∪ S2
19 end
20 S = S ∪ S0
21 Q = {}
22 for k = 1 to |S| by 1 do
23 Q = Q ∪ {q|q ∈ k-combinations(S), q ∈ K}
24 end
Since there is an intrinsic tree-like structure in both SPARQL queries and nat-
ural language questions, one basic assumption of our proposed QA system is that
the candidate queries can be distinguished based on their syntactic structure [9].
Therefore, the similarity between candidate queries and an input question can be
used to rank them. Since the desired query should capture the intention of the input
question, the final output is the candidate query that is most similar to the input
question in terms of syntactical structure. The similarity between candidate queries
and the input question is measured based on Tree-structured Long-Short Term
Memory (Tree-LSTM) [12], which takes into consideration the tree representation
of the input rather than just the sequential order of the input.
In order to clarify how we use Tree-LSTMs for query ranking, let us revisit the
whole query processing phase. In the preprocessing phase for the input question,
the words corresponding to the mapped resources in the question are substituted
with a placeholder. After that, the dependency parse tree of the input question is
created.
Let us consider an example by first revisiting Fig. 2 that shows the dependency
parse tree of the example question: “Who is the mayor of the capital of French
Polynesia?”. Let us move on to Fig. 4 that shows the tree representation of four
possible queries of the example question: “Who is the mayor of the capital of French
Polynesia?”. According to the output of Tree-LSTM, the first one has the highest
similarity score among all possible candidate queries.
predicted using a neural network that considers both the distance and angle between
this sentence pair. As cost function, we use the regularized Kullback–Leibler (KL)
divergence between the predicted and the target distributions.
Since SPARQL queries are more verbose due to the fact of being syntactically
composed of triplets, the length of a typical SPARQL query is often much longer
than the corresponding natural language question. Nevertheless, there is still a
syntactical tree-like structure in both questions and SPARQL queries. Hence, it is
more intuitive to choose Tree-LSTM than Recurrent Neural Networks to compute
the similarity between question-query pairs.
Compared to the vanilla Recurrent Neural Networks (RNN), which memorize in-
formation over only a few steps back, the LSTM networks should preserve sequence
information over a much longer period of time by introducing a memory cell as a
more complex computational unit added to the vanilla RNN structure. However,
standard LSTMs support input only in a sequential manner, whereas natural lan-
guage exhibits a tree-like structure naturally.
In contrast to the standard LSTM, the Tree-LSTM incorporates information not
only from an input vector but also from the hidden states of arbitrarily many child
units - instead of the hidden state of only the previous time step. Because of its
superior ability to represent sentence meaning, Tree-LSTM has been shown empir-
ically to outperform strong LSTM baselines in tasks such as predicting semantic
relatedness [12]. For more methodological detail on tree-LSTM, see the original
article [12].
4 Results
In this section we describe the experimental evaluation of our system. In order to
make our experiments reproducible, we provide our source code for download at
https://github.com/Sylvia-Liang/QAsparql. We ran the experiments on two well-
established real-world data sets - the Open Challenge on Question Answering over
Linked Data Challenge (QALD) [24] and the Large-Scale Complex Question An-
swering Dataset (LC-QuAD) [25]. Our results show that our QA system outper-
forms the state-of-art systems by 15% on the QALD-7 dataset and by 48% on the
LC-QuAD dataset.
recall(q) × precision(q)
F1 − score = 2 × (3)
recall(q) + precision(q)
The macro-average precision, recall and F1 -score are calculated as the average
precision, recall and F1 -score values for all the questions, respectively.
Optimizer [29] with a mini batch size of 25 examples. KL divergence was used as the
loss function, which provides a useful distance measure for continuous distributions
and is often useful when performing direct regression over the space of (discretely
sampled) continuous output distributions [30].
Here we analyze the results for the Random Forest in more detail. In particular,
we are interested in the classification accuracy for the three different query types.
Let us first start with the LC-QuAD dataset. Table 3 shows the precision, recall and
F1 -score for each question type. For the LC-QuAD dataset we achieve the highest
F1 -score for list queries, followed by Boolean and count queries. For the QUALD-7
dataset, the F1 -score for list queries is again the highest, while for Boolean queries
it is the lowest.
The question type classification accuracy results on the LC-QuAD dataset are as
follows: 99.9% for ‘List’ questions, 97% for ‘Count’ questions, and 98% for ‘Boolean’
Liang et al. Page 16 of 21
questions. These high accuracy values are due to the generation mechanism of the
LC-QuAD dataset. This dataset is generated by converting SPARQL queries to
Normalized Natural Question Templates (NNQTs) which act as canonical struc-
tures. Afterwards, natural language questions are composed by manually correcting
the generated NNQTs [25]. Therefore, the questions in LC-QuAD contain much
fewer noisy patterns compared to other collected natural language questions. As a
result, the performance of the Random Forest Classifier on LC-QuAD dataset is
quite satisfactory.
When considering the QALD-7 dataset for the question type classification, our
approach performed slightly worse than with the LC-QuAD dataset as shown in
Table 4. The accuracy for ‘List’ questions is 97%, for ‘Count’ questions 93% and for
‘Boolean’ questions 86%. The reduction in performance is mainly due to the differ-
ent qualities of the datasets. For instance, the QALD-7 dataset contains questions
with richer characteristics such as ‘Boolean’ questions starting with ‘Are’ or ’Was’.
However, the LC-QuAD dataset contains very few such ‘Boolean’ questions, which
results in the dramatic decrease in the accuracy for ‘Boolean’ questions.
Table 6 shows that on the QALD-7 dataset our QA system also significantly
outperforms the state-of-art systems WDAqua [14] and ganswer2 [26].
Our in-depth analysis of the failed questions shows that no SPARQL query was
generated for 968 questions in LC-QuAD datset and 80 questions in QALD-7
Liang et al. Page 17 of 21
dataset. Most of these failures were related to the phrase mapping step where the
required resources, properties or classes could not be detected.
For instance, most of these failures are related to detecting properties implicitly
stated in the input question. In such cases, the properties required to build the
SPARQL query cannot be inferred from the input question. For example, consider
the question “How many golf players are there in Arizona State Sun Devils?”. The
correct SPARQL query should be:
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX res: <http://dbpedia.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-
syntax-ns#>
SELECT DISTINCT ?num WHERE {
?uri dbo:college res:Arizona_State_Sun_Devils.
?uri rdf:type dbo:GolfPlayer }
The property http://dbpedia.org/ontology/college is necessary to build the cor-
rect SPARQL query but it is impossible to detect it solely from the input question.
Therefore, the bottleneck of designing QA systems over knowledge graphs lies in
the phrase mapping step, i.e detecting the corresponding resources, properties and
classes in the underlying knowledge graph.
The previous experiments showed the end-to-end performance of our system. We
will now show more detailed performance analysis based on the question type of the
natural language questions. Both Table 7 and Table 8 shows that the performance
on ‘List’ questions is much better than the performance on ‘Boolean’ questions.
Low recall for ‘Boolean’ questions might be caused by the intrinsic structure of the
SPARQL query. For instance, the question “Is Tom Cruise starring in Rain Man?”
has the following SPARQL query:
variable in the WHERE clause. However, the number of ‘Count’ questions in the
training dataset is relatively small as there are only 658 ‘Count’ questions in LC-
QuAD dataset. Therefore, more training data is required in order to fully learn the
characteristics of those complex queries.
5 Discussion
5.1 Conclusions
This paper presents a novel approach to constructing QA systems over knowledge
graphs. Our proposed QA system first identifies the type of each question by training
a Random Forest model. Then, an ensemble approach comprised of various entity
recognition and property mapping tools is used in the phrase mapping task. All pos-
sible triple patterns are then extracted based on the mapped resources, properties
and classes. Possible SPARQL queries are constructed by combining these triple
patterns in the query generation step. In order to select the correct SPARQL query
among a number of candidate queries for each question, a ranking model based on
Tree-LSTM is used in the query ranking step. The ranking model takes into ac-
count both the syntactical structure of the question and the tree representation of
the candidate queries to select the most plausible SPARQL query which represents
the correct intention behind the question. Experimental results demonstrate that
our proposed QA system outperforms the state-of-art result by 15% on the QALD-7
dataset and 48% on the LC-QuAD dataset, respectively.
The advantage of our QA system is that it requires neither any laborious feature
engineering, nor does it require a list of heuristic rules mapping a natural language
question to a query template and then to a SPARQL query. In this sense, our system
could avoid the over-fitting problem, which usually arises when defining the heuristic
rules for converting from natural language to a SPARQL query. In addition, our
proposed system can be used on large open-domain knowledge graphs and handle
noisy inputs, as it uses an ensemble method in the phrase mapping task, which
leads to a significant performance improvement. What is more, each component
in our QA system is reusable and can be integrated with other components to
construct a new QA system that further improves the performance. This proposed
system can be easily applied to newly unseen domains because the question type
classification model and the query generation model do not require any domain
specific knowledge.
Liang et al. Page 19 of 21
One important design question might concern our choice of a modular architec-
ture, rather than an end-to-end system. The reason behind this choice is that the
modular approach makes the QA system more independent and less susceptible
to data schema changes. An end-to-end system often needs to be re-trained due to
potentially frequent changes of the underlying database. However, in a modular sys-
tem, only one or two components will be affected by the changes in the underlying
database, and as a result, the training time and computing effort for updating the
modular system is much smaller than an end-to-end system. In addition, in order
to match the changed underlying database, the adjustment of the architecture used
by a modular system will also be much smaller compared to the end-to-end system.
phrase mapping model used in this paper performs well on DBpedia but not on other
knowledge graphs because it uses many pre-trained tools for DBpedia. In order to
make this system fully independent of the underlying knowledge graph, and for
it to be easily transferable to a new domain, the models used in this component
could be changed to more general models. For instance, DeepType [32] could map
resources in Wikidata [33], Freebase and YAGO2 [34]. If no pre-trained phrase
mapping models are available for a specific knowledge graph, one simple model is
to measure the similarity between the phrases in question and the labels of resources
in the knowledge graph. In order to improve the accuracy of this simple approach,
specific tailoring for each knowledge graph would be required.
Abbreviations
Author’s contributions
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Funding
We thank the Swiss National Science foundation for funding (NRP 75, grant 407540 167149).
Author details
1
The work was executed while at ETH Swiss Federal Institute of Technology, Rämistrasse 101, 8092, Zurich,
Switzerland. 2 Zurich University of Applied Sciences, Obere Kirchgasse 2, 8400, Winterthur, Switzerland. 3 SIB Swiss
Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, 1015, Lausanne, Switzerland. 4 Department of
Ecology and Evolution - University of Lausanne, Quartier Sorge - Bâtiment Biophore, 1015, Lausanne, Switzerland.
References
1. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bairoch, A.: Uniprotkb/swiss-prot. In: Plant
Bioinformatics, pp. 89–112. Springer, ??? (2007)
2. Diefenbach, D., Lopez, V., Singh, K., Maret, P.: Core techniques of question answering systems over knowledge
bases: a survey. Knowledge and Information systems 55(3), 529–569 (2018)
3. Li, F., Jagadish, H.: Constructing an interactive natural language interface for relational databases. Proceedings
of the VLDB Endowment 8(1), 73–84 (2014)
4. Basik, F., Hättasch, B., Ilkhechi, A., Usta, A., Ramaswamy, S., Utama, P., Weir, N., Binnig, C., Cetintemel,
U.: Dbpal: A learned nl-interface for databases. In: Proceedings of the 2018 International Conference on
Management of Data, pp. 1765–1768 (2018). ACM
5. Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for
databases. The VLDB Journal (2019). doi:10.1007/s00778-019-00567-8
6. Höffner, K., Walter, S., Marx, E., Usbeck, R., Lehmann, J., Ngonga Ngomo, A.-C.: Survey on challenges of
question answering in the semantic web. Semantic Web 8(6), 895–920 (2017)
7. Singh, K., Lytra, I., Radhakrishna, A.S., Shekarpour, S., Vidal, M.-E., Lehmann, J.: No one is perfect:
Analysing the performance of question answering components over the dbpedia knowledge graph. arXiv preprint
arXiv:1809.10044 (2018)
8. Sima, A.C., Mendes de Farias, T., Zbinden, E., Anisimova, M., Gil, M., Stockinger, H., Stockinger, K.,
Robinson-Rechavi, M., Dessimoz, C.: Enabling semantic queries across federated bioinformatics databases.
Database 2019 (2019)
9. Zafar, H., Napolitano, G., Lehmann, J.: Formal query generation for question answering over knowledge bases.
In: European Semantic Web Conference, pp. 714–728 (2018). Springer
10. Singh, K., Radhakrishna, A.S., Both, A., Shekarpour, S., Lytra, I., Usbeck, R., Vyas, A., Khikmatullaev, A.,
Punjani, D., Lange, C., Vidal, M.E., Lehmann, J., Auer, S.: Why reinvent the wheel: Let’s build question
answering systems together. In: Proceedings of the 2018 World Wide Web Conference (2018)
11. Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: Lc-quad: A corpus for complex question answering over
knowledge graphs. In: International Semantic Web Conference, pp. 210–218 (2017). Springer
12. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term
memory networks. arXiv preprint arXiv:1503.00075 (2015)
13. Zou, L., Huang, R., Wang, H., Yu, J., He, W., Zhao, D.: Natural language question answering over rdf - a
graph data driven approach. Proceedings of the ACM SIGMOD International Conference on Management of
Data (2014). doi:10.1145/2588555.2610525
14. Diefenbach, D., Singh, K., Maret, P.: Wdaqua-core0: A question answering component for the research
community. In: Dragoni, M., Solanki, M., Blomqvist, E. (eds.) Semantic Web Challenges, pp. 84–89. Springer,
Cham (2017)
Liang et al. Page 21 of 21
15. Diefenbach, D., Both, A., Singh, K., Maret, P.: Towards a question answering system over the semantic web.
Semantic Web (Preprint), 1–19 (2018)
16. Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In:
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1373–1378
(2015)
17. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval vol. 463. ACM press New York, ???
(1999)
18. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
19. Morsey, M., Lehmann, J., Auer, S., Stadler, C., Hellmann, S.: Dbpedia and the live extraction of structured
data from wikipedia. Program: electronic library and information systems 46, 157–181 (2012).
doi:10.1108/00330331211221828
20. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accuracy in multilingual entity
extraction. In: Proceedings of the 9th International Conference on Semantic Systems (I-Semantics) (2013)
21. Ferragina, P., Scaiella, U.: Tagme: On-the-fly annotation of short text fragments (by wikipedia entities). In:
Proceedings of the 19th ACM International Conference on Information and Knowledge Management. CIKM
’10, pp. 1625–1628. ACM, New York, NY, USA (2010). doi:10.1145/1871437.1871689.
http://doi.acm.org/10.1145/1871437.1871689
22. Dubey, M., Banerjee, D., Chaudhuri, D., Lehmann, J.: EARL: joint entity and relation linking for question
answering over knowledge graphs. CoRR abs/1801.03825 (2018). 1801.03825
23. Sakor, A., Onando Mulang’, I., Singh, K., Shekarpour, S., Esther Vidal, M., Lehmann, J., Auer, S.: Old is gold:
Linguistic driven approach for entity and relation linking of short text. In: Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 2336–2346. Association for Computational Linguistics,
Minneapolis, Minnesota (2019). https://www.aclweb.org/anthology/N19-1243
24. Lopez, V., Unger, C., Cimiano, P., Motta, E.: Evaluating question answering over linked data. Web Semantics
Science Services And Agents On The World Wide Web 21, 3–13 (2013). doi:10.1016/j.websem.2013.05.006
25. Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: Lc-quad: A corpus for complex question answering over
knowledge graphs. In: d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudré-Mauroux, P., Sequeda, J.,
Lange, C., Heflin, J. (eds.) The Semantic Web – ISWC 2017, pp. 210–218. Springer, Cham (2017)
26. Usbeck, R., Ngomo, A.-C.N., Haarmann, B., Krithara, A., Röder, M., Napolitano, G.: 7th open challenge on
question answering over linked data (qald-7). In: Dragoni, M., Solanki, M., Blomqvist, E. (eds.) Semantic Web
Challenges, pp. 59–69. Springer, Cham (2017)
27. Usbeck, R., Ngomo, A.-C.N., Haarmann, B., Krithara, A., Röder, M., Napolitano, G.: 7th open challenge on
question answering over linked data (qald-7). In: Dragoni, M., Solanki, M., Blomqvist, E. (eds.) Semantic Web
Challenges, pp. 59–69. Springer, Cham (2017)
28. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information.
Transactions of the Association for Computational Linguistics 5, 135–146 (2017)
29. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research 12(Jul), 2121–2159 (2011)
30. Kullback, S.: Information Theory and Statistics. Courier Corporation, ??? (1997)
31. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A collaboratively created graph database
for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on
Management of Data. SIGMOD ’08, pp. 1247–1250. ACM, New York, NY, USA (2008).
doi:10.1145/1376616.1376746. http://doi.acm.org/10.1145/1376616.1376746
32. Raiman, J.R., Raiman, O.M.: Deeptype: multilingual entity linking by neural type system evolution. In:
Thirty-Second AAAI Conference on Artificial Intelligence (2018)
33. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledge base (2014)
34. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge
base from wikipedia. Artificial Intelligence 194, 28–61 (2013)