Sage Journals: Discover world-class research

Abstract

With the development of the Semantic Web, a lot of new structured data has become available on the Web in the form of knowledge bases (KBs). Making this valuable data accessible and usable for end-users is one of the main goals of question answering (QA) over KBs. Most current QA systems query one KB, in one language (namely English). The existing approaches are not designed to be easily adaptable to new KBs and languages.

We first introduce a new approach for translating natural language questions to SPARQL queries. It is able to query several KBs simultaneously, in different languages, and can easily be ported to other KBs and languages. In our evaluation, the impact of our approach is proven using 5 different well-known and large KBs: Wikidata, DBpedia, MusicBrainz, DBLP and Freebase as well as 5 different languages namely English, German, French, Italian and Spanish. Second, we show how we integrated our approach, to make it easily accessible by the research community and by end-users.

To summarize, we provide a conceptional solution for multilingual, KB-agnostic question answering over the Semantic Web. The provided first approximation validates this concept.

Keywords

Question answering Multilinguality portability QALD SimpleQuestions

1. Introduction

Question answering (QA) is a research field in computer science that started in the sixties [38]. In the Semantic Web, a lot of new structured data has become available in the form of knowledge bases (KBs). Nowadays, there are KBs about media, publications, geography, life sciences and more.1

¹
http://lod-cloud.net

The core purpose of a QA system over KBs is to retrieve the desired information from one or many KBs, using natural language questions. This is generally addressed by translating a natural language question to a SPARQL query. Current research does not address the challenge of multilingual, KB-agnostic QA for both full and keyword questions (Table 1).

Table 1

Selection of QA systems evaluated over the most popular benchmarks. We indicated their capabilities with respect to multilingual questions, different KBs and different typologies of questions (full = “well-formulated natural language questions”, key = “keyword questions”)

QA system	Lang	KBs	Type
gAnswer [66] (QALD-3 Winner)	En	DBpedia	Full
Xser [60] (QALD-4 & 5 Winner)	En	DBpedia	Full
UTQA [59]	En, es, fs	DBpedia	Full
Jain [35] (WebQuestions Winner)	En	Freebase	Full
Lukovnikov [39] (SimpleQuestions Winner)	En	Freebase	Full
Ask Platypus (https://askplatyp.us)	En	Wikidata	Full
WDAqua-core1	En, fr, de, it, es	Wikidata, DBpedia, Freebase, DBLP, MusicBrainz	Full & key

There are multiple reasons for that. Many QA approaches rely on language-specific tools (NLP tools), e.g., SemGraphQA [3], gAnswer [66] and Xser [60]. Therefore, it is difficult or impossible to port them to a language-agnostic system. Additionally, many approaches make particular assumptions on how the knowledge is modelled in a given KB (generally referred to as “structural gap” [12]). This is the case of AskNow [19] and DEANNA [61].

There are also approaches which are difficult to port to new languages or KBs because they need a lot of training data which is difficult and expensive to create. This is for example the case of Bordes et al. [5]. Finally there are approaches where it was not proven that they scale well. This is for example the case of SINA [48].

In this paper, we present an algorithm that addresses all of the above drawbacks and that can compete, in terms of F-measure, with many existing approaches.

This publication is organized as follows. In Section 2, we present related works. In Section 3 and 4, we describe the algorithm providing the foundations of our approach. In Section 5, we provide the results of our evaluation over different benchmarks. In Section 6, we show how we implemented our algorithm as a service so that it is easily accessible to the research community, and how we extended a series of existing services so that our approach can be directly used by end-users. We conclude with Section 7.

2. Related work

In the context of QA, a large number of systems have been developed in the last years. For a complete overview, we refer to [12]. Most of them were evaluated on one of the following three popular benchmarks: WebQuestions [4], SimpleQuestions [5] and QALD.2

²
http://www.sc.cit-ec.uni-bielefeld.de/qald/

WebQuestions contains 5,810 questions that can be answered by one reified statement. SimpleQuestions contains 108,442 questions that can be answered using a single, binary-relation. The QALD challenge versions include more complex questions than the previous ones, and contain between 100 and 450 questions, and are therefore, compared to the other, small datasets.

The high number of questions of WebQuestions and SimpleQuestions led to many supervised-learning approaches for QA. Especially deep learning approaches became very popular in the recent years like Bordes et al. [5] and Zhang et al. [62]. The main drawback of these approaches is the training data itself. Creating a new training dataset for a new language or a new KB might be very expensive. For example, Berant et al. [4], report that they spent several thousands of dollars for the creation of WebQuestions using Amazon Mechanical Turk. The problem of adapting these approaches to new dataset and languages can also be seen by the fact that all these systems work only for English questions over Freebase.

A list of the QA systems that were evaluated with QALD-3, QALD-4, QALD-5, QALD-6, QALD-7, QALD-8 can be found in Table 3. According to [12] less than 10% of the approaches were applied to more than one language and 5% to more than one KB. The reason is the heavy use of NLP tools or NL features like in Xser [60], gAnswer [66] or QuerioDali [37].

The problem of QA in English over MusicBrainz3

https://musicbrainz.org

was proposed in QALD-1, in the year 2011. Two QA systems tackled this problem. Since then the MusicBrainz KB4

⁴

https://github.com/LinkedBrainz/MusicBrainz-R2RML

completely changed. We are not aware of any QA system over DBLP.5

⁵

http://dblp.uni-trier.de

In summary, most QA systems work only in English and over one KB. Multilinguality is poorly addressed while portability is not addressed at all. The few systems that address multilinguality rely on syntactic parsing techniques [31,51,59].

The fact that QA systems often reuse existing techniques and need several services to be exposed to the end-user, leads to the idea of developing QA systems in a modular way. At least four frameworks tried to achieve this goal: QALL-ME [21], openQA [40], the Open Knowledge Base and Question-Answering (OKBQA) challenge6

⁶

http://www.okbqa.org/

and Qanary [6 ,13 ,49]. We integrated our system as a Qanary QA component called WDAqua-core1. We choose Qanary for two reasons. First, it offers a series of off-the-shelf services related to QA systems and second, it allows to freely configure a QA system based on existing QA components.

Fig. 1.

Conceptual overview of the approach.

3. Approach for QA over Knowledge Bases

In this section, we present our multilingual, KB-agnostic approach for QA. It is based on the observation that many questions can be understood from the semantics of the words in the question while the syntax of the question has less importance. For example, consider the question “Give me actors born in Berlin”. This question can be reformulated in many ways like “In Berlin were born which actors?” or as a keyword question “Berlin, actors, born in”. In this case by knowing the semantics of the words “Berlin”, “actors”, “born”, we are able to deduce the intention of the user. This holds for many questions, i.e. they can be correctly interpreted without considering the syntax as the semantics of the words is sufficient for them. Taking advantage of this observation is the main idea of our approach. The KB encodes the semantics of the words and it can tell what is the most probable interpretation of the question (w.r.t. the knowledge model described by the KB).

Our approach is decomposed in 4 steps: question expansion, query construction, query ranking and response decision. A conceptual overview is given in Fig. 1. In the following, the processing steps are described. As a running example, we consider the question “Give me philosophers born in Saint-Etienne”. For the sake of simplicity, we use DBpedia as KB to answer the question. However, it is important to recognize that no assumptions either about the language or the KB are made. Hence, even the processing of the running example is language- and KB-agnostic.

Table 2
Expansion step for the question “Give me philosophers born in Saint Étienne”. The first column enumerates the candidates that were found. Here, 117 possible entities, properties and classes were found from the question. The second, third and fourth columns indicate the position of the n-gram in the question and the n-gram itself. The last column is for the associated IRI. Note that many possible meanings are considered: line 9 says that “born” may refer to a crater, line 52 that “saint” may refer to a software and line 114 that the string “Saint Étienne” may refer to a band

n Start End n-gram Resource

1 2 3 Philosophers dbrc:Philosophes

2 2 3 Philosophers dbr:Philosophes

3 2 3 Philosophers dbo:Philosopher

4 2 3 Philosophers dbrc:Philosophers

5 2 3 Philosophers dbr:Philosopher

6 2 3 Philosophers dbr:Philosophy

7 2 3 Philosophers dbo:philosophicalSchool

8 3 4 Born dbr:Born,_Netherlands

9 3 4 Born dbr:Born_(crater)

10 3 4 Born dbr:Born_auf_dem_Dar?

11 3 4 Born dbr:Born,_Saxony-Anhalt

⋮

42 3 4 Born dbp:bornAs

43 3 4 Born dbo:birthDate

44 3 4 Born dbo:birthName

45 3 4 Born dbp:bornDay

46 3 4 Born dbp:bornYear

47 3 4 Born dbp:bornDate

48 3 5 Born in dbp:bornIn

49 3 5 Born in dbo:birthPlace

50 3 5 Born in dbo:hometown

n Start End n-gram Resource

52 5 6 Saint dbr:SAINT_(software)

53 5 6 Saint dbr:Saint

54 5 6 Saint dbr:Boxers_and_Saints

55 5 6 Saint dbr:Utah_Saints

56 5 6 Saint dbr:Saints,_Luton

57 5 6 Saint dbr:Baba_Brooks

58 5 6 Saint dbr:Battle_of_the_Saintes

59 5 6 Saint dbr:New_York_Saints

⋮

106 5 6 Saint dbp:saintPatron

107 5 6 Saint dbp:saintsDraft

108 5 6 Saint dbp:saintsSince

109 5 6 Saint dbo:patronSaint

110 5 6 Saint dbp:saintsCollege

111 5 6 Saint dbp:patronSaintOf

112 5 6 Saint dbp:patronSaint(s)

113 5 6 Saint dbp:patronSaint’sDay

114 5 7 Saint etienne dbr:Saint_Etienne_(band)

115 5 7 Saint etienne dbr:Saint_Etienne

116 5 7 Saint etienne dbr:Saint-Étienne

117 6 7 Etienne dbr:Étienne

n	Start	End	n-gram	Resource
1	2	3	Philosophers	dbrc:Philosophes
2	2	3	Philosophers	dbr:Philosophes
3	2	3	Philosophers	dbo:Philosopher
4	2	3	Philosophers	dbrc:Philosophers
5	2	3	Philosophers	dbr:Philosopher
6	2	3	Philosophers	dbr:Philosophy
7	2	3	Philosophers	dbo:philosophicalSchool
8	3	4	Born	dbr:Born,_Netherlands
9	3	4	Born	dbr:Born_(crater)
10	3	4	Born	dbr:Born_auf_dem_Dar?
11	3	4	Born	dbr:Born,_Saxony-Anhalt
⋮
42	3	4	Born	dbp:bornAs
43	3	4	Born	dbo:birthDate
44	3	4	Born	dbo:birthName
45	3	4	Born	dbp:bornDay
46	3	4	Born	dbp:bornYear
47	3	4	Born	dbp:bornDate
48	3	5	Born in	dbp:bornIn
49	3	5	Born in	dbo:birthPlace
50	3	5	Born in	dbo:hometown

n	Start	End	n-gram	Resource
52	5	6	Saint	dbr:SAINT_(software)
53	5	6	Saint	dbr:Saint
54	5	6	Saint	dbr:Boxers_and_Saints
55	5	6	Saint	dbr:Utah_Saints
56	5	6	Saint	dbr:Saints,_Luton
57	5	6	Saint	dbr:Baba_Brooks
58	5	6	Saint	dbr:Battle_of_the_Saintes
59	5	6	Saint	dbr:New_York_Saints
⋮
106	5	6	Saint	dbp:saintPatron
107	5	6	Saint	dbp:saintsDraft
108	5	6	Saint	dbp:saintsSince
109	5	6	Saint	dbo:patronSaint
110	5	6	Saint	dbp:saintsCollege
111	5	6	Saint	dbp:patronSaintOf
112	5	6	Saint	dbp:patronSaint(s)
113	5	6	Saint	dbp:patronSaint’sDay
114	5	7	Saint etienne	dbr:Saint_Etienne_(band)
115	5	7	Saint etienne	dbr:Saint_Etienne
116	5	7	Saint etienne	dbr:Saint-Étienne
117	6	7	Etienne	dbr:Étienne

3.1. Expansion

Following a recent survey [12], we call a lexicalization, a name of an entity, a property or a class. For example, “first man on the moon” and “Neil Armstrong” are both lexicalizations of dbr:Neil_Armstrong. In this step, we want to identify all Internationalized Resource Identifiers (IRIs) of entities, properties and classes, which the question could refer to. To achieve this, we use the following rules:

All IRIs are searched whose lexicalization (up to stemming) is a word n-gram (up to stemming) in the question.

If an n-gram is a stop word (like “is”, “are”, “of”, “give”, …), then we exclude the IRIs associated to it. This is due to the observation that the semantics are important to understand a question and the fact that stop words do not carry a lot of semantics. Moreover, by removing the stop words the time needed in the next step is decreased.

An example is given in Table 2. The stop words and the lexicalizations used for the different languages and KBs are described in Section 5.1. In this part, we used the well-known Apache Lucene7

⁷
https://lucene.apache.org

technology which allows fast retrieval, while providing a small disk and memory footprint.

3.2. Query construction

In this step, we construct a set of queries that represent possible interpretations of the given question within the given KB. Therefore, we heavily utilize the semantics encoded into the particular KB. We start with a set R of IRIs from the previous step. The goal is to construct all possible queries containing the IRIs in R which give a non-empty result-set. Let V be the set of variables. Based on the complexity of the questions in current benchmarks, we restrict our approach to queries satisfying 4 patterns: $\begin{array}{l} SELECT / ASK var \\ WHERE { s1 s2 s3 . } \\ SELECT / ASK var \\ WHERE { s1 s2 s3 . \\ s4 s5 s6 . } \end{array}$ with

and

These correspond to all queries containing one or two triple patterns that can be created starting from the IRIs in R. Moreover, for entity linking, we add the following two patterns: $\begin{array}{l} SELECT ?x \\ WHERE { VALUES ?x {iri} . } \\ SELECT ?x \\ WHERE { VALUES ?x {iri} . \\ iri ?p iri1 . } \end{array}$ with iri, iri1 ∈ R. These correspond to all queries returning directly one of the IRIs in R with possibly one additional triple.

Note that these last queries just give back directly an entity and should be generated for a question like: “What is Apple Company?” or “Who is Marie Curie?”. An example of generated queries is given in Fig. 2.

The main challenge is the efficient construction of these SPARQL queries. The main idea is to perform in the KB graph a breadth-first search of depth 2 starting from every IRI in R. While exploring the KB for all IRIs $r_{j} \in R$ (where $r_{j} \neq r_{i}$ ) the distance $d_{r_{i}, r_{j}}$ between two resources is stored. These numbers are used when constructing the queries shown above. For a detailed algorithm of the query construction phase, please see Section 4. Concluding, in this section, we computed a set of possible SPARQL queries (candidates). They are driven by the lexicalizations computed in Section 3.1 and represent the possible intentions expressed by the question of the user.

Fig. 2.

Some of the 395 queries constructed for the question “Give me philosophers born in Saint Etienne”. Note that all queries could be semantically related to the question. The second one is returning “Saint-Etienne” as a band, the third one the birth date of people born in the city of “Saint-Etienne” and the forth one the birth date of persons related to philosophy.

3.3. Ranking

Now the computed candidates need to be ordered by their probability of answering the question correctly. Hence, we rank them based on the following features:

Number of the words in the question which are covered by the query. For example, the first query in Fig. 2 is covering two words (“Saint” and “born”, where “born” is covered by the property dbo:hometown).

The edit distance of the label of the resource and the word it is associated to. For example, the edit distance between the label of dbp:bornYear (which is “born year”) and the word “born” is 5.

The sum of the relevance of the resources, (e.g. the number of inlinks and the number of outlinks of a resource). This is a knowledge base independent choice, but it is also possible to use a specific score for a KB (like page-rank [16]).

The number of variables in the query.

The number of triples in the query.

If no training data is available, then we rank the queries using a linear combination of the above 5 features, where the weights are determined manually. Otherwise we assume a training dataset of questions together with the corresponding answers set, which can be used to calculate the F-measure for each of the SPARQL query candidates. As a ranking objective, we want to order the SPARQL query candidates in descending order with respect to the F-measure. In our implementation we rank the queries using RankLib8

⁸
https://sourceforge.net/p/lemur/wiki/RankLib/

with Coordinate Ascent [41]. At test time the learned model is used to rank the queries, the top-ranked query is executed against a SPARQL endpoint, and the result is computed. An example is given in Fig. 3. Note that, we do not use syntactic features. However, it is possible to use them to further improve the ranking.

Fig. 3.

The top 4 generated queries for the question “Give me philosophers born in Saint Étienne”. (1) is the query that best matches the question; (2) gives philosophical schools of people born in Saint-Étienne; (3), (4) give people born in Saint-Étienne or that live in Saint-Étienne. The order can be seen as a decreasing approximation to what was asked.

3.4. Answer decision

The computations in the previous section lead to a list of ranked SPARQL queries candidates representing our possible interpretations of the user’s intentions. We could directly give back the result of first ranked query from the previous step. This will (nearly) always generate an answer. However, there are situations where no suitable interpretation is generated for the question, or where the question is not answerable over the given KB. To determine if such a situation occurred, we add an additional step in which we decide if the result-set of the first query should be returned or if the system should not give back any result. This corresponds to a binary classification problem. We train a model based on logistic regression. We use a training set consisting of SPARQL queries and two labels equal to True or False. True indicates if the F-score of the SPARQL query is greater than a threshold $θ_{1}$ or false otherwise. Once the model is trained, it can compute a confidence score $p_{Q} \in [0, 1]$ for a query Q. In our exemplary implementation we assume a correctly ordered list of SPARQL query candidates computed in Section 3.3. Hence, it only needs to be checked whether $p_{Q_{1}} ⩾ θ_{2}$ is true for the first ranked query $Q_{1}$ of the SPARQL query candidates, or otherwise it is assumed that the whole candidate list does not reflect the user’s intention. Hence, we refuse to answer the question. We answer the question if it is above a threshold $θ_{2}$ otherwise we do not answer it. Note that $p_{Q}$ can be interpreted as the confidence that the QA system has in the generated SPARQL query Q, i.e. in the generated answer.

3.5. Multiple KBs

Note that the approach can also be extended, as it is, to multiple KBs. In the query expansion step, one has just to take in consideration the labels of all KBs. In the query construction step, one can consider multiple KBs as one graph having multiple unconnected components. The query ranking and answer decision step are literally the same.

3.6. Discussion

Overall, we follow a combinatorial approach with efficient pruning, that relies on the semantics encoded in the underlying KB.

In the following, we want to emphasize the advantages of this approach using some examples.

Joint disambiguation of entities and relations: For example, for interpreting the question “How many inhabitants has Paris?” between the hundreds of different meanings of “Paris” and “inhabitants” the top ranked queries contain the resources called “Paris” which are cities, and the property indicating the population, because only these make sense semantically.

Portability to different KBs: One problem in QA over KBs is the semantic gap, i.e. the difference between how we think that the knowledge is encoded in the KB and how it actually is. For example, in our approach, for the question “What is the capital of France?”, we generate the query

which probably most users would have expected, but also the query

which refers to an overview article in Wikipedia about the capitals of France and that most of the users would probably not expect. This important feature allows to port the approach to different KBs while it is independent of how the knowledge is encoded.

Ability to bridge over implicit relations: We are able to bridge over implicit relations. For example, given “Give me German mathematicians” the following query is computed:

Here ?p1 is:

dbo:field

dbo:occupation,

dbo:profession

and ?p2 is:

dbo:nationality,

dbo:birthPlace,

dbo:deathPlace,

dbo:residence.

Note that all these properties could be intended for the given question, even if dbo:deathPlace could be seen as an over-generalization.

Easy to port to new languages: The only parts where the language is relevant are the stop word removal and stemming. Since these are very easy to adapt to new languages, one can port the approach easily to other languages.

Permanent system refinement: It is possible to improve the system over time. The system generates multiple queries. This fact can be used to easily create new training datasets as is shown in [14]. Using these datasets one can refine the ranker to perform better on the asked questions.

System robust to malformed questions and keyword questions: We are not using part-of-speech tagging or dependency parsers in the approach which makes it very robust to malformed questions. For this reason, keyword questions are also supported.

A disadvantage of our exemplary implementation is that the identification of relations relies on a dictionary. Note that, methods not based on dictionaries follow one of the following strategies. Either they try to learn ways to express the relation from big training corpora (like in [5]), s.t. the problem is shifted to create suitable training sets. Alternatively, text corpora are used to either extract lexicalizations for properties (like in [4]) or learn word embeddings (like in [31]). Hence, possible improvements might be applied to this task in the future.

To conclude, the proposed approach uses the knowledge encoded in the KB to construct candidate queries. This is novel and responsible for the main distinctive features of the approach: easy portability to new languages, easy portability to new KBs and robustness to different types of questions.

4. Fast candidate generation

In this section, we explain how the SPARQL queries described in Section 3.2 can be constructed efficiently.

Let R be a set of resources. We consider the KB as a directed labeled graph G:

Definition 1 (Graph).

A directed labeled graph is an ordered pair $G = (V, E, f)$ , such that:

V is a non-empty set, called the vertex set;

E is a set, called edge set, such that $E \subset {(v, w) : v, w \in V}$ , i.e. a subset of the pairs of V;

For a set L called labeled set, f is a function $f : E \to L$ , i.e. a function that assigns to each edge a label $p \in L$ . We indicate an edge with label p as $e = (v, p, w)$ .

To compute the pairwise distance in G between every resource in R, we do a breadth-first search from every resource in R in an undirected way (i.e. we traverse the graph in both directions).

We define a distance function d as follows. Assume we start from a vertex r and find the following two edges $e_{1} = (r, p_{1}, r_{1})$ , $e_{2} = (r_{1}, p_{2}, r_{2})$ . We say that $d_{r, p_{1}} = 1$ , $d_{r, r_{1}} = 2$ , $d_{r, p_{2}} = 3$ and so on. When an edge is traversed in the opposite direction, we add a minus sign. For example, given the edges $e_{1} = (r, p_{1}, r_{1})$ and $e_{2} = (r_{2}, p_{2}, r_{1})$ , we say $d_{r, p_{2}} = - 3$ . For a vertex or edge r, and a variable x we artificially set $d_{r, x}$ to be any possible integer number. Moreover, we set $d_{x, y} = d_{y, x}$ for any $x, y$ . The algorithm to compute these numbers can be found in Algorithm 1.

Algorithm 1:

Algorithm to compute the pairwise distance between every resource in a set R appearing in a KB

The algorithm of our exemplary implementation simply traverses the graph starting from the nodes in R in a breadth-first search manner and keeps track of the distances as defined above. The breadth-first search is done by using HDT [20] as an indexing structure.9

⁹

https://www.w3.org/Submission/2011/03/

Note that HDT was originally developed as an exchange format for RDF files that is queryable. A rarely mentioned feature of HDT is that it is perfectly suitable for performing breadth-first search operations over RDF data. In HDT, the RDF graph is stored as an adjacency list which is an ideal data structure for breadth-first search operations. This is not the case for traditional triple-stores. The use of HDT at this point is key for two reasons, (1) the performance of the breadth-first search operations, and (2) the low footprint of the index in terms of disk and memory space. Roughly, a 100 GB RDF dump can be compressed to a HDT file of a size of approx. 10 GB [20].

Based on the numbers above, we now want to construct all triple patterns with K triples and one projection variable recursively. Given a triple pattern T, we only want to build connected triple-pattern while adding triples to T. This can be done recursively using the algorithm described in Algorithm 2. Note that thanks to the numbers collected during the breadth-first search operations, this can be performed very fast. Once the triple patterns are constructed, one can choose any of the variables, which are in subject or object position, as a projection variable.

The decision to generate a SELECT or/and ASK query, is made depending on some regular expressions over the beginning of the question.

5. Evaluation

To validate the approach w.r.t. multilinguality, portability and robustness, we evaluated our approach using multiple benchmarks for QA that appeared in the last years. The different benchmarks are not comparable and they focus on different aspects of QA. For example SimpleQuestions focuses on questions that can be solved by one simple triple-pattern, while LC-QuAD focuses on more complex questions. Moreover, the QALD questions address different challenges including multilinguality and the use of keyword questions. Unlike previous works, we do not focus on one benchmark, but we analyze the behaviour of our approach under different scenarios. This is important, because it shows that our approach is not adapted to one particular benchmark, as it is often done by existing QA systems, and proves its portability.

We tested our approach on 5 different datasets namely Wikidata,10

¹⁰
https://www.wikidata.org/

DBpedia,11

¹¹

http://dbpedia.org

MusicBrainz,12

¹²

https://musicbrainz.org

DBLP13

¹³

http://dblp.uni-trier.de

and Freebase.14

¹⁴

https://developers.google.com/freebase/

Moreover, we evaluated our approach on five different languages namely: English, German, French, Italian and Spanish. First, we describe how we selected stop words and collected lexicalizations for the different languages and KBs, then we describe and discuss our results.

5.1. Stop words and lexicalizations

As stop words, we use the lists, for the different languages, provided by Lucene, together with some words which are very frequent in questions like “what”, “which”, “give”.

Algorithm 2:

Recursive algorithm to create all connected triple patterns from a set R of resources with maximal K triple patterns. L contains the triple patterns created recursively and $L^{(k)}$ indicates the triple patterns with exactly k triples. Moreover, $x_{k, 1}, x_{k, 2}, x_{k, 3}$ are new variables that are added in step k. Note that the “if not” conditions correspond to the four possibilities a), b), c), d) of joining two triples which are depicted above. Note that they are very often not fulfilled. This guarantees the speed of the process

Depending on the KB, we followed different strategies to collect lexicalizations. Since Wikidata has a rich number of lexicalizations, we simply took all lexicalizations associated to a resource through rdfs:label,15

¹⁵

rdfs: http://www.w3.org/2000/01/rdf-schema#

skos:prefLabel16

¹⁶

skos: http://www.w3.org/2004/02/skos/core#

and skos: altLabel. For DBpedia, we only used the English DBpedia, where first all lexicalizations associated to a resource through the rdfs:label property were collected. Secondly, we followed the disambiguation and redirect links to get additional ones and took also into account available demonyms dbo:demonym (i.e. to dbr:Europe we associate also the lexicalization “European”). Thirdly, by following the inter-language links, we associated the labels from the other languages to the resources. DBpedia properties are poorly covered with lexicalizations, especially when compared to Wikidata. For example, the property dbo:birthPlace has only one lexicalization namely “birth place”, while the corresponding property over Wikidata P19 has 10 English lexicalizations like “birthplace”, “born in”, “location born”, “birth city”. In our exemplary implementation two strategies were implemented. First, while aiming at a QA system for the Semantic Web we also can take into account interlinkings between properties of distinguished KBs, s.t. lexicalizations are merged from all KBs currently considered. There, the owl:sameAs links from DBpedia relations to Wikidata are used and every lexicalization present in Wikidata is associated to the corresponding DBpedia relation. Secondly, the DBpedia abstracts are used to find more lexicalizations for the relations. To find new lexicalizations of a property

p

we follow the strategy proposed by [22]. We extracted from the KB the subject-object pairs (x,y) that are connected by p. Then the abstracts are scanned and all sentences are retrieved which contain both

label (x)

and

label (y)

. At the end, the segments of text between

label (x)

and

label (y)

, or

label (y)

and

label (x)

are extracted. We rank the extracted text segments and we choose the most frequent ones. This was done only for English.

For MusicBrainz we used the lexicalizations attached to purl:title,17

¹⁷

purl: http://purl.org/dc/elements/1.1/

foaf:name,18

¹⁸

foaf: http://xmlns.com/foaf/.

skos:altLabel and rdfs:label. For DBLP only the one attached to rdfs:label. Note, MusicBrainz and DBLP contain only few properties. We aligned them manually with Wikidata and moved the lexicalizations from one KB to the other. The mappings can be found under http://goo.gl/ujbwFW and http://goo.gl/ftzegZ respectively. This took in total 1 hour of manual work.

For Freebase, we considered the lexicalizations attached to rdfs:label. We also followed the few available links to Wikidata. Finally, we took the 20 most prominent properties in the training set of the SimpleQuestions benchmark and looked at the lexicalizations of them in the first 100 questions of SimpleQuestions. We extracted manually the lexicalizations for them. This took 1 hour of manual work. We did not use the other (75,810 training and 10,845 validation) questions, i.e., unlike previous works we only took a small fraction of the available training data.

We want to briefly discuss the strategies we used here. We do not see any option other than manually indicating the lexicalizations for instances. For example, in MusicBrainz the property purl:title must be selected otherwise one cannot find any existing album. On the other hand there could be a property expressing the cover description of an album. We do not see any method to determine automatically why the property purl:title should be used as a lexicalization, while not the one about the cover description. We therefore think that the only available solution is to make the standard more clear on how to express such an information. Regarding the lexicalization of relations, the situation is different. The literature contains a number of approaches that can be used to generate them. For an overview, we refer to Section 7 in [12]. All works suppose one of the three following situations: there is some free text that also contains such knowledge, or they are expressed in some external databases like WordNet, or a training repository to learn them from question and answer pairs is available. On knowledge bases like Musicbrainz any of the 3 alternatives is not available, so also in this case we believe that the manual work cannot be avoided.

Table 3

This table summarizes the results obtained by the QA systems evaluated with QALD-3 (over DBpedia 3.8), QALD-4 (over DBpedia 3.9), QALD-5 (over DBpedia 2014), QALD-6 (over DBpedia 2015-10), QALD-7 (2016-04), QALD-8 (2016-10). We indicated with “∗” the systems that did not participate directly in the challenges, but were evaluated on the same benchmark afterwards. We indicate the average running times of a query for the systems where we found them. Even if the runtime evaluations were executed on different hardware, it still helps to give an idea about the scalability

QA system	Language	Type	Total	Precision	Recall	F-measure	Runtime	Ref
QALD-3
WDAqua-core1	En	Full	100	0.64	0.42	0.51	1.01	-
WDAqua-core1	En	Key	100	0.71	0.37	0.48	0.79	-
WDAqua-core1	De	Key	100	0.79	0.31	0.45	0.22	-
WDAqua-core1	De	Full	100	0.79	0.28	0.42	0.30	-
WDAqua-core1	Fr	Key	100	0.83	0.27	0.41	0.26	-
gAnswer [66]∗	En	full	100	0.40	0.40	0.40	≈1 s	[66]
WDAqua-core1	Fr	Full	100	0.70	0.26	0.38	0.37	-
WDAqua-core1	Es	Full	100	0.77	0.24	0.37	0.27	-
WDAqua-core1	It	Full	100	0.79	0.23	0.36	0.30	-
WDAqua-core1	It	Key	100	0.84	0.23	0.36	0.24	-
WDAqua-core1	Es	Key	100	0.80	0.23	0.36	0.23	-
RTV [28]	En	Full	99	0.32	0.34	0.33	-	[7]
Intui2 [17]	En	Full	99	0.32	0.32	0.32	-	[7]
SINA [48]^∗	En	Full	100	0.32	0.32	0.32	≈10-20 s	[48]
DEANNA [61]^∗	En	Full	100	0.21	0.21	0.21	≈1-50 s	[66]
SWIP [45]	En	Full	99	0.16	0.17	0.17	-	[7]
Zhu et al. [64]^∗	En	Full	99	0.38	0.42	0.38	-	[64]
QALD-4
Xser [60]	En	Full	50	0.72	0.71	0.72	-	[53]
WDAqua-core1	En	Key	50	0.76	0.40	0.52	0.32 s	-
WDAqua-core1	En	Full	50	0.56	0.30	0.39	0.46 s	-
gAnswer [66]	En	Full	50	0.37	0.37	0.37	0.973 s	[53]
CASIA [33]	En	Full	50	0.32	0.40	0.36	-	[53]
WDAqua-core1	De	Key	50	0.92	0.20	0.33	0.04 s	-
WDAqua-core1	Fr	Key	50	0.92	0.20	0.33	0.06 s	-
WDAqua-core1	It	Key	50	0.92	0.20	0.33	0.04 s	-
WDAqua-core1	Es	Key	50	0.92	0.20	0.33	0.05 s	-
WDAqua-core1	De	Full	50	0.90	0.20	0.32	0.06 s	-
WDAqua-core1	It	Full	50	0.92	0.20	0.32	0.16 s	-
WDAqua-core1	Es	Full	50	0.90	0.20	0.32	0.06 s	-
WDAqua-core1	Fr	Full	50	0.86	0.18	0.29	0.09 s	-
Intui3 [18]	En	Full	50	0.23	0.25	0.24	-	[53]
ISOFT [44]	En	Full	50	0.21	0.26	0.23	-	[53]
Hakimov [32]^∗	En	Full	50	0.52	0.13	0.21	-	[32]

Table 3

(Continued)

QA system	Language	Type	Total	Precision	Recall	F-measure	Runtime	Ref
QALD-5
Xser [60]	En	Full	50	0.74	0.72	0.73	-	[54]
UTQA [59]	En	Full	50	-	-	0.65	-	[59]
UTQA [59]	Es	Full	50	0.55	0.53	0.54	-	[59]
UTQA [59]	Fs	Full	50	0.53	0.51	0.52	-	[59]
WDAqua-core1	En	Full	50	0.56	0.41	0.47	0.62 s	-
WDAqua-core1	En	Key	50	0.60	0.27	0.37	0.50 s	-
AskNow [19]	En	Full	50	0.32	0.34	0.33		[19]
QAnswer [47]	En	Full	50	0.34	0.26	0.29	-	[54]
WDAqua-core1	De	Full	50	0.92	0.16	0.28	0.20 s	-
WDAqua-core1	De	Key	50	0.90	0.16	0.28	0.19 s	-
WDAqua-core1	Fr	Full	50	0.90	0.16	0.28	0.19 s	-
WDAqua-core1	Fr	Key	50	0.90	0.16	0.28	0.18 s	-
WDAqua-core1	It	Full	50	0.88	0.18	0.30	0.20 s	-
WDAqua-core1	It	Key	50	0.90	0.16	0.28	0.18 s	-
WDAqua-core1	Es	Full	50	0.88	0.14	0.25	0.20 s	-
WDAqua-core1	Es	Key	50	0.90	0.14	0.25	0.20 s	-
SemGraphQA [3]	En	Full	50	0.19	0.20	0.20	-	[54]
YodaQA [2]	En	Full	50	0.18	0.17	0.18	-	[54]
QuerioDali [37]	En	Full	50	?	?	?	?	[37]
QALD-6
WDAqua-core1	En	Full	100	0.55	0.34	0.42	1.28 s	-
WDAqua-core1	De	Full	100	0.73	0.29	0.41	0.41 s	-
WDAqua-core1	De	Key	100	0.85	0.27	0.41	0.30 s	-
WDAqua-core1	En	Key	100	0.51	0.30	0.37	1.00 s	-
SemGraphQA [3]	En	Full	100	0.70	0.25	0.37	-	[55]
WDAqua-core1	Fr	Key	100	0.78	0.23	0.36	0.34 s	-
WDAqua-core1	Fr	Full	100	0.57	0.22	0.32	0.46 s	-
WDAqua-core1	Es	Full	100	0.69	0.19	0.30	0.45 s	-
WDAqua-core1	Es	Key	100	0.83	0.18	0.30	0.35 s	-
WDAqua-core1	It	Key	100	0.75	0.17	0.28	0.34 s	-
AMUSE [31]	En	Full	100	-	-	0.26	-	[31]
WDAqua-core1	It	Full	100	0.62	0.15	0.24	0.43 s	-
AMUSE [31]	Es	Full	100	-	-	0.20	-	[31]
AMUSE [31]	De	Full	100	-	-	0.16	-	[31]
QALD-7
AMAL [46]	Fr	Full	47	0.72	0.72	0.72	-	[57]
gAnswer2 [34]	En	Full	50	0.40	0.46	0.41	3.33 s	[23]
WDAqua-core1	En	Full	50	0.39	0.40	0.37	0.55 s	[23]
WDAqua-core1	De	Full	50	0.23	0.26	0.22	0.53 s	[24]
WDAqua-core1	Fr	Full	50	0.13	0.16	0.14	0.45 s	[25]
WDAqua-core1	It	Full	50	0.09	0.05	0.03	0.44 s	[26]
QALD-8
WDAqua-core1	En	Full	47	0.39	0.40	0.39	1.72 s	[27,56]
gAnswer2 [34]	En	Full	47	0.39	0.39	0.39	1.92 s	[27,56]
QAKIS [8]	En	Full	47	0.06	0.05	0.06	15.41 s	[27,56]

5.2. Experiments

To show the performance of the approach on different scenarios, we benchmarked it using the following benchmarks.

Table 4
The table shows the results of WDAqua-core1 over the QALD-7 task 4 training dataset. We used Wikidata (dated 2016-11-28)

QA System Language Type Number Precision Recall F-measure Runtime Ref

QALD-7 task 4, training dataset

WDAqua-core1 En Full 100 0.37 0.39 0.37 1.68 s -

WDAqua-core1 En Key 100 0.35 0.38 0.35 0.80 s -

WDAqua-core1 Es Key 100 0.31 0.32 0.31 0.45 s -

Sorokin et al. [50] En Full 100 - - 0.29 - [50]

WDAqua-core1 De Key 100 0.27 0.28 0.27 1.13 s -

WDAqua-core1 Fr Key 100 0.27 0.30 0.27 1.14 s -

WDAqua-core1 Fr Full 100 0.27 0.31 0.27 1.05 s -

WDAqua-core1 Es Full 100 0.24 0.26 0.24 0.65 s -

WDAqua-core1 De Full 100 0.18 0.20 0.18 0.82 s -

WDAqua-core1 It Full 100 0.19 0.20 0.18 1.00 s -

WDAqua-core1 It Key 100 0.17 0.18 0.16 0.44 s -

QA System	Language	Type	Number	Precision	Recall	F-measure	Runtime	Ref
QALD-7 task 4, training dataset
WDAqua-core1	En	Full	100	0.37	0.39	0.37	1.68 s	-
WDAqua-core1	En	Key	100	0.35	0.38	0.35	0.80 s	-
WDAqua-core1	Es	Key	100	0.31	0.32	0.31	0.45 s	-
Sorokin et al. [50]	En	Full	100	-	-	0.29	-	[50]
WDAqua-core1	De	Key	100	0.27	0.28	0.27	1.13 s	-
WDAqua-core1	Fr	Key	100	0.27	0.30	0.27	1.14 s	-
WDAqua-core1	Fr	Full	100	0.27	0.31	0.27	1.05 s	-
WDAqua-core1	Es	Full	100	0.24	0.26	0.24	0.65 s	-
WDAqua-core1	De	Full	100	0.18	0.20	0.18	0.82 s	-
WDAqua-core1	It	Full	100	0.19	0.20	0.18	1.00 s	-
WDAqua-core1	It	Key	100	0.17	0.18	0.16	0.44 s	-

5.2.1. Benchmarks

QALD: We evaluated our approach using the QALD benchmarks. These benchmarks allow us to see the performance on multiple languages and over both full-natural language questions and keyword questions. We executed the benchmarks for QALD-3 to QALD-6 locally while respecting the metrics of the original benchmarks. For QALD-7 and QALD-8 we relied on Gerbil for QA [58]. We did not use Gerbil for QA for all the benchmarks since Gerbil for QA uses only the newest version of DBpedia while QALD-3 to QALD-6 were based on different versions of DBpedia. Moreover, Gerbil for QA does not support benchmarking for keyword queries so that the benchmark results are not presented for them on QALD-7 and QALD-8. Finally note that the benchmarking metrics changed from QALD-7.

The results are given in Table 3 together with state-of-the-art systems. To find these, we used Google Scholar to select all publications about QA systems that cited one of the QALD challenge publications. Note that, in the past, QA systems were evaluated only on one or two of the QALD benchmarks. We provide, for the first time, an estimation of the differences between the benchmark series. Over English, we outperformed 90% of the proposed approaches. We achieve similar results as gAnswer2 [34], while we do not beat Xser [60], UTQA [59] and AMAL [46]. Note that Xser and UTQA required additional training data than the one provided in the benchmark, which required a significant cost in terms of manual effort. AMAL uses manual translations of the labels to address the lexical gap and can answer only questions with one triple pattern. Moreover, the robustness of these systems over keyword questions is probably not guaranteed. We cannot prove this claim because for these systems neither the source code nor a web-service is available.

Due to the manual effort required to do an error analysis for all benchmarks and the limited space, we restricted to the QALD-6 benchmark. The error sources over the 100 questions are the following:

40% (26 errors) are due to lexical gap (e.g. for “Who played Gus Fring in Breaking Bad?” the property dbo:portrayer is expected)

28% (18 errors) come from wrong ranking

12% (8 errors) are due to the missing support of superlatives and comparatives in our implementation (e.g. “Which Indian company has the most employees?”)

9% (4 errors) from the need of complex queries with unions or filters (e.g. the question “Give me a list of all critically endangered birds.” requires a filter on dbo:conservationStatus equal “CR”)

6% (4 errors) come from out of scope questions (i.e. question that should not be answered)

2% (1 error) from too ambiguous questions (e.g. “Who developed Slack?” is expected to refer to a “cloud-based team collaboration tool” while we interpret it as “linux distribution”).

One can see that keyword queries always perform worse as compared to full natural language queries. The reason is that the formulation of the keyword queries does not allow us to decide if the query is an ASK query or if a COUNT is needed (e.g. “Did Elvis Presley have children?” is formulated as “Elvis Presley, children”). This means that we automatically get these questions wrong.

To show the performance over Wikidata, we consider the QALD-7 task 4 training dataset. This originally provided only English questions. The QALD-7 task 4 training dataset reuses questions over DBpedia from previous challenges where translations in other languages were available. We moved these translations to the dataset. The results can be seen in Table 4. Except for English, keyword questions are easier than full natural language questions. The reason is the formulation of the questions. For keyword questions the lexical gap is smaller. For example, the keyword question corresponding to the question “Qui écrivit Harry Potter?” is “écrivain, Harry Potter”. Stemming does not suffice to map “écrivit” to “écrivain”, lemmatization would be needed. This problem is much smaller for English, where the effect described over DBpedia dominates. We can see that the best performing language is English, while the worst performing language is Italian. This is mostly related to the poorer number of lexicalizations for Italian. Note that the performance of the QA approach over Wikidata correlates with the number of lexicalizations for resources and properties for the different languages as described in [36]. This indicates that the quality of the data, in different languages, directly affects the performance of the QA system. Hence, we can derive that our results will probably improve while the data quality is increased. Finally we outperform the presented QA system over this benchmark.

Table 5
This table summarizes the QA systems evaluated over SQA2018. This benchmark is designed to measure the scalability of the approach, i.e. how many questions can be answered on increasing load

QA System Language Type Total Precision Recall F-measure Power Ref

SQA2018

WDAqua-core1 En Full 1830 0.37 0.38 0.36 0.47 [43]

GQA [65] En Full 1830 0.02 0.02 0.02 0.03 [43]

LAMA [46] En Full 1830 0.01 0.02 0.01 0.02 [43]

QA System	Language	Type	Total	Precision	Recall	F-measure	Power	Ref
SQA2018
WDAqua-core1	En	Full	1830	0.37	0.38	0.36	0.47	[43]
GQA [65]	En	Full	1830	0.02	0.02	0.02	0.03	[43]
LAMA [46]	En	Full	1830	0.01	0.02	0.01	0.02	[43]

Table 6

This table summarizes the QA systems evaluated over SimpleQuestions. Every system was evaluated over FB2M except the ones marked with (∗) which were evaluated over FB5M

QA System	Language	Type	Total	Accuracy	Runtime	Ref
SimpleQuestions
Lukovnikov et al.	En	Full	21687	0.712	-	[39]
Golub and He	En	Full	21687	0.709	-	[30]
Yin et al.	En	Full	21687	0.683	-	[63]
Bordes et al.	En	Full	21687	0.627	-	[5]
Dai et al.^∗	En	Full	21687	0.626	-	[9]
WDAqua-core1 ^∗	En	Full	21687	0.571	2.1 s	-

Table 7

This table summarizes the results of WDAqua-core1 over some newly appeared benchmarks

Benchmark	Lang	Type	Total	Precision	Recall	F-measure	Runtime
LC-QuAD	En	Full	5000	0.59	0.38	0.46	1.5 s
WDAquaCore0Questions	Mixed	Mixed	689	0.79	0.46	0.59	1.3 s

Table 8

Comparison on QALD-6 when querying only DBpedia and multiple KBs at the same time

Dataset	Language	Type	Total	Precision	Recall	F-measure	Runtime
DBpedia	En	Full	100	0.55	0.34	0.42	1.37s
All KBs supported	En	Full	100	0.49	0.39	0.43	11.94 s

SQA2018: SQA2018 is a benchmark to test how a QA system behaves when there is heavy load. The benchmark sends each minute and increasing number of queries. First 1 then 2, 4, 8 and so on. The performance is measured using as a metric the power which takes into consideration the precision, recall and number of queries that could be answered. The exact metric is reported in [43]. The result in Table 5 show that WDAqua-core1 outperforms by a large margin the other competing approaches. This shows that scalability is well addressed.

SimpleQuestions: SimpleQuestions contains 108,442 questions that can be solved using one triple pattern. We trained our system using the first 100 questions in the training set. The results of our system, together with the state-of-the-art systems are presented in Table 6. For this evaluation, we restricted the generated queries with one triple-pattern. The system performance is 14% below the state-of-the-art. Note that we achieve this result by considering only 100 of the 75,810 questions in the training set, and investing 1 hour of manual work for creating lexicalizations for properties manually. Concretely, instead of generating a training dataset with 80,000 questions, which can cost several thousands of euros, we invested 1 hour of manual work with the result of loosing (only) 14% in accuracy!

Note that the SimpleQuestions dataset is highly skewed towards certain properties (it contains 1629 properties, the 20 most frequent properties cover nearly 50% of the questions). Therefore, it is not clear how the other QA systems behave with respect to properties not appearing in the training dataset and with respect to keyword questions. Moreover, it is not clear how to port the existing approaches to new languages and it is not possible to adapt them to more difficult questions. These points are solved using our approach. Hence, we provided here, for the first time, a quantitative analysis of the impact of big training data corpora on the quality of a QA system.

LC-QuAD & WDAquaCore0Questions: Recently, a series of new benchmarks have been published. LC-QuAD [52] is a benchmark containing 5,000 English questions and it concentrates on complex questions. WDAquaCore0Questions [14] is a benchmark containing 689 questions over multiple languages and addressing mainly Wikidata, generated from the logs of a live running QA system. The questions are a mixture of real-world keyword and malformed questions. In Table 7, we present the first baselines for these benchmarks.

Multiple KBs: The only available benchmark that tackles multiple KBs was presented in QALD-4 task 2. The KBs are rather small and perfectly interlinked. This is not the case over the considered KBs. We therefore evaluated the ability to query multiple KBs differently. We run the questions of the QALD-6 benchmark, which was designed for DBpedia, both over DBpedia (only) and over DBpedia, Wikidata, MusicBrainz, DBLP and Freebase. Note that, while the original questions have a solution over DBpedia, a good answer could also be found over the other datasets. We therefore manually checked whether the answers that were found in other KBs are right (independently from which KB was chosen by the QA system to answer it). The results are presented in Table 8. WDAqua-core1 choose 53 times to answer a question over DBpedia, 39 over Wikidata and the other 8 times over a different KB. Note that we get better results when querying multiple KBs. Globally we get better recall and lower precision which is expected. While scalability is an issue, we are able to pick the right KB to find the answer!

Note: We did not tackle the WebQuestions benchmark for the following reasons. While it has been shown that WebQuestions can be addressed using non-reified versions of Freebase, this was not the original goal of the benchmark. More then 60% of the QA systems benchmarked over WebQuestions are tailored towards its reification model. There are two important points here. First, most KBs in the Semantic Web use binary statements. Secondly, in the Semantic Web community, many different reification models have been developed as described in [29].

5.2.2. Setting

All experiments were performed on a virtual machine with 4 cores of Intel Xeon E5-2667 v3 3.2GH, 16 GB of RAM and 500 GB of SSD disk. Note that the whole infrastructure was running on this machine, i.e. all indexes and the triple-stores needed to compute the answers (no external service was used). The original data dumps sum up to 336 GB. Note that across all benchmarks we can answer a question in less then 2 seconds except when all KBs are queried at the same time which shows that the algorithm should be parallelized for further optimization.

6. Provided services for multilingual and multi-KB QA

We have presented an algorithm that can be easily ported to new KBs and that can query multiple KBs at the same time. In the evaluation section, we have shown that our approach is competitive while offering the advantage of being multilingual and robust to keyword questions. Moreover, we have shown that we can achieve acceptable run-times on a modern laptop. In this section, we describe how we integrated the approach to an actual service and how we combine it to existing services so that it can be directly used by end-users.

First, we integrated WDAqua-core1 into Qanary [6,13], a framework to integrate QA components. This way WDAqua-core1 can be accessed via RESTful interfaces for example to benchmark it via Gerbil for QA [58]. It also allows to combine it with services that are already integrated into Qanary like a speech recognition component based on Kaldi19

¹⁹
http://kaldi-asr.org

and a language detection component based on [42]. Moreover, the integration into Qanary allows to reuse Trill [10], a reusable front-end for QA systems. A screenshot of Trill using in the back-end WDAqua-core1 can be found in Fig. 4.

Secondly, we reused and extended Trill to make it easily portable to new KBs. While Trill originally supported only DBpedia and Wikidata, now it also supports also MusicBrainz, DBLP and Freebase. We designed the extension so that it can be easily ported to new KBs. Enabling the support for a new KB is mainly reduced to writing an adapted SPARQL query for the new KB. Additionally, the extension allows to select multiple KBs at the same time.

Thirdly, we adapted some services that are used in Trill to be easily portable to new KBs. These include SPARQLToUser [11], a tool that generates a human readable version of a SPARQL query and SummaServer [16] a service for entity summarization. Finally we extended Trill to present aggregated information in from of collections of images and map aggregation depending on the available information [15]. All these tools now support the 5 mentioned KBs and the 5 mentioned languages.

A public online demo is available under: www.wdaqua.eu/qa

Moreover, there is an open API available that is described at http://wdaqua-frontend.univ-st-etienne.fr/faq

This is for example implemented by Gerbil for QA [58] and the DBpediaChat [1].20

²⁰

http://chat.dbpedia.org/

Fig. 4.

Screenshot of Trill, using in the back-end WDAqua-core1, for the question “Give me museums in Lyon”.

7. Conclusion and future work

In this paper, we introduced a novel concept for QA aimed at multilingual and KB-agnostic QA. Due to the described characteristics of our approach portability is ensured which is a significant advantage in comparison to previous approaches. We have shown the power of our approach in an extensive evaluation over multiple benchmarks. Hence, we clearly have shown our contributions w.r.t. qualitative (language, KBs) and quantitative improvements (outperforming many existing systems and querying multiple KBs) as well as the capability of our approach to scale for very large KBs like DBpedia.

We have applied our algorithm and adapted a set of existing services so that end-users can query, using multiple languages, multiple KBs at the same time, using a unified interface. Hence, we provided here a major step towards QA over the Semantic Web following our larger research agenda of providing QA over the LOD cloud.

In the future, we want to tackle the following points. First, we want to parallelize our approach, s.t. when querying multiple KBs acceptable response times will be achieved. Secondly, we want to query more and more KBs (hints to interesting KBs are welcome). Thirdly, from different lessons learned from querying multiple KBs, we want to give a set of recommendations for RDF datasets, s.t. they are fit for QA. And fourth, we want to extend our approach to also query reified data. Fifth, we would like to extend the approach to be able to answer questions including complex operators like aggregations and functions. We believe that our work can further boost the expansion of the Semantic Web since we presented a solution that easily allows to consume RDF data directly by end-users requiring low hardware investments.

Note.

There is a Patent Pending for the presented approach. It was submitted the 18 January 2018 at the EPO and has the number EP18305035.0.

Footnotes

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 642795.

References

R.G.

Athreya ,

A.-C.N.

Ngomo and

Usbeck , Enhancing community interactions with data-driven chatbots-the DBpedia chatbot, in: Companion of the Web Conference 2018 on the Web Conference 2018, WWW 2018, Lyon, France, April 23–27, 2018, 2018, pp. 143–146. doi:10.1145/3184558.3186964.

Baudiš and

Šedivỳ , QALD Challenge and the YodaQA System: Prototype Notes, 2015.

Beaumont ,

Grau and

A.-L.

Ligozat , SemGraphQA@QALD5: LIMSI participation at QALD5@CLEF, in: Working Notes of CLEF 2015 – Conference and Labs of the Evaluation Forum, Toulouse, France, 2015, pp. 8–11, http://ceur-ws.org/Vol-1391/164-CR.pdf.

Berant ,

Chou ,

Frostig and

Liang , Semantic parsing on freebase from question-answer pairs, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, Seattle, Washington, USA, 18–21 October 2013, 2013, pp. 1533–1544, Grand Hyatt Seattle, A meeting of SIGDAT, a Special Interest Group of the ACL, http://aclweb.org/anthology/D/D13/D13-1160.pdf.

Bordes ,

Usunier ,

Chopra and

Weston , Large-scale Simple Question Answering with Memory Networks. CoRR, 2015, abs/1506.02075 (2015), http://arxiv.org/abs/1506.02075arXiv:1506.02075.

Both ,

Diefenbach ,

Singh ,

Shekarpour ,

Cherix and

Lange , Qanary – a methodology for vocabulary-driven open question answering systems, in: ESWC 2016, 2016.

Cabrio ,

Cimiano ,

López ,

A.-C.N.

Ngomo ,

Unger and

Walter , QALD-3: Multilingual question answering over linked data, in: Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013, 2013, http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-CabrioEt2013.pdf.

Cabrio ,

Cojan ,

Gandon and

Hallili , Querying multilingual DBpedia with QAKiS, in: The Semantic Web: ESWC 2013 Satellite Events – ESWC 2013 Satellite Events, Montpellier, France, May 26–30, 2013, 2013, pp. 194–198. Revised Selected Papers. doi: 10.1007/978-3-642-41242-4_23.

Dai ,

Li and

Xu , CFO: Conditional focused neural question answering with large-scale knowledge bases, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Berlin, Germany, August 7–12, 2016,, Vol. 1, Long Papers, 2016, http://aclweb.org/anthology/P/P16/P16-1076.pdf.

10.

Diefenbach ,

Amjad ,

Both ,

K.D.

Singh and

Maret , Trill: A reusable front-end for QA systems, in: The Semantic Web: ESWC 2017 Satellite Events – ESWC 2017 Satellite Events, Portorož, Slovenia, May 28–June 1, 2017, 2017, pp. 48–53. Revised Selected Papers. doi: 10.1007/978-3-319-70407-4_10.

11.

Diefenbach ,

Dridi ,

K.D.

Singh and

Maret , SPARQLtoUser: Did the question answering system understand me? in: Joint Proceedings of BLINK2017: 2nd International Workshop on Benchmarking Linked Data and NLIWoD3: Natural Language Interfaces for the Web of Data Co-Located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 21–22, 2017, 2017, http://ceur-ws.org/Vol-1932/paper-01.pdf.

12.

Diefenbach ,

López ,

K.D.

Singh and

Maret , Core techniques of question answering systems over knowledge bases: A survey, Knowl. Inf. Syst.55(3) (2018), 529–569, 2018. doi:10.1007/s10115-017-1100-y.

13.

Diefenbach ,

Singh ,

Both ,

Cherix ,

Lange and

Auer , The qanary ecosystem: Getting new insights by composing question answering pipelines, in: Proceedings of Web Engineering – 17th International Conference, ICWE 2017, Rome, Italy, June 5–8, 2017, 2017, pp. 171–189. doi:10.1007/978-3-319-60131-1_10.

14.

Diefenbach ,

T.P.

Tanon ,

K.D.

Singh and

Maret , Question answering benchmarks for Wikidata, in: Proceedings of the ISWC 2017 Posters & Demonstrations and Industry Tracks Co-Located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 23–25, 2017, http://ceur-ws.org/Vol-1963/paper555.pdf.

15.

Diefenbach ,

Tardiveau ,

Both ,

Singh and

Maret , Lessons learned from a knowledge-driven search application on-top of large data sets, in: Workshop on Visual Interfaces for Big Data Environments in Industrial Applications (VisBIA 2018) Co-Located with International Conference on Advanced Visual Interfaces (AVI 2018), 2017, pp. 32–39.

16.

Diefenbach and

Thalhammer , PageRank and Generic Entity Summarization for RDF Knowledge Bases, in: Proceedings of the Semantic Web – 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, 2018, pp. 145–160. doi: 10.1007/978-3-319-93417-4_10.

17.

Dima , Intui2: A prototype system for question answering over linked data, in: Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013, 2013. http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-Dima2013.pdf.

18.

Dima , Answering natural language questions with Intui3, in: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014, 2014, pp. 1201–1211, http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-Dima2014.pdf.

19.

Dubey ,

Dasgupta ,

Sharma ,

Höffner and

Lehmann , AskNow: A framework for natural language query formalization in SPARQL, in: Proceedings of the Semantic Web. Latest Advances and New Domains – 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, 2016, pp. 300–316. doi: 10.1007/978-3-319-34129-3_19.

20.

J.D.

Fernández ,

M.A.

Martínez-Prieto ,

Gutiérrez ,

Polleres and

Arias , Binary RDF representation for publication and exchange (HDT), J. Web Sem.19 (2013), 22–41. doi:10.1016/j.websem.2013.01.002.

21.

Ó.

Ferrández ,

Spurk ,

Kouylekov ,

Dornescu ,

Ferrández ,

Negri ,

Izquierdo ,

Tomás ,

Orasan ,

Neumann ,

Magnini and

J.L.V.

González , The QALL-ME framework: A specifiable-domain multilingual question answering architecture, J. Web Sem.9(2) (2011), 137–145, 2011. doi:10.1016/j.websem.2011.01.002.

22.

Gerber and

A.-C.N.

Ngomo , Bootstrapping the linked data web, in: 1st Workshop on Web Scale Knowledge Extraction@ ISWC, Vol. 2011, 2011.

23.

Gerbil, 2018, http://gerbil-qa.aksw.org/gerbil/experiment?id=201809180002.

24.

Gerbil, 2018, http://gerbil-qa.aksw.org/gerbil/experiment?id=201809180004.

25.

Gerbil, 2018, http://gerbil-qa.aksw.org/gerbil/experiment?id=201809180003.

26.

Gerbil, 2018, http://gerbil-qa.aksw.org/gerbil/experiment?id=201809180004.

27.

Gerbil, 2018, http://gerbil-qa.aksw.org/gerbil/experiment?id=201710220000.

28.

Giannone ,

Bellomaria and

Basili , A HMM-based approach to question answering against linked data, in: Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013, 2013, pp. 23–26, http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-GiannoneEt2013.pdf.

29.

J.M.

Giménez-García ,

Zimmermann and

Maret , NdFluents: An ontology for annotated statements with inference preservation, in: Proceedings of the Semantic Web – 14th International Conference, ESWC 2017, Part I, Portorož, Slovenia, May 28–June 1, 2017, 2017, pp. 638–654. doi: 10.1007/978-3-319-58068-5_39.

30.

Golub and

He , Character-Level Question Answering with Attention. CoRR, 2016. arXiv:1604.00727, http://arxiv.org/abs/1604.00727.

31.

Hakimov ,

Jebbara and

Cimiano , AMUSE: Multilingual Semantic Parsing for Question Answering over Linked Data. CoRR, 2018, abs/1802.09296 (2018), http://arxiv.org/abs/1802.09296arXiv:1802.09296.

32.

Hakimov ,

Unger ,

Walter and

Cimiano , Applying semantic parsing to question answering over linked data: Addressing the lexical gap, in: Proceedings of Natural Language Processing and Information Systems – 20th International Conference on Applications of Natural Language to Information Systems, NLDB 2015, Passau, Germany, June 17–19, 2015, 2015, pp. 103–109. doi: 10.1007/978-3-319-19581-0_8.

33.

He ,

Zhang ,

Liu and

Zhao , CASIA@V2: A MLN-based question answering system over linked data, in: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014, 2014, pp. 1249–1259. http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-ShizhuEt2014.pdf.

34.

Hu ,

Zou ,

J.X.

Yu ,

Wang and

Zhao , Answering natural language questions by subgraph matching over knowledge graphs, IEEE Trans. Knowl. Data Eng.30(5) (2018), 824–837. doi:10.1109/TKDE.2017.2766634.

35.

Jain , Question answering over knowledge base using factual memory networks, in: Proceedings of the Student Research Workshop, SRW@HLT-NAACL 2016, the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, 2016, pp. 109–115, http://aclweb.org/anthology/N/N16/N16-2016.pdf.

36.

Kaffee ,

Piscopo ,

Vougiouklis ,

Simperl ,

Carr and

Pintscher , A glimpse into babel: An analysis of multilinguality in wikidata, in: Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23–25, 2017, 2017, pp. 14:1–14:5. doi: 10.1145/3125433.3125465.

37.

López ,

Tommasi ,

Kotoulas and

Wu , QuerioDALI: Question answering over dynamic and linked knowledge graphs, in: Proceedings of the Semantic Web – ISWC 2016 – 15th International Semantic Web Conference, Part II, Kobe, Japan, October 17–21, 2016, 2016, pp. 363–382. doi: 10.1007/978-3-319-46547-0_32.

38.

López ,

V.S.

Uren ,

Sabou and

Motta , Is question answering fit for the semantic web?: A survey, Semantic Web2(2) (2011), 125–155. doi: 10.3233/SW-2011-0041.

39.

Lukovnikov ,

Fischer ,

Lehmann and

Auer , Neural network-based question answering over knowledge graphs on word and character level, in: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, 2017, pp. 1211–1220. doi: 10.1145/3038912.3052675.

40.

Marx ,

Usbeck ,

A.-C.N.

Ngomo ,

Höffner ,

Lehmann and

Auer , Towards an open question answering architecture, in: Proceedings of the 10th International Conference on Semantic Systems, SEMANTICS 2014, Leipzig, Germany, September 4–5, 2014, 2014, pp. 57–60. doi: 10.1145/2660517.2660519.

41.

Metzler and

W.B.

Croft , Linear feature-based models for information retrieval, Inf. Retr.10(3) (2007), 257–274. doi:10.1007/s10791-006-9019-z.

42.

Nakatani , Language Detection Library for Java, 2010, https://github.com/shuyo/language-detection.

43.

Napolitano ,

Usbeck and

A.-C.N.

Ngomo , The scalable question answering over linked data (SQA) challenge 2018, in: Semantic Web Challenges – 5th SemWebEval Challenge at ESWC 2018, Heraklion, Greece, June 3–7, 2018, 2018, pp. 69–75, Revised Selected Papers. doi:10.1007/978-3-030-00072-1_6.

44.

Park ,

Shim and

G.G.

Lee , ISOFT at QALD-4: semantic similarity-based question answering system over linked data, in: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014, 2014, pp. 1236–1248. http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-ParkEt2014.pdf.

45.

Pradel ,

Haemmerlé and

Hernandez , A semantic web interface using patterns: The SWIP system, in: Graph Structures for Knowledge Representation and Reasoning, Springer, 2012.

46.

Radoev ,

Zouaq ,

Tremblay and

Gagnon , A language adaptive method for question answering on French and English, in: Semantic Web Challenges – 5th SemWebEval Challenge at ESWC 2018, Heraklion, Greece, June 3–7, 2018, 2018, pp. 98–113. Revised Selected Papers. doi: 10.1007/978-3-030-00072-1_9.

47.

Ruseti ,

Mirea ,

Rebedea and

Trausan-Matu , QAnswer – enhanced entity matching for question answering over linked data, in: Working Notes of CLEF 2015 – Conference and Labs of the Evaluation Forum, Toulouse, France, September 8–11, 2015, 2015. http://ceur-ws.org/Vol-1391/99-CR.pdf.

48.

Shekarpour ,

Marx ,

A.-C.N.

Ngomo and

Auer , SINA: Semantic interpretation of user queries for question answering on interlinked data, J. Web Sem.30 (2015), 39–51. doi:10.1016/j.websem.2014.06.002.

49.

Singh ,

Both ,

Diefenbach and

Shekarpour , Towards a message-driven vocabulary for promoting the interoperability of question answering systems, in: Tenth IEEE International Conference on Semantic Computing, ICSC 2016, Laguna Hills, CA, USA, February 4–6, 2016, 2016, pp. 386–389. doi:10.1109/ICSC.2016.59.

50.

Sorokin and

Gurevych , End-to-end representation learning for question answering with weak supervision, in: Semantic Web Challenges – 4th SemWebEval Challenge at ESWC 2017, Portoroz, Slovenia, May 28–June 1, 2017, 2017, pp. 70–83. Revised Selected Papers. doi:10.1007/978-3-319-69146-6_7.

51.

T.P.

Tanon ,

M.D.

de Assunção ,

Caron and

F.M.

Suchanek , Demoing platypus – A multilingual question answering platform for wikidata, in: The Semantic Web: ESWC 2018 Satellite Events – ESWC 2018 Satellite Events, Heraklion, Crete, Greece, June 3–7, 2018, 2018, pp. 111–116, Revised Selected Papers. doi: 10.1007/978-3-319-98192-5_21.

52.

Trivedi ,

Maheshwari ,

Dubey and

Lehmann , LC-QuAD: A corpus for complex question answering over knowledge graphs, in: Proceedings of the Semantic Web – ISWC 2017–16th International Semantic Web Conference, Part II, Vienna, Austria, October 21-25, 2017, 2017, pp. 210–218. doi:10.1007/978-3-319-68204-4_22.

53.

Unger ,

Forascu ,

López ,

A.-C.N.

Ngomo ,

Cabrio ,

Cimiano and

Walter , Question answering over linked data (QALD-4), in: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014, 2014, pp. 1172–1180, http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-UngerEt2014.pdf.

54.

Unger ,

Forascu ,

Lopez ,

A.-C.N.

Ngomo ,

Cabrio ,

Cimiano and

Walter , Answering over Linked Data (QALD-5), in: Working Notes for CLEF 2015 Conference, 2015.

55.

Unger ,

A.-C.N.

Ngomo and

Cabrio , 6th open challenge on question answering over linked data (QALD-6), in: Semantic Web Challenges – Third SemWebEval Challenge at ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, 2016, pp. 171–177. Revised Selected Papers. doi: 10.1007/978-3-319-46565-4_13.

56.

Usbeck ,

A.-C.N.

Ngomo ,

Conrads ,

Röder and

Napolitano , 8th challenge on question answering over linked data (QALD-8) (invited paper), in: Joint Proceedings of the 4th Workshop on Semantic Deep Learning (SemDeep-4) and NLIWoD4: Natural Language Interfaces for the Web of Data (NLIWOD-4) and 9th Question Answering over Linked Data Challenge (QALD-9) Co-Located with 17th International Semantic Web Conference (ISWC 2018), Monterey, California, USA, October 8th–9th, 2018, 2018, pp. 51–57. http://ceur-ws.org/Vol-2241/paper-05.pdf.

57.

Usbeck ,

A.-C.N.

Ngomo ,

Haarmann ,

Krithara ,

Röder and

Napolitano , 7th open challenge on question answering over linked data (QALD-7), in: Semantic Web Challenges – 4th SemWebEval Challenge at ESWC 2017, Portoroz, Slovenia, May 28–June 1, 2017, 2017, pp. 59–69. Revised Selected Papers. doi:10.1007/978-3-319-69146-6_6.

58.

Usbeck ,

Röder ,

Hoffmann ,

Conrads ,

Huthmann ,

A.-C.N.

Ngomo ,

Demmler and

Unger , Benchmarking question answering systems, Semantic Web Journal2016 (2016).

59.

A.P.B.

Veyseh , Cross-lingual question answering using common semantic space, in: Proceedings of TextGraphs@NAACL-HLT 2016: The 10th Workshop on Graph-Based Methods for Natural Language Processing, San Diego, California, USA, June 17, 2016, 2016, pp. 15–19, http://aclweb.org/anthology/W/W16/W16-1403.pdf.

60.

Xu ,

Feng and

Zhao , Answering natural language questions via phrasal semantic parsing, in: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014, 2014, pp. 1260–1274, http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-XuEt2014.pdf.

61.

Yahya ,

Berberich ,

Elbassuoni ,

Ramanath ,

Tresp and

Weikum , Natural language questions for the web of data, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, July 12–14, 2012, 2012, pp. 379–390. http://www.aclweb.org/anthology/D12-1035.

62.

Yin ,

W.X.

Zhao and

Li , Type-aware question answering over knowledge base with attention-based tree-structured neural networks, J. Comput. Sci. Technol.32(4) (2017), 805–813, 2017. doi:10.1007/s11390-017-1761-8.

63.

Yin ,

Yu ,

Xiang ,

Zhou and

Schütze , Simple question answering by attentive convolutional neural network, in: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, Osaka, Japan, December 11–16, 2016, 2016, pp. 1746–1756. http://aclweb.org/anthology/C/C16/C16-1164.pdf.

64.

Zhu ,

Ren ,

Liu ,

Wang ,

Tian and

Yu , A graph traversal based approach to answer non-aggregation questions over DBpedia, in: Semantic Technology – 5th Joint International Conference, JIST 2015, Yichang, China, November 11–13, 2015, 2015, pp. 219–234. Revised Selected Papers. doi:10.1007/978-3-319-31676-5_16.

65.

Zimina ,

Nummenmaa ,

Järvelin ,

Peltonen ,

Stefanidis and

Hyyrö , GQA: Grammatical question answering for RDF data, in: Semantic Web Challenges – 5th SemWebEval Challenge at ESWC 2018, Heraklion, Greece, June 3–7, 2018, 2018, pp. 82–97, Revised Selected Papers. doi:10.1007/978-3-030-00072-1_8.

66.

Zou ,

Huang ,

Wang ,

J.X.

Yu ,

He and

Zhao , Natural language question answering over RDF: a graph data driven approach, in: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, 2014, pp. 313–324. doi:10.1145/2588555.2610525.

Towards a question answering system over the Semantic Web

Abstract

Keywords

1. Introduction

1 http://lod-cloud.net

2 http://www.sc.cit-ec.uni-bielefeld.de/qald/

7 https://lucene.apache.org

8 https://sourceforge.net/p/lemur/wiki/RankLib/

3.5. Multiple KBs

3.6. Discussion

4. Fast candidate generation

Definition 1 (Graph).

10 https://www.wikidata.org/

6. Provided services for multilingual and multi-KB QA

19 http://kaldi-asr.org

Footnotes

Acknowledgements

References

¹
http://lod-cloud.net

²
http://www.sc.cit-ec.uni-bielefeld.de/qald/

⁷
https://lucene.apache.org

⁸
https://sourceforge.net/p/lemur/wiki/RankLib/

¹⁰
https://www.wikidata.org/

¹⁹
http://kaldi-asr.org