Sage Journals: Discover world-class research

Abstract

The necessity of making the Semantic Web more accessible for lay users, alongside the uptake of interactive systems and smart assistants for the Web, have spawned a new generation of RDF-based question answering systems. However, fair evaluation of these systems remains a challenge due to the different type of answers that they provide. Hence, repeating current published experiments or even benchmarking on the same datasets remains a complex and time-consuming task.

We present a novel online benchmarking platform for question answering (QA) that relies on the FAIR principles to support the fine-grained evaluation of question answering systems. We detail how the platform addresses the fair benchmarking platform of question answering systems through the rewriting of URIs and URLs. In addition, we implement different evaluation metrics, measures, datasets and pre-implemented systems as well as methods to work with novel formats for interactive and non-interactive benchmarking of question answering systems. Our analysis of current frameworks shows that most of the current frameworks are tailored towards particular datasets and challenges but do not provide generic models. In addition, while most frameworks perform well in the annotation of entities and properties, the generation of SPARQL queries from annotated text remains a challenge.

Keywords

Factoid question answering benchmarking repeatable open research

1. Introduction

The Web of Data has grown to contain billions of facts pertaining to a large variety of domains. While this wealth of data can be easily accessed by experts, it remains difficult to use for non-experts [7,43]. This need has led to the development of a large number of question answering (QA) and keyword search tools for the Web of Data [37–40,44]. As benchmarking has been credited with the more rapid advancement of research, many campaigns and challenges (e.g., Question Answering on Linked Data [38–40], BioASQ [37]) have evolved around the QA research field (see Section 2) since the evolution of the first question answering system [17]. A significant improvement in F-measure for question answering frameworks has been achieved in recent years, an increase which is partly due to the existence of such campaigns [20]. However, evaluation datasets, measures and QA system processes are hardly documented. In addition, the few existing testbeds are commonly tailored toward a particular challenge and cannot be used universally. Hence, there is no overview of the performance of frameworks outside of the challenges, making the evaluation of (1) the state of the art and (2) the weaknesses and strengths of existing systems tedious if not impossible.

Motivated by the more than 17,000 experiments that have already been run on GERBIL [45] and the improvement of named entity recognition (NER) and entity linking (EL) systems by over 12% F-measure since the deployment of GERBIL, we address the drawbacks aforementioned by presenting a novel benchmarking platform for question answering systems dubbed GERBIL QA. Our platform relies on the foundations provided by the community-approved GERBIL framework for benchmarking Named Entity Recognition and Entity Linking systems [45] (see Fig. 1). While we reused the mechanisms provided by GERBIL to store experiments and generate corresponding URIs, we replaced the core components of semantic annotation systems, datasets, metrics and matching procedures since they are not usable for the QA benchmarking task. In particular, we addressed the crucial problem of benchmarking systems which return equivalent URIs, URLs and strings, in a fair manner. In doing so, we provide the QA community with the means to perform citable, comparable and extensible in-depth benchmarking of QA systems.

Fig. 1.

Overview of the Question Answering Benchmarking platform based on the GERBIL core.

GERBIL QA follows the FAIR principles [49]:

Findable: All experimental (meta)data is available in persistent RDF as JSON-LD and in a SPARQL endpoint1

http://gerbil-qa.aksw.org/sparql

using the rich DataID [6] and DataCube [9] vocabularies.

Accesible: Experiments can be linked via W3ID2

https://w3id.org/

URIs using the HTTP protocol for a human- or machine-readable version.

Interoperatable: All (meta)data and its respective identifiers uses RDF as formal, accessible, shared, and broadly applicable language for knowledge representation.

Re-Usable: Every captured evaluation metric is described via an RDF model and released without any license restrictions. The metadata further describes the provenance of the used system and dataset.

Our approach differs from the state of the art (including GERBIL) and addresses the following drawbacks of existing challenges and systems for evaluating question answering:

Datasets: Current evaluation campaigns and challenges offer a dataset (mostly by mere reference) and a set of questions without any extensibility. We address this drawback by allowing for the user-driven addition of datasets.

Reference implementations: The development of QA systems driven by the objective assessment of the weaknesses of one’s own system in comparison to existing solutions was quasi impossible. With GERBIL QA, users can continuously benchmark their systems against the solutions included in the platform.

Evaluation: The fair evaluation of knowledge base-based QA systems across different URLs and URIs (e.g. Wikipedia vs. Freebase) used to refer to the same real-world object, which has not been investigated before this project but is now an integral part of GERBIL QA.

To address these challenges, GERBIL QA provides the following novel contributions:

We offer 8 metrics for benchmarking QA systems as well as 6 novel QA (sub-)experiment types to (1) allow for a fine-grained evaluation of QA systems and (2) improve the diagnostic process.

While we reuse the existing GERBIL core, we provide novel matching and metric calculations for QA since existing evaluation platforms do not offer this functionality.

We integrate 6 existing QA systems into the platform and provide an unprecedented bundle of 22 question answering datasets (QALD-1 to QALD-6 and NLQ) to evaluate these systems. We hence present the first integral comparison of QA systems for Linked Data across challenges.

Our framework supports both online systems and file-based evaluation campaigns over a large variety of datasets. That is, we allow for the upload of system results as well as datasets on the fly as well as webservices for systems.

In addition, we support three widely used formats for the interactive communication of QA systems via webservices.

Note that GERBIL QA reuses the mechanisms provided by GERBIL to offer citable, stable experiment URIs and descriptions, which are both human and machine-readable. To this end, GERBIL QA uses the recently proposed DataID [6] ontology which is based on a combination of VoID [2] and DCAT [25] metadata with Prov-O [23] provenance information and ODRL [27] licenses to describe datasets.

A demo of the system is available at http://gerbil-qa.aksw.org/gerbil/. Furthermore, we made the datasets, utilities and the source code openly available and extensible.3

https://github.com/dice-group/NLIWOD and https://github.com/dice-group/gerbil/tree/QuestionAnswering

A general overview of the GERBIL framework can be found at the project website.4

⁴

https://gerbil.aksw.org

Note, while our platform focuses on RDF-based systems, i.e., question answering systems based on Linked Data and other resources providing RDF resources or literals as answers, it can be easily extended to non-RDF systems.

2. Question answering benchmarking campaigns

Like in other disciplines, QA researchers and practitioners require reliable test environments and comparison methods to step-up their development speed and lower entrance barriers. There has thus been a number of challenges and campaigns attracting researchers as well as industry practitioners to QA. Since 1998, the TREC conference, especially the QA track [47], aims to provide domain-independent evaluations over large, unstructured corpora. This seminal campaign pushed research projects forward over the course of its more than ten implementations. The latest TREC-QA tackles the field of live QA5

⁵
https://sites.google.com/site/trecliveqa2016/

where systems answer real-life, real-time questions of users submitted to popular community-based Question and Answer sites. The CLEF campaigns on information retrieval have a more than 10 year tradition in evaluating IR systems [1]. However, here we focus on benchmarking QA systems that are able to return a concise set of answers rather than snippets from documents to a particular keyword query.

Next to that, the BioASQ series [37] challenges semantic indexing as well as QA systems on biomedical data and is currently at its fifth installment. Here, systems have to work on RDF as well as textual data to present matching triples as well as snippets of text. Moreover, the OKBQA6

⁶

http://www.okbqa.org

is primarily an open QA platform powered by several Korean research institutes such as KAIST. The KAIST institute also released the NLQ datasets within their 3rd hackathon.7

⁷

http://2015.okbqa.org/nlq

This dataset is answerable purely by Wikipedia or its machine-readable version – DBpedia – using SPARQL. The well-known QALD (Question Answering over Linked Data) [40] campaign, currently running in its 6th instantiation, is a diverse evaluation series including (1) RDF-based, (2) hybrid, i.e., RDF and textual data, (3) statistical (4) multi-knowledge base and (5) music-domain-based benchmarks.

In the following, we will use the datasets and formats (QALD-XML, QALD-JSON) as a base for our benchmarking suite, since they have been adopted by more than 20 QA systems since 2011 (see [11,20] and Table 2). So far, yearly QALD events enable participants to upload XML or JSON-based system answers to previously uploaded files on the QALD website.

Table 1

Built-in datasets and their features

Dataset	#Questions	Knowledge base
NLQ shared task 1	39	DBpedia 2015-04
QALD1_Test_dbpedia	50	DBpedia 3.6
QALD1_Train_dbpedia	50	DBpedia 3.6
QALD1_Test_musicbrainz	50	MusicBrainz a (dump 2011)
QALD1_Train_musicbrainz	50	MusicBrainz (dump 2011)
QALD2_Test_dbpedia	99	DBpedia 3.7
QALD2_Train_dbpedia	100	DBpedia 3.7
QALD3_Test_dbpedia	99	DBpedia 3.8
QALD3_Train_dbpedia	100	DBpedia 3.8
QALD3_Test_esdbpedia	50	DBpedia 3.8 es
QALD3_Train_esdbpedia	50	DBpedia 3.8 es
QALD4_Test_Hybrid	10	DBpedia 3.9 + long abstracts
QALD4_Train_Hybrid	25	DBpedia 3.9 + long abstracts
QALD4_Test_Multilingual	50	DBpedia 3.9
QALD4_Train_Multilingual	200	DBpedia 3.9
QALD5_Test_Hybrid	10	DBpedia 2014 + long abstracts
QALD5_Train_Hybrid	40	DBpedia 2014 + long abstracts
QALD5_Test_Multilingual	49	DBpedia 2014
QALD5_Train_Multilingual	300	DBpedia 2014
QALD6_Train_Hybrid	49	DBpedia 2015-10 + long abstracts
QALD6_Train_Multilingual	333	DBpedia 2015-10
Total	1431

http://musicbrainz.org/

In contrast to existing challenges and campaigns, our platform

allows the use of curated, updated benchmark datasets (e.g., via Github) instead of once-uploaded-static files and

allows to refer to specific experiments using a specific version of datasets by providing a time and date when the experiment was executed. This is a major issue when aiming to run benchmarks developed on previous versions of a dataset whose SPARQL endpoint has been updated over the years (e.g., running QALD-3 on the 2016 DBpedia endpoint) as the results achieved differ completely from those specified in the benchmark with some queries being not possible to execute.

In addition, GERBIL QA allows the implementation of wrappers for QA systems respectively using REST interfaces in an interactive manner to benchmark QA systems online and in real-time, (see Section 4).

We refer the interested reader to our dataset project homepage8

⁸

https://github.com/dice-group/NLIWOD/tree/master/qa.datasets

to read more or add novel datasets.

3. Datasets

In its current version, our framework supports 21 QALD campaign datasets9

⁹
http://qald.sebastianwalter.org/

as well as the OKBQA NLQ shared task 1 dataset10

¹⁰

http://3.okbqa.org/nlq

listed in Table 1. The versions of the datasets used here are curated versions of the original datasets with respect to correctness of answers, quality of questions and completeness of metadata. It is important to note that no evaluation campaign, especially QALD and OKBQA, offers endpoints for all knowledge bases, i.e., developers and end users have to set up their own knowledge base (KB) endpoint for the respective version. In Table 1, the Knowledge Base version is the dataset which served as a background for the provided answersets for questions. That is, a certain benchmark dataset was created to work on a certain version of the KB and thus the answers could look different with another version of the KB. Curating these datasets to the most current KB is an open, future task. However, our platform already checks basic assertions to these datasets, such as the existence of answers in the gold standard or syntactical correctness of gold standard SPARQL queries.

In contrast to the existing benchmarking campaigns, GERBIL QA allows supplementary datasets to be added. Users can (1) add them to the project repository and write a dataset wrapper in Java or (2) upload a dataset as a file via our Web-interface for only one particular experiment. The first option enables other users to benchmark with this dataset and can thus spark the generation of new datasets. The second option allows the benchmarking of not yet ready or non-disclosed datasets. In addition to supporting JSON and XML files in the QALD format, GERBIL QA supports the extension dubbed eQALD-JSON, which we developed to address some of the drawbacks of the QALD format.11

¹¹

See also the wiki pages for updates about the formats https://github.com/dice-group/gerbil/wiki/Question-Answering.

Existing formats lack the possibility to measure a system’s ability to recognize entities, classes or properties. Moreover, the QALD XML and JSON do not allow the benchmarking of systems with respect to their confidence in the computed answer. Thus, the main advantage of eQALD-JSON is that it represents the answers of a QA system, supporting the full set of benchmark types provided by GERBIL QA by explicating annotations, underlying SPARQL queries and more, (see Section 5). In particular, it includes (1) a knowledge base version, (2) questions in multiple languages and equivalent keyword queries, (3) annotations of the question w.r.t. RDF resources and properties, (4) meta-information like answer-type and answer-item-type, (5) a schemaless query12

¹²

https://sites.google.com/site/eswcsaq2015/documents

and a SPARQL query, and (6) answers from the KB formatted in a manner compliant with the W3C JSON-RDF standard,13

¹³

https://www.w3.org/TR/sparql11-results-json/

as well as confidence scores for further evaluations. This format is currently being standardized for the evaluation of natural language interfaces (see Section 7) and is depicted below:

4. Systems

Table 2
Systems that participated in past QALD challenges. Note, having a publication is optional with QALD. U means unreliable webservice, N not yet implemented due to non-open API, M human interaction needed

Engine Reference Webservice? Reason for exclusion

QALD-1

FREyA [10] – –

PowerAqua [24] ✓ U

SWIP [8] – –

QALD-2

SemSeK – – –

Alexandria [48] ✓ N

MHE – – –

QAKiS [7] ✓ –

QALD-3

squal2sparql [15] – –

CASIA [18] – –

Scalewelis [21] – –

RTV [16] – –

Intui2 [13] – –

SWIP [29] – –

QALD-4

Xser [50] – –

gAnswer [51] ✓ N

CASIA [19] – –

Intui3 [14] – –

ISOFT [28] – –

RO_FII – – –

QALD-5

Xser [50] – –

APEQ – – –

QAnswer [32] – –

SemGraphQA [4] – –

YodaQA [3] ✓ –

ISOFT [28] – –

HAWK [43] ✓ –

QALD-6

CANaLI [26] ✓ M

PersianQA – – –

UTQA [46] – –

KWGAnswer – – –

NbFramework – – –

SemGraphQA [4] – –

UIQA – – –

Engine	Reference	Webservice?	Reason for exclusion
QALD-1
FREyA	[10]	–	–
PowerAqua	[24]	✓	U
SWIP	[8]	–	–
QALD-2
SemSeK	–	–	–
Alexandria	[48]	✓	N
MHE	–	–	–
QAKiS	[7]	✓	–
QALD-3
squal2sparql	[15]	–	–
CASIA	[18]	–	–
Scalewelis	[21]	–	–
RTV	[16]	–	–
Intui2	[13]	–	–
SWIP	[29]	–	–
QALD-4
Xser	[50]	–	–
gAnswer	[51]	✓	N
CASIA	[19]	–	–
Intui3	[14]	–	–
ISOFT	[28]	–	–
RO_FII	–	–	–
QALD-5
Xser	[50]	–	–
APEQ	–	–	–
QAnswer	[32]	–	–
SemGraphQA	[4]	–	–
YodaQA	[3]	✓	–
ISOFT	[28]	–	–
HAWK	[43]	✓	–
QALD-6
CANaLI	[26]	✓	M
PersianQA	–	–	–
UTQA	[46]	–	–
KWGAnswer	–	–	–
NbFramework	–	–	–
SemGraphQA	[4]	–	–
UIQA	–	–	–

Table 2 shows that many systems of previous challenges and campaigns do not offer webservices, hence increasing the difficulty to benchmark them with novel datasets. Some offer webservice interfaces but they are either not open or demand human input. Other systems do not provide comprehensive answerset representations, e.g., showing whole paragraphs containing the answer to a question instead of a Linked Data URI. Another kind of system participating in past challenges did not leave any trace of fine granular quality assessment as they missed publications and webservices. Thus, the first release of GERBIL QA contains only 6 implemented system webservice clients. These are capable of answering hybrid, multilingual questions or keyword queries. These systems are:

HAWK [42], the first hybrid source QA system which processes RDF as well as textual information to answer one input query. HAWK is based on a mix of computational linguistics and semantic annotations to build SPARQL queries.

SINA [34], a keyword and natural language query search engine which exploits the structure of RDF graphs to implement an explorative search approach. The system is based on Hidden Markov Models for choosing the correct dataset to query based on a SPARQL generation process.

YodaQA [3], a modular, open-source, hybrid approach built on top of the Apache UIMA framework.14

¹⁴

https://uima.apache.org/

YodaQA allows easy parallelization and leverages pre-existing NLP UIMA components by representing each artifact (question, search result, passage, candidate answer) as standalone module.

QAKIS [7], a language-agnostic QA system grounded in ontology-relation matches. Here, the relation matches are based on surface forms extracted from Wikipedia to enforce a wide variety of context matches. QAKiS matches only one relation per query and moreover relies on basic heuristics which do not account for the variety of natural language in general.

QANARY [5] follows the desire to reuse the most components possible to enable a best-of-breed QA system following a new methodology for combining preexisting modules. Thus, QANARY itself is a rapid development environment for new QA systems and a QA system itself.

OKBQA [22] was recently introduced by Kim et al. to likewise facilitate a strong collaboration among experts. The Open Knowledge Base Question Answering system thus supports the development of a new QA system reusing collaborative and intuitive ways.

Currently, GERBIL QA supports the addition of 3 types of systems: (1) services implemented as Java-based wrapper (see above), (2) services configured via the Web-interface as webservice or (3) file uploads. Option (2) demands responses as either QALD-JSON or eQALD-JSON while (3) supports QALD-XML files as well. For option (1), we implemented the 4 systems which were available as webservice and returned Linked Data. We tested option (2) using the recent QANARY [35] framework. Option (3) was tested with the QALD-6 data and will be used for the 7th instantiation of the QALD challenge. This option enables developers to benchmark their system without setting up a webservice endpoint under a public address. Within the main GERBIL platform, experiments and log files remain private until published, i.e., companies and interested parties can test their systems online without fearing premature publication.

5. Experiment evaluation

In this section, we will explain the different experiment types to evaluate a QA system, as well as how the evaluation metrics are computed and the system answers are compared. Throughout this section, we will use the question “Who are the children of Ann Dunham?” as running example.

5.1. Experiment types

GERBIL QA allows the performance of common components of QA systems (named entity recognition, entity linking, etc.) to be measured, in addition to the benchmarking of whole QA systems. We use the term sub-experiments to denote experiments for benchmarking such sub-components. We designed and implemented 5 sub-experiments inspired by past evaluation campaigns. Moreover, we follow the motivation to also measure sub-experiments suggested by Both et al. [5]. For the sub-experiments P2KB and RE2KB we argue also, that in most QA systems, two different components are responsible for linking resources and properties and thus these features must be evaluated independently according to a recent study [33]. The goal is to provide system designers, researchers and decision makers with the opportunity to spot particular flaws in a QA pipeline and gain in-depth insights about the performance on different aspects of systems on diverse datasets.

The data necessary to carry out all five of these sub-experiments can be provided via eQALD-JSON. For four of the five new sub-experiments, the needed data can also be derived from the SPARQL query that might be returned by the QA system via QALD-XML or QALD-JSON, see Table 3.

Table 3
Availability of sub-experiments if the data has the QALD format without a SPARQL query, including a SPARQL query (i.S.q.) and for the eQALD-JSON format

QALD QALD i.S.q. eQALD-JSON

QA ✓ ✓ ✓

C2KB ✓ ✓

P2KB ✓ ✓

RE2KB ✓ ✓

AT ✓

AIT2KB ✓ ✓

	QALD	QALD i.S.q.	eQALD-JSON
QA	✓	✓	✓
C2KB		✓	✓
P2KB		✓	✓
RE2KB		✓	✓
AT			✓
AIT2KB		✓	✓

Question Answering (QA). The first experiment is the classic experiment as described by evaluation campaigns like OKBQA and QALD. It aims to measure the capability of a system to answer questions correctly. A system’s answer and the corresponding gold standard answer are regarded as the set of URIs and literals. For our running example, GERBIL QA expects a set of URIs containing dbr:Maya_Soetoro-Ng and dbr:Barack_Obama.15

¹⁵

The prefix dbr is used for http://dbpedia.org/resource/ while dbo is used for http://dbpedia.org/ontology/ throughout the paper.

Note, that if a different set is returned, we refer to our matching algorithm to try to match the answers, (see Section 5.3).

Resource to Knowledge Base (C2KB). This sub-experiment aims to identify all relevant resources for the given question. It is known from GERBIL [45] as Concept to Knowledge Base. The evaluation calculates the measure’s precision, recall and F-measure based on a comparison of the expected resource URIs and the URIs returned by the QA system. Instead of a simple string comparison, we make use of an advanced meaning matching implementation offered by GERBIL, which is explained in the technical report for GERBIL version 1.2.2 [30]. With respect to our running example, GERBIL QA would expect a system to annotate dbr:Ann_Dunham.

Properties to Knowledge Base (P2KB). For this experiment, the system must identify all properties that are relevant for the given question. The experiment is evaluated in a manner akin to that of the C2KB experiment. In our case, the correct answer would be dbo:children.

Relation to Knowledge Base (RE2KB). This sub-experiment focuses on the triples that have to be extracted from the question and are needed to generate the SPARQL query that would retrieve the correct answers. These triples can contain resources, variables and literals. The evaluation of this sub-experiment calculates precision, recall and F-measure based on the comparison of the expected triples from the gold standard SPARQL query and the triples or the SPARQL query returned by the QA system. For achieving a true positive, a returned triple has to match an expected triple. Two triples are counted as matching if they contain the same resources at the same positions. If they contain variables, the positions of the variables must be the same but the variable names are ignored. If they contain a literal, the value of the literal must be the same. Regarding the running example, we would expect a system to build the triple dbr:Ann_Dunham SPRCORRmiscellaneous13SPRCORR ?uri.

Answer Type (AT). The identification of the answer type is an important part of a QA system. We distinguish 5 different answer types extracted from the QALD benchmarking campaign [40], i.e., date, number, string, boolean and resource, where resource can be a single URI or a set of URIs. A single answer type is expected for each question. This is the type for which the F-measure is calculated. Note that this sub-experiment can only generate meaningful results if the eQALD-JSON is used. For the case of our running example, we expect resource as answer type.

Answer Item Type to Knowledge Base (AIT2KB). The answer item types are the rdf:type information of the returned resources. Precision, recall and F-measure are calculated based on the set of expected types. If the expected answerset of a question does not contain resources then the set of answer item types is expected to be empty. Here, we would expect to see dbo:Person as answer item type because both answers are persons.

5.2. Metrics

GERBIL QA implements 9 evaluation metrics, i.e., micro- as well as macro precision, recall and F-measure, a QALD-specific Macro F1 metric as well as the runtime and number of errors of webservice-based QA systems [45]. As a reminder, the F1-Score is defined as $\begin{matrix} (1) & F1-score = 2 \cdot \frac{Precision \times Recall}{Precision + Recall}, \end{matrix}$ with $\begin{array}{l} (2) & Precision = \frac{TP}{TP + FP} and \\ (3) & Recall = \frac{TP}{TP + FN}, \end{array}$ with respect to a set of provided answers.

For the micro precision, recall and F-measure, we first collect all true and false positives and negatives and only in the end average them. Thus, this measure actually gives more weight on questions which have many answers.

For the macro metric, we calculate the precision, recall and F-measure per question and average these metrics individually at the end. Thus, this measure assigns more meaning to the question of whether a system can answer all questions correctly. Note that it is possible that the F-measure is not between the precision and recall for all macro F-measures.

The metrics use the following additional semantic information:

If the golden answerset is empty and the system does respond with an empty answer, we set precision, recall and F-measure to 1.

If the golden answerset is empty but the system responds with any answerset, we set precision, recall and F-measure to 0.

If there is a golden answer but the QA system responds with an empty answerset, we assume the system could not respond. Thus we set the precision to 0 and the recall and F-measure to 0.

In any other case, we calculate the standard precision, recall and F-measure per question.

For the Macro F1 QALD metric, we decided to have a more comparable metric to older QALD challenges and also to follow community requests.16

¹⁶
https://github.com/dice-group/gerbil/issues/211

This metric uses the previously mentioned additional semantic information with the following exception:

If the golden answerset is not empty but the QA system responds with an empty answerset it is assumed that the system determined that it cannot answer the question. Here we set the precision to 1 and the recall and F-measure to 0.

However, GERBIL QA offers the implementation of additional metrics [30]. Thus, it would be possible to use a hierarchical F-measure, e.g., for the AIT2KB sub-experiment [31].

5.3. Answer matching

A general problem of benchmarking current QA systems is the different answerset formats. The example question might be answered with the resources listed above or the names “Maya Soetoro-Ng” and “Barack Obama”.

Our approach chooses a matching strategy based on the type of response that is expected by the benchmark.

In this case, the gold standard answerset asks for a list of resources, like in the running example above, our approach can handle two types of answersets. First, if the QA system returns an RDF resource, GERBIL QA relies on the transitive closure of resource URIs that are connected by owl:sameAs links [30]. One set is generated for the gold standard answer and the other set for the returned resource. If both sets intersect, the answer of the system is correct, i.e., counted as a true positive for the normal precision and recall calculation. This approach enables the benchmarking of QA systems with datasets even if both are not based on the same KB as long as we can find owl:sameAs relations between the KBs.

Second, if the answer type demanding a resource is a plain string, GERBIL QA tries to use it as label for the resource. However, a returned label like “Barack Obama” might be shared by several resources. In this case, all resources are retrieved and used as input for the resource-based strategy described previously. Note that this might decrease the precision of the system since not all retrieved resources sharing the given label match the expected answers.

Another problem is that in the course of time the solution for a question can differ and QA systems should provide the most recent answers. If a question refers to the current president of the USA, it would generate different results today compared with 2015. To challenge this problem, GERBIL QA provides a way to update answers for older QA datasets. If present, the SPARQL query in the QALD file can be used to ask a specific configurable KB and receive the latest answers. The main drawback of this feature is that the answerset is not manually curated and contains unchecked results.

Strings, Dates and Numbers are currently matched by exact string matches. In the future, we will extend this by more sophisticated matching strategies such as lexical mapping, e.g. towards XSD datatypes.

Besides these result-focused metrics, our method measures the performance of live systems in two ways. First, it computes the average time a system needs to generate a response. Second, it counts the number of errors returned by the system, or that occur during communication with the system.

5.4. Diagnostics

The implemented sub-experiments lead to detailed insights about a system’s performance. We created an example experiment17

¹⁷
http://w3id.org/gerbil/qa/experiment?id=201605010001

with four QA systems, three pre-implemented as well as an uploaded QALD XML answer file, on two datasets, namely QALD-5-train multilingual and hybrid. The uploaded HAWK file suggests an improvement over the pre-implemented HAWK system. The pre-implemented HAWK system however performs better on hybrid questions than on plain English questions. Systems like YODA, which only provide answers without a SPARQL query, cannot be analysed sufficiently. However, systems that also provide a SPARQL query can be analysed on their performance in the sub-experiments.

6. Sustainability plan and community

To foster an open community of QA researchers, we need a reliable platform for managing experimental data in a citable and comparable way, both readable for humans and machines. Thus, we published the GERBIL QA platform under the permanent ID http://w3id.org/gerbil/qa, which has been registered with W3ID.18

¹⁸
https://w3id.org/

We presented this platform as a prototype for the W3C community group for Natural Language Interfaces for the Web of Data,19

¹⁹

https://www.w3.org/community/nli/

which will build recommendations for benchmarks based on it. All experiment data and source code is open source, in particular, underlies a dual-LGPL license or is without any licence restrictions.20

²⁰

https://github.com/dice-group/gerbil/blob/master/LICENSE

The project itself is hosted by the AKSW research group, who already maintain more than 50 projects.21

²¹

http://aksw.org/Projects.html

Furthermore, the research and development unit of the University Leipzig Computation Center keeps daily backups to ensure long-term quotability. GERBIL is open-source software which can be maintained and hosted by anybody.

We are seeing tremendous interest in the platform even though it is not yet published in any conference or journal. GERBIL QA has already been used for 85 experiments including more than 940 sub-experiment executions. For example, the developers of HAWK use the system to measure the performance of different configurations through uploads in this experiment http://gerbil-qa.aksw.org/gerbil/experiment?id=201610230001. Although the HAWK optimal configuration is overall better than the HAWK feature configuration, the feature configuration is more able to detect the correct Answer Item Type. Such insights enable researchers and developers to steer the development process more precisely.

7. Conclusion & future work

We present the first online benchmarking system for question answering approaches over factoid questions. Our platform strives to speed up the development process by offering diverse datasets, systems and interfaces to generate repeatable and citable experiments with in-depth analytics of a system’s performance. A known limitation is our focus on RDF-based systems (RDF resource matching, requiring the SPARQL query for sub-experiments), which we seek to circumvent in the future by using a standard to let interfaces communicate the needed information by demanding a SPARQL query within the result set.

In near-future developments, we will add additional metrics such as hierarchical f-measure, novel datasets (e.g. LCQUAD [36] as currently the largest QA benchmark dataset or the Wikidata-based dataset [12] presented at the 3rd NLIWOD workshop) and more systems. Moreover, we will unify the method of matching system answers with gold standard answers, thus pushing a fast-paced, open science movement. We will look into evaluation campaigns such as TREC LiveQA and CLEF to broaden our scope and also include non-factoid QA. Therefore, we need to look into enabling hybrid crowd-based evaluations within the workflow of the existing automatic evaluation. Furthermore, we will add this benchmarking platform to the H2020 HOBBIT project22

²²

http://project-hobbit.eu/

to broadcast our activities. Also, we will bring this development to the W3C community group of Natural Language Interfaces for the Web of Data to standardize system interfaces and allow for an even easier and more concise benchmarking. Finally, GERBIL QA will be used as underlying system for the 7th instantiation of the QALD challenge [41].

Footnotes

Acknowledgements

The authors gratefully acknowledge financial support from the German Federal Ministry of Education and Research within Eurostars, a joint programme of EUREKA and the European Community under the project E!9367 DIESEL and E!9725 QAMEL as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227). We thank the QANARY team for inspiring discussions. Furthermore, we want to thank Jin-Dong Kim for his thoughts on the novel QA format. We also want to acknowledge that this project has been supported by the BMVI projects LIMBO (project no. 19F2029C) and OPAL (project no. 19F20284) as well as by the German Federal Ministry of Education and Research (BMBF) within ‘KMU-innovativ: Forschung für die zivile Sicherheit’ in particular ‘Forschung für die zivile Sicherheit’ and the project SOLIDE (no. 13N14456).

References

Agosti,

G.M.D.

Nunzio,

Dussin and

Ferro, 10 years of CLEF data in DIRECT: where we are and where we can go, in: Proceedings of the 3rd International Workshop on Evaluating Information Access, EVIA 2010, National Center of Sciences, Tokyo, Japan, June 15, 2010,

Sakai,

Sanderson and

Webber, eds, National Institute of Informatics (NII), pp. 16–24. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings8/EVIA/04-EVIA2010-AgostiM.pdf. ISBN 978-4-86049-054-6.

Alexander,

Cyganiak,

Hausenblas and

Zhao, Describing linked datasets with the void vocabulary, 2011, http://www.w3.org/TR/void/.

Baudiš and

Šedivý, Modeling of the question answering task in the YodaQA system, in: CLEF’15, Springer International Publishing, 2015, pp. 222–228.

Beaumont,

Grau and

A.-L.

Ligozat, Semgraphqa at qald-5: Limsi participation at qald-5 at clef, in: CLEF (Working Notes), 2015.

Both,

Diefenbach,

Singh,

Shekarpour,

Cherix and

Lange, Qanary – a methodology for vocabulary-driven open question answering systems, in: Proceedings of the 13th International Conference on the Semantic Web. Latest Advances and New Domains, Vol. 9678, Springer-Verlag, New York, NY, USA, 2016, pp. 625–641. ISBN 978-3-319-34128-6. doi:10.1007/978-3-319-34129-3_38.

Brümmer,

Baron,

Ermilov,

Freudenberg,

Kontokostas and

Hellmann, DataID: Towards semantically rich metadata for complex datasets, in: 10th International Conference on Semantic Systems 2014, 2014.

Cabrio,

Cojan,

Gandon and

Hallili, Querying multilingual DBpedia with QAKiS, in: ESWC, 2013, pp. 194–198.

Comparot,

Haemmerlé and

Hernandez, An easy way of expressing conceptual graph queries from keywords and query patterns, in: International Conference on Conceptual Structures, Springer, 2010, pp. 84–96.

Cyganiak,

Reynolds and

Tennison, The RDF Data Cube Vocabulary, 2014, http://www.w3.org/TR/vocab-data-cube/.

10.

Damljanovic,

Agatonovic and

Cunningham, Natural language interfaces to ontologies: Combining syntactic analysis and ontology-based lookup through the user interaction, in: Extended Semantic Web Conference, Springer, 2010, pp. 106–120.

11.

Diefenbach,

Lopez,

Singh and

Maret, Core techniques of question answering systems over knowledge bases: A survey, Knowledge and Information Systems (2017). ISSN 0219-3116. doi:10.1007/s10115-017-1100-y.

12.

Diefenbach,

T.P.

Tanon,

Singh and

Maret, Question answering benchmarks for wikidata, in: ISWC 2017, Vienne, Austria, 2017, https://hal.archives-ouvertes.fr/hal-01637141.

13.

Dima, Intui2: A prototype system for question answering over linked data, in: CLEF (Working Notes), 2013.

14.

Dima, Answering natural language questions with intui3, in: CLEF (Working Notes), 2014, pp. 1201–1211.

15.

Ferré, squall2sparql: A translator from controlled english to full sparql 1.1, in: Work. Multilingual Question Answering over Linked Data (QALD-3), 2013.

16.

Giannone,

Bellomaria and

Basili, A hmm-based approach to question answering against linked data, in: CLEF (Working Notes), Citeseer, 2013.

17.

B.F.

GreenJr.,

A.K.

Wolf,

Chomsky and

Laughery, Baseball: An automatic question-answerer, in: Western Joint IRE-AIEE-ACM Computer Conference, May 9–11, 1961, ACM, 1961, pp. 219–224.

18.

He,

Liu,

Chen,

Zhou,

Liu and

Zhao, Casia@ qald-3: A question answering system over linked data, in: CLEF (Working Notes), 2013.

19.

He,

Zhang,

Liu and

Zhao, Casia@ v2: A mln-based question answering system over linked data, in: CLEF (Working Notes), 2014, pp. 1249–1259.

20.

Höffner,

Walter,

Marx,

Lehmann,

A.-C.

Ngonga Ngomo and

Usbeck, Overcoming Challenges of Semantic Question Answering in the Semantic Web, Semantic Web Journal, 2016. To appear.

21.

Joris and

Ferré, Scalewelis: A scalable query-based faceted search system on top of sparql endpoints, in: Work. Multilingual Question Answering over Linked Data (QALD-3), 2013.

22.

Kim,

Choi,

Kim,

Kim and

Choi, The open framework for developing knowledge base and question answering system, in: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference System Demonstrations, Osaka, Japan, December 11–16, 2016,

Watanabe, ed., ACL, 2016, pp. 161–165. http://aclweb.org/anthology/C/C16/C16-2034.pdf. ISBN 978-4-87974-703-7.

23.

Lebo,

Sahoo,

McGuinness,

Belhajjame,

Cheney,

Corsar,

Garijo,

Soiland-Reyes,

Zednik and

Zhao, PROV-O: The PROV Ontology, 2013, http://www.w3.org/TR/prov-o/.

24.

Lopez,

Motta and

Uren, Poweraqua: Fishing the semantic web, in: European Semantic Web Conference, Springer, 2006, pp. 393–410.

25.

Maali,

Erickson and

Archer, Data Catalog Vocabulary (DCAT), 2014, http://www.w3.org/TR/vocab-dcat/.

26.

G.M.

Mazzeo and

Zaniolo, Canali: A system for answering controlled natural language questions on rdf knowledge bases, 2016.

27.

McRoberts and

Rodríguez-Doncel, Open Digital Rights Language (ODRL) Ontology, 2014, http://www.w3.org/ns/odrl/2/.

28.

Park,

Kwon,

Kim and

G.G.

Lee, Isoft at qald-5: Hybrid question answering system over linked data and text data, in: CLEF (Working Notes), 2015.

29.

Pradel,

Peyet,

Haemmerlé and

Hernandez, Swip at qald-3: Results, criticisms and lesson learned, in: Information Access Evaluation. Multilinguality, Multimodality, and Visualization. PROMISE Network of Excellence, 2013.

30.

Röder,

Usbeck and

A.-C.

Ngonga Ngomo, Gerbil’s new stunts: Semantic annotation benchmarking improved, Technical report, Leipzig University, 2016.

31.

Röder,

Usbeck and

A.-C.

Ngonga Ngomo, Developing a Sustainable Platform for Entity Annotation Benchmarks, ESWC Developers Workshop 2015, p. 2015, http://svn.aksw.org/papers/2015/ESWC_GERBIL_semdev/public.pdf.

32.

Ruseti,

Mirea,

Rebedea and

Trausan-Matu, Qanswer-enhanced entity matching for question answering over linked data, in: CLEF (Working Notes), 2015.

33.

Saleem,

S.N.

Dastjerdi,

Usbeck and

A.N.

Ngomo, Question answering over linked data: What is difficult to answer? What affects the F scores? in: Joint Proceedings of BLINK2017: 2nd International Workshop on Benchmarking Linked Data and NLIWoD3: Natural Language Interfaces for the Web of Data Co-Located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 21–22, 2017,

Usbeck,

A.N.

Ngomo,

Kim,

Choi,

Cimiano,

Fundulaki and

Krithara, eds, CEUR Workshop Proceedings, Vol. 1932, CEUR-WS.org, 2017, http://ceur-ws.org/Vol-1932/paper-02.pdf.

34.

Shekarpour,

Marx,

A.-C.N.

Ngomo and

S.A.

Sina, Semantic interpretation of user queries for question answering on interlinked data, Journal of Web Semantics (2014).

35.

Singh,

Both,

Diefenbach,

Shekarpour,

Cherix and

Lange, Qanary – the fast track to creating a question answering system with linked data technology, in: ESWC, 2016.

36.

Trivedi,

Maheshwari,

Dubey and

Lehmann, Lc-quad: A corpus for complex question answering over knowledge graphs, in: The Semantic Web – ISWC 2017,

d’Amato,

Fernandez,

Tamma,

Lecue,

Cudré-Mauroux,

Sequeda,

Lange and

Heflin, eds, Springer International Publishing, Cham, 2017, pp. 210–218. ISBN 978-3-319-68204-4. doi:10.1007/978-3-319-68204-4_22.

37.

Tsatsaronis,

Balikas,

Malakasiotis,

Partalas,

Zschunke,

M.R.

Alvers,

Weissenborn,

Krithara,

Petridis,

Polychronopoulos,

Almirantis,

Pavlopoulos,

Baskiotis,

Gallinari,

Artières,

Ngonga,

Heino,

É.

Gaussier,

Barrio-Alvers,

Schroeder,

Androutsopoulos and

Paliouras, An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, BMC Bioinformatics16 (2015), 138. doi:10.1186/s12859-015-0564-6.

38.

Unger,

Forascu,

Lopez,

A.N.

Ngomo,

Cabrio,

Cimiano and

Walter, Question answering over linked data (QALD-4), in: CLEF, 2014, pp. 1172–1180.

39.

Unger,

Forascu,

Lopez,

A.N.

Ngomo,

Cabrio,

Cimiano and

Walter, Question answering over linked data (QALD-5), in: CLEF, 2015, http://ceur-ws.org/Vol-1391/173-CR.pdf.

40.

Unger,

A.-C.N.

Ngomo and

Cabrio, in: 6th Open Challenge on Question Answering over Linked Data (QALD-6), Springer International Publishing, Cham, 2016, pp. 171–177. ISBN 978-3-319-46565-4.

41.

Usbeck,

A.-C.N.

Ngomo,

Haarmann,

Krithara,

Röder and

Napolitano, 7th open challenge on question answering over linked data (qald-7), in: Semantic Web Evaluation Challenge, Springer, Cham, 2017, pp. 59–69. doi:10.1007/978-3-319-69146-6_6.

42.

Usbeck,

A.N.

Ngomo,

Bühmann and

Unger, HAWK – hybrid question answering using linked data, in: Proceedings, the Semantic Web. Latest Advances and New Domains – 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31–June 4, 2015. 2015, pp. 353–368.

43.

Usbeck and

A.-C.

Ngonga Ngomo, HAWK@QALD5 – trying to answer hybrid questions with various simple ranking techniques, in: CLEF 2015 Labs and Workshops, Notebook Papers, CEUR Workshop Proceedings, Vo. 1391, CEUR-WS.org, 2015, http://svn.aksw.org/papers/2015/CLEF_HAWK/public.pdf.

44.

Usbeck,

Röder,

Haase,

Kozlov,

Saleem and

A.-C.N.

Ngomo, Requirements to modern semantic search engines, in: KESW, 2016.

45.

Usbeck,

Röder,

A.-C.

Ngonga Ngomo,

Baron,

Both,

Brümmer,

Ceccarelli,

Cornolti,

Cherix,

Eickmann,

Ferragina,

Lemke,

Moro,

Navigli,

Piccinno,

Rizzo,

Sack,

Speck,

Troncy,

Waitelonis and

Wesemann, GERBIL – general entity annotation benchmark framework, in: 24th WWW Conference, 2015.

46.

A.P.B.

Veyseh, Cross-lingual question answering using common semantic space, in: Proceedings of TextGraphs@NAACL-HLT 2016: The 10th Workshop on Graph-Based Methods for Natural Language Processing, San Diego, California, USA, June 17, 2016, 2016, pp. 15–19. http://aclweb.org/anthology/W/W16/W16-1403.pdf.

47.

E.M.

Voorhees

et al., The trec-8 question answering track report, in: Trec, Vol. 99, 1999, pp. 77–82.

48.

Wendt,

Gerlach and

Düwiger, Linguistic modeling of linked open data for question answering, in: Extended Semantic Web Conference, Springer, 2012, pp. 102–116.

49.

M.D.

Wilkinson,

Dumontier,

I.J.

Aalbersberg,

Appleton,

Axton,

Baak,

Blomberg,

J.-W.

Boiten,

L.B.

da Silva Santos,

P.E.

Bourneet al., The fair guiding principles for scientific data management and stewardship, Scientific data3 (2016).

50.

Xu,

Feng and

Zhao, Xser@ QALD-4: Answering Natural Language Questions Via Phrasal Semantic Parsing. QALD-4, 2014.

51.

Zou,

Huang,

Wang,

J.X.

Yu,

He and

Zhao, Natural language question answering over rdf: A graph data driven approach, in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, 2014, pp. 313–324.

Benchmarking question answering systems

Abstract

Keywords

1. Introduction

5 https://sites.google.com/site/trecliveqa2016/

9 http://qald.sebastianwalter.org/

5.1. Experiment types

Table 3 Availability of sub-experiments if the data has the QALD format without a SPARQL query, including a SPARQL query (i.S.q.) and for the eQALD-JSON format QALD QALD i.S.q. eQALD-JSON QA ✓ ✓ ✓ C2KB ✓ ✓ P2KB ✓ ✓ RE2KB ✓ ✓ AT ✓ AIT2KB ✓ ✓

16 https://github.com/dice-group/gerbil/issues/211

5.4. Diagnostics

17 http://w3id.org/gerbil/qa/experiment?id=201605010001

18 https://w3id.org/

Footnotes

Acknowledgements

References

⁵
https://sites.google.com/site/trecliveqa2016/

⁹
http://qald.sebastianwalter.org/

Table 3
Availability of sub-experiments if the data has the QALD format without a SPARQL query, including a SPARQL query (i.S.q.) and for the eQALD-JSON format

QALD QALD i.S.q. eQALD-JSON

QA ✓ ✓ ✓

C2KB ✓ ✓

P2KB ✓ ✓

RE2KB ✓ ✓

AT ✓

AIT2KB ✓ ✓

¹⁶
https://github.com/dice-group/gerbil/issues/211

¹⁷
http://w3id.org/gerbil/qa/experiment?id=201605010001

¹⁸
https://w3id.org/