Abstract
The necessity of making the Semantic Web more accessible for lay users, alongside the uptake of interactive systems and smart assistants for the Web, have spawned a new generation of RDF-based question answering systems. However, fair evaluation of these systems remains a challenge due to the different type of answers that they provide. Hence, repeating current published experiments or even benchmarking on the same datasets remains a complex and time-consuming task.
We present a novel online benchmarking platform for question answering (QA) that relies on the FAIR principles to support the fine-grained evaluation of question answering systems. We detail how the platform addresses the fair benchmarking platform of question answering systems through the rewriting of URIs and URLs. In addition, we implement different evaluation metrics, measures, datasets and pre-implemented systems as well as methods to work with novel formats for interactive and non-interactive benchmarking of question answering systems. Our analysis of current frameworks shows that most of the current frameworks are tailored towards particular datasets and challenges but do not provide generic models. In addition, while most frameworks perform well in the annotation of entities and properties, the generation of SPARQL queries from annotated text remains a challenge.
Introduction
The Web of Data has grown to contain billions of facts pertaining to a large variety of domains. While this wealth of data can be easily accessed by experts, it remains difficult to use for non-experts [7,43]. This need has led to the development of a large number of question answering (QA) and keyword search tools for the Web of Data [37–40,44]. As benchmarking has been credited with the more rapid advancement of research, many campaigns and challenges (e.g., Question Answering on Linked Data [38–40], BioASQ [37]) have evolved around the QA research field (see Section 2) since the evolution of the first question answering system [17]. A significant improvement in F-measure for question answering frameworks has been achieved in recent years, an increase which is partly due to the existence of such campaigns [20]. However, evaluation datasets, measures and QA system processes are hardly documented. In addition, the few existing testbeds are commonly tailored toward a particular challenge and cannot be used universally. Hence, there is no overview of the performance of frameworks outside of the challenges, making the evaluation of (1) the state of the art and (2) the weaknesses and strengths of existing systems tedious if not impossible.
Motivated by the more than 17,000 experiments that have already been run on GERBIL [45] and the improvement of named entity recognition (NER) and entity linking (EL) systems by over 12% F-measure since the deployment of GERBIL, we address the drawbacks aforementioned by presenting a novel benchmarking platform for question answering systems dubbed

Overview of the Question Answering Benchmarking platform based on the GERBIL core.
GERBIL QA follows the FAIR principles [49]:
Our approach differs from the state of the art (including GERBIL) and addresses the following drawbacks of existing challenges and systems for evaluating question answering:
Datasets: Current evaluation campaigns and challenges offer a dataset (mostly by mere reference) and a set of questions without any extensibility. We address this drawback by allowing for the user-driven addition of datasets.
Reference implementations: The development of QA systems driven by the objective assessment of the weaknesses of one’s own system in comparison to existing solutions was quasi impossible. With GERBIL QA, users can continuously benchmark their systems against the solutions included in the platform.
Evaluation: The fair evaluation of knowledge base-based QA systems across different URLs and URIs (e.g. Wikipedia vs. Freebase) used to refer to the same real-world object, which has not been investigated before this project but is now an integral part of GERBIL QA.
To address these challenges, GERBIL QA provides the following novel contributions:
We offer 8 metrics for benchmarking QA systems as well as 6 novel QA (sub-)experiment types to (1) allow for a fine-grained evaluation of QA systems and (2) improve the diagnostic process. While we reuse the existing GERBIL core, we provide novel matching and metric calculations for QA since existing evaluation platforms do not offer this functionality. We integrate 6 existing QA systems into the platform and provide an unprecedented bundle of 22 question answering datasets (QALD-1 to QALD-6 and NLQ) to evaluate these systems. We hence present the first integral comparison of QA systems for Linked Data across challenges. Our framework supports both online systems and file-based evaluation campaigns over a large variety of datasets. That is, we allow for the upload of system results as well as datasets on the fly as well as webservices for systems. In addition, we support three widely used formats for the interactive communication of QA systems via webservices.
Note that GERBIL QA reuses the mechanisms provided by GERBIL to offer citable, stable experiment URIs and descriptions, which are both human and machine-readable. To this end, GERBIL QA uses the recently proposed DataID [6] ontology which is based on a combination of VoID [2] and DCAT [25] metadata with Prov-O [23] provenance information and ODRL [27] licenses to describe datasets.
A demo of the system is available at
Like in other disciplines, QA researchers and practitioners require reliable test environments and comparison methods to step-up their development speed and lower entrance barriers. There has thus been a number of challenges and campaigns attracting researchers as well as industry practitioners to QA. Since 1998, the TREC conference, especially the QA track [47], aims to provide domain-independent evaluations over large, unstructured corpora. This seminal campaign pushed research projects forward over the course of its more than ten implementations. The latest TREC-QA tackles the field of live QA5
Next to that, the BioASQ series [37] challenges semantic indexing as well as QA systems on biomedical data and is currently at its fifth installment. Here, systems have to work on RDF as well as textual data to present matching triples as well as snippets of text. Moreover, the OKBQA6
In the following, we will use the datasets and formats (QALD-XML, QALD-JSON) as a base for our benchmarking suite, since they have been adopted by more than 20 QA systems since 2011 (see [11,20] and Table 2). So far, yearly QALD events enable participants to upload XML or JSON-based system answers to previously uploaded files on the QALD website.
Built-in datasets and their features
In contrast to existing challenges and campaigns, our platform
allows the use of curated, updated benchmark datasets (e.g., via Github) instead of once-uploaded-static files and allows to refer to specific experiments using a specific version of datasets by providing a time and date when the experiment was executed. This is a major issue when aiming to run benchmarks developed on previous versions of a dataset whose SPARQL endpoint has been updated over the years (e.g., running QALD-3 on the 2016 DBpedia endpoint) as the results achieved differ completely from those specified in the benchmark with some queries being not possible to execute. In addition, GERBIL QA allows the implementation of wrappers for QA systems respectively using REST interfaces in an interactive manner to benchmark QA systems online and in real-time, (see Section 4).
We refer the interested reader to our dataset project homepage8
In its current version, our framework supports 21 QALD campaign datasets9
In contrast to the existing benchmarking campaigns, GERBIL QA allows supplementary datasets to be added. Users can (1) add them to the project repository and write a dataset wrapper in Java or (2) upload a dataset as a file via our Web-interface for only one particular experiment. The first option enables other users to benchmark with this dataset and can thus spark the generation of new datasets. The second option allows the benchmarking of not yet ready or non-disclosed datasets. In addition to supporting JSON and XML files in the QALD format, GERBIL QA supports the extension dubbed eQALD-JSON, which we developed to address some of the drawbacks of the QALD format.11
See also the wiki pages for updates about the formats
Existing formats lack the possibility to measure a system’s ability to recognize entities, classes or properties. Moreover, the QALD XML and JSON do not allow the benchmarking of systems with respect to their confidence in the computed answer. Thus, the main advantage of eQALD-JSON is that it represents the answers of a QA system, supporting the full set of benchmark types provided by GERBIL QA by explicating annotations, underlying SPARQL queries and more, (see Section 5). In particular, it includes (1) a knowledge base version, (2) questions in multiple languages and equivalent keyword queries, (3) annotations of the question w.r.t. RDF resources and properties, (4) meta-information like answer-type and answer-item-type, (5) a schemaless query12
Systems that participated in past QALD challenges. Note, having a publication is optional with QALD. U means unreliable webservice, N not yet implemented due to non-open API, M human interaction needed
Systems that participated in past QALD challenges. Note, having a publication is optional with QALD. U means unreliable webservice, N not yet implemented due to non-open API, M human interaction needed
Table 2 shows that many systems of previous challenges and campaigns do not offer webservices, hence increasing the difficulty to benchmark them with novel datasets. Some offer webservice interfaces but they are either not open or demand human input. Other systems do not provide comprehensive answerset representations, e.g., showing whole paragraphs containing the answer to a question instead of a Linked Data URI. Another kind of system participating in past challenges did not leave any trace of fine granular quality assessment as they missed publications and webservices. Thus, the first release of GERBIL QA contains only 6 implemented system webservice clients. These are capable of answering hybrid, multilingual questions or keyword queries. These systems are:
HAWK [42], the first hybrid source QA system which processes RDF as well as textual information to answer one input query. HAWK is based on a mix of computational linguistics and semantic annotations to build SPARQL queries.
SINA [34], a keyword and natural language query search engine which exploits the structure of RDF graphs to implement an explorative search approach. The system is based on Hidden Markov Models for choosing the correct dataset to query based on a SPARQL generation process.
YodaQA [3], a modular, open-source, hybrid approach built on top of the Apache UIMA framework.14
QAKIS [7], a language-agnostic QA system grounded in ontology-relation matches. Here, the relation matches are based on surface forms extracted from Wikipedia to enforce a wide variety of context matches. QAKiS matches only one relation per query and moreover relies on basic heuristics which do not account for the variety of natural language in general.
QANARY [5] follows the desire to reuse the most components possible to enable a best-of-breed QA system following a new methodology for combining preexisting modules. Thus, QANARY itself is a rapid development environment for new QA systems and a QA system itself.
OKBQA [22] was recently introduced by Kim et al. to likewise facilitate a strong collaboration among experts. The Open Knowledge Base Question Answering system thus supports the development of a new QA system reusing collaborative and intuitive ways.
Currently, GERBIL QA supports the addition of 3 types of systems: (1) services implemented as Java-based wrapper (see above), (2) services configured via the Web-interface as webservice or (3) file uploads. Option (2) demands responses as either QALD-JSON or eQALD-JSON while (3) supports QALD-XML files as well. For option (1), we implemented the 4 systems which were available as webservice and returned Linked Data. We tested option (2) using the recent QANARY [35] framework. Option (3) was tested with the QALD-6 data and will be used for the 7th instantiation of the QALD challenge. This option enables developers to benchmark their system without setting up a webservice endpoint under a public address. Within the main GERBIL platform, experiments and log files remain private until published, i.e., companies and interested parties can test their systems online without fearing premature publication.
In this section, we will explain the different experiment types to evaluate a QA system, as well as how the evaluation metrics are computed and the system answers are compared. Throughout this section, we will use the question “Who are the children of Ann Dunham?” as running example.
Experiment types
GERBIL QA allows the performance of common components of QA systems (named entity recognition, entity linking, etc.) to be measured, in addition to the benchmarking of whole QA systems. We use the term sub-experiments to denote experiments for benchmarking such sub-components. We designed and implemented 5 sub-experiments inspired by past evaluation campaigns. Moreover, we follow the motivation to also measure sub-experiments suggested by Both et al. [5]. For the sub-experiments P2KB and RE2KB we argue also, that in most QA systems, two different components are responsible for linking resources and properties and thus these features must be evaluated independently according to a recent study [33]. The goal is to provide system designers, researchers and decision makers with the opportunity to spot particular flaws in a QA pipeline and gain in-depth insights about the performance on different aspects of systems on diverse datasets.
The data necessary to carry out all five of these sub-experiments can be provided via eQALD-JSON. For four of the five new sub-experiments, the needed data can also be derived from the SPARQL query that might be returned by the QA system via QALD-XML or QALD-JSON, see Table 3.
Availability of sub-experiments if the data has the QALD format without a SPARQL query, including a SPARQL query (i.S.q.) and for the eQALD-JSON format
Availability of sub-experiments if the data has the QALD format without a SPARQL query, including a SPARQL query (i.S.q.) and for the eQALD-JSON format
The prefix
Note, that if a different set is returned, we refer to our matching algorithm to try to match the answers, (see Section 5.3).
GERBIL QA implements 9 evaluation metrics, i.e., micro- as well as macro precision, recall and F-measure, a QALD-specific Macro F1 metric as well as the runtime and number of errors of webservice-based QA systems [45]. As a reminder, the F1-Score is defined as
For the micro precision, recall and F-measure, we first collect all true and false positives and negatives and only in the end average them. Thus, this measure actually gives more weight on questions which have many answers.
For the macro metric, we calculate the precision, recall and F-measure per question and average these metrics individually at the end. Thus, this measure assigns more meaning to the question of whether a system can answer all questions correctly. Note that it is possible that the F-measure is not between the precision and recall for all macro F-measures.
The metrics use the following additional semantic information:
If the golden answerset is empty and the system does respond with an empty answer, we set precision, recall and F-measure to 1.
If the golden answerset is empty but the system responds with any answerset, we set precision, recall and F-measure to 0.
If there is a golden answer but the QA system responds with an empty answerset, we assume the system could not respond. Thus we set the precision to 0 and the recall and F-measure to 0.
In any other case, we calculate the standard precision, recall and F-measure per question.
For the Macro F1 QALD metric, we decided to have a more comparable metric to older QALD challenges and also to follow community requests.16
If the golden answerset is not empty but the QA system responds with an empty answerset it is assumed that the system determined that it cannot answer the question. Here we set the precision to 1 and the recall and F-measure to 0.
However, GERBIL QA offers the implementation of additional metrics [30]. Thus, it would be possible to use a hierarchical F-measure, e.g., for the AIT2KB sub-experiment [31].
A general problem of benchmarking current QA systems is the different answerset formats. The example question might be answered with the resources listed above or the names “Maya Soetoro-Ng” and “Barack Obama”.
Our approach chooses a matching strategy based on the type of response that is expected by the benchmark.
In this case, the gold standard answerset asks for a list of resources, like in the running example above, our approach can handle two types of answersets. First, if the QA system returns an RDF resource, GERBIL QA relies on the transitive closure of resource URIs that are connected by
Second, if the answer type demanding a resource is a plain string, GERBIL QA tries to use it as label for the resource. However, a returned label like “Barack Obama” might be shared by several resources. In this case, all resources are retrieved and used as input for the resource-based strategy described previously. Note that this might decrease the precision of the system since not all retrieved resources sharing the given label match the expected answers.
Another problem is that in the course of time the solution for a question can differ and QA systems should provide the most recent answers. If a question refers to the current president of the USA, it would generate different results today compared with 2015. To challenge this problem, GERBIL QA provides a way to update answers for older QA datasets. If present, the SPARQL query in the QALD file can be used to ask a specific configurable KB and receive the latest answers. The main drawback of this feature is that the answerset is not manually curated and contains unchecked results.
Strings, Dates and Numbers are currently matched by exact string matches. In the future, we will extend this by more sophisticated matching strategies such as lexical mapping, e.g. towards XSD datatypes.
Besides these result-focused metrics, our method measures the performance of live systems in two ways. First, it computes the average time a system needs to generate a response. Second, it counts the number of errors returned by the system, or that occur during communication with the system.
Diagnostics
The implemented sub-experiments lead to detailed insights about a system’s performance. We created an example experiment17
To foster an open community of QA researchers, we need a reliable platform for managing experimental data in a citable and comparable way, both readable for humans and machines. Thus, we published the GERBIL QA platform under the permanent ID
We presented this platform as a prototype for the W3C community group for Natural Language Interfaces for the Web of Data,19
We are seeing tremendous interest in the platform even though it is not yet published in any conference or journal. GERBIL QA has already been used for 85 experiments including more than 940 sub-experiment executions. For example, the developers of HAWK use the system to measure the performance of different configurations through uploads in this experiment
We present the first online benchmarking system for question answering approaches over factoid questions. Our platform strives to speed up the development process by offering diverse datasets, systems and interfaces to generate repeatable and citable experiments with in-depth analytics of a system’s performance. A known limitation is our focus on RDF-based systems (RDF resource matching, requiring the SPARQL query for sub-experiments), which we seek to circumvent in the future by using a standard to let interfaces communicate the needed information by demanding a SPARQL query within the result set.
In near-future developments, we will add additional metrics such as hierarchical f-measure, novel datasets (e.g. LCQUAD [36] as currently the largest QA benchmark dataset or the Wikidata-based dataset [12] presented at the 3rd NLIWOD workshop) and more systems. Moreover, we will unify the method of matching system answers with gold standard answers, thus pushing a fast-paced, open science movement. We will look into evaluation campaigns such as TREC LiveQA and CLEF to broaden our scope and also include non-factoid QA. Therefore, we need to look into enabling hybrid crowd-based evaluations within the workflow of the existing automatic evaluation. Furthermore, we will add this benchmarking platform to the H2020 HOBBIT project22
Footnotes
Acknowledgements
The authors gratefully acknowledge financial support from the German Federal Ministry of Education and Research within Eurostars, a joint programme of EUREKA and the European Community under the project E!9367 DIESEL and E!9725 QAMEL as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227). We thank the QANARY team for inspiring discussions. Furthermore, we want to thank Jin-Dong Kim for his thoughts on the novel QA format. We also want to acknowledge that this project has been supported by the BMVI projects LIMBO (project no. 19F2029C) and OPAL (project no. 19F20284) as well as by the German Federal Ministry of Education and Research (BMBF) within ‘KMU-innovativ: Forschung für die zivile Sicherheit’ in particular ‘Forschung für die zivile Sicherheit’ and the project SOLIDE (no. 13N14456).
