Abstract
Semantic Question Answering (SQA) removes two major access requirements to the Semantic Web: the mastery of a formal query language like SPARQL and knowledge of a specific vocabulary. Because of the complexity of natural language, SQA presents difficult challenges and many research opportunities. Instead of a shared effort, however, many essential components are redeveloped, which is an inefficient use of researcher’s time and resources. This survey analyzes 62 different SQA systems, which are systematically and manually selected using predefined inclusion and exclusion criteria, leading to 72 selected publications out of 1960 candidates. We identify common challenges, structure solutions, and provide recommendations for future systems. This work is based on publications from the end of 2010 to July 2015 and is also compared to older but similar surveys.
Introduction
Semantic Question Answering (SQA) is defined by users (1) asking questions in natural language (NL) (2) using their own terminology to which they (3) receive a concise answer generated by querying an RDF knowledge base.1 Definition based on Hirschman and Gaizauskas [80].
This survey follows a strict discovery methodology; Objective inclusion and exclusion criteria are used to find and restrict publications on SQA.
The time before is already covered in Cimiano and Minock [37].
Sources of publication candidates along with the number of publications in total, after excluding based on conference tracks (I), based on the title (II), and finally based on the full text (selected). Works that are found both in a conference’s proceedings and in Google Scholar are only counted once, as selected for that conference. The QALD 2 proceedings are included in ILD 2012, QALD 3 [27] and QALD 4 [31] in the CLEF 2013 and 2014 working notes
This section gives an overview of recent QA and SQA surveys (see Table 2) and differences to this work, as well as QA and SQA evaluation campaigns, which quantitatively compare systems.
Other surveys
Semantically weak constructions. Cannot be answered as the information required is not contained in the knowledge base.
Other surveys by year of publication. Surveyed years are given except when a dataset is theoretically analyzed. Approaches addressing specific types of data are also indicated
In contrast to the surveys mentioned above, we do not focus on the overall performance or domain of a system, but on analyzing and categorizing methods that tackle specific problems. Additionally, we build upon the existing surveys and describe the new state of the art systems, which were published after the before mentioned surveys in order to keep track of new research ideas.
Contrary to QA surveys, which qualitatively compare systems, there are also evaluation campaigns, which quantitatively compare them using benchmarks. Those campaigns show how different open-domain QA systems perform on realistic questions on real-world knowledge bases. This accelerates the evolution of QA in four different ways: First, new systems do not have to include their own benchmark, shortening system development. Second, standardized evaluation allows for better research resource allocation as it is easier to determine, which approaches are worthwhile to develop further. Third, the addition of new challenges to the questions of each new benchmark iteration motivates addressing those challenges. And finally, the competitive pressure to keep pace with the top scoring systems compells emergence and integration of shared best practises. On the other hand, evaluation campaign proceedings do not describe single components of those systems in great detail. By focussing on complete systems, research effort gets spread around multiple components, possibly duplicating existing efforts, instead of being focussed on a single one.
System frameworks
System frameworks provide an abstraction in which a generic functionality can be selectively changed by additional third-party code. In document retrieval, there are many existing frameworks, such as Lucene,5
Document retrieval frameworks usually split the retrieval process in three steps: (1) query processing, (2) retrieval and (3) ranking. In the (1) query processing step, query analyzers identify documents in the data store. Thereafter, the query is used to (2) retrieve documents that match the query terms resulting from the query processing. Later, the retrieved documents are (3) ranked according to some ranking function, commonly tf-idf [134]. Developing an SQA framework is a hard task because many systems work with a mixture of NL techniques on top of traditional IR systems. Some systems make use of the syntactic graph behind the question [142] to deduce the query intention whereas others, the knowledge graph [129]. There are hybrid systems that to work both on structured and unstructured data [144] or on a combination of systems [71]. Therefore, they contain very peculiar steps. This has led to a new research sub field that focuses on QA frameworks, that is, the design and development of common features for SQA systems.
openQA [95]8
The 72 surveyed publications describe 62 distinct systems or approaches. The implementation of a SQA system can be very complex and depending on, thus reusing, several known techniques. SQA systems are typically composed of two stages: (1) the query analyzer and (2) retrieval. The query analyzer generates or formats the query that will be used to recover the answer at the retrieval stage. There is a wide variety of techniques that can be applied at the analyzer stage, such as tokenisation, disambiguation, internationalization, logical forms, semantic role labels, question reformulation, coreference resolution, relation extraction and named entity recognition amongst others. For some of those techniques, such as natural language (NL) parsing and part-of-speech (POS) tagging, mature all-purpose methods are available and commonly reused. Other techniques, such as the disambiguating between multiple possible answers candidates, are not available at hand in a domain independent fashion. Thus, high quality solutions can only be obtained by the development of new components. This section exemplifies some of the reviewed systems and their novelties to highlight current research questions, while the next section presents the contributions of all analyzed papers to specific challenges.
Hakimov et al. [72] propose a SQA system using syntactic dependency trees of input questions. The method consists of three main steps: (1) Triple patterns are extracted using the dependency tree and POS tags of the questions. (2) Entities, properties and classes are extracted and mapped to the underlying knowledge base. Recognized entities are disambiguated using page links between all spotted named entities as well as string similarity. Properties are disambiguated by using relational linguistic patterns from PATTY [107], which allows a more flexible mapping, such as “die” to
dbo:deathPlace
(see Table 3). Finally, (3) question words are matched to the respective answer type, such as “who” to
URL prefixes used throughout this work
URL prefixes used throughout this work
PARALEX [54] only answers questions for subjects or objects of property-object or subject-property pairs, respectively. It contains phrase to concept mappings in a lexicon that is trained from a corpus of paraphrases, which is constructed from the question-answer site WikiAnswers.9
Xser [149] is based on the observation that SQA contains two independent steps. First, Xser determines the question structure solely based on a phrase level dependency graph and second uses the target knowledge base to instantiate the generated template. For instance, moving to another domain based on a different knowledge base thus only affects parts of the approach so that the conversion effort is lessened.
QuASE [136] is a three stage open domain approach based on web search and the Freebase knowledge base.10
DEV-NLQ [63] is based on lambda calculus and an event-based triple store11
CubeQA [81,82] is a novel approach of SQA over multi-dimensional statistical Linked Data using the RDF Data Cube Vocabulary,12
QAKiS [26,28,39] queries several multilingual versions of DBpedia at the same time by filling the produced SPARQL query with the corresponding language-dependent properties and classes. Thus, it can retrieve correct answers even in cases of missing information in the language-dependent knowledge base.
Freitas and Curry [59] evaluate a distributional-compositional semantics approach that is independent from manually created dictionaries but instead relies on co-occurring words in text corpora. The vector space over the set of terms in the corpus is used to create a distributional vector space based on the weighted term vectors for each concept. An inverted Lucene index is adapted to the chosen model.
Instead of querying a specific knowledge base, Sun et al. [136] use web search engines to extract relevant text snippets, which are then linked to Freebase, where a ranking function is applied and the highest ranked entity is returned as the answer.
HAWK [144] is the first hybrid source SQA system which processes Linked Data as well as textual information to answer one input query. HAWK uses an eight-fold pipeline comprising part-of-speech tagging, entity annotation, dependency parsing, linguistic pruning heuristics for an in-depth analysis of the natural language input, semantic annotation of properties and classes, the generation of basic triple patterns for each component of the input query as well as discarding queries containing not connected query graphs and ranking them afterwards.
SWIP (Semantic Web intercase using Pattern) [118] generates a pivot query, a hybrid structure between the natural language question and the formal SPARQL target query. Generating the pivot queries consists of three main steps: (1) Named entity identification, (2) Query focus identification and (3) sub query generation. To formalize the pivot queries, the query is mapped to linguistic patterns, which are created by hand from domain experts. If there are multiple applicable linguistic patterns for a pivot query, the user chooses between them.
Hakimov et al. [73] adapt a semantic parsing algorithm to SQA which achieves a high performance but relies on large amounts of training data which is not practical when the domain is large or unspecified.
Several industry-driven SQA-related projects have emerged over the last years. For example, DeepQA of IBM Watson [71], which was able to win the Jeopardy! challenge against human experts.
YodaQA [15] is a modular open source hybrid approach built on top of the Apache UIMA framework13
Further, KAIST’s Exobrain14
Cheng et al. [ 34] proposes a random surfer model extended by a notion of centrality, i.e., a computation of the central elements involving similarity (or relatedness) between them as well as their informativeness. The similarity is given by a combination of the relatedness between their properties and their values.
Ngomo et al. [111] present another approach that automatically generates natural language description of resources using their attributes. The rationale behind SPARQL2NL is to verbalize15 For example, "123"ˆˆ<http://dbpedia.org/datatype/squareKilometre> can be verbalized as
In this section, we address seven challenges that have to be faced by state-of-the-art SQA systems. All mentioned challenges are currently open research fields. For each challenge, we describe efforts mentioned in the 72 selected publications. Challenges that affect SQA, but that are not to be solved by SQA systems, such as speech interfaces, data quality and system interoperability, are analyzed in Shekarpour et al. [130].
Lexical gap
In a natural language, the same meaning can be expressed in different ways. Natural language descriptions of RDF resources are provided by values of the
rdfs:label
property ( In linguistics, the term
If normalizations are not enough, the distance – and its complementary concept, similarity – can be quantified using a
Different techniques for bridging the lexical gap along with examples of deviations of the word “running” that these techniques cover
In traditional document-based search engines with high recall and low precision, this trade-off is more common than in SQA. SQA is typically optimized for concise answers and a high precision, since a SPARQL query with an incorrectly identified concept mostly results in a wrong set of answer resources. However, AQE can be used as a backup method in case there is no direct match. One of the surveyed publications is an experimental study [128] that evaluates the impact of AQE on SQA. It has analyzed different lexical18 Lexical features include synonyms, hyper and hyponyms. Semantic features making use of RDF graphs and the RDFS vocabulary, such as equivalent, sub- and superclasses.
E.g., “X wrote Y” and “Y is written by X”. E.g., “X E.g., “if X writes a book, X is called the author of it.”
PATTY [107] detects entities in sentences of a corpus and determines the shortest path between the entities. The path is then expanded with occurring modifiers and stored as a pattern. Thus, PATTY is able to build up a pattern library on any knowledge base with an accompanying corpus.
BOA [69] generates linguistic patterns using a corpus and a knowledge base. For each property in the knowledge base, sentences from a corpus are chosen containing examples of subjects and objects for this particular property. BOA assumes that each resource pair that is connected in a sentence exemplifies another label for this relation and thus generates a pattern from each occurrence of that word pair in the corpus.
PARALEX [54] contains phrase to concept mappings in a lexicon that is trained from a corpus of paraphrases from the QA site WikiAnswers. The advantage is that no manual templates have to be created as they are automatically learned from the paraphrases.
BELA [146] implements four layers. First, the question is mapped directly to the concept of the ontology using the index lookup. Second, the question is mapped based on Levenshtein distance to the ontology, if the Levenshtein distance of a word from the question and a property from an ontology exceed a certain threshold. Third, WordNet is used to find synonyms for a given word. Finally, BELA uses explicit semantic analysis (ESA) Gabrilovich and Markovitch [65]. The evaluation is carried out on the QALD 2 [143] test dataset and shows that the more simple steps, like index lookup and Levenshtein distance, had the most positive influence on answering questions so that many questions can be answered with simple mechanisms.
Park et al. [ 115] answer natural language questions via regular expressions and keyword queries with a Lucene-based index. Furthermore, the approach uses DBpedia [92] as well as their own triple extraction method on the English Wikipedia.
Ambiguity is the phenomenon of the same phrase having different meanings; this can be structural and syntactic (like “flying planes”) or lexical and semantic (like “bank”). We distinguish between homonymy, where the same string accidentally refers to different concepts (as in money bank vs. river bank) and polysemy, where the same string refers to different but related concepts (as in bank as a company vs. bank as a building). We distinguish between synonymy and taxonomic relations such as metonymy and hypernymy. In contrast to the lexical gap, which impedes the recall of a SQA system, ambiguity negatively effects its precision. Ambiguity is the flipside of the lexical gap.
This problem is aggravated by the very methods used for overcoming the lexical gap. The more loose the matching criteria become (increase in recall), the more candidates are found which are generally less likely to be correct than closer ones.
In SQA,
Instead of trying to resolve ambiguity automatically, some approaches let the user clarify the exact intent, either in all cases or only for ambiguous phrases: SQUALL [56,57] defines a controlled, English-based, vocabulary that is enhanced with knowledge from a given triple store. While this ideally results in a high performance, it moves the problem of the lexical gap and disambiguation fully to the user. As such, it covers a middle ground between SPARQL and full-fledged SQA with the author’s intent that learning the grammatical structure of this proposed language is easier for a non-expert than to learn SPARQL. A cooperative approach that places less of a burden on the user is proposed in [96], which transforms the question into a discourse representation structure and starts a dialogue with the user for all occurring ambiguities. CrowdQ [48] is a SQA system that decomposes complex queries into simple parts (keyword queries) and uses crowdsourcing for disambiguation. It avoids excessive usage of crowd resources by creating general templates as an intermediate step. FREyA (Feedback, Refinement and Extended VocabularY Aggregation) [42] represents phrases as potential ontology concepts which are identified by heuristics on the syntactic parse tree. Ontology concepts are identified by matching their labels with phrases from the question without regarding its structure. A consolidation algorithm then matches both potential and ontology concepts. In case of ambiguities, feedback from the user is asked. Disambiguation candidates are created using string similarity in combination with WordNet synonym detection. The system learns from the user selections, thereby improving the precision over time. TBSL [142] uses both an domain independent and a domain dependent lexicon so that it performs well on a specific domain but is still adaptable to another one. It uses AutoSPARQL [89] to refine the learned SPARQL query using the QTL algorithm for supervised machine learning. The user marks certain answers as correct or incorrect and triggers a refinement. This is repeated until the user is satisfied with the result. An extension of TBSL is DEQA [91], which combines Web extraction with OXPath [64], interlinking with LIMES [110] and SQA with TBSL. It can thus answer complex questions about objects which are only available as HTML. Another extension of TBSL is ISOFT [114], which uses explicit semantic analysis to help bridging the lexical gap. NL-Graphs [53] combines SQA with an interactive visualization of the graph of triple patterns in the query which is close to the SPARQL query structure yet still intuitive to the user. Users that find errors in the query structure can either reformulate the query or modify the query graph. KOIOS [18] answers queries on natural environment indicators and allows the user to refine the answer to a keyword query by faceted search. Instead of relying on a given ontology, a schema index is generated from the triples and then connected with the keywords of the query. Ambiguity is resolved by user feedback on the top ranked results.
A different way to restrict the set of answer candidates and thus handle ambiguity is to determine the expected answer type of a factual question. The standard approach to determine this type is to identify the focus of the question and to map this type to an ontology class. In the example “Which books are written by Dan Brown?”, the focus is “books”, which is mapped to
dbo:Book
. There is however a long tail of rare answer types that are not as easily alignable to an ontology, which, for instance, Watson [71] tackles using the TyCor [87] framework for type coercion. Instead of the standard approach, candidates are first generated using multiple interpretations and then selected based on a combination of scores. Besides trying to align the answer type directly, it is
Multilingualism
Knowledge on the Web is expressed in various languages. While RDF resources can be described in multiple languages at once using language tags, there is not a single language that is always used in Web documents. Additionally, users have different native languages. A more flexible approach is thus to have SQA systems that can handle multiple input languages, which may even differ from the language used to encode the knowledge. Deines and Krechel [46] use
Complex queries
Simple questions can most often be answered by translation into a set of simple triple patterns. Problems arise when several facts have to be found out, connected and then combined. Queries may also request a specific result order or results that are aggregated or filtered.
YAGO-QA [1] allows nested queries when the subquery has already been answered, for example “Who is the governor of the state of New York?” after “What is the state of New York?” YAGO-QA extracts facts from Wikipedia (categories and infoboxes), WordNet and GeoNames. It contains different surface forms such as abbreviations and paraphrases for named entities.
PYTHIA [140] is an ontology-based SQA system with an automatically build ontology-specific lexicon. Due to the linguistic representation, the system is able to answer natural language question with linguistically more complex queries, involving quantifiers, numerals, comparisons and superlatives, negations and so on.
IBM Watson [71] handles complex questions by first determining the focus element, which represents the searched entity. The information about the focus element is used to predict the lexical answer type and thus restrict the range of possible answers. This approach allows for indirect questions and multiple sentences.
Shekarpour et al. [125,126], as mentioned in Section 5.2, propose a model that use a combination of knowledge base concepts with a HMM model to handle complex queries.
Intui2 [49] is an SQA system based on DBpedia based on
GETARUNS [47] first creates a logical form out of a query which consists of a focus, a predicate and arguments. The focus element identifies the expected answer type. For example, the focus of “Who is the major of New York?” is “person”, the predicate “be” and the arguments “major of New York”. If no focus element is detected, a yes/no question is assumed. In the second step, the logical form is converted to a SPARQL query by mapping elements to resources via label matching. The resulting triple patterns are then split up again as properties are referenced by unions over both possible directions, as in (
Distributed knowledge
If concept information – which is referred to in a query – is represented by distributed RDF resources, information needed for answering it may be missing if only a single one or not all of the knowledge bases are found. In single datasets with a single source, such as DBpedia, however, most of the concepts have at most one corresponding resource. In case of combined datasets, this problem can be dealt with by creating
Some questions are only answerable with multiple knowledge bases and we assume already created links for the sake of this survey. The ALOQUS [86] system tackles this problem by using the PROTON [43] upper level ontology first to phrase the queries. The ontology is than aligned to those of other knowledge bases using the BLOOMS [85] system. Complex queries are decomposed into separately handled subqueries after coreferences23 Such as “List the Semantic Web
Herzig et al. [ 79] search for entities and consolidate results from multiple knowledge bases. Similarity metrics are used both to determine and rank results candidates of each datasource and to identify matches between entities from different datasources.
In basic RDF, each fact, which is expressed by a triple, is assumed to be true, regardless of circumstances. In the real world and in natural language however, the truth value of many statements is not a constant but a function of either or both the location or time.
Melo et al. [ 96] propose to include the implicit temporal and spatial context of the user in a dialog in order to resolve ambiguities. It also includes spatial, temporal and other implicit information.
QALL-ME [55] is a multilingual framework based on description logics and uses the spatial and temporal context of the question. If this context is not explicitly given, the location and time are of the user posing the question are added to the query. This context is also used to determine the language used for the answer, which can differ from the language of the question.
See
Younis et al. [ 154] employ an inverted index for named entity recognition that enriches semantic data with spatial relationships such as crossing, inclusion and nearness. This information is then made available for SPARQL queries.
For complex questions, where the resulting SPARQL query contains more than one basic graph pattern, sophisticated approaches are required to capture the structure of the underlying query. Current research follows two paths, namely (1) template based approaches, which map input questions to either manually or automatically created SPARQL query templates or (2) template-free approaches that try to build SPARQL queries based on the given syntactic structure of the input question.
For the first solution, many (1) template-driven approaches have been proposed like TBSL [142] or SINA [125,126]. Furthermore, Casia [77] generates the graph pattern templates by using the question type, named entities and POS tags techniques. The generated graph patterns are then mapped to resources using WordNet, PATTY and similarity measures. Finally, the possible graph pattern combinations are used to build SPARQL queries. The system focuses in the generation of SPARQL queries that do not need filter conditions, aggregations and superlatives.
Ben Abacha and Zweigenbaum [16] focus on a narrow medical patients-treatment domain and use manually created templates alongside machine learning.
Damova et al. [44] return well formulated natural language sentences that are created using a template with optional parameters for the domain of paintings. Between the input query and the SPARQL query, the system places the intermediate step of a multilingual description using the Grammatical Framework [122], which enables the system to support 15 languages.
Rahoman and Ichise [120] propose a template based approach using keywords as input. Templates are automatically constructed from the knowledge base.
However, (2) template-free approaches require additional effort of making sure to cover every possible basic graph pattern [144]. Thus, only a few SQA systems tackle this approach so far.
Xser [149] first assigns semantic labels, i.e., variables, entities, relations and categories, to phrases by casting them to a sequence labelling pattern recognition problem which is then solved by a structured perceptron. The perceptron is trained using features including n-grams of POS tags, NER tags and words. Thus, Xser is capable of covering any complex basic graph pattern.
Going beyond SPARQL queries is TPSM, the open domain Three-Phases Semantic Mapping [68] framework. It maps natural language questions to OWL queries using Fuzzy Constraint Satisfaction Problems. Constraints include surface text matching, preference of POS tags and the similarity degree of surface forms. The set of correct mapping elements acquired using the FCSP-SM algorithm is combined into a model using predefined templates.
An extension of gAnswer [156] (see Section 5.2) is based on question understanding and query evaluation. First, their approach uses a relation mining algorithm to find triple patterns in queries as well as relation extraction, POS-tagging and dependency parsing. Second, the approach tries to find a matching subgraph for the extracted triples and scores them based on a confidence score. Finally, the top-k subgraph matches are returned. Their evaluation on QALD 3 shows that mapping NL questions to graph pattern is not as powerful as generating SPARQL (template) queries with respect to aggregation and filter functions needed to answer several benchmark input questions.
Conclusion
In this survey, we analyzed 62 systems and their contributions to seven challenges for SQA systems. SQA is an active research field with many existing and diverse approaches covering a multitude of research challenges, domains and knowledge bases.
We only cover QA on the Semantic Web, that is, approaches that retrieve resources as Linked Data from RDF knowledge bases. As similar challenges are faced by QA unrelated to the Semantic Web, we refer to Section 3. We choose to not go into detail for approaches that do not retrieve resources from RDF knowledge bases. Moreover, our consensus can be found in Table 6 for best practices. The upcoming HOBBIT25
Number of publications per year per addressed challenge. Percentages are given for the fully covered years 2011–2014 separately and for the whole covered timespan, with 1 decimal place. For a full list, see Table 7
Overall, the authors of this survey cannot observe a research drift to any of the challenges. The number of publications in a certain research challenge does not decrease significantly, which can be seen as an indicator that none of the challenges is solved yet – see Table 5. Naturally, since only a small number of publications addressed each challenge in a given year, one cannot draw statistically valid conclusions. The challenges proposed by Cimiano and Minock [37] and reduced within this survey appear to be still valid.
Established and actively researched as well as envisioned techniques for solving each challenge
Bridging the (1) lexical gap has to be tackled by every SQA system in order to retrieve results with a high recall. For named entities, this is commonly achieved using a combination of the reliable and mature natural language processing algorithms for string similarity and either stemming or lemmatization, see Table 6. Automatic Query Expansion (AQE), for example with WordNet synonyms, is prevalent in information retrieval but only rarely used in SQA. Despite its potential negative effects on precision,28 Synonyms and other related words almost never have exactly the same meaning.
Surveyed publications from November 2010 to July of 2015, inclusive, along with the challenges they explicitely address and the approach or system they belong to. Additionally annotated is the use light expressions as well as the use of intermediate templates. In case the system or approach is not named in the publication, a name is generated using the last name of the first author and the year of the first included publication
The next challenge, (2) ambiguity is addressed by the majority of the publications but the percentage does not increase over time, presumably because of use cases with small knowledge bases, where its impact is minuscule. For systems intended for longtime usage by the same persons, we regard as promising the integration of previous questions, time and location, as is already common in web of document search engines. There is a variety of established disambiguation methods, which use the context of a phrase to determine the most likely RDF resource, some of which are based on unstructured text collections and others on RDF resources. As we could make out no clear winner, we recommend system developers to make their decisions based on the resources (such as query logs, ontologies, thesauri) available to them. Many approaches reinvent disambiguation efforts and thus – like for the lexical gap – holistic, knowledge-base aware, reusable systems are needed to facilitate faster research.
Despite its inclusion since QALD 3 and following, publications dealing with (3) multilingualism remain a small minority. Automatic translation of parts of or the whole query requires the least development effort, but suffers from imperfect translations. A higher quality can be achieved by using components, such as parsers and synonym libraries, for multiple languages. A possible future research direction is to make use of various language versions at once to use the power of a unified graph [39]. For instance, DBpedia [92] provides a knowledge base in more than 100 languages, which could form the base of a multilingual SQA system.
Complex operators (4) seem to be used only in specific tasks or factual questions. Most systems either use the syntactic structure of the question or some form of knowledge-base aware logic. Future research will be directed towards domain-independence as well as non-factual queries.
Approaches using (5) distributed knowledge as well as those incorporating (6) procedural, temporal and spatial data remain niches. Procedural SQA does not exist yet as present approaches return unstructured text in the form of already written step-by-step instructions. While we consider future development of procedural SQA as feasible with the existing techniques, as far as we know there is no RDF vocabulary for and knowledge base with procedural knowledge yet.
The (7) templates challenge which subsumes the question of mapping a question to a query structure is still unsolved. Although the development of template based approaches seems to have decreased in 2014, presumably because of their low flexibility on open domain tasks, this still presents the fastest way to develop a novel SQA system but the limitiation to simple query structures has yet to be overcome.
Future research should be directed at more modularization, automatic reuse, self-wiring and encapsulated modules with their own benchmarks and evaluations. Thus, novel research field can be tackled by reusing already existing parts and focusing on the research core problem itself. A step in this direction is QANARY [23], which describes how to modularize QA systems by providing a core QA vocabulary against which existing vocabularies are bound. Another research direction are SQA systems as aggregators or framework for other systems or algorithms to benefit of the set of existing approaches. Furthermore, benchmarking will move to single algorithmic modules instead of benchmarking a system as a whole. The target of local optimization is benchmarking a process at the individual steps, but global benchmarking is still needed to measure the impact of error propagation across the chain. A Turing test-like spirit would suggests that the latter is more important, as the local measure are never fully representative. Additionally, we foresee the move from factual benchmarks over common sense knowledge to more domain specific questions without purely factual answers. Thus, there is a movement towards multilingual, multi-knowledge-source SQA systems that are capable of understanding noisy, human natural language input.
Footnotes
Acknowledgements
This work has been supported by the FP7 project GeoKnow (GA No. 318159), the BMWI project SAKE (Project No. 01MD15006E) and by the Eurostars projects DIESEL (E!9367) and QAMEL (E!9725) as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227).
