Sage Journals: Discover world-class research

Abstract

Recent and intensive research in the biomedical area enabled to accumulate and disseminate biomedical knowledge through various knowledge bases increasingly available on the Web. The exploitation of this knowledge requires to create links between these bases and to use them jointly. Linked Data, the SPARQL language and interfaces in natural language question answering provide interesting solutions for querying such knowledge bases. However, while using biomedical Linked Data is crucial, life-science researchers may have difficulties using the SPARQL language. Interfaces based on natural language question answering are recognized to be suitable for querying knowledge bases. In this paper, we propose a method for translating natural language questions into SPARQL queries. We use Natural Language Processing tools, semantic resources and RDF triple descriptions. We designed a four-step method which allows to linguistically and semantically annotate questions, to perform an abstraction of these questions, then to build a representation of the SPARQL queries, and finally to generate the queries. The method is designed on 50 questions over three biomedical knowledge bases used in the task 2 of the QALD-4 challenge framework and evaluated on 27 new questions. It achieves good performance with 0.78 F-measure on the test set. The method for translating questions into SPARQL queries is implemented as a Perl module and is available at http://search.cpan.org/~thhamon/RDF-NLP-SPARQLQuery/.

Keywords

Natural Language Processing SPARQL biomedical domain semantic resources

1. Introduction

Recent and intensive research in the biomedical area enabled to accumulate and disseminate biomedical knowledge through various knowledge bases (KBs) increasingly available on the Web. Such life-science KBs usually focus on a specific type of biomedical information: clinical studies in ClinicalTrials.gov,1

¹
http://clinicaltrials.gov/.
drugs and their side effects in Sider [16], chemical, pharmacological and target information on drugs in DrugBank [34], etc.

Nowadays, creating connections between these KBs is crucial for obtaining a more global and comprehensive view on the links between different biomedical components. Such links are also required for inducing and producing new knowledge from the already available data. There is a great endeavour in the definition of Open Linked Data to connect such knowledge and in taking advantage of the SPARQL language to query multiple KBs jointly. Particularly, the creation of fine-grained links between the existing KBs related to drugs is a great challenge that is being addressed, for instance, by the project Linked Open Drug Data (LODD).2 ²
http://www.w3.org/wiki/HCLSIG/LODD.
The knowledge recorded in the KBs and dataset interlinks are represented as RDF triples, on the basis of which Linked Data can then be queried through a SPARQL endpoint. However, typical users of this knowledge, such as physicians, life-science researchers or even patients, can manage neither the syntactic and semantic requirements of the SPARQL language, nor the structure of various KBs. This situation impedes an efficient use of KBs and retrieval of useful information [14]. It has been observed that it is important to design friendly interfaces that manage the technical and semantic complexity of the task and provide simple approaches for querying KBs [12]. The main challenge is then to design optimal methodologies for an easy and reproducible rewriting of natural language questions into SPARQL queries. In the remainder of this work, the term question means the natural language expressions uttered by human users to formulate their information need, while the term query designates the same expression formalised with the SPARQL syntax and semantics. Terms refer to the single- or multi-word linguistic entities which are extracted by the automatic term extractor, while semantic entities refer to the entities provided by the semantic resources (e.g. DrugBank). Unlike terms, semantic entities are assigned to semantic types (e.g. sideEffect, drug, foodInteraction).

We start with the presentation of some related work.
2. Related work

Querying Linked Data requires to define the end user interfaces which hide the underlying structure of the KB as well as the SPARQL syntax. Usually, three ways are identified for querying Linked Data: knowledge-based specific interface, Graphical Query Builder and question answering system. However, it has been demonstrated that natural language interfaces are the most suitable [12]: indeed, for querying KBs and Semantic Web data, the use of full and standard sentences is preferred to the use of keywords, menus or graphs.

Another important distinction is related to the types of Linked Data which are processed, i.e. typically, general [5,15,32] or specialized [1] KBs, and the purpose of this processing. Concerning the purpose, two kinds of work can be distinguished: $(1)$ transformation of natural language questions into SPARQL queries; $(2)$ transformation of natural language questions into SPARQL queries and questioning KBs. We first present works that address only the transformation of natural language questions into SPARQL queries without generating and evaluating the URIs which should be returned (Section 2.1). We then present those works that go beyond and, in addition, query KBs and evaluate the obtained answers (Section 2.2). Finally, we also present other related research questions addressed in the work (Section 2.3).

2.1. Transformation of questions into queries

The main objective of this kind of works is to propose methods for a more efficient transformation of natural language questions into SPARQL queries. The evaluation of these approaches focuses on the syntactic correctness of the generated queries, but does not take into account the URIs which should be found through these queries. Most of the existing approaches rely on patterns or templates.

The question answering system AutoSPARQL is based on active supervised machine learning independent of the KB [18]. The SPARQL query model is learnt from natural language questions. The authors report that the 50 questions of the QALD-1 challenge are successfully transformed with the system. Another method is based on modular patterns for parsing the questions [26]. Semantic relations are identified on the basis of the first keywords detected. The method is tested on 160 movie-related questions. One advantage is that the method requires only four general and modular query patterns, while in a previous work of the authors, twelve patterns were necessary [25].

The use of resources automatically derived from ontologies or KBs also provides the possibility to transform questions into formal queries, such as those defined using the SeRQL language [31]: questions undergo a set of treatments (e.g. linguistic analysis, string similarity computation). Applied to a set of 22 questions, the method can interpret and transform correctly 15 questions (68%).

Existing tools can also be utilized. For instance, a multilingual toolkit, called Grammatical Framework, available for 36 languages [27] has been used for the transformation of questions into queries [6]. Correspondences between linguistic units and SPARQL elements are established: common noun (kind), noun phrase (entity), verb phrase (property) and verb phrase with a higher arity (relation). The evaluation is performed on seven languages. The results indicate that up to 112 basic query patterns are to be used and can be combined with several logical operators.

A manually written grammar together with ontological knowledge allow processing 145 questions out of 164 (88% coverage) [9].

Finally, [1] aims at translating natural language questions into SPARQL queries. The proposed method relies on a hybrid approach: a SVM machine learning-based approach, which is used to extract the characteristics of the questions (named entities, relations), is combined with patterns to generate the SPARQL queries. The method is applied to medical questions issued from a journal. The evaluation is carried out on 100 questions. The method achieves a precision of 0.62.

Queries generated by such systems can be further used for querying linked KBs, provided that the reference data, i.e. the expected resulting URIs, are available.

2.2. Query generation and querying Linked Data

The main objective of the following related work is more complex than work presented in Section 2.1: First, the system has to transform the natural language questions into SPARQL queries; and second, it has to query the KB in order to get the best results possible when querying the Linked Data. The main advantage of this kind of work is that they cover the entire querying process. Moreover, they allow to evaluate the final results (answers extracted from the KBs) and to provide precise evaluation figures. Often, NLP tools and methods are used for the transformation of questions into queries. We can mention three such experiments.

In one study, the system is template-based and relies on NLP tools and semantic resources [32]. The application of the system on 50 questions from DBpedia proposed by the QALD-2 challenge gives competitive results with an average of 0.62 F-measure obtained with 39 questions (the average recall is 0.63 and the average precision is 0.61), but shows a low coverage because 11 questions are not covered by the templates.

Notice that recently, the Question Answering over Linked Data (QALD-4) challenge proposed the task Biomedical question answering over interlinked data,3

³
http://sc.cit-ec.uni-bielefeld.de/qald/.
dedicated to the retrieval of precise biomedical information from linked KBs with questions expressed in natural language. The other participant considered the task as a controlled natural language problem which is achieved thanks to a specifically designed Grammatical Framework grammar [21]. It relies on an extensive manual definition of the grammar.
2.3. Other related research questions

Other research questions can be related to the querying of linked data with natural language interfaces. Usually this kind of work aims at improving specific points: identification of different types of SPARQL queries (select, construct, ask, describe) [19], detection of named entities [28], generation of SPARQL templates4

⁴
http://www.lodqa.org/docs/references/.
[13], classification of semantic correspondences between question units and query elements [10], design of a SPARQL solver based on contraint programming to query RDF documents [17], or processing of complex queries and their decomposition into sub-queries [23]. Finally, an approach based on a knowledge-driven disambiguation of questions and on a coloured activation of the query graph [20] has been tested on 100 questions from the QALD-3 dataset. According to tested settings, the system F-measure varies from 0.4 (QALD-3 dataset) to 0.6 with the entity search, and up to 0.8 with a boolean setting. Among the difficulties observed, the authors notice errors due to the relation interpretation, missing lexical knowledge, parsing of complex questions and remaining difficulties with ambiguities.

Fig. 1.
Global workflow of the system. Square boxes represent processing steps (they are detailed in Figs 2 to 7). Rounded boxes describe the resources used for the processing of questions and queries.
3. Objectives

The objective of our work is to propose an end-to-end method for translating natural language questions into SPARQL queries and for querying KBs. The method is based on the use of Natural Language Processing (NLP) tools and resources for enriching questions with linguistic and semantic information. Questions are then translated into SPARQL with a rule-based approach.

Our method goes further in comparison with those presented in Section 2.2. Indeed, we propose to use information available in the Linked Data resources to semantically annotate the questions and to define frames (i.e., linguistic representations of the RDF schema) in order to model and build SPARQL queries. Thus, in comparison with the closest work [32], our method exploits extensively NLP for processing Linked Data. Besides, as our method performs an end-to-end processing, our work is also related to several aspects presented in Section 2.3: identification of different types of SPARQL queries [19], detection of named entities [28], generation of SPARQL templates [13], etc.

The paper is structured as follows. We describe the proposed method in Section 4 and then the semantic resources available and developed for enriching the questions in Section 5. The evaluation of the method is presented in Section 6 and we finally discuss our results in Section 8.

4. Question translation into SPARQL query

Fig. 2.

Linguistic and semantic annotation process. Square boxes represent the steps of linguistic and semantic analysis of questions. Rounded boxes indicate the resources used for the semantic annotation.

To translate natural language questions into SPARQL queries, we design a four-step rule-based approach, that relies on NLP methods, semantic resources and RDF triple descriptions (see Fig. 1):

Natural language questions are annotated with linguistic and semantic information (Section 4.1). This step aims at associating a linguistic and semantic description with words and terms constituting the questions.

Linguistic and semantic information is used for abstracting questions (Section 4.2). This step aims at identifying relevant elements within questions and at building a representation of these elements.

The abstracted questions are used for constructing the corresponding SPARQL query representations (Section 4.3). This step joins together previously identified elements and defines a structure representing the SPARQL graph pattern.

This graph pattern representation is used to generate each SPARQL query as a string (Section 4.4).

We design our approach on 50 questions proposed by task 2, Biomedical question answering over interlinked data, of the QALD-4 challenge, and evaluate it on 27 newly defined questions. A sample of the new test set is given at Section 6.1. We work with three KBs: Drugbank, Diseasome, and Sider, described in Section 5. To illustrate the different tasks of our approach, we exemplify our approach using the following questions:

What is the side effects of drugs used for Tuberculosis?

Which approved drugs interact with fibers?

List drugs that lead to strokes and arthrosis.

Give me drugs in the gaseous state.

Which drugs have no side-effects?

Which is the least common chromosome location?

Which foods does allopurinol interact with?

Note that the source questions are kept as provided despite the misspelling (first question for instance).

Fig. 3.

Examples of the linguistic annotation of questions issued from the QALD-4 challenge dataset. The source questions are kept as provided despite the misspellings they may contain (question (a) for instance). Gray rounded boxes represent words. Subscript text indicates Part-of-Speech tags computed by TreeTagger [29].

4.1. Linguistic and semantic annotation of questions

The annotation step aims at associating a linguistic and semantic description with words and terms from the questions (see Fig. 2).

First, the linguistic annotation aims at parsing questions in order to identify numerical values (such as numbers and solubility values) and words. During this step, part-of-speech tags and lemmas are associated with words. To achieve that, we use the TreeTagger POS-tagger [29]. Figure 3 illustrates the obtained linguistic annotation of questions. POS tagging errors are intentionally kept in the examples (e.g. List tagged as noun instead of verb in Fig. 3(c)).

Fig. 4.

Examples of the question pre-processing. Gray rounded boxes represent words and semantic entities. Bracketed subscript texts are semantic types associated with semantic entities.

The objective of the semantic annotation is to identify semantic entities, i.e. terms together with the associated semantic types representing their meaning. Figure 4 displays the obtained semantic annotation of illustrated questions. This step relies on semantic resources, such as DrugBank, Sider and Diseasome (see Section 5), used in order to recognize semantic entities, such as disease names and side effects. The semantic entity recognition is based on the TermTagger Perl module.5 ⁵

http://search.cpan.org/~thhamon/Alvis-TermTagger.

For instance, in Fig. 4(a), Tuberculosis is recognized as an entity with two concurrent semantic types: diseasome/disease/1154 and sider/side-effects/C0041296.

Fig. 5.

Question abstraction process. Square boxes represent the abstraction steps of questions. Rounded boxes indicate the resources used.

However, because semantic resources often suffer from low coverage [3,22], we also extract terms which usually correspond to noun phrases relevant for the targeted domain from the questions in order to improve the coverage of our approach. For instance, the terms side effects of drugs (Fig. 4(a)) and fibers (Fig. 4(b)) are extracted while none of them is provided by the semantic resources. The term extractor 6 ⁶

http://search.cpan.org/~thhamon/Lingua-YaTeA/.

[2] is used for this task. It performs shallow parsing of the POS-tagged and lemmatized text by chunking it according to syntactic frontiers (pronouns, conjugated verbs, typographic marks, etc.) in order to identify noun phrases. Then, parsing patterns are recursively applied and provide parsed terminological entities. These parsing patterns have been manually defined in a previous work [2]. These patterns are available in the configuration files of the Perl module of

. They take into account the morpho-syntactic variation and represent basic syntactic dependencies within terminological entities. Each term is represented in a syntactic tree, while its sub-terms are also considered as terms in the current configuration (e.g. side effects of drugs gives side effects and drugs in Fig. 4(a)). No semantic types are associated with the extracted terms.

The processing of questions also requires to identify expressions of negation (e.g. no) and quantification (e.g. number of, least of). Words expressing negation (e.g. no in Fig. 4(e)) are identified through regular expressions provided by the NegEx resource [4]. Then, their scope is computed to detect terms that are negated within questions. For performing the task, we use the NegEx algorithm.7 ⁷

http://search.cpan.org/~osler/Lingua-NegEx/.

We also collect and identify quantification expressions in the questions processed.

4.2. Question abstraction

The question abstraction step aims at identifying relevant elements within questions and at building a representation of these elements (see Fig. 5). It relies on linguistic and semantic annotations associated with the question words detected at the previous step.

Before the identification of relevant elements, annotations are post-processed in order to disambiguate the generated semantic annotations. Indeed, annotated semantic entities may receive conflicting, concurrent or erroneous semantic types. For instance, in Fig. 4(g), allopurinol received two similar semantic types (one from DrugBank (drugbank/gen/DB00437) and one from Sider (sider/drugname/83786)), in Fig. 4(a), Tuberculosis is tagged as disease (diseasome/disease/1154) and side-effect (sider/side-effects/C0041296), while in Fig. 4(c), lead is erroneously tagged as drug (drugbank/drug/7191). The post-processing first aims at selecting those entities and semantic types that may be useful for the next steps. Therefore, semantic entities like lead are removed if they are part of larger entities (lead to). As part of this post-processing, larger terms which do not include other semantic entities are kept in order to increase the coverage of our approach. For instance, in Fig. 4(a), the terms side effects of drugs and effects of drugs are removed because two components (side effects and drugs) are semantically tagged, while in Fig. 4(b), the term fibers is kept.

Besides, in order to choose the correct predicate it may be necessary to consider semantic types in the context of words or phrases corresponding to the predicates. For instance, the phrase interact with is ambiguous and may correspond to two predicates:

foodInteraction when its context contains a semantic entity with the type food (Fig. 4(g)),

InteractionDrug1 when its context contains semantic entity with the type drug (Fig. 4(b)).

In that respect, we manually analyze the predicate names in order to define rewriting rules and to adjust (modify or delete) semantic types associated with a given entity according to its context. Other rules may also modify or delete the entity itself. In total, we defined 44 contextual rewriting rules8

⁸
Available at http://cpansearch.perl.org/src/THHAMON/RDF-NLP-SPARQLQuery-0.1/etc/nlquestion/SemanticTypeCorresp.rc.
based on the 112 predicate names and on the documentation from the exploited KBs, mainly DrugBank.9 ⁹
http://www.drugbank.ca/documentation.
Additionally, the vocabulary from the 25 questions of the QALD-4 training set (see Section 6.1) was used. Notice that supplementary disambiguation of the annotations is also performed during the query construction step when arguments of the predicate or the question topic share the same semantics and are connected.

We define the question topic as the type of semantic entity which is the major context of the question, characterizing the user interest [8]. For instance, in Fig. 4(a), the question topic is sideEffect.

For performing question abstraction, we identify information related to the query structure:

Definition of the result form: Negated terms and information related to coordination markers, aggregation operators, and requirements on specific result forms are recorded and will be used at the end of the query construction step or during the query generation step. Questions are scanned for identifying negated terms but also for identifying aggregation operation on the results, e.g. number for count, mean for avg or higher for max, and specific result forms such as Boolean queries (ASK). Thus, in Fig. 6(f), the result form is the aggregation operation min applied to the object of the predicate chromosomeLocation, while the result form of other questions is SELECT. Also, presence of the negated semantic entity side-effects in Fig. 6(e) and of the coordination and in Fig. 6(c) are recorded.

Identification of the question topic: We assume that the first semantic entity occurring in the sentence, with a given expected semantic type corresponds to the question topic. The expected semantic types are those provided by the RDF subjects and objects issued from the resources. This information will be used during the query construction step. As illustration, the question topic is identified as sideEffect in Fig. 6(a) and chromosomeLocation in Fig. 6(f).

Fig. 6.
Examples of the question abstraction. The left part of the sub-figures displays the graph representation of the identified frames: gray boxes represent subjects and objects of the predicates together with their semantic types, while edges represent predicates and their semantic types. The right part of the sub-figures represents semantic entities and terms identified in questions together with the associated information: question topic (QT), URI, etc.

Identification of predicates and arguments: We use linguistic representations of RDF schemas, i.e. frames which contain one predicate and at least two elements with associated semantic types. In that respect, potential predicates, subjects and objects of frames are identified among the semantic entities and then recorded in a table: entries are semantic types of the elements and refer to linguistic, semantic and SPARQL information associated with these elements. Subjects and objects are fully described in the table with the inflected and lemmatized forms of words or terms, the corresponding SPARQL types and indicators on their use as object or subject of a given predicate. Concerning the predicates, only the semantic types of their arguments are instantiated. Subjects and objects can be URIs, RDF typed literals (numerical values or strings) and extracted terms (these are considered as elements of regular expressions).

For instance, in Fig. 6(a), two predicates are identified: sideEffect and possibleDrug. Given the RDF schemas of these predicates, the expected arguments of the former are sider/drugs and sider/side-effects, while the arguments of the latter are diseasome/diseases and drugbank/drugs. In Fig. 6(d), the predicate state as well as the expected arguments drugbank/drugs and Gas/String are recognized.

Scope of coordination: Arguments and predicates in the neighbourhood of coordination are identified. These elements are recorded as coordinated, e.g. the semantic entities strokes and arthrosis are related by the coordination and in Fig. 6(c).

Figure 6 presents graph representations and abstractions of the questions presented in Fig. 4.
4.3. Query construction

Fig. 7.

Query construction process. Square boxes represent construction steps of queries. Rounded boxes indicate the resource used.

Fig. 8.

Examples of the query construction. Graphs represent the constructed queries and consequently, the SPARQL graph patterns. Gray boxes represent subjects and objects of the predicates instantiated by the semantic entities and terms. Variables (?v0, etc.) associated with the predicate arguments are also displayed.

The objective of the query construction step is to join previously identified elements together, and to build a representation of the SPARQL graph pattern (introduced by the keyword WHERE). Figure 7 presents the workflow of the query construction process and Fig. 8 illustrates the construction of the queries corresponding to the example questions.

Thus, the predicate arguments are instantiated by URIs associated with the subjects, objects, variables, and numerical values or strings. For each question, we perform several associations:

The question topic is associated with one predicate argument and this is represented through a variable. Hence, this variable is associated with two elements: the question topic and one of the predicate arguments that matches the semantic type of this question topic. Notice that it is not necessary to associate all the predicate arguments that have the same semantic type with the question topic for now. Moreover, at the end of this step, the question topic may remain unassociated with any predicate. In Fig. 8(a), the ?v0 variable represents the association between the question topic and the object (with the expected type sider/side-effects) of the sideEffect predicate. In Fig. 8(b), only the subject drugbank/drugs of the first drugType predicate is associated with the node corresponding to the question topic while the remaining association (subject drugbank/drugs of the foodInteraction predicate) will be processed during the next step. Performing this association at the beginning of the query construction is helpful for removing some ambiguities. For instance, in Fig. 8(a), the connection of the object of the sideEffect predicate prevents further use of the semantic type sider/sideeffect/C0041296 of Tuberculosis (which was identified during the previous step, as illustrated in Fig. 6(a)).

The predicate arguments are associated with semantic entities identified during the question abstraction, as they concern elements referring to URIs. Moreover, each predicate with arguments in the coordination scope is duplicated and arguments are also associated with semantic entities, if needed. Thus, in Fig. 8(a), Tuberculosis with the semantic type diseasome/disease/1154 is associated with the subject of the possibleDrug predicate, in Fig. 8(b), approved is associated with the object of the drugType predicate, and in Fig. 8(d), the semantic entity gaseous is associated with the object of the state predicate. In Fig. 8(c), since the semantic entities strokes and arthrosis are coordinated, the sideEffect predicate is duplicated and these two semantic entities are associated with the predicate instances each. Note that, in this example, strokes is also implicitly disambiguated and not further considered as a disease.

The predicates are associated with each other through their subjects and objects, and the association is then represented by a variable. For example, in Fig. 8(b), the subject of the drugType and foodInteraction predicates are related. Hence, both are associated with the variable ?v0, which is already referred to as the question topic.

Predicates from different datasets are joined together. We use the sameAs description to identify URIs referring to the same element. New variables are defined in order to relate two predicates. This kind of association occurs in the example in Fig. 8(a): The subject of the sideEffect predicate and the object of the possibleDrug predicate are related through the sameAs predicate linking semantic entities sider/drugs and drugbank/drugs, respectively identified by the variables ?v1 and ?v2.

The remaining question topics are associated with arguments of the sameAs predicate. The above examples do not require to perform such association.

The arguments corresponding to the STRING type are associated with the extracted terms. These arguments are related to the string matching operator REGEX. Thus, the terms are considered as string expressions. This is the case of the term fibers in Fig. 8(b) which will be represented as a regular expression in the next step.

At this point, the predicate arguments which remain unassociated are replaced by new variables in order to avoid empty literals.

Finally, the negation operators are processed: Predicates are marked as negated, while the arguments corresponding to negated terms are included in the new rdf:type predicate, if required. Thus, in Fig. 8(e), the object of the sideEffect predicate is negatively associated with the variable ?v1 and the rdf:type predicate with the object sider/drugs is added in the representation of the SPARQL graph pattern.

At this stage, each question is fully translated into a representation of the SPARQL query.

4.4. Query generation

The SPARQL query representation built during the query construction step is used to generate the SPARQL query string. Figure 9 illustrates the generated queries which correspond to the example questions.

The query generation process is composed of two parts:

The generation of the result form which takes into account the expected type of the result form (e.g. ASK or SELECT), the presence of aggregation operators and the variable associated with the question topic. For all the example questions, the result form is SELECT. Even if an aggregation operator is expected in Fig. 9(f), it requires the GROUP BY clause which will be processed during the next stage.

The generation of the graph pattern. This part consists of the generation of strings for representing each RDF triple and the filters if the predicates are negated terms. This is the case of the examples from Figs 9(a)–(e) and 9(g). Also, in Fig. 9(d), the state predicate is replaced by the corresponding URI and its object is replaced by the string Gas. When aggregation operators are used, it is also necessary to recursively generate filters and sub-queries for computing the subsets of expressions, before their aggregation. Thus, in Fig. 9(f), two sub-queries are generated for counting the number of molecules per chromosome (given by the chromosomeLocation predicate) and the corresponding minimum number. Then, a filter is defined for the selection of the molecules per chromosome that have the minimum number.

The SPARQL queries are then submitted to a SPARQL endpoint10

¹⁰
For our experiments, we use the SPARQL endpoint provided by the QALD-4 challenge.
and answers are collected for the evaluation.

Fig. 9.
Examples of the query generation.
5. Description of the semantic resources

The method described above relies on $(1)$ the existing biomedical resources that provide information on semantic entities (Section 5.1), $(2)$ additional resources specifically collected and built to support the method (Section 5.2).

5.1. Domain-specific resources

To process the set of questions, we used the following three biomedical resources:

DrugBank11

¹¹
http://www.drugbank.ca.
KB is dedicated to drugs [34]. It merges chemical, pharmacological and pharmaceutical information from other available KBs. We exploit the documentation12 ¹²
http://www.drugbank.ca/documentation.
of this resource to define the rewriting rules and regular expressions for the named entity recognition.

Diseasome13 ¹³
http://diseasome.eu.
is dedicated to diseases and genes linked by known disorder/gene associations [11]. It provides a single framework with all known phenotypes and disease/gene associations, indicating the common genetic origin of many diseases. We exploit the RDF triples and the documentation of the resource to define the rewriting rules.

Fig. 10.
Example of natural language questions from the new test set.

Sider14 ¹⁴
http://sideeffects.embl.de.
is dedicated to adverse effects of drugs [16]. It contains information on marketed medicines and their recorded adverse drug reactions. Information is extracted from public documents and package inserts. Information available in Sider includes side effect frequency, drug and side effect classifications, as well as links to other data, such as drug/target relations. We use the documentation and the RDF triples of this KB.
The content of each resource is provided in a specific format: RDF triples of form subject predicate object. In that respect, we also exploit the RDF schemas of these resources to define the frames (see Section 5.2).
5.2. Additional resources for question annotation

On the basis of the RDF triples, frames are built from the RDF schemas in which the RDF predicate is the frame predicate, and subject and object of the RDF triples are the frame elements. This also includes the OWL sameAs triples. Several types of frame entities are isolated:

As indicated, subject, object and predicate become semantic entities. At least one of them must occur in questions. This way, the frames are the main resources for rewriting questions into queries;

The vocabulary specific to questions is also built. It covers for instance the aggregation operators and the types of questions;

RDF literals, issued from the named entity recognizer or the term extractor, complete the resources. The RDF literals are detected with specifically designed automata that may rely on the source KB documentation.

All these entities are associated with the expected semantic types which allow creating the queries and rewriting the RDF triples into SPARQL queries. In that respect, we can process several types of data (URIs, strings, common datatypes or regular expressions) when literals are expected.

Most of the entities are considered and processed through their semantic types, although some ambiguous entities (e.g., interaction or class) are considered atomically. For these, the rewriting rules are applied contextually to generate the semantic entities corresponding to the frames (see Section 4.2). When using the queries, the semantic types become variables and are used for connecting the query edges.

6. Experiments and results

6.1. Training and test question set

Our training set gathers the 50 questions from the training and test sets of the QALD-4 challenge. Separately, they show unbalanced complexity but taken together they provide a balanced training set. The evaluation is performed on 27 new questions. Questions from this new test set are similar to the QALD-4 questions but may differ as for the involved semantic entities or predicates. Our method is applied to this new test set without additional adaptations. Figure 10 presents a sample of questions from the new test set available at the following URL: http://perso.limsi.fr/hamon/Files/QALD/qald-4_biomedical_additional_test.xml.

6.2. Evaluation metrics

The generated SPARQL queries are evaluated through the answers they generate. The evaluation is performed with the following macro-measures [30]: $\begin{array}{l} (1) & M_{precision} & = \frac{\sum_{i = 1}^{| q |} \frac{TP (q)}{TP (q) + FP (q)}}{| q |} \\ (2) & M_{recall} & = \frac{\sum_{i = 1}^{| q |} \frac{TP (q)}{TP (q) + FN (q)}}{| q |} \\ (3) & M_{F-measure} & = \frac{2 \times M_{precision} \times M_{recall}}{M_{precision} + M_{recall}} \end{array}$ where $TP (q)$ are the correct answers, $FP (q)$ are the wrong answers and $FN (q)$ are the missing answers for the question q.

Fig. 11.

Performance per sub-step for 50 questions from the training set.

Through the use of macro-measures, we equally consider all the questions independently on the number of expected answers for a given SPARQL query.

6.3. Global results

Table 1 presents the overall results obtained on the training and test sets. On the test set, the macro-F-measure is 0.78 with 0.81 precision and 0.76 recall, while on the training set, the macro-F-measure is 0.86 with 0.84 precision and 0.87 recall.

Table 1
Results obtained with the training and test sets

Query set Training (50 Q) Test (27 Q)

Correct Queries 39 20

M-precision 0.84 0.81

M-recall 0.87 0.76

M-F-measure 0.86 0.78

Query set	Training (50 Q)	Test (27 Q)
Correct Queries	39	20
M-precision	0.84	0.81
M-recall	0.87	0.76
M-F-measure	0.86	0.78

Our method always proposes syntactically correct SPARQL queries for all natural language questions. On the test set, concerning the answers generated over Linked Data, 20 questions provide the exact expected answers, two questions return partial answers, and five questions return erroneous answers. On the training set, 39 SPARQL queries (out of the 50 questions) provide the expected answers, six questions return partial answers, and five questions return no answers.

6.4. System performance

Fig. 12.

System performance for an increasing number of questions (1 to 50) from the training set.

Table 2

Evaluation results of comparable end-to-end systems

System	Ref. data	P	R	F
(Unger et al., 2012) [32]	QALD-2	0.61	0.63	0.62
(Ben Abacha et al., 2012) [1]	Med. Journals	0.62
(Lukovnikov et al., 2014) [20]	QALD-3			0.4–0.8
Our system	QALD-4	0.81	0.76	0.78

Table 3

Evaluation results from the QALD-4 challenge

System	Approach	P	R	F
GFMed [21]	GF grammar	0.99	1	0.99
Our system	NLP, sem. resources	0.87	0.82	0.85
RO_FII (mentioned in [33])	semi-manual	0.16	0.16	0.16

We analyzed the system performance when translating 50 questions from the training set, on a computer with 4 GB of memory and a 2.7 GHz dual-core CPU. Figure 11 presents the run time for each query according to the pre-processing substeps (named entity recognition, word segmentation, POS tagging, semantic entity tagging, term extraction with and negation scope identification with NegEx) and the question translation into a SPARQL query (Question2SPARQLQuery). Most of the processing time is dedicated to the TermTagger, which aims at recognizing the semantic entities. With the internal Ogmios processing (i.e., mainly the control of inputs and outputs), each question is processed in two seconds on average.

Figure 12 shows the overall system performance according to the number of questions to be processed. The variation of run time when processing one question and the whole set of questions is less than two seconds on the training set.

7. Discussion

7.1. Findings

Overall, our approach exhibits good results, with an F-measure of 0.78 on the newly created test set. This value is lower than the F-measure obtained on the training set because no additional rules or adjustments were made for processing the questions from the new set. It is noteworthy that this value would have been higher if the sameAs predicate had been correctly described as symmetric. Indeed, we noticed that in the reference data, for three questions the generated SPARQL query was correct but the SPARQL endpoint did not return the expected answers. By switching the arguments of the sameAs predicate in the queries, we observed that the expected answers were returned. Although the sameAs predicate is symmetric by definition, it actually appears that the instances of this predicate do not encode the expected reflexivity of this relation in the source KBs.

7.2. Comparison with existing work

We propose two ways for comparing our work with existing ones: either the methods or the reference data are comparable. Hence, we can compare our work with those end-to-end works presented in Section 2.2. In Table 2, we indicate the available evaluation numbers for precision, recall and F-measure. We can observe that our system provides competitive results.

We can also compare our system with those that participated in the QALD-4 challenge. In Table 3, we indicate the official results of the challenge: our system rates second among the three participating systems. As already said, the first system exploits a Grammatical Framework (GF) grammar based on formal syntax, while the third one proposes a semi-manual approach combining automatic POS tagging and manual transformation of questions into queries. The results provided by our system are close to the best system of the challenge.

In general, we can observe that the proposed method is competitive by comparison with existing work, and is also portable to new datasets.

Similar work applied on general-language datasets show less impressive results. For instance, a comparable approach (linguistic annotation, syntactic analysis and scoring of the right answers) applied to comparable material (QALD-3 DBpedia) reaches 0.32 F-measure [7]. We assume that processing of data from specialized areas, for which terminologies and semantic resources are available, provides the possibility to describe the involved concepts and scenarios with more detail. As result, the performance of automatic systems can be higher there.

7.3. Error analysis

We performed an analysis of erroneous or partial answers, other than those caused by sameAs. The analysis shows that most of the errors are due to the management of ambiguities in the questions, which has also been noticed in previous work [20]. These errors are mainly related to:

The annotation of semantic entities. For instance, in Which genes are associated with breast cancer?, breast cancer is correctly annotated, while the reference assumes mistakenly that it should concern the semantic entity Breast cancer-1.

The expected meaning of the terms in the questions. Semantic entities mentioned in some questions may refer to specific entities, while in other questions they may refer to general entities. For instance, in What are enzymes of drugs used for anemia?, the semantic entity anemia refers to all types of anemia (Hypercholanemia, Hemolytic anemia, Aplastic anemia, etc.), and not specifically to elements that contain the label anemia.

These two main problems can be solved by using regular expressions in SPARQL graphs rather than URIs. However, we must test the influence of this modification on each query individually.

Other erroneous answers happen during the question abstraction step when the question topics are wrongly identified or when the contextual rewriting rules are not applied. Errors may also occur during the query construction step: The method may abusively connect predicate arguments and semantic entities or, on contrary, it may not consider all the identified semantic entities. Further investigations have to be carried out to solve these limitations.

Besides, during the design of queries, we had difficulties to express some constraints in SPARQL. For instance, the question Which approved drugs interact with calcium supplements? requires to define a regular expression with the term calcium supplement, while this term is only mentioned in coordination with other supplements in the exploited KBs (e.g. Do not take calcium, aluminum, magnesium or Iron supplements within 2 hours of taking this medication.). We assume that solving this difficulty requires a more sophisticated NLP processing of the textual elements of the RDF triples, such as parsing of the RDF textual elements, named entity and term recognition, identification of discontinuous terms and term variants.

7.4. Reproducibility of our method

Our method is fully automated, once the rewriting rules have been defined. A key issue related to the reproducibility concerns the evolution of KBs. In case they are updated, it is only required to rebuild the semantic resources used for identifying the semantic entities. Yet, for managing the change of the structure of the KBs, entire frames must be regenerated. This is one direction of our ongoing research work. Moreover, the addition of new resources such as DailyMed15

¹⁵
http://dailymed.nlm.nih.gov/.
is also related to these two problems.

The method for translating questions into SPARQL queries is implemented as a Perl module and is available at http://search.cpan.org/~thhamon/RDF-NLP-SPARQLQuery/ including the rewriting rules and frames.16 ¹⁶
http://cpansearch.perl.org/src/THHAMON/RDF-NLP-SPARQLQuery-0.1/etc/nlquestion/SemanticTypeCorresp.rc.
The new set of 27 questions is also available at the following URL: http://perso.limsi.fr/hamon/Files/QALD/qald-4_biomedical_additional_test_withanswers.xml. Besides, some resources (e.g. expressions of quantification) used in our work are being made available.
8. Conclusion

We proposed a rule-based method to translate natural language questions into SPARQL queries. The method relies on the linguistic and semantic annotation of questions with NLP methods, semantic resources and RDF triple description. We designed our approach on 50 biomedical questions proposed by the QALD-4 challenge, and tested it on 27 newly created questions. The method achieves good performance with an F-measure of 0.78 on the set of 27 questions.

Further work aims at addressing the limitations of our current method including the management of term ambiguity, the question abstraction, and the query construction. Moreover, to avoid the manual definition of the dedicated resources required by our approach (frames, specific vocabulary and rewriting rules), we plan to investigate how to automatically build these dedicated resources from the RDF schemas of the Linked Data set. This perspective will also facilitate the integration of other biomedical resources such as DailyMed or RxNorm [24], and the use of our method in text mining applications.

Footnotes

Acknowledgements

This work was partly funded through the project POMELO (PathOlogies, MEdicaments, aLimentatiOn) funded by the MESHS (Maison Européenne des Sciences de l’Homme et de la Société) under the framework Projets Émergents. We thank Arthur Plesak for his editorial assistance. We are thankful to the reviewers for their useful comments and advices which permitted to improve the quality of the manuscript.

References

A.B.

Abacha and

Zweigenbaum, Medical question answering: Translating medical questions into SPARQL queries, in: ACM International Health Informatics Symposium, IHI’12, Miami, FL, USA, January 28–30, 2012,

Luo,

Liu and

C.C.

Yang, eds, ACM, New York, NY, USA, 2012, pp. 41–50, doi:10.1145/2110363.2110372.

Aubin and

Hamon, Improving term extraction with terminological resources, in: Advances in Natural Language Processing, Proc. of the 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August 23–25, 2006,

Salakoski,

Ginter,

Pyysalo and

Pahikkala, eds, Lecture Notes in Computer Science, Vol. 4139, Springer, August 2006, pp. 380–387, doi:10.1007/11816508_39.

Bodenreider,

T.C.

Rindflesch and

Burgun, Unsupervised, corpus-based method for extending a biomedical terminology, in: Proc. of the ACL 2002 Workshop on Natural Language Processing in the Biomedical Domain, University of Pennsylvania, Philadelphia, PA, USA, July 11, 2002,

Johnson, ed., Association for Computational Linguistics, 2002, pp. 53–60, doi:10.3115/1118149.1118157.

W.W.

Chapman,

Bridewell,

Hanbury,

G.F.

Cooper and

B.G.

Buchanan, A simple algorithm for identifying negated findings and diseases in discharge summaries, Journal of Biomedical Informatics34(5) (October 2001), 301–310. doi:10.1006/jbin.2001.1029.

Damljanovic,

Agatonovic and

Cunningham, Natural language interfaces to ontologies: Combining syntactic analysis and ontology-based lookup through the user interaction, in: The Semantic Web: Research and Applications, Proc. of the 7th Extended Semantic Web Conference, ESWC 2010, Part I, Heraklion, Crete, Greece, May 30–June 3, 2010,

Aroyo,

Antoniou,

Hyvönen,

ten Teije,

Stuckenschmidt,

Cabral and

Tudorache, eds, Lecture Notes in Computer Science, Vol. 6088, Springer, 2010, pp. 106–120, doi:10.1007/978-3-642-13486-9_8.

Damova,

Dannélls and

Enache, Multilingual retrieval interface for structured data on the web, in: Proc. of the 1st International Workshop on Natural Language Interfaces for Web of Data (NLIWoD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19, 2014, 2014.

Dima, Intui2: A prototype system for question answering over linked data, in: Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013,

Forner,

Navigli,

Tufis and

Ferro, eds, CEUR Workshop Proceedings, Vol. 1179, CEUR-WS.org, 2013, http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-Dima2013.pdf.

Duan,

Cao,

Lin and

Yu, Searching questions by identifying question topic and question focus, in: Proc. of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, Ohio, USA, June 15–20, 2008,

McKeown,

J.D.

Moore,

Teufel,

Allan and

Furui, eds, Association for Computational Linguistics, June 2008, pp. 156–164, http://www.aclweb.org/anthology/P08-1019.

Franconi,

Gardent,

Juarez-Castro and

Perez-Beltrachini, Quelo natural language interface: Generating queries and answer descriptions, in: Proc. of the 1st International Workshop on Natural Language Interfaces for Web of Data (NLIWoD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19, 2014, 2014.

10.

Freitas,

J.C.

Pereira da Silva and

Curry, On the semantic mapping of schema-agnostic queries: A preliminary study, in: Proc. of the 1st International Workshop on Natural Language Interfaces for Web of Data (NLIWoD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19, 2014, 2014.

11.

Janjić and

Prz˘ulj, The core Diseasome, Molecular BioSystems8(10) (Aug. 2012), 2614–2625. doi:10.1039/c2mb25230a.

12.

Kaufmann and

Bernstein, How useful are natural language interfaces to the semantic web for casual end-users? in: The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007,

Aberer,

Choi,

N.F.

Noy,

Allemang,

Lee,

L.J.B.

Nixon,

Golbeck,

Mika,

Maynard,

Mizoguchi,

Schreiber and

Cudré-Mauroux, eds, Lecture Notes in Computer Science, Vol. 4825, Springer, 2007, pp. 281–294, doi:10.1007/978-3-540-76298-0_21.

13.

J.-D.

Kim and

Cohen, Triple pattern variation operations for flexible graph search, in: Proc. of the 1st International Workshop on Natural Language Interfaces for Web of Data (NLIWoD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19, 2014, 2014.

14.

Kozák,

Necaský,

Dedek,

Klímek and

Pokorný, Linked open data for healthcare professionals, in: The 15th International Conference on Information Integration and Web-Based Applications & Services, IIWAS’13, Vienna, Austria, December 2–4, 2013,

E.R.

Weippl,

Indrawan-Santiago,

Steinbauer,

Kotsis and

Khalil, eds, ACM, 2013, pp. 400–409. doi:10.1145/2539150.2539195.

15.

Kuchmann-Beauger and

Aufaure, Natural language interfaces for DataWarehouses, in: Actes des 8èmes Journées Francophones sur les Entrepôts de Données et L’Analyse en Ligne, EDA 2012, Bordeaux, France, Juin 2012,

Maabout, ed., RNTI, Vol. B-8, Hermann, Juin 2012, pp. 83–92, http://editions-rnti.fr/?inprocid=1001218.

16.

Kuhn,

Campillos,

Letunic,

L.J.

Jensen and

Bork, A side effect resource to capture phenotypic effects of drugs, Molecular Systems Biology6(1) (2010), 343. doi:10.1038/msb.2009.98.

17.

Le Clément de Saint-Marcq,

Deville,

Solnon and

P.-A.

Champin, Un solveur léger efficace pour interroger le web sémantique, in: 8e Journées Francophones de Programmation Par Contraintes (JFPC 2012), Toulouse, France, 2012.

18.

Lehmann and

Bühmann, AutoSPARQL: Let users query your knowledge base, in: The Semantic Web: Research and Applications – Proc. of the 8th Extended Semantic Web Conference, ESWC 2011, Part I, Heraklion, Crete, Greece, May 29–June 2, 2011,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

P.D.

Leenheer and

J.Z.

Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, 2011, pp. 63–79, doi:10.1007/978-3-642-21034-1_5.

19.

Letouzey and

Gabillon, Implementation d’un modèle de contrle d’accès pour les documents RDF, in: SarSsi 2010, 2010.

20.

Lukovnikov and

A.-C.

Ngonga Ngomo, SESSA – Keyword-based entity search through coloured spreading activation, in: Proc. of the 1st International Workshop on Natural Language Interfaces for Web of Data (NLIWoD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19, 2014, 2014.

21.

Marginean, GFMed: Question answering over biomedical linked data with grammatical framework, in: Proc. of the 1st International Workshop on Natural Language Interfaces for Web of Data (NLIWoD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19, 2014, 2014.

22.

A.T.

McCray,

A.C.

Browne and

Bodenreider, The lexical properties of the Gene Ontology, in: AMIA 2002, American Medical Informatics Association Annual Symposium, San Antonio, TX, USA, November 9–13, 2002, AMIA, 2002, pp. 504–508, http://knowledge.amia.org/amia-55142-a2002a-1.610020/t-001-1.612667/f-001-1.612668/a-101-1.612945/a-102-1.612942.

23.

Montoya,

Vidal and

Acosta, A heuristic-based approach for planning federated SPARQL queries, in: Proc. of the Third International Workshop on Consuming Linked Data, COLD 2012, Boston, MA, USA, November 12, 2012,

Sequeda,

Harth and

Hartig, eds, CEUR Workshop Proceedings, Vol. 905, CEUR-WS.org, 2012, pp. 63–74, http://ceur-ws.org/Vol-905/MontoyaEtAl_COLD2012.pdf.

24.

NLM, RxNorm, a standardized nomenclature for clinical drugs, Technical report, National Library of Medicine, Bethesda, Maryland, 2009. Available at www.nlm.nih.gov/research/umls/rxnorm/docs/index.html.

25.

Pradel,

Haemmerlé and

Hernandez, Expression de requêtes SPARQL à partir de patrons: Prise en compte des relations, in: Journées Francophones d’Ingénierie des Connaissances (IC),

Mille, ed., Chambéry, May 2011, pp. 771–787.

26.

Pradel,

Haemmerlé and

Hernandez, Des patrons modulaires de requêtes SPARQL dans le système SWIP, in: Journées Francophones d’Ingénierie des Connaissances (IC), June 2012, pp. 412–428.

27.

Ranta, Grammatical Framework: Programming with Multilingual Grammars, CSLI Publications, Stanford, 2011.

28.

Rizzo and

Troncy, NERD: Evaluating named entity recognition tools in the Web of Data, in: Proc. of the Workshop on Web Scale Knowledge Extraction (WEKEX11), Bonn, Germany, October 24, 2011,

Fan and

Kalyanpur, eds, 2011, Bohn, Germany, http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/WeKEx/paper_6.pdf.

29.

Schmid, Probabilistic part-of-speech tagging using decision trees, in: New Methods in Language Processing, Studies in Computational Linguistics,

Jones and

Somers, eds, UCL Press, London, GB, 1997, pp. 154–164.

30.

Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys34(1) (March 2002), 1–47. doi:10.1145/505282.505283.

31.

Tablan,

Damljanovic and

Bontcheva, A natural language query interface to structured information, in: The Semantic Web: Research and Applications, Proc. of the 5th European Semantic Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1–5, 2008,

Bechhofer,

Hauswirth,

Hoffmann and

Koubarakis, eds, Lecture Notes in Computer Science, Vol. 5021, Springer, 2008, pp. 361–375. doi:10.1007/978-3-540-68234-9_28.

32.

Unger,

Bühmann,

Lehmann,

A.N.

Ngomo,

Gerber and

Cimiano, Template-based question answering over RDF data, in: Proc. of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16–20, 2012,

Mille,

F.L.

Gandon,

Misselis,

Rabinovich and

Staab, eds, ACM, 2012, pp. 639–648, doi:10.1145/2187836.2187923.

33.

Unger,

Forascu,

Lopez,

A.N.

Ngomo,

Cabrio,

Cimiano and

Walter, Question answering over linked data (QALD-4), in: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014,

Cappellato,

Ferro,

Halvey and

Kraaij, eds, CEUR Workshop Proceedings, Vol. 1180, CEUR-WS.org, 2014, pp. 1172–1180, http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-UngerEt2014.pdf.

34.

D.S.

Wishart,

Knox,

Guo,

Shrivastava,

Hassanali,

Stothard,

Chang and

Woolsey, DrugBank: A comprehensive resource for in silico drug discovery and exploration, Nucleic Acids Research34(Suppl. 1) (2006), 668–672, Database-Issue. doi:10.1093/nar/gkj067.