Abstract
The blending of linked data with ontologies leverages the access to data. GFMed introduces grammars for a controlled natural language targeted towards biomedical linked data and the corresponding controlled SPARQL language. The grammars are described in Grammatical Framework and introduce linguistic and SPARQL phrases mostly about drugs, diseases and relationships between them. The semantic and linguistic chunks correspond to Description Logic constructors. Problems and solutions for querying biomedical linked data with Romanian, besides English, are also considered in the context of GF.
Keywords
Introduction
Linked data means using the Web to connect related data. A large amount of data from various domains such as government, education, life sciences, art and others were made available in the context of the Linked Open Data initiative built around DBpedia. One of the greatest challenges of this new big set of data is querying it. In order to fill the gap between end users and formal languages like SPARQL, more approaches emerged: querying in full natural language [15], or in Controlled Natural Languages [5,6,10], or incremental query building [24].
The suitability of interfaces in natural languages for querying linked data is justified by more reasons. Frequently, the linked data is described by means of large terminologies. Connections between datasets are encouraged by the very essence of the concept of linked data. Consequently, the lack of detailed knowledge about the structure of the data makes the querying task tedious, even for users well adjusted to the semantic web technologies. A controlled natural language could hide the complexity from the user, without important limitations on the expected expressivity. At the same time, building a restricted natural language with a well defined semantic is feasible, especially in the context of the large adoption of ontologies, and more recently, of their lexicalization [8].
Querying databases with meaning representation languages that rely on natural language is an old idea. The CHILL system [23] used such a language for querying Geobase, a set of Prolog facts. The specialized parser for the language was learned through inductive logic programming. More recently, statistical machine learning was used. With an increasing popularity, controlled natural languages (CNLs) aim at giving an intuitive representation of formal representations with a trade-off between precision of formal languages and ambiguity, but high expressivity, of natural languages. A comparative analysis of controlled natural languages can be found in [13].
In this line, we propose a system, GFMed, based on application grammars manually built with Grammatical Framework (GF) [17]. All GFMed resources are available at
DrugBank is part of the project Bio2RDF [4] and contains chemical, pharmacological and pharmaceutical data about drugs with comprehensive drug target information. Diseasome provides information about human disease-gene networks, while SIDER relates drugs to their adverse reactions. The linked data version of Diseasome publishes a network of 4300 disorders and disease genes, as well as possible drugs for diseases. SIDER includes 4192 side effects, 996 drugs and 99423 drug/side effect pairs. The three datasets are connected, diseases from Diseasome are related to drugs described in DrugBank, DrugBank drugs are related to SIDER drugs with the relation owl:sameAs and there are side effects from SIDER that are also diseases.
Grammatical Framework is a special purpose programming language with a high support for multilingual applications. Therefore, starting from the grammars proposed for querying in English, a multilingual system was investigated, with Romanian as the second natural language.
The article is structured in 6 sections. Section 2 introduces the GF resource library that we propose for SPARQL, together with a very brief description of the proposed CNL. Section 3 describes the main functions from the abstract and concrete grammars which define the controlled language. Section 4 introduces our first attempt in building a multilingual language for querying biomedical data. Related work is analyzed in Section 5, while some conclusions are drawn in Section 6.
Controlled natural languages with GF for linked data

Querying linked data with natural language.
The general workflow of a system for querying linked data with GF is detailed in Fig. 1. The user inputs the query in natural language and a SPARQL query is generated with the help of the GF grammars.
GF grammars are divided into abstract and concrete grammars. An abstract grammar defines categories and functions. Each category stands for a set of trees. Functions produce trees of certain categories. The linearization types and functions are defined in concrete grammars. For each category, a linearization type is defined and for each function, a linearization function. Based on the abstract grammar and the concrete grammars for each language, GF is able to translate a phrase from one language to another by parsing it first into an abstract tree and then linearizing it by means of the concrete grammars.
GF has comprehensive libraries for syntax, lexicon and inflections in 36 languages [17]. CNLs built with GF, including ours, rely on these libraries for syntax, morphological paradigms useful for introduction of new elements in the lexicon, and coordination.

Operators which create graph patterns.
The abstract grammar of a CNL built with GF stands for the CNL’s semantic model [2]. The syntax of the CNL is defined by the concrete grammars for natural languages. In case of the SPARQL concrete grammar, since SPARQL is a formal language, the patterns described in this grammar could also be considered as the CNL’s semantic.
The first step in using GF for querying linked data is to have proper support for the language SPARQL. There is another SPARQL-based retrieval interface for structured data [5,6], yet, we preferred to define our own resource for SPARQL, with a set of types and operators. In a previous attempt to query touristic linked data [20], we did not use any GF module of type resource in writing the SPARQL concrete grammar. But, SPARQL-dedicated types and operators ease the process of building compact SPARQL grammars. We covered basic elements of SPARQL 1.1, with a limited support for term constraints, aggregates and negation, and no support for property paths of length different from 1, optional graph pattern, or assignment.
Two main types were defined:
The type

Operators which create a
A set of operators were defined on these types. There are operators that create incomplete triple patterns, such as

Operators which create SELECT and ASK queries.
Finally, there is a set of operators that create the strings for
The GFMed CNL aims at querying biomedical knowledge bases. The set of interrogative or imperative sentences includes which, what, are there, list and give me type sentences. The entities for which some data are queried are: diseases, drugs, genes, targets, side-effects. The entities can be referred: i) by their name (e.g. lepirudin, rickets, fever), ii) by a named category to which they belong (e.g. approved drugs, metabolic class), or iii) by a category defined through a certain property, as the ones mentioned in Table 1 (e.g. drugs that treat rickets, diseases associated to lrrc8). The expressivity of the CNL is enlarged by the fact that these three types can be composed, e.g. side-effects of drugs that treat diseases associated to growth hormone 1. Besides questions about these entities, certain information about properties are also covered: highest value of melting point, least common chromosomal location.
The first two types of entity references are obtained from the three datasets and are part of the lexicon. For the third type, a set of about forty words is included in the lexicon, e.g. number, of, distinct, involve, target, drug, together with the names of the properties defined by the terminology of the three datasets, e.g. predicted solubility, affected organisms, brand name, route of elimination. The next section gives a detailed description of how this CNL is built.
Classes of resources to express in CNL
Classes of resources to express in CNL

The main grammars and resources used by GFMed.
Description Logics (DLs) [3] are a family of knowledge representation languages that can be used to represent the knowledge of an application domain in a structured and formally well-understood way. In the description logic
GFMed, the proposed system, consists mainly of a GF grammar for the application domain given by SIDER, Diseasome and DrugBank datasets. GFMed also includes some minor preprocessing of questions and post-processing of translation results, mainly in order to deal with structures involving numeric values, e.g. values for water solubility, or free text, like different names of foods. Figure 5 shows the main grammars: an abstract grammar and two concrete grammars. For each concrete grammar, lexicons derived from SIDER, Diseasome and DrugBank were generated. Many syntactic structures in both English and SPARQL were driven by the datasets’ terminology.
For domain-specific applications, the GF abstract grammar must state the main semantic categories and trees of the language. For GFMed we introduced the following categories:
Main categories of GFMed grammars
The core of the abstract grammar consists of functions to build trees. GFMed’s functions can be categorized in i) functions that describe property restrictions, ii) functions for transforming a criterion of a class into a class or for transformations between different types of classes and properties, iii) functions expressing queries.
In the concrete SPARQL grammar built on top on our SPARQL resource library, each property from the targeted datasets is linearized to a value of type
When dealing with
In DLs, there are two types of roles or properties: object properties and datatype properties. Object properties relate individuals of two concepts, while datatype properties relate an individual of one concept to a value of a certain datatype. For example, the object property
Examples of class expressions and assertions in Diseasome
Examples of class expressions and assertions in Diseasome
Object properties Object properties and datatype properties are treated differently in the GFMed grammar. When it comes to object properties, the DL existential restriction ∃
Each DL constructor can be expressed in natural language in more than one way, either as noun phrase (NP), verbal phrase (VP), adjectival phrase (AP), verb-phrase-modifying adverb (Adv), relative clause (RCl) or clause with some missing part (ClSlash). These syntactic categories are defined by GF library. Each DL constructor identified at a conceptual level corresponds to more functions which build trees at the concrete English level, one for each possible syntactic structure. All the English alternatives for expressing a conceptual DL constructor have the same SPARQL linearization. This is somehow expected, as SPARQL is a formal language tightly related to DLs. The first 6 functions from Fig. 6 model restrictions on the property

Functions for diseases expressed as restrictions on the property

SPARQL linearization function for diseases expressed as restrictions on the property

Functions for restrictions on datatype properties.
In a similar manner, functions for restrictions on the inverse property of
Within this approach,
Datatype properties When it comes to restrictions on datatype properties, the English methods to express them are not anymore particular to each property, therefore it is possible to treat all with the same set of functions. Some examples are described in Fig. 8. The property becomes one of the functions’ parameters. The most important issue is that it is not possible to include all actual values in the grammar, because the set of values is not finite. This issue can not be completely solved in GF. The proposed solution is to include in the grammar generic trees with a dummy string. If the translation to SPARQL succeeds, the dummy value is replaced in the generated query during post-processing. Since the values for these restrictions tend to appear at the end of the question, e.g.
Cardinality restrictions, either on datatype or object properties, are not covered in the current version of the CNL, but they can be included without much effort: the current SPARQL resource already has support for inner queries. However, the number of functions for restrictions on object properties would double. The conjunction of DL classes that correspond to main entities of the CNL is covered for a limited subset, for example
Other described constructors include

Examples for datatype properties.
Figure 9 details two examples. The first uses the function

Abstract tree and concrete linearizations in English and SPARQL for an example with two transformation functions.
For composability reasons, transformation functions are defined for getting from a criterion to a class, or for getting from one dataset to another. The former are important for English linearization, while the latter play an important role in SPARQL linearization.
The first transformation functions take criteria and build from them the upper level linguistic structures needed in queries. For example, in order to get to the Noun Phrase

Functions for queries.
The only domain-dependent part from the SPARQL linearization of the function
The second type of transformations deals with queries requesting access to more datasets. In this case, English linearization does not alter the object of transformation, while the SPARQL linearization introduces new variables and sameAs statements, i.e. based on
An example for the composition of two transformation functions is given in Fig. 10. The innermost tree
Several types of queries were identified: give, list, which, what, for/with which, and is/are there. They are applied to one class, one criterion, or to a list of classes or criteria for classes (Fig. 11). The questions mostly deal with resource classes and criteria for these classes and less with properties. An exception to this rule is the function
The advantage of the described approach is the flexibility in the composition of trees and trees constructors, based on their types and transformation functions. For example,
The grammars can be used not only for translating natural language questions into SPARQL, but also for translating SPARQL queries into natural language questions, for the phrases that are not altered by the pre/post-processing of GFMed. For example, the query
Pre- and post-processing
GF comes with an HTTP server that supports REST services for its main functionality, as translation or parsing. GFMed includes i) the abstract grammar and the concrete grammars for English and SPARQL described previously, and ii) a Java standalone application that interfaces with the GF translation service based on these grammars.
The standalone application includes a preprocessing module, a module for consuming the translation service, and a post-processing module. Algorithm 1 describes the main steps of the translation from a natural language to SPARQL.

NaturalLanguage2SPARQL
Preprocessing includes a simple transformation of the question to lowercase, and a failure handling method. When the translation module gets a failure from the server, the failure handling method repeatedly trims the last word of the question and replaces the trimmed sequence with the dummy string
A special case of this trimming is done for situations where a list of free text values is included in the question. Question 13 from the QALD test set is an example for this situation: it includes the phrase
The transformation functions from a dataset to another are not protected against redundant application. It is possible to transform a drug from DrugBank to a drug from SIDER, just to return back to a drug from DrugBank. Therefore the parsing step could return more alternative abstract trees of variable size, with redundant application of the transformation functions as one factor that increases size. The post-processing module searches for the abstract tree with the smallest size, where the size of a tree is the number of included nodes. Once the tree is found, its SPARQL linearization is extracted. In case it was a value restriction, solved by the failure handling method, some replacements are done.
GF grammars must know, at compilation time, all the tokens that are part of the analyzed text. Therefore, GFMed includes lexicons for both SPARQL and English formed of all drugs, targets, diseases, genes, and side effects extracted from the three datasets (Fig. 5).
These lexicons where generated from the data sources available on the sites of the three datasets, either by using SPARQL endpoints, or by parsing RDF files. Also, the same results were obtained from executing SPARQL queries on the QALD-provided endpoint. Special attention was given to side effects, drugs, and genes. For the same ID of a side effect more synonym names are known, expressed through the property
Table 4 shows the number of resources identified in this way, giving both the number of distinct IDs and distinct names for each category.
Number of resources described in lexicon
Number of resources described in lexicon
The system was evaluated against training and test questions of Task 2 in QALD-4. The results on the test questions are shown in Table 5. GFMed correctly parsed all the questions, except one. It partially parsed question 21,
Results of GFMed in Task 2 of QALD-4
Results of GFMed in Task 2 of QALD-4
Grammatical Framework is a programming language for multilingual grammars. Therefore, starting from GFMed’s concrete grammar for English, the first steps were made towards a multilingual system. The Romanian language was considered, besides English.
Multilingual grammar
The default GF mechanism for building multilingual application grammars is through incomplete grammars that are language independent. Such a grammar was created for the described CNL without any significant changes. The incomplete grammar is extended by two concrete grammars, one for English and one for Romanian. Lexicons were generated for both languages. The name of the properties from the schemas of the three sets were translated manually in Romanian.
In Fig. 12, two two-place adjectives are defined in the English and the Romanian lexicons. In the incomplete grammar, they are used in linearizations of two criteria for either the property

Functions from lexicons and incomplete grammar.
For efficiency reasons, some constructions were not included in the GF Romanian library [9]. The main problems in using the GF library for Romanian were identified for expressing relational nouns and genitive relative phrases. Unlike English prepositions, Romanian prepositions have cases, and the nouns and adjectives have different forms for different cases. GF’s resource library has a good support for these cases, but we experienced a problem with expressing nouns in the genitive case due to the expected presence of the possessive article (a, al, ai, ale, alor). The correct forms are
Nonetheless, these problems, particular to the chosen language, are less relevant for the current goal, which is to test the possibility to add a new language. The incomplete grammar covers the majority of the functions from our CNL. This justifies the conclusion that the addition of other languages requires effort mostly only for the translation of the lexicon, which can be done with simple dictionaries, except for the domain-dependent names of the queried resources.
Apart from having phrases of common sense knowledge expressed in different languages, an important issue in building multilingual systems is the domain-specific terminology. When there are structured versions of the terminology in different languages, the problem is easier to solve. For example, Medical Dictionary for Regulatory Activities (MedDRA) is a multilingual terminology developed in order to provide a single standardized international medical terminology which can be used for regulatory communication and evaluation of data pertaining to medicinal products for human use. Unfortunately, Romanian is not included in the set of used languages.
However, based on existing classifications of disorders as OMIM (Online Mendelian Inheritance in Man), ICD (International Classification of Diseases) and Orphanet, we investigated a solution to build a Romanian lexicon for the named diseases from Diseasome.
OMIM captures the relation between genes and disorders. Each disorder has an OMIM code. This catalog is continuously updated. The initial bipartite graph from Diseasome was built on the OMIM version from 2005 [11], while the 2012 version of Diseasome was extended with drugs.
ICD-10 classification is the 10th revision of the International Classification of Diseases and Related Health Problems. There is a Romanian version of the Australian version ICD-10-AM, that is used in Romanian hospitals. Diagnosis Related Group (DRG)1
provides for an application that includes this classification.Orphanet is a portal for rare diseases, led by a consortium of around 40 countries. One of its freely accessible services is a classification of diseases elaborated using existing published expert classifications.2
The included alignments between disorders and external terminologies or resources, as OMIM, ICD10, MeSH (Medical Subject Headings), UMLS (Unified Medical Language System) and MedDRA, are characterized in order to specify if the terms are perfectly equivalent (exact mappings) or not. There are versions for 7 languages, but again Romanian is not included.Bio2RDF [4] is a project that aims at providing linked data for the Life Sciences and includes many medical resources formalized in RDF. DBpedia also contains information for diseases, including OMIM, MeSH or ICD identifiers, and names in different languages. One way to use it for the task of building a Romanian lexicon is to rely on a mapping between English and Romanian names without the use of ICD classification. For other languages this could be an acceptable solution, but in case of Romanian, the small number of DBpedia resources in this language make it inappropriate. The alternative is to rely on DBpedia for identification of ICD codes from OMIM codes or names, followed by the use of the Romanian version of ICD-10AM. A first problem is the incompleteness of the OMIM and ICD codes in DBpedia. Another problem is the inconsistency with the referred classifications. For example, for
With this analysis, the current solution for building the Romanian lexicon for diseases follows the following steps:
The labels of the diseases are obtained from Diseasome. The ICD code is searched in the linked data version of OMIM from Bio2RDF. The search is done either on the disease’s label or on the OMIM code extracted from the label (in case it is included). The used endpoint is The ICD code is searched in Orphanet, again based on the name or the OMIM code. In case it is found, the type of the mapping is extracted, too. The Romanian terms are extracted from the Romanian classification ICD-10-AM based on the identified ICD code. If the ICD code was obtained at step 2, the mapping is considered exact. Otherwise, the type of Orphanet mapping is analyzed, and in case it is an exact mapping, only the Romanian term is kept. In case the mapping is not exact, the initial English label is concatenated to the Romanian term.
Table 6 includes some examples of obtained terms. For example, for the disease with OMIM code 180920, Orphanet mentions two mappings to ICD-classified disorders, both having the status of not decided (ND), and one of them has Q38.4 code.
Romanian terms for diseases
The steps 2 and 3 are alternatives for finding the ICD code. None of them is complete, and their combination is also incomplete. From 4213 different Diseasome resources, using only Orphanet, 1210 Romanian terms were found, including all the mapping types, not only the exact mappings. By using also OMIM Bio2RDF, the number increases to 1815 terms. Their correctness relies solely on the existing data in the queried resources. It must be emphasized that the process of building the lexicon can be improved through i) a more detailed filtering by the name of the disease in Orphanet, since for example, in Diseasome, the phrase type ii is used and in Orphanet it appears as type 2; ii) the extensive use of synonyms of diseases from Orphanet. Medical competence is needed to validate the resulted lexicon and to solve mappings different from the exact one.
A question parsed and linearized with GFMed grammars that include the generated lexicon for Romanian terms of diseases in shown in Fig. 13. The Java GUI of GFMed is meant only for testing the grammars combined with the pre- and post-processing steps. For the end user, the fridge magnet type interface included in GF might be appropriate, since it supports incremental parsing and completion.

An example of the multilingual version of GFMed, with the Romanian term for disorder.
With the recent boost in available linked data and ontologies, the interest in extending the lexical context of ontologies increased as well [1]. A recent result in extending ontologies with a richer lexical layer is the ontology-lexicon model lemon [8,16]. This model proposes design patterns for the most common lexicalizations of labels from ontologies. The model was used in a manual approach of building an ontology-derived lexicon for DBpedia [19]. The building process consists of creating descriptions of verbalizations for classes and properties from ontologies. A significant part of DBpedia ontology was covered, 98% of the classes and 20% of the properties. Similat to this approach, GFMed grammars are built manually, starting from the schemas of datasets to be queried. Patterns are identified in our approach starting from DL constructors, mainly restrictions on properties. The patterns are described directly in GF, based on our own SPARQL resource library. The GF functions correspond to DLs constructors, facilitating their composition in a similar way to DLs.
Manual development of the grammars strongly restricts the scalability of our approach. The semantics of the targeted linked data, biomedical data from Diseasome, SIDER and DrugBank, is narrower compared to DBpedia. Nevertheless, the required precision in tackling medical data can be obtained with a manual approach. A very recent result in the automatic derivation of lexicons in lemon format is described in [21].
SQUALL [10] is another controlled natural language that allows for a translation into SPARQL queries, relying on Montague grammars. Unlike SQUALL, GFMed is appropriate for multilingual development due to the fact that is a controlled natural language built with GF. GF was previously used in multilingual systems for querying linked data in [5–7]. Compared to cultural heritage linked data, the biomedical domain calls for high composability of the recognized expressions due to the strong relations between the involved entities, like drugs and diseases.
Another CNL for the biomedical domain was created with Attempto Controlled English (ACE) for stating facts about interaction between proteins [14]. In contrast to this, GFMed CNL is a querying language, and the queried data is linked data, so it requires a translation of the identified meaning to SPARQL. The transitions between concrete syntax, semantics and then back to concrete syntax are easily captured with GF concrete and abstract grammars. Furthermore, the CNL from [14] is restricted to English, while GFMed includes also a controlled natural language for Romanian.
An incremental construction of queries is described in [24]. Relevant concepts and properties are identified at each step and the user can choose one. The use of the ontology is the shared point with our approach, but in our case the user does not interact with the ontology but uses natural language to express his needs.
From a completely different perspective, [22] and [25] propose learning and pattern matching for querying linked data in natural language. Ambiguity and named entity recognition are not easy to tackle within these approaches, unlike the case of controlled natural languages. But their important advantage is scalability and domain independence. Nevertheless, if we consider Romanian instead of English as querying language, the limited existing resources for processing this language hamper the application of many approaches based on learning and pattern matching and sustain approaches based on GF and its resources for Romanian.
Conclusion
A controlled natural language for querying biomedical linked data was introduced. The language is able to cover questions over more datasets, complex questions with different linguistic structures, and questions that involve lists and free text. It is built with Grammatical Framework by following a methodology based on DLs constructors. The functions defined in GFMed’s grammar are highly composable, due to their relation with DLs constructors. This relation, together with the fact that the main categories of the abstract grammar map to the types of the main entities from the queried datasets, makes an extension of the queried datasets with, for example, linked data about medical publications possible. A general GF resource for SPARQL was also introduced. The steps followed in building the language are not specific to the biomedical area. We consider that porting to a different domain is facilitated by three elements: i) the split between criteria for a class and the class, with criteria expressed with different syntactic categories, ii) the transformation functions, either from one criteria to a class, or from one dataset to another, iii) the operators from the SPARQL resource. In the same time, the manual character of the language building process limits the scalability and claims for future work in the automatic derivation of GF functions from ontologies extended with lexical layer.
The proposed system also addresses multilinguality. Romanian was investigated in addition to English. In order to obtain Romanian terms for diseases from Diseasome, a method based on more international classifications was analyzed, employing mainly the resources which follow the principles of linked data.
