Sage Journals: Discover world-class research

Abstract

Large-scale knowledge graphs such as those in the Linked Open Data cloud are typically stored as subject-predicate-object triples. However, many facts about the world involve more than two entities. While n-ary relations can be converted to triples in a number of ways, unfortunately, the structurally different choices made in different knowledge sources significantly impede our ability to connect them. They also increase semantic heterogeneity, making it impossible to query the data concisely and without prior knowledge of each individual source. This article presents FrameBase, a wide-coverage knowledge base schema that uses linguistic frames to represent and query n-ary relations from other knowledge bases, providing multiple levels of granularity connected via logical entailment. Overall, this provides a means for semantic integration from heterogeneous sources under a single schema and opens up possibilities to draw on natural language processing techniques for querying and data mining.

Keywords

Knowledge representation semantic web n-ary relations frames reification semantic integration

1. Introduction

Over the past few years, large-scale knowledge bases (KBs) have grown to play an important role on the Web. Increasing numbers of institutions publish their data using Semantic Web standards [2] and Linked Open Data (LOD) principles, contributing to the LOD cloud. This data can be used for a variety of purposes. For instance, commercial search engines exploit these KBs to provide direct answers to user queries, while IBM’s Watson question answering system [21,34], which defeated human champions of the Jeopardy! quiz show, used them to find or to rule out answer candidates.

KBs of this sort are mostly based on simple statements expressed as subject-predicate-object triples, as defined by the RDF model [29]. Such triples are convenient to process and can be visualized as entity networks with labeled edges.

Whereas triple representations work straightforwardly for relations involving two entities, many interesting facts relate more than just two participants – a problem that has gained renewed attention in several recent papers [25,43] as well as in the current W3C proposal to add roles to schema.org [5]. For a birth event, for instance, one may wish to capture not just the time but also the location and the parents. For an actress starring in a movie, the name of the portrayed character may be relevant. Such facts naturally correspond to n-ary relations. In order to capture them as triples, several different representation schemes have been proposed.

Figure 1 shows some possibilities of expressing that two entities John and Mary married in 1964. These different modeling patterns are used across different KBs in the LOD cloud, which will be discussed in more detail later in Section 2.

The basic-triple pattern in Fig. 1(a) is very simple and just establishes pair-wise connections between the arguments of the n-ary relation. If one regards every triple as representing an underlying n-ary relation with only two arguments given, it could be said that this pattern occurs in every KB in the LOD cloud. It lacks the expressive power to connect more than two arguments of the same n-ary relation.

The triple-reification1

¹
This kind of reification is different from the other kind that is discussed in this paper, which is explained in Section 2.1.2. Both kinds of reification have in common that they consist of creating an entity for something that was not represented explicitly by a single entity before. To avoid confusion, we will refer to this kind of reification as triple-reification, while the other kind – more related to the field of linguistics – will be referred to as “reification” without any qualifier.

pattern in Fig. 1(b) is used in the YAGO ontology [31] and attempts to solve the above problem by creating an entity representing a triple about which additional information can be expressed, but it incurs a significant overhead that is superlinear to the number of elements in the relation, and its semantics are also problematic.

The singleton-property pattern in Fig. 1(c) [43] improves the pattern above, but still carries some of the same problems.

The pattern in Fig. 1(d) is an event-centric pattern used frequently in specific parts of many KBs (e.g. Freebase [4]), usually to represent public events by means of a reduced ad-hoc vocabulary. It uses specific properties connected to an event class. The event class is often specific but sometimes may also be general (since more specific information can often be explicitly or implicitly inferred from the specific roles).

The pattern in Fig. 1(e) is similar to the previous one but uses a reduced set of generic roles. It is found in some KBs and schemas such as the Simple Event Model (SEM) Ontology [70] and LODE (Linking Open Descriptions of Events) [61].

The pattern in Fig. 1(f) is based on “role classes” that substitute for the regular object of a triple, and to which additional properties can be appended.

Other more ad-hoc solutions exist as well.2

http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

One is to encode the value of the third, fourth, etc. argument in the IRI of a property connecting the first two, e.g. John marriesMaryAtDate 1964. This reduces the overhead of the patterns from Fig. 1(b) and (c) but at the cost of misusing the RDF standard by creating ad-hoc semantics encoded within the IRIs. This would require extra processing and in the long run produce incompatibilities and defeat the purpose of RDF of serving as a simple, homogeneous standard.

Fig. 1.

The same information represented using different modelling patterns used in different KBs in the LOD. The property “same event” is meant to link entities that are not logically equivalent but represent the same underlying event.

Table 1 provides examples of these different modeling patterns in terms of the involved triples given in an N-triples-like format. These correspond to Fig. 1, but the parts corresponding to Fig. 1(a), (b), (c), and (f) are restricted to the structures surrounding the triple isMarriedWith.

Table 1

Triple representations of n-ary relations

Basic-triple pattern
John	isMarriedWith	Mary
Triple-reification pattern
John	isMarriedWith	Mary
s	type	Statement
s	subject	John
s	predicate	isMarriedWith
s	object	Mary
s	time	1964
Singleton-property pattern
p	subPropertyOf	isMarriedWith
John	p	Mary
p	time	1964
Specific-Role-Neo-Davidsonian
e	type	Marriage
e	groom	John
e	bride	Mary
e	time	1964
General-Role-Neo-Davidsonian
e	type	Marriage
e	participant	John
e	participant	Mary
e	time	1964
Role-class pattern
John	isMarriedWith	x
x	isMarriedWith	Mary
x	type	Role
x	date	1964

As the examples show, this sort of semantic heterogeneity leads to significant data integration challenges. One KB might use a simple binary property between two entities, whereas another may instead choose a more complex representation that accommodates additional arguments (as will be analyzed in Section 2.1). The representations can easily be so at odds with each other that no particular mapping between entities could bridge the differences. There are entities at each side that have no counterpart at the other. This leads to several challenging problems:

When linking data, there are currently no standard mechanisms to connect KBs with different modeling choices. Predicates exist to link equivalent classes, instances, or properties, but not for connecting the different patterns arising from the different modelling choices, as the ones introduced above. For instance, the entities of type rdf:Statement in Fig. 1(b) cannot be linked with owl:sameAs to the entities of type rdf:Property in Fig. 1(c) or to the entity of type :Marriage in Fig. 1(d). Existing work on automatic ontology and KB alignment [3] focuses only on finding aliases.

When using structured queries, the query must be built in a way that fits the particular modeling choices made for the respective KB. Otherwise, the recall may be as low as zero [52]. Even worse, for the case of a set of different KBs instead of a single coherent KB, there is no simple query (as could be formulated on a single given schema) that can have a high recall across all KBs.

Similarly to the previous point, when natural language interfaces to KBs are queried, state-of-the-art systems typically attempt to map verbs and predicate phrases to RDF predicates [71]. This approach, however, cannot be applied when the KB fails to provide a compatible binary relation.

In this article, we describe how these problems are addressed by FrameBase, a broad-coverage multi-layered schema that can represent a wide range of knowledge and therefore is a suitable candidate to homogeneously integrate other KBs. Figure 2 shows an example of how FrameBase represents knowledge. It combines the ability to express n-ary relations unambiguously and efficiently from the “Specific-Role-Neo-Davidsonian” pattern in Fig. 1(d) with the abstraction of the “General-Role-Neo-Davidsonian” pattern in Fig. 1(e), by connecting roles, together with a wide-coverage vocabulary of events, in a comprehensive hierarchy. At the same time, it also provides the conciseness from the “Basic-triple” pattern in Fig. 1(a).

Fig. 2.

Knowledge represented the examples in Fig. 1, represented under the FrameBase model, which combines expressiveness with conciseness by combining different representation layers.

The latter is achieved by offering a two-layered structure with a mechanism to convert back and forth between the Neo-Davidsonian representation and one based on direct binary predicates, using a vocabulary of automatically generated binary properties exploiting the ties to resources in linguistics. These are more concise and can be used when only two arguments are relevant, either in the KB or in a query.

This article builds upon previously published work on FrameBase [53,54] and extends it by:

Including an expanded and updated analysis of the state of the art.

Describing the addition of miniframes to the FrameBase schema.

Linking with external Linked Open Data resources such as Lexvo.org [12,13] and the Princeton RDF WordNet [40].

Creating 10,270 new Direct Binary Predicates and Reification–Dereification rules based on nouns, for which the head verbs have been extracted with a novel method.

Adding linguistically rich annotations to all Direct Binary Predicates using the Lemon model [41].

Incorporating, in Section 6, results from additional work [54–56] on the classification and generation of integration rules.

Including new illustrations, an updated structure, and a more in-depth analysis of several aspects of FrameBase.

This paper is structured as follows. Section 2 reviews related work and conducts a more thorough analysis of existing approaches for modeling n-ary relations and their space efficiency. Then, an overview of FrameBase is given in Section 3. Section 4 explains how the FrameBase schema is constructed, including rules to convert between different levels of granularity and expressiveness. Section 5 provides an evaluation of the quality of the FrameBase schema. Section 6 presents a typology and examples of integration rules used to capture knowledge from external KBs into the FrameBase schema, and existing methods to automatically create the simplest kinds of rules. Section 7 discusses challenges regarding the creation of more complex integration rules, and possible ways to address them. Section 8 provides a conclusion and outlines other potential lines of future work.

2. State of the art

In this section, we review prior work in this area. In particular, Section 2.1 provides a deeper analysis of the patterns introduced in Fig. 1. Section 2.2 discusses previous work on integrating knowledge. Section 2.3 introduces FrameNet, which serves as the backbone of our schema, as well as other related work based on it.

2.1. Modeling patterns for N-ary relations

Table 2
Triple count associated to different approaches or patterns for modeling n-ary relations

Pattern Struc. Prop. EL All Reif. Reasoning Dereif. Reasoning

Triple-Reification $4 k$ $(n - 2) k$ $(k - 1) \frac{k}{2}$ $(n + 2) k$ $4 k$ Def. clauses k Def. clauses

Role-class $3 k$ $(n - 2) k$ $(k - 1) \frac{k}{2}$ $(n + 1) k$ $3 k$ Def. clauses k Def. clauses

Singleton-property $2 k$ $(n - 2) k$ $(k - 1) \frac{k}{2}$ $n k$ $2 k$ Def. clauses 1 Def. clause / RDFS

*-Neo-Davidsonian 1 n 0 $n + k + 1$ $3 k$ Def. clauses k Def. clauses

Pattern	Struc.	Prop.	EL	All	Reif. Reasoning	Dereif. Reasoning
Triple-Reification	$4 k$	$(n - 2) k$	$(k - 1) \frac{k}{2}$	$(n + 2) k$	$4 k$ Def. clauses	k Def. clauses
Role-class	$3 k$	$(n - 2) k$	$(k - 1) \frac{k}{2}$	$(n + 1) k$	$3 k$ Def. clauses	k Def. clauses
Singleton-property	$2 k$	$(n - 2) k$	$(k - 1) \frac{k}{2}$	$n k$	$2 k$ Def. clauses	1 Def. clause / RDFS
*-Neo-Davidsonian	1	n	0	$n + k + 1$	$3 k$ Def. clauses	k Def. clauses

Different approaches or patterns for modeling n-ary relations exist, as summarized in Fig. 1 and Table 1. In Table 2, we provide a novel analysis of their general space efficiency, which has consequences with regards to their applicability for large-scale KBs. Each row considers the space efficiency of a specific modeling pattern for representing an event with n participants, where $k ⩽ \frac{n (n - 1)}{2}$ is the number of pairs of participants with relationships that are significant enough to be linked by direct binary relations (we do not count inverse properties because these can easily be accounted for by using owl:inverseOf). The columns reflect particular space complexity functions for the patterns.

Struc. (Structure) indicates the number of triples that are part of the core structure of the modeling approaches. For instance, for “Triple-Reification”, the value $4 k$ is obtained from having to represent, for each of the k relevant pairs, 4 different triples (using rdf:subject, rdf:predicate, rdf:object and rdf:Statement). The formulae for the other patterns are similar, just with different numbers of triples required for each relevant pair. The Neo-Davidsonian pattern is an exception, since the core structure here is just a single triple (assigning a type to the event instance), so the number of triples remains constant with respect to the number of relevant pairs. “Structure” excludes the direct binary relations for the relevant pairs, whose number is always k.

Prop. (Properties) indicates the number of triples whose predicates are properties of the underlying n-ary relation. For patterns other than the Neo-Davidsonian ones, each entity associated to one of the triples between the k relevant pairs is the subject for the $n - 2$ properties that are not involved in that relevant pair. These triples are necessary to avoid ambiguity only if the “Entity Links” triples (explained next) are not included.

EL (Entity Links) indicates the number of triples that are needed if one wishes to connect entities that represent the same event (aliases). For a Neo-Davidsonian representation, this is zero, because it does not produce intrinsic aliases for an event. In Fig. 1, these are the triples with the predicate “same event”. They are only necessary to avoid ambiguity if the “Properties” triples (explained before) are not included.

All indicates the total number of triples that can be materialized, summing up the “Structure” and “Properties” triples, and the k direct binary relations. The “Linking event” triples are omitted from the sum because the “Properties” triples already ensure that the properties of the n-ary relations are unambiguously connected. To avoid ambiguity, either “Properties” or “Linking event” triples can be used, and it is not necessary to include both.

Reif. Reasoning (Reification Reasoning) indicates what is required for inference to obtain the representation in “All triples” or “Core” from the k direct binary relations. Definite clauses are a kind of rules that can be expressed as a disjunction of logical atoms with only one negated, which is the consequent when it is written as an implication (rule). In this context, the atoms are of the form triple(subject,predicate,object). In Section 4.4, we will describe these rules for the case of FrameBase in more detail.

Dereif. Reasoning (Dereification Reasoning) indicates what is necessary for inference to obtain the k direct binary relations or the representation in “All triples” from the representation in “Core”.

Figure 1 can be regarded as a specific case of Table 2 with

n = k = 3

, though the latter does not include the “Basic-triple pattern”, as it cannot represent n-ary relations unambiguously. Additionally, for the sake of simplicity and ease of representation, different groups of triples are included or emitted. Namely, Fig. 1(b) and (c) represent “Struc.” and “EL” triples, as well as the triples for the relevant pairs. Figure 1(f) represents “Struc.”, “EL”, and “Prop.” triples, but not the triples for the relevant pairs. Figure 1(d) and (e) represent “Struc.” and “Prop.” triples, and again not the triples for the relevant pairs. This is because they are not an inherent part of the model.

Similarly, Table 1 can be seen as a specific case of Table 2 with $n = 3, k = 1$ .

Each pattern will be discussed in detail in the following subsections.

2.1.1. Basic-triple pattern

A common way to represent n-ary facts is to simply decompose them directly into binary relations between two participants [14]. However, in doing so, important information may be lost. For instance, given three triples sharing the subject, one of these with property isMarriedOnDate and two with isMarriedTo, we cannot be sure which marriage the given time span corresponds to. This is shown in the example in Fig. 1(a).

2.1.2. Triple-reification pattern

The RDF standard includes a method for performing reification [29] of triples, which introduces a new Internationalized Resource Identifier (IRI) for a statement and then describes the original RDF triple using three new triples with subject, predicate, and object properties. Subsequently, arbitrary properties of the statement can be captured by adding further triples about it.

Triple-reification is used in the different versions of YAGO [31,64,65] to attach additional information to the event represented by the original RDF triple (evoked by its property). It has also been proposed in the W3C WebSchema drafts [5]. This pattern is exemplified in Fig. 1(b) (in YAGO, the marry relation is called isMarriedTo, but this does not change the semantics). Using the triple-reification pattern in this manner has the advantage that both the original triple as well as the triple-reified triple can be present in the KB and queries that do not require the additional information can still use the original binary relation directly. However, this also has several drawbacks:

Formally, the event represented by a triple and the triple as a statement are different entities with different properties. For instance, an institution may endorse the triple as a statement without endorsing the marriage. Using triple-reification, both are represented by the same RDF resource identifier, which conceptually is meant to be unambiguous. This is a potential source of confusion and inconsistency.

The number of triples increases by a factor of 4. For each triple (S P O), one has to add (T a rdf:Statement), (T rdf:subject S), (T rdf:predicate P), and (T rdf:object O). These do not add any new information themselves but are merely a prerequisite for being able to extend the original binary relation to an n-ary relation by subsequently adding more triples with T as subject.

The advantage of being able to include the original non-triple-reified triple only applies to the primary binary relation, and not to the other $\frac{n (n - 1)}{2} - 1$ ones that can be formed (not counting inverses). Some of these may be rare or irrelevant, but others may be important and are indeed used in YAGO (e.g. yago:bornAtPlace, yago:bornOnDate).

The choice of the primary pair of entities and their binary relation (John and Mary in Fig. 1(b)) is arbitrary, and a third party willing to query the KB cannot replicate the choice independently. If their choice is different, they will not obtain any results. A possible solution, which is actually implemented in YAGO, is to include the triples for the other pairs and reify them, too, but this adds yet another factor of overhead, besides data redundancy that would complicate updates.

If the triplestore implementation makes use of quads,3

³
http://www.w3.org/TR/n-quads/

the 4-fold overhead can be avoided (though the underlying storage needs a new column), but the other disadvantages still remain. Quad-based singleton named graphs [29] could be used instead of triple-reification, the underlying problems being the same.

2.1.3. Singleton-property pattern

The “singleton property” approach [43] aims to solve some of the issues with triple-reification by instead declaring a subproperty of the original property in the primary pair, and using this subproperty as the subject for the other arguments of the n-ary relation. This is shown in Fig. 1(c).

While the approach enables us to use RDFS reasoning to obtain the triple with the parent property that relates two of the participants, and also reduces the overhead of triple-reification, it still suffers from the problems mentioned above related to the existence of a primary pair. For example, the non-triple-reified binary relationships for the other pairs cannot be inferred from that subproperty using RDFS.

2.1.4. Role-class pattern

Schema.org is an effort sponsored by Google, Yahoo, and Microsoft to establish common standards for semantic markup in Web pages. It offers a method to qualify a binary predicate by adding additional information to it [32], which in practice is equivalent to representing the n-ary relation arising from adding arguments to the binary relation underlying the binary predicate. This works by substituting the object of the binary predicate with a fresh instance of a class Role (or a subclass thereof with its own properties), and appending to this role instance the original object by means of the same binary predicate, alongside other properties such as time, instrument, etc. In order to avoid confusion, it is relevant to note that Schema.org’s use of the term “role” differs from its standard use in linguistics, which are qualifying properties such as agent and patient [23]. This definition has also been adopted in ontologies, for instance CaseRole in the SUMO ontology [59]. An example of the role-class pattern is shown in Fig. 1(f). Another example, originally used by Schema.org contributors, uses the triple :SanFrancisco49ers schemaorg:athlete :JoeMontana, which would be converted to:

:SanFrancisco49ers schemaorg:athlete _:x

_:x a schemaorg:Role

_:x schemaorg:athlete :JoeMontana

_:x schemaorg:startDate "1979"

This transformation offers a certain level of compatibility between the simple pattern with the direct binary predicate and the complex pattern, because the binary predicate is preserved in the complex pattern, with the same subject. However, the object changes, and therefore the simple pattern as such is not truly preserved after the transformation. Besides, the definition or original contract of the direct binary predicate is broken in the complex pattern. For example, schemaorg:athlete has SportsTeam and Person as domain and range respectively, and the semantics is that the object is a person that plays in the team denoted by the subject. However, none of the two usages in the complex pattern follow this: one has SportsTeam and Role as domain and range, and the other has Role and Person. Using RDFS-like inference one would infer that the role instance is also a participant, and other participants would be attached.

An example of how this conflation can lead to problems can be fully appreciated with intransitive predicates. For instance, if the predicate is somekb:fatherOf, then people’s children will become their grandchildren after the transformation.

Furthermore, the complex pattern produced by this method, given a direct binary predicate between two entities and a further qualifying value (like time in the example), is not equivalent to the one produced by another binary predicate between one of these entities and the qualifying value. This produces a similar effect of redundancy as in the method using triple-reification.

2.1.5. Neo-Davidsonian pattern

Another approach, and the one that FrameBase adopts, is to make use of Neo-Davidsonian representations [33, p. 600f.]. This means that we first define an entity that represents the event or situation (also referred to as a frame) underlying the n-ary relation. Then, this entity is connected to each of the entities filling the n arguments by means of properties describing the respective semantic roles [25,44] associated with each argument position.

The process of converting from the binary representation to the Neo-Davidsonian one is called reification, but this is different from triple-reification discussed above. In triple-reification, an entity is defined that stands for a whole triple so that additional triples can be used to describe the reified triple as a unit that represents a statement. However, in the context of event semantics, reification is used to denote the process by which an entity is defined that refers to the event, process, situation, or more generally, frame, evoked by a property or binary relation. Having done this, additional information about it can then easily be added. Both kinds have in common that a new entity is defined to refer to something that before was not explicitly represented by an entity in the KB, but in one case it is an RDF statement, while in the other it is an event.

Advantages. Table 2 compares the Neo-Davidsonian approach to the alternatives. These require a lot more triples when several direct binary relations need to be included. In the worst case, $k = \frac{n (n - 1)}{2}$ , despite discounting inverse relations, but even if not all of these relations are relevant, connecting all agents and possibly patients to all other elements would be relevant, which would easily satisfy $k > n$ .

Semantic Heterogeneity. Even when using the Neo-Davidsonian approach, there are different ways to do so, corresponding to different levels of granularity for the events and the semantic roles: from a very small set of abstract generic ones [61] to more specific ones [4].

The Simple Event Model (SEM) Ontology [70] uses the general-role Neo-Davidsonian pattern in Fig. 1(e). It defines four very general entities, Event, Actor, Place, and Time. It also establishes a framework for creating more specific ones by extending these, but it does not provide these extensions, nor ways to integrate existing KBs in a way that would solve the problem of semantic heterogeneity. Similarly, LODE (Linking Open Descriptions of Events) [61] specifies only very general concepts such as the four just mentioned.

Freebase [4] was built both by tapping on existing structured sources and via collaborative editing. Although it uses its own formalisms, there are official and third-party translations to RDF. Freebase makes use of mediators (also called compound value types, CVTs) as a way to merge multiple values into a single value, similar to a struct datatype in C. An example of a CVT is /people/marriage, which has outgoing properties such as /people/marriage/spouse, /people/marriage/from, /people/marriage/to, and /people/marriage/type_of_union. There are around 1,870 CVTs in Freebase (1,036 with more than one instance) and around 14 million composite value instances. These CVTs do not represent frames or events per se, but are regarded as complex data types. Some CVTs, for instance, connect a number and a unit. Still, in terms of their structure, they correspond to the specific-role Neo-Davidsonian pattern in Fig. 1(d). However, Freebase places a number of restrictions on CVTs. For instance, CVTs cannot be nested, and thus, if a CVT involves a monetary value, it cannot re-use the existing Dated Money Value CVT, but needs to include separate entries for the amount and currency. Also, there is no hierarchy or network-like organization, and thus Freebase does not capture any particular relationship between similar CVTs such as the film performance and TV guest role CVTs.

2.2. Knowledge integration

Connecting and integrating different knowledge sources is a long-standing problem. For KBs, there has been substantial work on ontology alignment [17] to identify matching classes from different sources, and in some cases also instances and properties [39,42,63].

However, relatively little work has considered scenarios in which the same type of ontological knowledge is modeled in different ways, as in the different modeling patterns illustrated in Fig. 1 and explained in Section 2.1. In these cases, alignment by means of binary properties such as equivalence or subsumption is no longer sufficient because an entity in a KB may not have a direct counterpart in another KB. For instance, neither any of the properties in Fig. 1(a), nor the statement instance in Fig. 1(b), the subproperty in Fig. 1(c), nor the event instance in Fig. 1(d) can be connected by owl:sameAs, owl:equivalentClass, rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentProperty, skos:exactMatch, skos:closeMatch, or any other binary relation.

The EDOAL (Expressive and Declarative Ontology Alignment Language) format [10] has been proposed to express complex relationships between properties. It defines a way to describe complex correspondences but it does not address how to create them. Similarly, complex correspondence patterns between ontologies have been described and classified in an ontology [60]. However, this approach does not provide any method to create the correspondence patterns, neither fully nor semi-automatically. The iMAP tool [15] searches a space of possible complex relationships between the values of entries in two KBs, e.g., room-price = room-rate * (1 + tax-rate), but these are limited to specific types of attribute combinations. The S-Match tool [27] makes use of formal reasoning to prove possible matches between ontology classes, involving union and intersection operators, but it does not address complex matching of properties beyond this. The work from Ritze et al. [51] uses a rule-based approach to detect specific kinds of complex alignment patterns between entries in small ontologies.

Unlike previous work, the approach presented in this paper does not focus on matching pairs of entities but provides techniques to match knowledge that can also be expressed with complex patterns involving multiple entities at one side. However, these techniques can be combined with the existing work on creating the one-to-one mappings.

2.3. FrameNet

FrameNet [22,58] is a well-known resource in natural language processing (NLP) that defines over 1,000 frames, which represent abstract concepts that encompass situations, events, or processes. These are evoked by certain words, called Lexical Units (LUs), which can be any part of speech: nouns, verbs, adjectives, etc. For example, the verb to buy and the noun acquisition can evoke (depending on the intended sense) a “commercial transaction” frame. Frames have associated participants (called Frame Elements or FEs for short). For instance, the “commercial transaction” frame has FEs for the seller, the buyer, the goods, and so on.

FrameNet includes a corpus of text that has been annotated with frames and FEs. Each annotation consists of a frame and an LU that appears (possibly inflected) in a piece of text, and some FEs whose values also appear in the text. This information can be used for training semantic role labelling (SRL) systems, also known as semantic parsers, to extract semantics or meaning from arbitrary text.

Fig. 3.

Overview of the structure of the FrameBase system.

There has been previous work on producing conversions of FrameNet to RDF as a resource [45] instead of a schema. Also, previous work [24] has proposed a framework, in the form of a meta-schema, for using frames as units of meaning to address the semantic heterogeneity problem. The framework serves as a model that can be instantiated to generate schemas from FrameNet, but it does not provide a specific one.

FRED [48] builds semantic representations of text, based on Discourse Representation Theory and with links to VerbNet [46], FrameNet [22], DOLCE Ultra-Lite [49], and other knowledge sources. Our work, in contrast, does not focus on creating representations from text but rather on converting all the knowledge in such knowledge sources to a unified schema.

3. System overview

As pointed out in the previous section, there are a number of different patterns used to represent n-ary relations in KBs.

This paper describes the construction of FrameBase, an extensible KB schema that allows for representing a wide range of knowledge, aiming at an optimal balance between the existing modeling patterns. The paper also discusses methods to integrate knowledge from external KBs.

FrameBase consists of two layers. The more expressive but also more verbose layer of the FrameBase schema is referred to as the reified layer. It consists of classes, representing frames, which can be events, situations, processes of a very general kind. It also contains frame-element properties that specify qualities about frame instances: agents participating in different ways, time, place, cause, consequence, instrument, etc. The frames are organized in a hierarchy of macroframes, miniframes, and synset- and LU-microframes, ordered here from more general to more specific kinds of frames. Synsets and LUs (Lexical Units) are concepts imported from WordNet [18] and FrameNet [1], respectively, which are both resources from computational linguistics. FrameNet constitutes the backbone of FrameBase and is a compilation of such frames and FEs to annotate the semantics of natural language. WordNet is a computational lexicon that includes word senses grouped by synonymy and other semantic relations. Both synsets and LUs are closely related to sense-disambiguated words and therefore they are used to produce the most specific frames, whereas miniframes and macroframes represent groups of near-synonymous or related concepts.

The less verbose but also less expressive layer of the FrameBase schema is the dereified layer, which consists of direct binary predicates (DBPs). These are properties for simple binary relationships between elements of a given frame. Rather than having to query such relationships via a common frame instance, this layer enables direct querying of these binary relationships.

Data from external KBs in the LOD cloud can be imported using integration rules, which can create FrameBase instance data from the instance data of the external KBs. This paper also describes the creation of these rules in manual, semi-automatic, and automatic ways, exploiting the linguistic aspects of FrameBase inherited from FrameNet. The results for automatic and semi-automatic methods are evaluated empirically. We also provide examples of how the resulting FrameBase instance data can be queried. Figure 3 provides a general overview of the dataflow in the FrameBase system.

3.1. FrameNet-based representation

The use of FrameNet as the backbone of FrameBase is motivated by the following considerations.

FrameNet has long been used to describe the semantics of general natural language. It thus provides a relatively large and growing inventory of frames and roles, with a coverage of different domains. The average number of FEs per frame is 9.45.

FrameNet comes with a large collection of English sentences annotated with frame and frame-element labels, which enables semantic role labeling [26]. This strong connection to natural language facilitates question answering [38] and related tasks.

While FrameNet’s lexicon and annotations cover the English language, its frame inventory is abstract enough to be adopted for languages as different as Spanish and Japanese [62]. This also makes it more suitable as a basis for language-independent knowledge representation than more language-specific syntax-oriented SRL resources such as PropBank [35], although being more abstract can make the SRL task more challenging.

In terms of what is expressed as a frame and what is expressed as a role or frame element, FrameNet provides a reasonable level of granularity for the phenomena that humans care to describe. From a theoretical perspective, there is no universally appropriate single level of reification. Any frame element might itself be reified, and any two elements of a frame could be connected directly by a predicate. Using FrameNet strikes a well-motivated balance, at a point that is granular enough to constitute a model for natural language semantics. However, as Section 4.4 will explain in more detail, a second level of representation is provided in FrameBase, which is based on the direct binary predicates between frame elements, and therefore less expressive but more concise.

Fig. 4.

Example of some microframes and labels under the general frame class :frame-Quitting_a_place. The initial part of the names of classes is common and has been omitted.

4. FrameBase schema creation

The FrameBase schema consists of a reified layer and a dereified layer, connected by inference rules. The reified layer provides a comprehensive hierarchy of frames and FEs, with lexical labels in English. The dereified layer provides direct binary predicates that can be used between the values of the FEs. The creation of the schema is carried out in the following steps.

FrameNet–WordNet Mapping. First, a high-precision mapping is created between FrameNet and another well-known lexical resource called WordNet [18], which will be used to enrich the lexical coverage and relations of the FrameBase schema. This is described in Section 4.1.

Hierarchy Construction. FrameNet, WordNet, and their mapping are used to create a hierarchy of frames and FEs that has very wide coverage and is also extensible. This involves creating general macroframes, extracted from FrameNet, as well as specific LU-microframes and synset-microframes extracted from FrameNet and WordNet, respectively. However, these microframes are too fine-grained, with separate entries for synonyms and near-synonyms. For instance, there are distinct LUs for get vs. obtain. This is a problem for knowledge representation because it increases the sparsity of data. At the same time, some macroframes are very coarse-grained, as mentioned above, so direct inheritance from a common macroframe cannot be used as a criterion for considering LU-microframes semantically equivalent. For instance, various kinship relationships such as mother, sister-in-law, etc. are lumped together under the same macroframe. This wide range of LUs may stand in various lexical-semantic relationships without these being indicated, including synonymy, antonymy, or nominalization. The only characteristic they have in common is that, by definition, they evoke a similar kind of situation. Therefore, neither the fine-grained nor the coarse-grained levels are ideal for knowledge representation purposes. In FrameBase, this is addressed by providing a novel intermediate level composed of miniframes that group together LU-microframes and synset-microframes that have equivalent or near-equivalent meanings, solving the problem described above. The children of each miniframe are connected in a clique with the property :isSimilarTo. The creation of the hierarchy is described in Section 4.2. An example of two resulting sibling miniframes with all their members can be appreciated in Fig. 4. Without the extended hierarchy, it would not be possible to determine that two instances of :frame-Quitting_a_place-withdraw.v and :frame-Quitting_a_place-withdrawal.n are equivalent (and optionally, they can be converted to the same type :frame-Quitting_a_place-cluster-retreat.v if desired, with external logic or by adding the triple :isSimilarTo rdfs:subPropertyOfowl:sameClassAs in an OWL-enabled triplestore).

Automatic Reification–Dereification Mechanism. Reification–dereification (ReDer) rules are created, in the form of definite clauses that allow a KB to be stored or queried independently of whether reified frames or dereified direct binary predicates are used. This mechanism may also be used to reduce overhead in the KB. The structure, implementation, and creation of ReDer rules are described in Section 4.4.

4.1. FrameNet–WordNet mapping

While FrameNet [22,58] is the largest high-quality inventory of semantic frame descriptions and their participants, WordNet [18] is the most well-known resource capturing meanings of words in a lexical network, covering for example nouns and named entities missing in FrameNet. WordNet, for instance, serves as the backbone of YAGO’s ontology. This section proposes a novel way of mapping the two resources, which later enables us to integrate both of them into FrameBase’s schema.

WordNet contains synsets, which are sets of sense-disambiguated synonymous words with a given part of speech (POS), such as noun or verb. FrameNet contains lexical units (LUs), which are also POS-annotated words associated with frames. Because of the semantics of the containing frame, LUs are also disambiguated to a certain extent, though not with the same granularity as in WordNet (for instance, WordNet has different senses for the verb to assert corresponding to stating something categorically and to declaring or affirming something solemnly as true; this is a nuanced difference that is conflated under a single LU in the frame Statement). The objective at hand is to produce an alignment of synsets and LUs with the same meaning, which can be later used to enrich FrameBase’s FrameNet-based schema with relations and annotations from WordNet.

More specifically, the objective is to map each LU to exactly one synset. While there are some LUs that could be mapped to more than one synset, as a general rule the restriction to a single one favors precision, which is desirable for the purpose of obtaining a clean knowledge base (even at the cost of some recall). The only cases where this model would be detrimental to precision are those for which LUs do not have any associated synset, but these are few and most can easily be avoided by omitting LUs with parts of speech not covered in WordNet, such as prepositions.

This choice allows for modeling the mapping as a function $S (l | a, b)$ from LUs to synsets as in Eq. (1). In this definition, $S_{l}$ stands for the set of synsets with the same lexical label and part-of-speech tag as the LU l, $μ_{L}$ and $μ_{G}$ are the lexical and gloss (definition) overlap, respectively, f yields the corpus frequency of the synset, and a and b are parameters for a linear combination (the third parameter can be omitted because of the argmax function). $\begin{array}{l} S (l | a, b) = & arg max_{s \in S_{l}} μ_{L} (l, s) \\ (1) & + a \cdot μ_{G} (l, s) + b \cdot f (s) \end{array}$

The lexical overlap $μ_{L}$ of an LU l and a synset s is the size of the intersection between the POS-annotated words from the LUs in the same frame as l and the POS-annotated words in s and its neighborhood. The neighborhood is defined as the synsets interconnected by a selection of lexical and semantic relations (called “semantic pointers” in WordNet) such as “See also”, “Similar to”, “Antonym”, “Attribute” and “Derivationally related”. Without this expansion, sparsity is too high. It also helps matching the sets with those generated for the LUs, which, due to the different semantics of frames and synsets, may already include these related words.

The gloss overlap $μ_{G}$ is the size of the intersection between the set of words in the definition of the LU and the gloss of the synset. The Stanford CoreNLP library [67] is used to clean XML tags, tokenize, POS-label, and lemmatize the text, and all words except nouns and verbs are filtered out.

Parameters a and b are trained with a greedy search starting at several randomized seeds, obtaining optimal values $a = 5$ , $b = 0.13$ .

4.2. Hierarchy construction

In FrameBase, frames are modeled as classes whose instances are specific events or situations. The frame elements of each frame are properties whose domain is that frame. The class hierarchy of frames is created as follows.

General Frames: These frames are obtained from the original FrameNet frames and are referred to in FrameBase as macroframes, because they correspond to general concepts. They are connected to each other via the relations :inheritsFrom and :isPerspectiveOf,4

⁴
We use http://framebase.org/ns as default prefix.

which are obtained from FrameNet’s frame inheritance and perspectivization relations between frames. Both relations are made subproperties of rdfs:subClassOf, because every subframe or perspectivized frame is also an instance of the parent or general frame, and inheritance between frame element properties (belonging to frames connected by inheritance) is modeled with rdfs:subPropertyOf. Perspectivization is similar to but still somewhat different from inheritance. It is a sort of specialization relation that captures a particular perspective of a situation or event associated with a frame. For instance, in the resulting FrameBase schema, the frame :frame-Transfer is perspectivized by :frame-Giving, which reflects a perspective centered around the frame element Donor being the agent. This is reflected by the fact that :frame-Giving, at the same time, inherits from :frame-Intentionally_act, and its FE property :fe-Giving-Donor is a subproperty of the FE property :fe-Intentionally_act-Agent. On the contrary, the frame element property :fe-Transfer-Donor does not inherit from any agentive frame element property. Additionally, a top frame is declared for the hierarchy. Semantic types are sometimes provided as ranges in FrameNet, but their current coverage is limited, and they have therefore been left out of FrameBase.

Another example covering both inheritance and perspectivization is the following. Using RDFS inference, an instance of :frame-Commerce_sell with a certain property:fe-Commerce_sell-BuyerB is also an instance of :frame-Giving, and B is its :fe-Giving-Recipient, because the former frame inherits from the latter. Likewise, it is also an instance of :frame-Transfer and B is the :fe-Transfer-Recipient, because :frame-Giving is a perspective on :frame-Transfer.

Leaf Nodes: Since FrameNet’s original frame inventory is coarse-grained and different LUs like construction and to glue evoke the same frame, more specific frames associated with particular LUs are employed. In other words, every LU is treated as evoking its own separate fine-grained frame, an LU-microframe, which is made a subclass of the more coarse-grained original FrameNet frame. In addition, another type of microframes, denoted as synset-microframes, are created from the synsets in WordNet 3.0. The IRIs for the microframes are coined by appending the more specific identifier (LU or synset) to the IRI of the parent macroframe. For instance, macroframe class frame-Personal_relationship has, among others, two subclass microframes: frame-Personal_relationship-partner.n and frame-Personal_relationship-wn_spouse_n_10640620. The former is obtained from an LU in FrameNet, and the “n” suffix indicates that it is a noun concept. The latter is obtained from a synset in WordNet, including the number (synset ID).

Intermediate Nodes: As mentioned earlier in this section, macroframes are sometimes too general, while LU-microframes and synset-microframes are too fine-grained, sometimes leading to multiple aliases for near-identical concepts. This is addressed by providing a novel intermediate level composed of miniframes. Each miniframe groups together a cluster of LU-microframes and synset-microframes that have equivalent or near-equivalent meanings.

Algorithm 1 describes how the clusters are created, defined in a set C of pairs of microframes representing edges of a graph. In the main loop, the algorithm independently considers each macroframe m. Such macroframes have microframes as direct descendants. First, for a given m, an empty set $C^{'}$ of pairs of microframes is declared. Then, for each LU-microframe l that is a direct descendant of m, the corresponding set $S (l)$ of synsets equivalent to l is retrieved from the FrameNet–WordNet mapping ( $S (l)$ as defined in Equation (1), but we omit the parameters here). In the case of the mapping in Section 4.1, $| S (l) | = 1$ , but in general it could have more than one element. $C^{'}$ is filled with pairs of microframes connecting LU-microframe l with each synset-microframe in $S (l)$ . Then, $C^{'}$ is expanded by adding all other synsets related by lexical relations reflecting cross-POS morphological transformations (R): “Derivationally related”, “Derived from Adjective”, “Participle”, and “Pertainym”. The lexical relation “Derivationally related” connects word senses that share the root (normally from different POS, e.g., the verb visualize and the noun visualizer, but can also have the same POS like author and authorship). “Pertainym” and “Derived from Adjective” are more specific and overlapping, connecting nouns and adverbs, respectively, with adjectives. They cover some cases not covered by “Derivationally related” (e.g., textile as an adjective and as a noun). “Participle” connects verbs with their participle form (e.g., stack with stacked).

Algorithm 1

Algorithm for generating clusters

Fig. 5.

Example of six clusters of LU- and synset-microframes under the macroframe :frame–Personal_relationship (which is also how the IRIs of the child microframes start, but this is omitted in the second column). The connection of derivationally related words evoking nearly equivalent situations or relations (in general, frames) can be appreciated in all of them. For instance, LU-microframes friend and friendship are connected to the homonymous synset-microframes by the FrameNet-WordNet mapping, and the two synset-microframes are connected by the WordNet relation “derivationally related from”.

In general, these lexical relations do not necessarily imply any close semantics (e.g., the verb create and the noun creature), but when restricted to synsets all tied to the same FrameNet frame, such cases are normally factored out. Therefore, from the pairs in $C^{'}$ , only those will be copied to C whose second element (synset-microframe) appears twice. That is, it has been generated from two different l under m. The goal of using the lexical relations is linking cross-POS LU-microframes that evoke the same specific situation with a different syntactic form, such as nominalizations (produce–production), non-finite verb forms (produce–produced), adjectivization, or adverbization. Next, the LU-microframe is connected with the synset-microframes from the set of synsets, using the property framebase:isSimilarTo, which is declared to be transitive and symmetric in OWL (although the sets of triples produced materialize the transitive and symmetric closure, so in practice this is not needed).

Finally, the transitive closure of the symmetric closure of C is calculated, which effectively creates cliques for the clusters.

Figure 5 presents examples of clusters under a single macroframe.

Once C is obtained, the property framebase:isSimilarTo is used to connect each pair of microframes in C, and for each cluster or clique in C, intermediate miniframes are reified5

⁵

This is yet another different but related use of the term reification. In general, reification means the process of making something real, and in the context of knowledge bases, can be used whenever a new entity is created for something that was only implicitly represented before, generally as a function of pre-existing entities.

and declared superframes of the members of the cluster, and at the same time subframes of their previously immediate superframe. Each miniframe is also connected by framebase:isSimilarTo to the subframes. An example of the result can be appreciated in Fig. 4.

The use of the property framebase:isSimilarTo yields direct connections between members of the cluster. It may also be convenient in contexts when users wish to reduce sparsity by completely merging all members of each cluster. In this case, they can achieve this simply by declaring framebase:isSimilarTo a subproperty of rdfs:subClassOf and enabling RDFS inference. By virtue of the already materialized inverses of framebase:isSimilarTo, every instance of a member of the cluster, including the miniframe, becomes an instance of the others. Alternatively, owl:equivalentClass can be used.

4.3. Labeling of the hierarchy

Names, definitions, and glosses in FrameNet and WordNet are also used to create text annotations for our schema. Lexical forms are attached with rdfs:label and definitions and glosses from FrameNet and WordNet are attached with rdfs:comment. Additional linguistically rich annotations are added using Lemon [41]. An example annotation is provided in Fig. 6.

Fig. 6.

Example of Lemon annotation for LU-microframe.

Following the best practices in the Linked Open Data community, we link synset-microframes to IRIs in the canonical RDF translation of WordNet [40]. We also provide links to word-sense IRIs in Lexvo.org, a KB that connects information about languages, words, characters, and other human language-related entities [12,13]. This allows FrameBase to be transitively connected to other KBs in the Linked Open Data web, as well as provide multilingual support.

In general, the schema depends on OWL inference, albeit of a lighter kind, consisting merely of RDFS inference plus support for owl:TransitiveProperty, owl:SymmetricProperty, and owl:equivalentClass. However, the use of the transitive and symmetric closure (which is manageable for the size of the schema) and the inverted rdf:subClassOf properties makes it possible to instead rely only on RDFS inference, which is more widely implemented and usually more efficient than OWL inference.

4.4. Automatic reification–dereification mechanism

While frames are convenient for representational purposes, users wishing to query the knowledge base benefit from direct binary predicates between pairs of frame elements. For example, for a birth event, binary predicates like bornInPlace and bornOnDate can facilitate querying by offering a more compact and simple representation.

Thus, FrameBase presents a novel mechanism to convert between frame representations and direct binary predicates. This mechanism can also allow us to avoid materializing frame instances when only two frame elements are needed.

4.4.1. Structure of ReDer rules

The dereification rules have the form expressed in Fig. 7. Additionally, for each dereification rule there is a converse reification rule so that one can go back from binary predicates to the frame representation. Each Direct Binary Predicate (DBP) has only one set of possible frame and frame elements associated, and therefore chaining reification and dereification rules is an idempotent operation. We call the pair of a reification rule and its converse dereification rule a ReDer (reification-dereification) rule. An example of a ReDer rule is provided in Fig. 8.

Fig. 7.

The general pattern of a ReDer rule. The conjunction of the three triples below is semantically equivalent to the triple above.

Fig. 8.

A particular example of a ReDer rule. The direct binary predicate :dbp-Statement-writesAboutTopic has the lexical label “writes about topic”, and connects the values of :FE-Statement-Speaker and :FE-Statement-Topic when they are connected to a common frame of type :frame-Statement-write.v, which is associated with the verb “to write” when it evokes a “Statement” frame.

The ReDer rules can be implemented in different ways.

As SPARQL CONSTRUCT queries, due to SPARQL’s prominence as a standard query language for KBs [28]. These can be used to materialize the DBPs into the KB.

As clauses, with triples as atoms, to be fed into general-purpose inference engines, with or without materialization. For example, ReDer rules can also be implemented as rules for the Rubrik reasoner in Jena [6].

Given an instance set (ABox), the reified and dereified layers can be stored using different strategies.

Materializing both the reified and dereified layers. This is the simplest but less space-efficient approach. Ensuring consistency between both layers after updates to a single one requires some bookkeeping.

Materializing the reified layer and virtualizing the dereified layer. This means that the triples with DBPs as predicates are not stored in the KB like the rest, but they can be inferred at query time. A general-purpose inference engine like the Rubrik reasoner [6] could handle ReDer rules, since these can be written as definite clauses. This offers moderate space efficiency. Only the dereification sense of the rules is used. Ensuring consistency after updates is trivial if only the materialized layer is updated.

Materializing frame instances with two FEs in the dereified layer and the rest in the reified layer. This offers the highest space efficiency. Ensuring consistency after updates is the most complex of the three cases, because knowledge has to be moved between the reified and dereified layers when triples with FE predicates are added or deleted.

This choice of the storage strategy is in theory orthogonal to the implementation of the ReDer rules. In practice, however, storage strategy 1 is relatively trivial to implement using SPARQL CONSTRUCT implementations of the ReDer rules, while storage strategy 2 is trivial to implement using dereification rules in Jena format. Storage strategy 3 would require internal logic (which has not been implemented so far), making the choice of the format a design choice.

Besides the plain rdfs:label and rdfs:comment annotations, we annotate the DBPs using Lemon [41]. This provides syntactically rich annotations that describe the internal structure and external syntactic frame of their labels. Instead of using Lemon’s generator, which uses automatic tokenization, parsing, etc., we use our knowledge of the synthetic structure of the different possible labels for DBPs to create annotations with human-level precision. Similarly, we also use Lemon for annotating microframes.

4.4.2. Creation of ReDer rules

The ReDer rules are automatically built using the syntactic annotations of English sentences given for different LUs in FrameNet, like the grammatical function (GFs) and phrase types (PTs) [58]. These are used in FrameNet to describe the syntactic valence properties of individual lexical items. In particular, in the annotated sentences in FrameNet, each instance of an example sentence annotated by a frame is accompanied by the GF and PT associated with each of the FEs of that frame filled in that sentence.

FrameNet provides three kinds of GF labels.

External Argument (Ext). In the case of verb LUs, it represents the subject of the LU (“[The physician] performed the surgery” [58]), any constituent that controls the subject of the LU (“[The doctor] tried to cure me”), or a dependent of a governing noun (“We are glad for the [American] decision to provide relief”). In the case of adjective LUs, it is the subject of a copular verb (“[The chair] is red”), or other semantically similar constructions (“We consider [Pat] very intelligent”). In the case of noun LUs, the external argument can be interpreted as the subject of a semantically related verb in a periphrasis (“[He] made a statement to the press”).

Object (Obj). The syntactic object of a verb LU (“Voters approved [the stadium measure]”).

Dependent (Dep). This is the general grammatical function assigned to adverbs, Prepositional Phrases (PPs), and some other attached constituents, but in our case only PPs are used. In these cases, the PP annotation is attached (between square brackets) to the preposition forming the PP. It can be used for verb LUs (“Give the gun [to the officer]”; PP[to]), adjective LUs (“Lee is certain [of his innocence]”; PP[of]) or noun LUs (“The letter was [to the President]”; PP[to]).

Some of the PT labels that can be found are N (noun), NP (noun phrase), Obj (object) and PPinterrog (PP interrogative).

Only constituents tagged with frame elements are assigned grammatical functions. While target words (LUs) are occasionally tagged with frame elements, they are never assigned a grammatical function.

ReDer rules and new DBPs are created using ReDer rule constructors. Each constructor specifies certain conditions on the annotations associated with a pair of FEs in an example sentence. When the conditions are met, a new DBP is generated and a ReDer rule containing the pair of FEs is created.

The constructors are shown in Figs 9–14. As in the general reification-dereification rule pattern in Fig. 7, the postfixes “-S” and “-O” in the constructors indicate the data associated with the FEs that fill the first and second arguments of the DBP, respectively, or equivalently, the respective subject and object of the resulting RDF triple. The creation of the DBP implies the creation of a dereification rule following the pattern in Fig. 7, with <FRAME_CLASS> defined by the LU, and <FRAME_INSTANCE> left as a free variable. The corresponding reification rule is built similarly, but assigning an anonymous node or a skolem constant to <FRAME_INSTANCE>.

Fig. 9.

Agent-Verb-Patient ReDer rule constructor and some examples of ReDer rules created.

Fig. 10.

Patient-Verb-Agent ReDer rule constructor and some examples of ReDer rules created.

Fig. 11.

Agent-Verb-PP ReDer rule constructor and some examples of ReDer rules created.

Fig. 12.

Patient-Verb-PP ReDer rule constructor and some examples of ReDer rules created.

The Agent-Verb-Patient constructor in Fig. 9 creates DBPs whose lexical heads are verbs, whose subjects in the KB are agents, and whose objects are patients, thus having a lexical representation in the form of linguistic predicates in active voice. The constructor inverts example sentences that are deemed to be in passive form.

There is no explicit syntactic annotation in FrameNet to indicate if the verb LUs are evoked in passive form. Therefore, two different heuristics are used for detecting this. One (IsPassivePosHeuristic(LU)) draws on the POS annotations available in FrameNet, and decides that the target (LU) verb is in passive if and only if it appears as a past participle, and the verb to be, in any form, is in a prior position, without another verb in between. The other heuristic (IsPassiveDepHeuristic(LU)) uses the Stanford dependency parser [36], determining that the target (LU) verb is in passive if and only if it is the source of any of the dependencies nsubjpass, csubjpass or auxpass. Both heuristics make type I and II errors (false positives and false negatives) differently, so the cases where they disagree were discarded, and in the ones where they agree that there is passive form, the rules are created inverting the Ext and Obj GFs.

The Patient-Verb-Agent constructor in Fig. 10 is the converse of the Agent-Verb-Patient constructor: it also creates DBPs whose lexical head are verbs, but whose subject in the KB is a patient, and whose object is an agent, thus having a lexical representation using the passive voice. Every time the Agent-Verb-Patient constructor is invoked on an example sentence and a pair of FEs, the Patient-Verb-Agent constructor is invoked as well, creating the converse DBP.

The Agent-Verb-PP constructor in Fig. 11 creates DBPs whose lexical heads are verbs, whose subjects in the KB are agents, and whose objects are complements that are contained in a PP in the example sentence. In the DBP label, a new PP is included using the name of the FE-O, following the convention used to name predicates in many LOD KBs (e.g., diedOnDate, isWrittenByAuthor, etc.). However, the preposition in the PP in the example sentence is not always the most appropriate to insert in the DBP label. Therefore, Algorithm 2 is used, where different options are tried in order, with more precise but narrow-scoped ones first.

Fig. 13.

Agent-Verb-Noun-PP ReDer rule constructor and some examples of created ReDer rules.

Fig. 14.

Agent-Verb-Particle-Noun-PP ReDer rule constructor and some examples of created ReDer rules.

The Patient-Verb-PP constructor (Fig. 12) changes agent with patient with respect to the constructor Agent-Verb-PP, in the same way Patient-Verb-Agent does with respect to the constructor Agent-Verb-Patient. It creates verb-based DBPs whose subjects in the KB are patients instead of agents, and the DBP has a lexical representation using passive voice.

Using only agent and patient as subject of the triple prevents the constructors from forming DBPs that would rarely be useful, like those connecting the time and place, or the place and the cause.

The Agent-Verb-Noun-PP constructor (Fig. 13) and the Agent-Verb-Particle-Noun constructor (Fig. 14) create ReDer rules with DBPs whose heads are nouns, based on noun LU-microframes. In these cases, a verb is needed that takes the noun as an argument, normally as a direct object. Across RDF vocabularies and ontologies, this verb is sometimes made implicit in human-readable IRIs and lexical labels alike. For example skos:hasTopConcept includes “has” explicitly, while skos:topConceptOf includes “is” implicitly. In FrameBase, the modeling choice has been to always make them explicit both in the IRI and in the lexical annotations, in order to avoid ambiguity and prevent incorrect use. The verbs have been conjugated in third person singular form.

The difference between these two constructors is that in the Agent-Verb-Noun-PP constructor (Fig. 13), the noun is part of the object of the verb, while in the Agent-Verb-Particle-Noun-PP constructor (Fig. 14) it is part of a PP with its own preposition.

In both cases, the verb governing the noun is obtained using the same method. For each noun LU in an annotation, the head verb is extracted by parsing the example annotated sentences with the Stanford dependency parser and searching the paths of dependencies indicated in the constructors Agent-Verb-Noun-PP and Agent-Verb-Particle-Noun-PP.6

⁶

We use collapsed CC-processed dependencies, version 3.2.0.

For brevity, the paths are annotated with the notation of SPARQL property paths, but this is not part of any query.

The Agent-Verb-Noun-PP constructor contains several possible dependency paths using dependencies of type “dobj” (direct object), “cop” (copula), “nsubj” (nominal subject), and “prep” (preposition).

(LU ^dobj HeadVerb) matches HeadVerb=“make” and LU=“comment” for the sentence “I have decided not to make any further comment concerning the change of ball during the lunch interval at Lord ’s on Sunday”.

(LU cop HeadVerb) matches HeadVerb=“is” and LU=“maiden name” for the sentence “The maiden name of one of his wives (probably the second) was Watt”.

(LU ^nsubj/cop HeadVerb) matches HeadVerb=“is” and LU=“cause” for the sentence “The short-term cause of overriding local significance were the droughts and crop failures in 1920 and 1921”.

(LU ^prep_*/cop HeadVerb) matches HeadVerb=“is” and LU=“cause” for the sentence “Well-meaning ignorance is one of the biggest causes of animal suffering in this country (...)”.

(LU ^prep_*/^dobj HeadVerb) matches HeadVerb=“give” and LU=“thought” for the sentence “I have given a great deal of thought as to how much I should actually tell you about this period and what just to leave to your imagination”.

The Agent-Verb-Particle-Noun-PP constructor fires in cases of phrasal verbs, where the head verb must be extracted with a particle.

(LU ^prep_VerbParticle HeadVerb) matches HeadVerb=“go”, VerbParticle=“on” and LU= “tour” for the sentence “Something else I shall miss by going on this dratted tour with Gwen!”.

Algorithm 2

The algorithm that is used to select the preposition. $c (e, p, s)$ is 1 if frame element e is annotating a PP with preposition p in example sentence s, and 0 otherwise. P is the set of prepositions existing in the annotations across the set of sentences S. $h (e)$ is an function that maps the 40 most common frame elements to a manually selected preposition

The Subject-Copula-Adjective-PP constructor in Fig. 15 creates adjective-based DBPs using the copular verb “to be”.

Fig. 15.

Subject-Copula-Adjective-PP ReDer rule constructor and some examples of created ReDer rules.

With the rules obtained with the process above, the same DBP can be associated with different reified patterns (i.e., pairs of frame elements in a given LU-microframe), owing to different senses or syntactic frames for a given verb – for example the transitive and intransitive syntactic frames for ergative verbs such as to break. This would conflate different senses, and if the reification and the dereification directions of the rules were chained, it would logically entail different pairs of frame elements, which would not be sound. Furthermore, a given reified pattern can also produce different DBPs, which would lead to redundancy. To achieve the idempotency mentioned earlier, a DBP should not be connected to more than one reified pattern (i.e. not present in more than one ReDer rule). To avoid redundancy, a reified pattern should not be connected to more than one DBP (ditto). Therefore, it is necessary to find an ${0, 1}$ -to- ${0, 1}$ assignment between DBPs and reified patterns. To obtain the most correct and intuitive of such possible assignments, we optimize the number of example sentences on which the ReDer rules in the one-to-one assignment are based. This can be seen as an instance of the assignment problem. We build a bipartite graph with the set of DBPs and the set of reified patterns as the two sets of vertices, and with pairs of DBPs and reified patterns connected by edges weighted with the additive inverse of the number of annotated example sentences creating a ReDer rule that connects that DBP with that reified pattern (positive infinity is used as weight for the pairs that do not have any associated ReDer rule created from examples). The Kuhn-Munkres algorithm [47] can be applied over this graph to find a maximal subset of the ReDer rules that satisfies the ${0, 1}$ -to- ${0, 1}$ condition between DBPs and reification patterns and maximizes the number of example sentences on which they are based. The Kuhn-Munkres algorithm is chosen for its polynomial (cubic) complexity. Although this could still be a problem for the total number of original ReDer rules, it is averted by creating an independent instance of the problem for the rules created from each frame. This does not change the results because ReDer rules are not created connecting DBPs and reified patterns from different frames (the DBP IRIs contain the frame name as a sort of namespace).

For alternative applications where the ${0, 1}$ -to- ${0, 1}$ condition is not necessary, like using the pre-filtered DBPs as lexical labels of the filtered DBPs, the original mapping can be used.

5. Evaluation of the schema

In this section, we evaluate the results of the methods used to create the FrameBase schema (Section 4) as well as some practical examples resulting from the integration of knowledge (Section 6).

First, Section 5.1 presents the evaluation of the FrameNet-WordNet mapping described in Section 4.1. Then, Sections 5.2 and 5.3 present our findings for the methods for constructing the schema hierarchy (Section 4.2) and the construction of the ReDer rules (Section 4.4). These two sets of results (summarized in Table 3) cover all the parts of the schema that are created automatically, and since the original resources (FrameNet and WordNet) are created manually, these results provide a complete evaluation of the quality of the FrameBase schema with respect to the standard of human-level annotations.

Table 3
Quality measures for the FrameBase schema for intra-cluster pairs of microframes, verb-based ReDer rules, and noun-based ReDer rules. Nuanced correctness is a variable collected over correct elements, that reflects how perfectly accurate the element is (perfect synonymy for pairs of microframes, readability for rules)

Correctness Nuanced correctn.

Cluster pairs $87.55 % \pm 6.18 %$ $31.15 % \pm 9.38 %$

V-ReDer rules $96.22 % \pm 3.22 %$ $80.43 % \pm 7.61 %$

N-ReDer rules $87.50 % \pm 6.41 %$ $91.91 % \pm 6.28 %$

	Correctness	Nuanced correctn.
Cluster pairs	$87.55 % \pm 6.18 %$	$31.15 % \pm 9.38 %$
V-ReDer rules	$96.22 % \pm 3.22 %$	$80.43 % \pm 7.61 %$
N-ReDer rules	$87.50 % \pm 6.41 %$	$91.91 % \pm 6.28 %$

Table 4

Comparison of FrameBase’s FrameNet–WordNet mapping to state-of-the-art approaches in terms of precision, recall, F1, and accuracy, from the original papers. ^†Mappings from WordNet 1.6 to WordNet 3.0 are used to convert from the MapNet gold standard

	Prec	Rec	F1	Acc
SVM Polynomial kernel 1 [66]	0.761	0.613	0.679	–
SVM Polynomial kernel 2 [66]	0.794	0.569	0.663	–
SSI-Dijkstra [37]	0.78	0.63	0.69	–
SSI-Dijkstra+ [37]	0.76	0.74	0.75	–
Neighborhoods [20]^†	–	–	–	0.772
FrameBase’s mapping^†	0.789	0.709	0.746	0.864

5.1. FrameNet–WordNet alignment

To evaluate the created schema, the created FrameNet–WordNet mapping has been compared to the gold standard used to evaluate MapNet [66]. This gold standard uses older versions of FrameNet and WordNet, so mappings from WordNet 1.6 to 3.0 [9] had to be applied, removing those with a confidence lower than one, and the few LUs of FrameNet 1.3 that are not contained in FrameNet 1.5 were discarded. 5-fold cross-validation was used for obtaining the results. Table 4 compares the results against state-of-the-art approaches and the scores that they report on the MapNet gold standard. As stated as goal when setting the cardinality restrictions in Section 4.1, the approach described in Section 4 achieves higher precision (albeit with a very narrow margin) while still maintaining good recall. For this reason, we consider it more appropriate than the previously existing ones to be used in the following steps because high precision is usually prioritized for tasks related to knowledge representation. It must be noted, however, that there are minor differences since our results in Table 4 are evaluated without the few frames dropped between FrameNet 1.3 and 1.5, and some results [37,66] also without using the inter-version WordNet mapping. However, this also means that our new mapping developed for FrameBase provides results for more recent and updated versions of FrameNet and WordNet.

Table 5
Examples of ReDer rules created and evaluated. “C” stands for “Correct” and “R” for “Readable”

C R DBP Frame FE-Subj FE-Obj

1 1 freezes Activity_pause-freeze.v Agent Activity

1 1 skims Reading-skim.v Reader Text

1 0 pukesWithManner Excreting-puke.v Excreter Manner

1 0 seizesAroundLocus Manipulation-seize.v Agent Locus

0 – glances Perception_active-glance.v Perceiver_agentive Direction

0 – releasesOnCircumstances Releasing-release.v Theme Circumstances

0 – rushes Fluidic_motion-rush.v Fluid Path

C	R	DBP	Frame	FE-Subj	FE-Obj
1	1	freezes	Activity_pause-freeze.v	Agent	Activity
1	1	skims	Reading-skim.v	Reader	Text
1	0	pukesWithManner	Excreting-puke.v	Excreter	Manner
1	0	seizesAroundLocus	Manipulation-seize.v	Agent	Locus
0	–	glances	Perception_active-glance.v	Perceiver_agentive	Direction
0	–	releasesOnCircumstances	Releasing-release.v	Theme	Circumstances
0	–	rushes	Fluidic_motion-rush.v	Fluid	Path

It may be relevant to note that there is in practice an upper bound to precision scores in tasks like this because of the subjective component of any gold standard. The creators of the gold standard [66] report “0.90 as Cohen’s Kappa computed over 192 LU-synset pairs for the same mapping task” by [11]. More generally, Fellbaum & Baker [19] maintain that “both people and automatic systems, when asked to assign tokens in a text to the appropriate senses in dictionaries, find the task difficult and do not agree among themselves”.

5.2. Creation of the hierarchy

The frame hierarchy in the FrameBase schema is based on FrameNet and WordNet and the mapping created between the two resources. It provides 19,376 frames, including 11,939 LU-microframes and 6,418 synset-microframes, all with lexical labels. A total of 18,357 microframes are clustered into 8,145 logical clusters, which are the sets of microframes whose elements are linked by a logical equivalence relation. The size of the schema is 250,407 triples.

The quality of the microframe clusters has been evaluated by asking two independent reviewers to evaluate a random sample of 100 intra-cluster pairs of LU-microframes. Each pair has been annotated with two variables: correctness (1 if the pair is correct, 0 otherwise) and synonymy (only applying when the pair is correct; assigned value 1 if they are WordNet-level synonyms, or 0 if there is a change of nuance higher than that but still having significant semantic overlap). A resulting average correctness of $87.55 %$ has been obtained. The 95% Wilson confidence interval is $[81.37, 93.73]$ , which means that if the experiment were to be repeated on different random samples, 95% of all times the generated interval would contain the true correctness score. The evaluation showed a small change of nuance (synonymy $= 0$ ) for $31.15 % \pm 9.38 %$ of the correct pairs. The 95% Wilson confidence interval is $[21.77, 40.53]$ . Some examples of such pairs are (Amalgamation-merge.v, Amalgamation-unify.v), (Giving-contribution.n, Giving-gift.n) and (Color-purple.a, Color-violet.a). Most of these are caused by the choice to use semantic pointers such as “Similar to”, which could be removed if very fine-grained distinctions of microframes were desired. The linear weighted Cohen’s Kappa (inter-annotator agreement) over the three-valued combination of the two variables with which each cluster pair is annotated was 0.23 over a maximum of 0.87. The maximum here refers to the highest value that the Kappa could achieve given the distribution of scores obtained from the raters.

5.3. Reification–dereification rules

Additionally, reification-dereification rules are provided, with the same number of direct binary predicates, with both human-readable IRIs and lexical labels. 83,790 are verb-based, 3,190 are noun-based, and 7,248 are adjective-based. For evaluating them, the same methodology was used, with two independent human annotators. Two different variables were used for each rule: correctness and readability. A ReDer rule is considered correct if the new name can be interpreted as a relation such that the dereified side is a necessary and sufficient condition of the reified side. A correct rule is considered to be not easily readable if the name of the direct binary predicate contains a preposition that is appropriate for some but not all possible objects, or it is not appropriate for the frame element in the name, or it contains a frame element whose meaning is not obvious for a layperson. For the latter task, the annotators were asked to provide an assessment of whether a layperson could understand certain terms: for instance, “patient” in FrameNet has a different meaning than the usual one in general language.

The obtained average correctness for verb-based rules is $96.22 % \pm 3.22 %$ , whereas $80.43 % \pm 7.61 %$ of the correct rules were found easily readable. For noun-based rules, the respective scores are $87.5 % \pm 6.41 %$ and $91.91 % \pm 6.28 %$ . Cohen’s kappa for the two annotations was 0.39 (over a maximum of 0.54).7

⁷
The maximum of Cohen’s Kappa is defined as the highest value that it could achieve given the distribution of scores from the raters, and can be useful when interpreting the value obtained for the coefficient [68].

Table 5 provides some examples of the evaluated rules. Rules 1–2 are both correct and readable. Rules 3–4 are correct but not readable. In rule 3, the preposition “in” would be more appropriate. In rule 4, the term “Locus” is too specialized. Rules 5–7 are not correct. In rule 5, Direction cannot be the FE-Obj (it should be Phenomenon). In rule 6, Theme cannot be the FE-Subj (it should be Agent). In rule 7, the DBP should be rushesThroughPath or rushesAlongPath instead of rushes.

6. Integration

Knowledge from other KBs such as Freebase can be integrated using integration rules. In practice, these result in a graph transformation from the source KB to FrameBase. Formally, these are rules whose antecedent and consequent are graph patterns sharing some variables. Whenever there is an instantiation of variables that, applied to the antecedent, returns a subset of the source KB, then the consequent, after having applied the same instantiation of variables, can be added to the FrameBase instance data (the ABox in the jargon of description logics).

When the sources are in RDF, the most obvious choice for implementing integration rules is using SPARQL CONSTRUCT queries with the WHERE clause containing the antecedent and the CONSTRUCT clause containing the consequent. Additionally, SPARQL CONSTRUCT queries support predicates and logical operators that allow for imposing additional logical conditions on the WHERE clause to match the original KB (i.e., for the rule to be fired). For non-RDF sources, a simple choice would be applying an off-the-shelf RDF converter8

⁸
http://www.w3.org/wiki/ConverterToRdf

to pre-process the source, after which SPARQL CONSTRUCT queries can still be used.

The SPARQL examples in this and the next sections use the following prefixes.

PREFIX : <http://framebase.org/ns/> PREFIX freeb: <http://rdf.freebase.com/ns/> PREFIX dbr: <http://dbpedia.org/resource/> PREFIX sch: <http://schema.org/>

In Section 6.1, some examples of manually built integration rules are presented for integrating events from two different sources: DBpedia and schema.org. Besides showing concrete examples of rules, the section provides an assessment of the expressiveness of the FrameBase schema in its current state, by reviewing to which extent external knowledge can be integrated when using manually built rules. It also introduces a basic typology of integration rules. These are important steps before reviewing the task of integrating knowledge automatically.

Subsequently, Section 6.2 discusses the creation of ReDer rules based on existing work. Finally, in Section 6.3, we provide examples of queries that make use of the schema.

6.1. Manually built integration rules

We will first show two simple examples of integration rules integrating knowledge from Freebase. They belong to two basic rule types that we label Class-Frame and Property-Frame, which will later serve as the basis for constructing more complex rules.

Class-Frame integration rules integrate a class from the source KB into a frame in FrameBase, and the outgoing properties from the external class into FE properties. The following example integrates a class organization.leadership into the frame :frame-Leadership-leader.n.

CONSTRUCT { _:f a :frame-Leadership-leader.n . _:f :fe-Leadership-Leader ?o1 . _:f :fe-Leadership-Governed ?o2 . _:f :fe-Leadership-Role ?o3 . _:f :fe-Leadership-Type ?o4 . _:timePeriod a :frame-Timespan-period.n . _:timePeriod :fe-Timespan-Start ?o5 . _:timePeriod :fe-Timespan-End ?o6 . } WHERE { ?cvti a freeb:organization.leadership . OPTIONAL { ?cvti freeb:organization.leadership.person ?o1 .} OPTIONAL { ?cvti ...organization.leadership.organization ?o2 .} OPTIONAL { ?cvti freeb:organization.leadership.role ?o3 .} OPTIONAL { ?cvti freeb:organization.leadership.title ?o4 .} OPTIONAL { ?cvti freeb:organization.leadership.from ?o5 .} OPTIONAL { ?cvti freeb:organization.leadership.to ?o6 .} }

Property-Frame integration rules translate a property from the source KB into a frame and two FEs in FrameBase. The structure is similar to that of ReDer rules, but the property in the antecedent is not a FrameBase DBP, although the similarity of the structure will be exploited later to automatically produce integration rules of this type from existing ReDer rules. The following example integrates a property freeb:people.person.nationality into the frame frame-People_by_jurisdiction-citizen.n.

CONSTRUCT { _:f a :frame-People_by_jurisdiction-citizen.n . _:f :fe-People_by_jurisdiction-Person ?person . _:f :fe-People_by_jurisdiction-Jurisdiction ?country . } WHERE { ?person freeb:people.person.nationality ?country . }

The next example pertains to the Event class in DBpedia. It is a Class-Frame rule with extensions. From the nine properties of the Event class, numberOfPeopleAttending was omitted because the Event class is too general for it, as it has subclasses such as PersonalEvent (Birth, etc.) and SocietalEvent, that are more appropriate for this. The remaining eight properties were integrated, but although the example shares the same basic structure as the Class–Frame rule provided for Freebase, it includes additional complex patterns in the consequent.

CONSTRUCT { ?f a :frame-Event-event.n . # ?f :fe-Event-Time _:timePeriod . _:timePeriod a :frame-Timespan-period.n ; fbe:fe-Timespan-Start ?o1 ; fbe:fe-Timespan-End ?o2 . # _:af2 a :frame-Relative_time-preceding.a ; :fe-Relative_time-Landmark_occasion ?f ; :fe-Relative_time-Focal_occasion ?o3 . # _:af3 a :frame-Relative_time-following.a ; :fe-Relative_time-Landmark_occasion ?o3 ; :fe-Relative_time-Focal_occasion ?f . # _:af4 a :frame-Relative_time-following.a ; :fe-Relative_time-Landmark_occasion ?f ; :fe-Relative_time-Focal_occasion ?o4 . # _:af5 a :frame-Relative_time-preceding.a ; :fe-Relative_time-Landmark_occasion ?o4 ; :fe-Relative_time-Focal_occasion ?f . # _:af6 a :frame-Relative_time-following.a ; :fe-Relative_time-Landmark_occasion ?f ; :fe-Relative_time-Focal_occasion ?o5 . # _:af7 a :frame-Relative_time-preceding.a ; :fe-Relative_time-Landmark_occasion ?o5 ; :fe-Relative_time-Focal_occasion ?f . # ?f :fe-Event-Reason ?o6 . # _:af8 a :frame-Dimension-length.n ; :fe-Dimension-Object ?f ; :fe-Dimension-Measurement ?o7 . # ?f a :frame-Social_event-meeting.n ; :fe-Social_event-Attendee ?o9 ; :fe-Social_event-Duration ?o7 . # } WHERE { ?f a dbr:Event . OPTIONAL{?f dbr:startDate ?o1} OPTIONAL{?f dbr:endDate ?o2} OPTIONAL{?f dbr:previousEvent ?o3} OPTIONAL{?f dbr:followingEvent ?o4} OPTIONAL{?f dbr:nextEvent ?o5} OPTIONAL{?f dbr:causedBy ?o6} OPTIONAL{?f dbr:duration ?o7} OPTIONAL{ #Omitted ?f dbr:numberOfPeopleAttending ?o8} OPTIONAL{?f dbr:participant ?o9} }

The dbr:Event class has several subclasses that can also be translated. However, the hierarchy in the original ontology is not necessarily consistent with the hierarchy in FrameBase. Only in certain cases does a subsumption relationship between two entities of the source also exist between the two entities’ respective translations to FrameBase. Therefore, for each translation of an element in the source KB, the translations of more general elements can be added, and this provides additional knowledge that would not always be inferred by the FrameBase schema alone.

For example, using RDFS inference, the substitutions for ?f that trigger the rule below (“?f a dbr:SocietalEvent”), also trigger the one for dbr:Event because dbr:SocietalEvent is a subclass of dbr:Event. This rule is very short, because in DBpedia, all of the outgoing properties belong to the parent Event class itself.

CONSTRUCT { ?f a :frame-Social_event-meeting.n . } WHERE { ?f a dbr:SocietalEvent }

Similarly, the substitutions for ?f that trigger five specific examples from DBpedia – dbr:SpaceMission, dbr:Convention, dbr:Election, dbr:FilmFestival, dbr:MilitaryConflict) – also trigger the ones for dbr:SocietalEvent and dbr:Event, because the classes captured in the antecedent are subclasses of dbr:SocietalEvent.

In the rule for dbr:SpaceMission, we minimize the need for declaring new frames and frame elements for specialized domains by making use of the compositionality of most specialized terms, creating complex structures that combine the semantics of simpler, basic elements. For instance, the translation for the type dbr:SpaceMission declares a frame of type Project-project.n, and specifies that it is about space exploration by assigning dbrl:Space_exploration as the value for the Project-Activity FE.

CONSTRUCT { ?f a :frame-Project-project.n . ?f :fe-Project-Activity dbr:Space_exploration . } WHERE { ?f a dbr:SpaceMission }

CONSTRUCT { ?f a fbe:frame-Social_event-convention.n . } WHERE { ?f a dbr:Convention }

CONSTRUCT { ?f a :frame-Change_of_leadership-election.n . } WHERE { ?f a dbr:Election . }

CONSTRUCT { ?f a :frame-Social_event-festival.n . ?f :fe-Social_event-Attendee ?o3 . ?f :fe-Social_event-Descriptor dbr:Film . ?f a :frame-Competition-competition.n . ?f :fe-Competition-Participant_1 ?o3 . ?f :fe-Competition-Competition dbr:Film . _:af1 a :frame-Ordinal_numbers-first.a . _:af1 :fe-Ordinal_numbers-Item ?o1 . _:af1 :fe-Ordinal_numbers-Comparison_set ?f . _:af1 :fe-Ordinal_numbers-Comparison_set dbr:Film . _:af2 a :frame-Ordinal_numbers-last.a . _:af2 :fe-Ordinal_numbers-Item ?o2 . _:af2 :fe-Ordinal_numbers-Comparison_set ?f . _:af2 :fe-Ordinal_numbers-Comparison_set dbr:Film . } WHERE { ?f a dbr:FilmFestival . OPTIONAL{?f dbr:closingFilm ?o1} OPTIONAL{?f dbr:openingFilm ?o2} OPTIONAL{?f dbr:film ?o3} }

CONSTRUCT { ?f a :frame-Hostile_encounter-hostility.n . _:af1 a :frame-Death-die.v . _:af1 :fe-Death-Sub_event ?f . _:af1 :fe-Death-Protagonist ?o1 . ?f :fe-Hostile_encounter-Side_1 ?o2 . _:af3 a :frame-Part_whole-part.n . _:af3 :fe-Part_whole-Part ?f . _:af3 :fe-Part_whole-Whole ?o3 . ?f :fe-Hostile_encounter-Place ?o4 . ?f :fe-Hostile_encounter-Result ?o5 . ?f :fe-Hostile_encounter-Depictive ?o6 . ?f :fe-Hostile_encounter-Side_2 ?o7 . } WHERE { ?f a dbr:MilitaryConflict . OPTIONAL{?f dbr:casualties ?o1} OPTIONAL{?f dbr:combatant ?o2} OPTIONAL{?f dbr:isPartOfMilitaryConflict ?o3} OPTIONAL{?f dbr:place ?o4} OPTIONAL{?f dbr:result ?o5} OPTIONAL{?f dbr:strength ?o6} OPTIONAL{?f dbr:opponents ?o7} }

Below, we also present the translation of the class Event in schema.org.

We omit the subclasses here, but these have very few genuine properties, and therefore the specialization is relatively simple. Besides, the taxonomy of schema.org events has some inconsistency issues that makes its use complex: the Event class is defined as capturing events such as concerts, lectures, and festivals, with properties such as “typical age range”, but there are sub-events such as UserInteraction and UserPlusOnes that actually represent a more general kind of events.

CONSTRUCT { ?f a :frame-Social_event-meeting.n . ?f a :frame-Event-event.n . # ?f :fe-Social_event-Time _:timePeriod . _:timePeriod a fbe:frame-Timespan-period.n ; fbe:fe-Timespan-Start ?Osta ; fbe:fe-Timespan-End ?Oend . ?f :fe-Event-Time _:timePeriod . # ?f :fe-Social_event-Duration ?Odur . ?f :fe-Event-Duration ?Odur . # ?f :fe-Social_event-Place ?Oloc . ?f :fe-Event-Place ?Oloc . # ?f :fe-Social_event-Attendee ?Oatt . ?f :fe-Social_event-Host ?Oorg . # ?f :fe-Social_event-Occasion ?Osup . ?Osub :fe-Social_event-Occasion ?f . # ?Ooff a :frame-Offering-offer.v ; :fe-Offering-Theme ?f . # ?f a :frame-Performing_arts-performance.n ; :fe-Performing_arts-Performer ?Oper ; :fe-Performing_arts-Performance ?Owor . # _:af1 a :frame-Recording-record.v ; :fe-Recording-Phenomenon ?f ; :fe-Recording-Medium ?Orec . # ?f :fe-Social_event-Descriptor ?Oeve . # _:af2 a Change_event_time-postpone.v ; Change_event_time-Event ?f; Change_event_time-Landmark_time ?Opre. # _:af a :frame-Typicality-normal.a . _:af :fe-Typicality-Entity _:af2 . _:af2 :frame-Age-age.n . _:af2 :fe-Age-Age ?Otyp . } WHERE { ?f a sch:Event . OPTIONAL{?f sch:startDate ?Osta} OPTIONAL{?f sch:endDate ?Oend} OPTIONAL{?f sch:duration ?Odur} OPTIONAL{?f sch:location ?Oloc} OPTIONAL{?f sch:attendee ?Oatt} OPTIONAL{?f sch:organizer ?Oorg} OPTIONAL{?f sch:superEvent ?Osup} OPTIONAL{?f sch:subEvent ?Osub} OPTIONAL{?f sch:offers ?Ooff} OPTIONAL{?f sch:performer ?Oper} OPTIONAL{?f sch:workPerformed ?Owor} OPTIONAL{?f sch:recordedIn ?Orec} OPTIONAL{?f sch:eventStatus ?Oeve} OPTIONAL{?f sch:previousStartDate ?Opre} OPTIONAL{?f sch:typicalAgeRange ?Otyp} # No translation OPTIONAL{?f sch:doorTime ?Odoo} }

The only extension of the FrameBase schema used for these examples was the frame :frame-Timespan-period.n with the start and end frame elements, used to denote periods of time. This, however, is not an ad-hoc extension motivated by a particular need of only one source but a very general one. Of the 16 properties of the Event class, only one (sch:doorTime, with an official gloss “The time admission will commence”) was not integrated. The remaining 15 were integrated.

Some integration rules, namely Property-Frame rules as well as some complex Class-Frame rules, declare new instances in the CONSTRUCT clause. This can be achieved either by means of anonymous nodes, as in the examples, or by coining new, essentially skolemized IRIs. In any case, the integration rules do not link or merge frame instances that are created by different rules or different instantiations of the same rule but should correspond to the same n-ary relation. This is a later step for which an out-of-the-box entity de-duplicator [39,42,63] could be applied. What the integration rules provide are instances of the same type that actually represent the same thing (event, situation, process, i.e. frame), so that the entities can actually be linked and – if the de-duplication process has high enough precision – merged, so the full efficiency indicated in Table 2 is achieved. This would not be possible with the heterogeneous models in Fig. 1.

6.2. Automatically built integration rules

We have recently been able to devise basic Class-Frame and Property-Frame integration rules using automatic methods guided with KB specific heuristics, which have been tested for Freebase and Yago [56]. To build Class-Frame integration rules, a support vector machine is used to classify pairs consisting of an external class and a FrameBase frame class, using lexico-semantic features. The support vector machine is trained with a manually built gold standard for Freebase. In order to increase precision, instead of using the SVM directly, we use the scoring function from the SVM (the distance from the hyperplane) to filter the pairs classified as true, selecting the best candidate for each external class. Then, for each mapped pair ( $c, f$ ), properties whose domain is c and frame element properties whose domain is f are mapped using lexical features and some heuristics based on frequent names. To build Property-Frame rules, our approach matches properties from external KBs against FrameBase DBPs, substituting the DBP in the ReDer rule with an external property that creates a sufficiently good match. This is similar to how Legalo [50] works for extracting relations from hyperlinks surrounded by text. Again, some heuristics are used for very frequent properties. Table 6 shows the number of triples and frame types integrated in FrameBase under this method.

Table 6
Number of statements and distinct frame types in the integrated data, from YAGO2s and from Freebase. The numbers in parentheses include the equivalent microframes that can be obtained with RDFS inference

YAGO2s Freebase

Reified

Number of triples 32,927,963 7,483,430

Instantiated frame types 186 (1634) 29 (130)

Dereified

Number of triples 3,933,207 6,120,201

	YAGO2s	Freebase
Reified
Number of triples	32,927,963	7,483,430
Instantiated frame types	186 (1634)	29 (130)
Dereified
Number of triples	3,933,207	6,120,201

The method described for creating Property-Frame rules has also been extended by applying a previous canonicalization to the properties from the external KB and creating additional ReDer rules [55], as well as using a more advanced similarity function with a weighted combination of lexical and semantic features, and coreness features of the FEs in FrameNet. The canonicalization addresses certain common types of ambiguity in the names of properties in LOD datasets, like the omission of the verb. For instance, given a property named “author”, it is not clear from the name alone if it is meant as hasAuthor or isAuthorOf. These problems are solved by a combination of linguistic constraints and information from the schema (i.e. domain, range, and symmetry). Also, instead of using the Kuhn-Munkres algorithm for filtering the many-to-many relation between DBPs and reified patterns as in Section 4.4, a max operator is used to select the highest ranked reified pattern for each DBP, but allowing each reified pattern to be obtained from different DBPs.

The ability to create Property-Frame integration rules towards FrameBase in this way, exploiting its linguistic nature and its corpus of annotations, is especially important. First, because traditional ontology alignment systems cannot produce such complex mappings, as was discussed in Section 2, and therefore their recall will be effectively equal to zero in this task. Second, because the same ontology alignment systems can be re-used to create Class-Frame rules (mapping classes with classes and properties with properties, if the ontology alignment system allows declaring constraints related to the properties’ domains). We will discuss the creation of complex Property-Frame rules in Section 7.

Additionally, a demo system has been developed that allows us to re-use these methods as search and suggestion engines behind an intuitive GUI, enabling human-level accuracy while minimizing the effort for the user [57].

6.3. Querying

FrameBase facilitates novel forms of queries. The query in Fig. 16, for instance, uses reified patterns to find the heads of the World Bank.

Fig. 16.

Example query using reified pattern.

The results in Table 7 show example instances integrated into the FrameBase schema from both Freebase (rows 1–3, extracted from the second example integration rule above) and YAGO2s (rows 4–5, extracted with a similar integration rule made for YAGO2s) [56].

Table 7

Results from the query

?leader	?leaderLabel	?role	?roleLabel
freeb:m.0h_ds2s	‘Caroline Anstey’	freeb:m.04t64n	‘Managing Director’
freeb:m.0d_dq5	‘Mahmoud Mohieldin’	freeb:m.04t64n	‘Managing Director’
freeb:m.047cdkk	‘Sri Mulyani Indrawati’	freeb:m.01yc02	‘Chief Operating Officer’
yago:Jim_Yong_Kim	‘Ji, Yong Kim’	–	–
yago:Robert_Zoellick	‘Rober Zoellick’	–	–

Alternatively, if the triple :isSimilarTo rdfs:subPropertyOfrdfs:subClassOf is added with RDFS inference, then the microframe :frame-Leadership-leader.n in Fig. 16 can be substituted with any of the microframes in the cluster, as listed in Fig. 17. This effectively helps increasing recall.

7. Towards complex integration

The methods described in Section 6.2 create basic Class-Frame and Property-Frame integration rules. However, as some examples in Section 6.1 illustrate, integration rules can become very complex. In the following, we present instructive examples of complex integration rules.

7.1. Complex property-frame integration rules

FrameBase-driven. The first kind of these involves extending an approach already explored in existing FrameBase integration work [55], creating very complex ReDer rules whose DBPs could also be matched with external properties. These DBPs could have, for instance, a “(VP <VBZ> (NP <NP₁> (PP <IN> <NP₂>)))” structure, as e.g. “developsUnderstandingOfContent” (see Fig. 19) or “startsDemolitionOfBuilding”, but other more complex structures could be considered, as well. Rules of this sort involve two frame instances (one evoked by VBZ and the other by NP₁) and several challenges. In particular, syntactically correct but semantically nonsensical combinations should be filtered out, e.g. “procrastinationDrunkByQuadruplicity”. This could be achieved based on example sentences in FrameNet. However, if the frames evoked by the VBZ and NP₁ are not annotated in the same sentence, the correct pair of frames for the pair of lexical units (VBZ, NP₁) should be obtained automatically, together with the correct FE connecting both. This is a more challenging disambiguation task.

A DBP from the dereification rules can also be used to obtain the same non-optional results, as illustrated in the query in Fig. 18. Either of the verb-based DBPs leads or heads can be used because the LU-microframes for these verbs are in the same cluster as the nouns leader and head, and there is a dereification rule between the Leader and Governed FEs for both.

Fig. 17.

Microframes from the cluster where :frame-Leadership-leader.n belongs.

An advantage of this approach is that it provides richer ReDer rules for FrameBase, but the disadvantage is that because it is driven by FrameBase, it may have poor recall for real-world datasets, both because of its reliance on FrameNet example sentences and because of FrameNet’s non-specialized vocabulary, instead of the kind of knowledge present in a particular knowledge base. This problem could be significantly reduced by also updating the similarity function between DBPs and source properties, in order to account for hypernymy and synonymy relations that would allow capturing very specific concepts in source KBs for which hypernyms can be found in FrameBase (for instance, “catalyzesChemicalReaction” from a source KB could match the more general ‘increasesSpeedOfProcess” created from FrameBase annotations).

Fig. 18.

Example query using a dereified pattern.

Fig. 19.

Example of a very complex noun-based ReDer rule.

Source-data driven. Another approach would involve parsing predicate names with a semantic role labelling system, similar to Legalo [50]. However, such SRL systems are also constrained by their reliance on annotated example sentences for training. In any case, an advantage of this approach is that if FrameBase is extended with PropBank, SRL systems for this could be used as well.

7.2. Complex class-frame integration rules

There are multiple ways in which Class-Frame rules can differ from their basic pattern. We will use the examples in Section 6.1 to illustrate this.

Sometimes, a class integration rule may need to instantiate multiple frames rather than just a single one. We distinguish two main types of this phenomenon.

The instantiated frame instances may be connected by FEs. Examples of this include the frame :frame-Timespan-period.n, created to represent time periods, and the subframes of Relative_time to express precedence between events (all in the example for dbr:Event). The same applies when an FE is used to specify a frame beyond the lexical unit (see the rule for dbr:Space_exploration).

Several frames can also be evoked separately, without the instances being directly connected by any FE. When these frames describe different perspectives of the same event, there is the possibility that FrameNet links them by means of perspectivization, and therefore FrameBase can infer one from another. In this case, inference is possible because RDFS subclass and subproperty properties are used in FrameBase to reflect the perspectivization relation between frame classes and FEs respectively. Another example are :frame-Receive_visitor_scenario and :frame-Visit_host, which are perspectives of :frame-Visitor_and_host. However, in other cases, one cannot rely on existing inference. For instance, one can observe that the rule to translate Event from schema.org, besides the frames Event-event.n and Timespan-period.n, also instantiates Performing_arts-performance.n, Recording-record.v, and Offering-offer.v, when certain properties are present.

Another possible source of complexity is that FEs can be inverted. In this case, the integration rules need to invert the order of the arguments, as in the second occurrence of :fe-Social_event-Occasion in the integration rule for the class Event in schema.org.

Arbitrary combinations of these phenomena are possible, as, e.g., in the rule integrating the Event class from schema.org.

A possible way to address this problem may be by defining a reduced alphabet of transformations over a basic Class-Frame rule, similar to our list above, so that a complex Class-Frame rule can be represented as a basic initial one followed by a sequence of transformations, and this representation can be acquired via supervised learning.

However, the high number of variables involved would mean that any attempt to train a system would face extreme sparsity. Inter-annotator agreement, which is already low for simple integration rules [56], would probably be even lower. Investigating how to produce such genuinely complex rules entirely automatically thus remains an important research challenge.

In the short term, we believe that a combination of automated assistance and user feedback, as provided by user interfaces such as Klint [57], may be the best strategy whenever complex rules are needed and high-quality integration is desired.

8. Conclusion

FrameBase is a novel approach for integrating knowledge from different heterogeneous sources and connecting it to decades of work from the NLP community. It provides a flexible and homogeneous model to describe n-ary relations, which combines efficiency and expressiveness, and is based on a linguistically sound foundation. The ties with natural language can be exploited to automatically integrate knowledge from external sources. FrameBase opens up several new research directions, which we enumerate next.

Integrating additional sources Either using a unified approach [56] or focusing on Property-Frame rules and combining them with existing ontology alignment systems [55], additional sources could be integrated. Both generic KBs such as Wikidata [16] and domain-specific ones such as from the biomedical domain could be incorporated.

Interfacing from natural language Due to its use of linguistic resources for ontological purposes, FrameBase has significant potential for text mining and other natural language related tasks. Both pure Semantic Role Labelling (SRL) systems for FrameNet such as SEMAFOR [8] as well as text-to-ontology systems such as FRED [48] and Pikes [7] could be adapted to produce FrameBase data from natural language text. Similar methods could also enable question answering [38]. For example, for the example in Fig. 16 in Section 6.3, given the question “Who has been the head of the World_Bank?”, the SRL tool SEMAFOR [8] successfully extracts the frame Leadership with lexical unit head.noun and frame elements Governed and Leader. Based on this, and after a named entity disambiguator such as AIDA [30] has matched World_Bank to the entities in the KBs, a structured query can easily be built. Although accurate semantic role labeling is still very challenging, semantics has become one of the largest research areas in natural language processing and thus FrameBase can benefit from progress made in this area in the future.

Natural language generation FrameBase also offers opportunities for natural language generation from KBs. Dereification rules can be interpreted as syntactic templates [69] for simple English sentences without subordinate clauses. For instance, :dbp-Statement-writesAboutTopic from Fig. 8 could be used to produce a natural language representation “X writes about Y”. It would be trivial to produce similar ternary syntactic templates for “X writes about Y in Z” (for Z being a Time or a Date) and “X writes to Z about Y” (for Z being the Addressee).

Implementing virtual querying Currently, the integration rules for integrating source KBs into FrameBase have been implemented as SPARQL CONSTRUCT queries applied over the sources’ data, which can be used to materialize the integrated knowledge. An alternative implementation would involve virtual querying: using the integration rules to provide FrameBase-adapted virtual views of the source KBs. This would make it possible to re-use existing SPARQL endpoints from the different sources and enable access to the most recent version of the source data.

Further information Details and more information about FrameBase are available at http://framebase.org. The FrameBase data is freely available under a Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Footnotes

Acknowledgements

This research was partially funded by the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement No. FP7-SEC-2012-312651 (ePOOLICE project) as well as by the Danish Council for Independent Research (DFF) under grant agreement no. DFF-4093-00301. Additional funding was provided by China 973 Program Grants 2011CBA00300, 2011CBA00301, and NSFC Grants 61033001, 61361136003, 20141330245.

References

C.F.

Baker,

C.J.

Fillmore and

J.B.

Lowe, The Berkeley FrameNet project, in: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL’98, Proceedings of the Conference, Université de Montréal, Montréal, Quebec, Canada, August 10–14, 1998,

Boitet and

Whitelock, eds, Morgan Kaufmann Publishers/ACL, 1998, pp. 86–90, http://aclweb.org/anthology/P/P98/P98-1013.pdf .

Bizer,

Heath and

Berners-Lee, Linked data – the story so far, International Journal on Semantic Web and Information Systems5(3) (2009), 1–22. doi:10.4018/jswis.2009081901.

Böhm,

de Melo,

Naumann and

Weikum, LINDA: distributed web-of-data-scale entity matching, in: 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29–November 02, 2012,

Chen,

Lebanon,

Wang and

M.J.

Zaki, eds, ACM, 2012, pp. 2104–2108. doi:10.1145/2396761.2398582.

K.D.

Bollacker,

Evans,

Paritosh,

Sturge and

Taylor, Freebase: A collaboratively created graph database for structuring human knowledge, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008,

J.T.-L.

Wang, ed., ACM, 2008, pp. 1247–1250. doi:10.1145/1376616.1376746.

Brickley, Roles in schema.org – final draft. Technical report, W3C, 8 May 2014. Available at: https://www.w3.org/wiki/images/c/c8/RolesinSchema.orgMay8.pdf.

J.J.

Carroll,

Dickinson,

Dollin,

Reynolds,

Seaborne and

Wilkinson, Jena: Implementing the semantic web recommendations, in: Proceedings of the 13th International Conference on World Wide Web – Alternate Track Papers & Posters, WWW 2004, New York, NY, USA, May 17–20, 2004,

S.I.

Feldman,

Uretsky,

Najork and

C.E.

Wills, eds, ACM, 2004, pp. 74–83. doi:10.1145/1013367.1013381.

Corcoglioniti,

Rospocher and

A.P.

Aprosio, A 2-phase frame-based knowledge extraction framework, in: Proceedings of the 31st Annual ACM Symposium on Applied Computing, Pisa, Italy, April 4–8, 2016,

Ossowski, ed., ACM, 2016, pp. 354–361. doi:10.1145/2851613.2851845.

Das,

Chen,

A.F.T.

Martins,

Schneider and

N.A.

Smith, Frame-semantic parsing, Computational Linguistics40(1) (2014), 9–56. doi:10.1162/COLI_a_00163.

Daudé,

Padró and

Rigau, Mapping wordnets using structural information, in: 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, October 1–8, 2000, Association for Computational Linguistics, Hong Kong, China, 2000, pp. 504–511. doi:10.3115/1075218.1075282.

10.

David,

Euzenat,

Scharffe and

C.T.

dos Santos, The alignment API 4.0, Semantic Web2(1) (2011), 3–10. doi:10.3233/SW-2011-0028.

11.

De Cao,

Croce and

Basili, Extensive evaluation of a framenet-wordnet mapping resource, in: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 17–23 May 2010,

Calzolari,

Choukri,

Maegaard,

Mariani,

Odijk,

Piperidis,

Rosner and

Tapias, eds, European Language Resources Association, 2010, http://www.lrec-conf.org/proceedings/lrec2010/summaries/773.html .

12.

de Melo, Lexvo.org: Language-related information for the linguistic linked data cloud, Semantic Web6(4) (2015), 393–400. doi:10.3233/SW-150171.

13.

de Melo and

Weikum, Language as a foundation of the semantic web, in: Proceedings of the Poster and Demonstration Session at the 7th International Semantic Web Conference (ISWC2008), Karlsruhe, Germany, October 28, 2008,

Bizer and

Joshi, eds, CEUR Workshop Proceedings, Vol. 401, CEUR-WS.org, 2008, http://ceur-ws.org/Vol-401/iswc2008pd_submission_77.pdf .

14.

Del Corro and

Gemulla, Clausie: Clause-based open information extraction, in: 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13–17, 2013,

Schwabe,

V.A.F.

Almeida,

Glaser,

R.A.

Baeza-Yates and

S.B.

Moon, eds, International World Wide Web Conferences Steering Committee/ACM, 2013, pp. 355–366, http://dl.acm.org/citation.cfm?id=2488420 .

15.

Dhamankar,

Lee,

Doan,

A.Y.

Halevy and

P.M.

Domingos, iMAP: Discovering complex mappings between database schemas, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, 2004,

Weikum,

A.C.

König and

Deßloch, eds, ACM, 2004, pp. 383–394. doi:10.1145/1007568.1007612.

16.

Erxleben,

Günther,

Krötzsch,

Mendez and

Vrandecic, Introducing Wikidata to the linked data web, in: Proceedings, Part I, the Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Riva del Garda, Italy, October 19–23, 2014,

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, 2014, pp. 50–65. doi:10.1007/978-3-319-11964-9_4.

17.

Euzenat and

Shvaiko, Ontology Matching, Springer, 2007. doi:10.1007/978-3-540-49612-0.

18.

Fellbaum (ed.), WordNet: An Electronic Lexical Database, The MIT Press, 1998. ISBN 026206197X. doi:10.1086/603115.

19.

Fellbaum and

C.F.

Baker, Can WordNet and FrameNet be made “interoperable”?, in: Proceedings of the First International Conference on Global Interoperability for Language Resources (ICGL ’08), 2008, pp. 67–74, http://icgl.ctl.cityu.edu.hk/2008/html/resources/~proceeding_conference.pdf .

20.

Ó.

Ferrández,

Ellsworth,

Muñoz and

C.F.

Baker, Aligning framenet and wordnet based on semantic neighborhoods, in: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 17–23 May 2010,

Calzolari,

Choukri,

Maegaard,

Mariani,

Odijk,

Piperidis,

Rosner and

Tapias, eds, European Language Resources Association, 2010, pp. 17–23, http://www.lrec-conf.org/proceedings/lrec2010/summaries/636.html .

21.

D.A.

Ferrucci,

E.W.

Brown,

Chu-Carroll,

Fan,

Gondek,

Kalyanpur,

Lally,

J.W.

Murdock,

Nyberg,

J.M.

Prager,

Schlaefer and

C.A.

Welty, Building Watson: An overview of the DeepQA project, AI Magazine31(3) (2010), 59–79, http://www.aaai.org/ojs/index.php/aimagazine/article/view/2303 .

22.

C.J.

Fillmore,

C.R.

Johnson and

M.R.

Petruck, Background to FrameNet, International Journal of Lexicography16(3) (2003), 235–250. doi:10.1093/ijl/16.3.235.

23.

Frawley, Linguistic Semantics, Routledge, Abingdon, United Kingdom, 1992.

24.

Gangemi and

Presutti, Towards a pattern science for the Semantic Web, Semantic Web1(1–2) (2010), 61–68. doi:10.3233/SW-2010-0020.

25.

Gangemi and

Presutti, A multi-dimensional comparison of ontology design patterns for representing n-ary relations, in: SOFSEM 2013: Theory and Practice of Computer Science, 39th International Conference on Current Trends in Theory and Practice of Computer Science, Proceedings, Špindlerův Mlýn, Czech Republic, January 26–31, 2013,

van Emde Boas,

F.C.A.

Groen,

G.F.

Italiano,

J.R.

Nawrocki and

Sack, eds, Lecture Notes in Computer Science, Vol. 7741, Springer, 2013, pp. 86–105. doi:10.1007/978-3-642-35843-2_8.

26.

Gildea and

Jurafsky, Automatic labeling of semantic roles, Computational Linguistics28(3) (2002), 245–288. doi:10.1162/089120102760275983.

27.

Giunchiglia,

Shvaiko and

Yatskevich, S-Match: An algorithm and an implementation of semantic matching, in: The Semantic Web: Research and Applications, First European Semantic Web Symposium, ESWS 2004, Heraklion, Crete, Greece, May 10–12, 2004,

Bussler,

Davies,

Fensel and

Studer, eds, Lecture Notes in Computer Science, Vol. 3053, Springer, 2004, pp. 61–75. doi:10.1007/978-3-540-25956-5_5.

28.

Harris and

Seaborne, eds, SPARQL 1.1 Query Language. W3C Recommendation, 21 March 2013, https://www.w3.org/TR/sparql11-query/.

29.

Hayes and

Patel-Schneider, eds, RDF 1.1 Semantics. W3C Recommendation, 25 February 2014, http://www.w3.org/TR/rdf11-mt/.

30.

Hoffart,

Amir Yosef,

Bordino,

Fürstenau,

Pinkal,

Spaniol,

Taneva,

Thater and

Weikum, Robust disambiguation of named entities in text, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP, John McIntyre Conference Centre, Edinburgh, UK, 27–31 July 2011, Association for Computational Linguistics, A meeting of SIGDAT, a Special Interest Group of the ACL, 2011, pp. 782–792, http://www.aclweb.org/anthology/D11-1072 .

31.

Hoffart,

F.M.

Suchanek,

Berberich and

Weikum, YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia, Artificial Intelligence194 (2013), 28–61. doi:10.1016/j.artint.2012.06.001.

32.

Johnson and

V.T.

Holland Introducing ‘Role’, June 16, 2014. Available at: http://blog.schema.org/2014/06/introducing-role.html.

33.

Jurafsky and

J.H.

Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 2nd edn, Pearson Prentice Hall, 2009.

34.

Kalyanpur,

Boguraev,

Patwardhan,

J.W.

Murdock,

Lally,

Welty,

J.M.

Prager,

Coppola,

Fokoue-Nkoutche,

Zhang,

Pan and

Qiu, Structured data and inference in deepqa, IBM Journal of Research and Development56(3) (2012), 10. doi:10.1147/JRD.2012.2188737.

35.

Kingsbury and

Palmer, From TreeBank to PropBank, in: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain, May 29–31, 2002, Language Resources Association, 2002, http://www.lrec-conf.org/proceedings/lrec2002/pdf/283.pdf .

36.

Klein and

C.D.

Manning, Accurate unlexicalized parsing, in: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo Convention Center, Sapporo, Japan, 7–12 July 2003,

E.W.

Hinrichs and

Roth, eds, Association for Computational Linguistics, 2003, pp. 423–430, http://aclweb.org/anthology/P/P03/P03-1054.pdf .

37.

Laparra,

Rigau and

Cuadros, Exploring the integration of WordNet and FrameNet, in: Proceedings of the 5th Global WordNet Conference, 2010, http://adimen.si.ehu.es/%7Erigau/publications/gwc10-lrc.pdf .

38.

Li,

Wang,

de Melo,

Tu and

Chen, Multimodal question answering over structured data with ambiguous entities, in: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3–7, 2017,

Barrett,

Cummings,

Agichtein and

Gabrilovich, eds, ACM, 2017, pp. 79–88. doi:10.1145/3041021.3054135.

39.

Li,

Tang,

Li and

Luo, Rimom: A dynamic multistrategy ontology alignment framework, IEEE Transactions on Knowledge and Data Engineering21(8) (2009), 1218–1232. doi:10.1109/TKDE.2008.202.

40.

J.P.

McCrae,

Fellbaum and

Cimiano, Publishing and linking WordNet using lemon and RDF, in: 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing,

Chiarcos,

Philip McCrae,

Osenova and

Vertan, eds, 2014, pp. 13–18, http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-LDL2014 .

41.

J.P.

McCrae,

Spohr and

Cimiano, Linking lexical resources and ontologies on the Semantic Web with lemon, in: The Semantic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011, Proceedings, Part I, Heraklion, Crete, Greece, May 29–June 2, 2011,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

De Leenheer and

J.Z.

Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, 2011, pp. 245–259. doi:10.1007/978-3-642-21034-1_17.

42.

A.N.

Ngomo and

Auer, LIMES – A time-efficient approach for large-scale link discovery on the web of data, in: IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16–22, 2011,

Walsh, ed., IJCAI/AAAI, 2011, pp. 2312–2317. doi:10.5591/978-1-57735-516-8/IJCAI11-385.

43.

Nguyen,

Bodenreider and

A.P.

Sheth, Don’t like RDF reification?: Making statements about statements using singleton property, in: 23rd International World Wide Web Conference, WWW’14, Seoul, Republic of Korea, April 7–11, 2014,

Chung,

A.Z.

Broder,

Shim and

Suel, eds, ACM, 2014, pp. 759–770. doi:10.1145/2566486.2567973.

44.

Noy and

Rector, eds, Defining N-ary Relations on the Semantic Web. W3C Working Group Note, 12 April 2006. Available at: https://www.w3.org/TR/swbp-n-aryRelations/. Additional contributors: Pat Hayes and Chris Welty.

45.

A.G.

Nuzzolese,

Gangemi and

Presutti, Gathering lexical linked data and knowledge patterns from FrameNet, in: Proceedings of the 6th International Conference on Knowledge Capture (K-CAP 2011, Banff, Alberta, Canada, June 26–29, 2011,

M.A.

Musen and

Ó.

Corcho, eds, ACM, 2011, pp. 41–48. doi:10.1145/1999676.1999685.

46.

Palmer, SemLink: Linking PropBank, VerbNet and FrameNet, in: Proceedings of the Generative Lexicon Conference (GenLex-09), 2009, pp. 9–15.

47.

D.T.

Phillips and

Garcia-Diaz, Fundamentals of Network Analysis, Prentice Hall, 1981.

48.

Presutti,

Draicchio and

Gangemi, Knowledge extraction based on discourse representation theory and linguistic frames, in: Knowledge Engineering and Knowledge Management – 18th International Conference, EKAW 2012, Proceedings, Galway City, Ireland, October 8–12, 2012,

ten Teije,

Völker,

Handschuh,

Stuckenschmidt,

d’Aquin,

Nikolov,

Aussenac-Gilles and

Hernandez, eds, Lecture Notes in Computer Science, Vol. 7603, Springer, 2012, pp. 114–129. doi:10.1007/978-3-642-33876-2_12.

49.

Presutti and

Gangemi, Content ontology design patterns as practical building blocks for web ontologies, in: Conceptual Modeling – ER 2008, 27th International Conference on Conceptual Modeling, Proceedings, Barcelona, Spain, October 20–24, 2008,

Li,

Spaccapietra,

E.S.K.

Yu and

Olivé, eds, Lecture Notes in Computer Science, Vol. 5231, Springer, 2008, pp. 128–141. doi:10.1007/978-3-540-87877-3_11.

50.

Presutti,

Giovanni Nuzzolese,

Consoli,

Gangemi and

D.R.

Recupero, From hyperlinks to semantic web properties using open knowledge extraction, Semantic Web7(4) (2016), 351–378. doi:10.3233/SW-160221.

51.

Ritze,

Meilicke,

Sváb-Zamazal and

Stuckenschmidt, A pattern-based ontology matching approach for detecting complex correspondences, in: Proceedings of the 4th International Workshop on Ontology Matching (OM-2009) collocated with the 8th International Semantic Web Conference (ISWC-2009), Chantilly, USA, October 25, 2009,

Shvaiko,

Euzenat,

Giunchiglia,

Stuckenschmidt,

N.F.

Noy and

Rosenthal, eds, CEUR Workshop Proceedings, Vol. 551, CEUR-WS.org, 2009, http://ceur-ws.org/Vol-551/om2009_Tpaper3.pdf .

52.

Rouces, Enhancing recall in semantic querying, in: Proceedings of the Twelfth Scandinavian Conference on Artificial Intelligence (SCAI’13), Vol. 257, IOS Press, Amsterdam, Netherlands, 2013, pp. 291–294.

53.

Rouces,

de Melo and

Hose, FrameBase: Representing N-ary relations using semantic frames, in: The Semantic Web. Latest Advances and New Domains – 12th European Semantic Web Conference, ESWC 2015, Proceedings, Portoroz, Slovenia, May 31–June 4, 2015,

Gandon,

Sabou,

Sack,

d’Amato,

Cudré-Mauroux and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9088, Springer, 2015, pp. 505–521. doi:10.1007/978-3-319-18818-8_31.

54.

Rouces,

de Melo and

Hose, Representing specialized events with framebase, in: Proceedings of the 4th International Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2015) Co-Located with the 12th Extended Semantic Web Conference (ESWC 2015), Protoroz, Slovenia, May 31, 2015,

van Erp,

Troncy,

Rospocher,

V.R.

van Hage and

D.A.

Shamma, eds, CEUR Workshop Proceedings, Vol. 1363, CEUR-WS.org, 2015, pp. 58–69, http://ceur-ws.org/Vol-1363/paper_7.pdf .

55.

Rouces,

de Melo and

Hose, Complex schema mapping and linking data: Beyond binary predicates, in: Proceedings of the Workshop on Linked Data on the Web, LDOW 2016, Co-Located with 25th International World Wide Web Conference (WWW 2016),

Auer,

Berners-Lee,

Bizer and

Heath, eds, CEUR Workshop Proceedings, Vol. 1593, CEUR-WS.org, 2016, http://ceur-ws.org/Vol-1593/article-05.pdf .

56.

Rouces,

de Melo and

Hose, Heuristics for connecting heterogeneous knowledge via FrameBase, in: The Semantic Web. Latest Advances and New Domains – 13th International Conference, ESWC 2016, Proceedings, Heraklion, Crete, Greece, May 29–June 2, 2016,

Sack,

Blomqvist,

d’Aquin,

Ghidini,

S.P.

Ponzetto and

Lange, eds, Lecture Notes in Computer Science, Vol. 9678, Springer, 2016, pp. 20–35. doi:10.1007/978-3-319-34129-3_2.

57.

Rouces,

de Melo and

K.H.

Klint, Assisting integration of heterogeneous knowledge, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016,

Kambhampati, ed., IJCAI/AAAI Press, 2016, pp. 4262–4263, http://www.ijcai.org/Abstract/16/651 .

58.

Ruppenhofer,

Ellsworth,

M.R.L.

Petruck,

C.R.

Johnson and

Scheffczyk, FrameNet II: Extended Theory and Practice, International Computer Science Institute, Berkeley, California, 2006.

59.

A.C.

Schalley and

Zaefferer, Ontolinguistics: How Ontological Status Shapes the Linguistic Coding of Concepts, Trends in Linguistics. Studies and Monographs (TiLSM), Vol. 176, Walter de Gruyter, Berlin, Germany, 2007. doi:10.1515/9783110197792.

60.

Scharffe and

Fensel, Correspondence patterns for ontology alignment, in: Knowledge Engineering: Practice and Patterns, 16th International Conference, EKAW 2008, Proceedings, Acitrezza, Italy, September 29–October 2, 2008,

Gangemi and

Euzenat, eds, Lecture Notes in Computer Science, Vol. 5268, Springer, 2008, pp. 83–92. doi:10.1007/978-3-540-87696-0_10.

61.

Shaw,

Troncy and

Hardman, LODE: linking open descriptions of events, in: The Semantic Web, Fourth Asian Conference, ASWC 2009, Proceedings, Shanghai, China, December 6–9, 2009,

Gómez-Pérez,

Yu and

Ding, eds, Lecture Notes in Computer Science, Vol. 5926, Springer, 2009, pp. 153–167. doi:10.1007/978-3-642-10871-6_11.

62.

Subirats, Spanish FrameNet: A frame-semantic analysis of the Spanish lexicon, in: Multilingual FrameNets in Computational Lexicography: Methods and Applications, Mouton de Gruyter, 2009, pp. 135–162.

63.

F.M.

Suchanek,

Abiteboul and

Senellart, Paris: Probabilistic alignment of relations, instances, and schema, Proceedings of the VLDB Endowment5(3) (2011), 157–168. doi:10.14778/2078331.2078332.

64.

F.M.

Suchanek,

Hoffart,

Kuzey and

Lewis-Kelham, Yago2s: Modular high-quality information extraction with an application to flight planning, in: Datenbanksysteme für Business, Technologie und Web (BTW), 15. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), Proceedings, Magdeburg, Germany, 11.–15.3.2013,

Markl,

Saake,

K.-U.

Sattler,

Hackenbroich,

Mitschang,

Härder and

Köppen, eds, LNI, Vol. 214, GI, 2013, pp. 515–518, http://www.btw-2013.de/proceedings/YAGO2s .

65.

F.M.

Suchanek,

Kasneci and

Weikum, YAGO: A large ontology from Wikipedia and wordnet, Journal of Web Semantics6(3) (2008), 203–217. doi:10.1016/j.websem.2008.06.001.

66.

Tonelli and

Pighin, New features for framenet – wordnet mapping, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4–5, 2009,

Stevenson and

Carreras, eds, Association for Computational Linguistics, Boulder, Colorado, USA, 2009, pp. 219–227, http://aclweb.org/anthology/W/W09/W09-1127.pdf . doi:10.3115/1596374.1596408.

67.

Toutanova,

Klein,

C.D.

Manning and

Singer, Feature-rich part-of-speech tagging with a cyclic dependency network, in: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, May 27–June 1, 2003,

M.A.

Hearst and

Ostendorf, eds, Association for Computational Linguistics, 2003. doi:10.3115/1073445.1073478.

68.

U.N.

Umesh,

R.A.

Peterson and

M.H.

Sauber, Interjudge agreement and the maximum value of kappa, Educational and Psychological Measurement49(4) (1989), 835–850. doi:10.1177/001316448904900407.

69.

van Deemter,

Theune and

Krahmer, Real versus template-based natural language generation: A false opposition?, Computational Linguistics31(1) (2005), 15–24. doi:10.1162/0891201053630291.

70.

W.R.

van Hage,

Malaisé,

Segers,

Hollink and

Schreiber, Design and use of the simple event model (SEM), Journal of Web Semantics9(2) (2011), 128–136. doi:10.1016/j.websem.2011.03.003.

71.

Yahya,

Berberich,

Elbassuoni,

Ramanath,

Tresp and

Weikum, Natural language questions for the web of data, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, July 12–14, 2012,

Tsujii,

Henderson and

Pasca, eds, Association for Computational Linguistics, pp. 379–390, 2003, http://www.aclweb.org/anthology/D12-1035 .

FrameBase: Enabling integration of heterogeneous knowledge

Abstract

Keywords

1. Introduction

2.1. Modeling patterns for N-ary relations

2.1.2. Triple-reification pattern

3 http://www.w3.org/TR/n-quads/

2.1.4. Role-class pattern

2.1.5. Neo-Davidsonian pattern

2.2. Knowledge integration

2.3. FrameNet

3.1. FrameNet-based representation

4.1. FrameNet–WordNet mapping

4.2. Hierarchy construction

4 We use http://framebase.org/ns as default prefix.

4.4.1. Structure of ReDer rules

5.3. Reification–dereification rules

7 The maximum of Cohen’s Kappa is defined as the highest value that it could achieve given the distribution of scores from the raters, and can be useful when interpreting the value obtained for the coefficient [68].

8 http://www.w3.org/wiki/ConverterToRdf

6.2. Automatically built integration rules

7.1. Complex property-frame integration rules

8. Conclusion

Footnotes

Acknowledgements

References

³
http://www.w3.org/TR/n-quads/

⁴
We use http://framebase.org/ns as default prefix.

⁷
The maximum of Cohen’s Kappa is defined as the highest value that it could achieve given the distribution of scores from the raters, and can be useful when interpreting the value obtained for the coefficient [68].

⁸
http://www.w3.org/wiki/ConverterToRdf