Abstract
Open information extraction approaches are useful but insufficient alone for populating
the Web with machine readable information as their results are not directly linkable to,
and immediately reusable from, other Linked Data sources. This work proposes a novel
paradigm, named Open Knowledge Extraction, and its implementation (Legalo) that performs
unsupervised, open domain, and abstractive knowledge extraction from text for producing
machine readable information. The implemented method is based on the hypothesis that
hyperlinks (either created by humans or knowledge extraction tools) provide a pragmatic
trace of semantic relations between two entities, and that such semantic relations, their
subjects and objects, can be revealed by processing their linguistic traces (i.e. the
sentences that embed the hyperlinks) and formalised as Semantic Web triples and ontology
axioms. Experimental evaluations conducted on validated text extracted from Wikipedia
pages, with the help of crowdsourcing, confirm this hypothesis showing high performances.
A demo is available at
Keywords
Populating the Semantic Web from natural language text
The vision of the Semantic Web is to populate the Web with machine understandable data so that intelligent agents will be able to automatically interpret its content – just like humans do by inspecting Web content – and assist users in performing a significant number of tasks, relieving them of cognitive overload.
The Linked Data movement [2] realised the first substantiation of this vision by bootstrapping the publication of machine understandable information, mainly taken from structured data (mostly databases) or semi-structured data (e.g. Wikipedia infoboxes). However, a large part of the Web content consists of natural language text, hence a main challenge is to extract as much relevant knowledge as possible from this content, and publish it in the form of Semantic Web triples. This work aims to solve this problem by extracting relational knowledge that is “hidden” in hyperlinks, which can be either defined manually by humans (e.g. Wikipedia pagelinks) or created automatically by Knowledge Extraction (KE) systems (e.g. a KE system can automatically add links to Wikipedia pages or to local datasets of Semantic Web entities).
Current KE systems address the task of linking pieces of text to Semantic Web entities very
well (e.g.
Nevertheless, it is desirable to enrich Web content with other semantic relations than
Cf.
Besides common sense, this hypothesis is also supported by a previous study [33], which describes the extraction of encyclopedic knowledge patterns for DBpedia types, based on links between Wikipedia pages. A user study showed that hyperlinks between Wikipedia pages determine relevant descriptive contexts for DBpedia entities at the type level, which suggests that these links mirror relevant semantic relations between entities.
A hyperlink in a Web page can be produced either by a human or a KE system (e.g., by
linking a piece of text to a Wikipedia page, which in turn refers to a Semantic Web entity,
i.e. a DBpedia entity). If a KE system recognises two or more entities in a sentence, there
is a possibility that such sentence expresses some relation between them. For example, the
following sentence: The New York Times reported that John McCarthy died. He
invented the programming language LISP.
Wikipedia:
stands for
Comparison between relations resulting from extractive and abstractive approaches for
the sentence “
In the Semantic Web era, such factual relations should be expressed as RDF triples where subjects, objects, and predicates have a URI (except for literal objects and blank nodes), and predicates are formalised as RDF/OWL properties, in order to facilitate their reuse and alignment to existing vocabularies, and for example to annotate hyperlinks with RDFa, within HTML anchor tags.
While subjects and objects can be mostly directly resolved through existing public or local Semantic Web entities, predicates are to be defined by performing “paraphrasing”, a summarisation task that abstracts over the text (when needed) in order to design labels that are as close as possible to what a human would design for a Linked Data vocabulary. In this respect, [40] distinguishes between extractive and abstractive summarisation approaches. Extractive methods select pieces of texts from the original source in order to define a summary (i.e. they rely only on the available text), while abstractive techniques ideally rely on modelling the text, and then combining it with other resources and language generation techniques for generating a summary. Abstractive methods are usually applied to large documents to the aim of producing a meaningful summary of their content.
This work proposes to apply the guiding principle of abstractive techniques to open information extraction as a novel contribution. Open information extraction refers to an open domain and unsupervised extraction paradigm. Existing open information extraction approaches are mainly extractive, hence showing a complimentary nature to what we present in this paper. They mostly focus on breaking text in meaningful fragments for building resources of relational patterns (e.g. PATTY [30],8
Knowledge extraction for the Semantic Web should instead include an abstractive step, which
exploits a formal semantic representation of text, and produces output that is compliant
with Semantic Web principles and requirements. The method described in this paper
demonstrates this novel approach, called Open Knowledge Extraction (OKE).
For example, given the sentence:
Notice that this is the output of OIE for this sentence.
The main contributions of this work are:
the introduction of Open Knowledge Extraction (OKE), a paradigm based on unsupervised, open domain, and abstractive knowledge extraction from text for producing directly usable machine readable information;
an implementation of OKE, named Legalo that given an English sentence produces a set of RDF triples representing relevant factual relations expressed in the sentence, the predicates of which are formally defined in terms of OWL axioms;
an evaluation of Legalo performed on a corpus of validated sentences from Wikipedia pages that provide evidence of factual relations. The results have been evaluated with the help of crowdsourcing and the creation of a gold standard, all showing high values of precision, recall, and accuracy;
a discussion highlighting the current limits of the approach and possible ways of improving it, and including an informal comparison of the proposed method with one of the main existing open information extraction tools.
Additionally, the paper includes a brief description of a specific implementation of OKE, specialised for extracting the semantics of Wikipedia pagelinks, which has been evaluated in [39] showing promising results.
The paper is structured as follows: Section 2 introduces a novel paradigm named Open Knowledge Extraction. Sections 3 and 4 describe the implementation of an OKE system, named Legalo, focusing on the method implemented and the pipeline of components, respectively. Legalo has been evaluated with the help of crowdsourcing, as described in Section 5. Section 6 discusses the limits of the method and possible ways to improve it, and informally compares Legalo with Open Information Extraction (OIE) [26]. Section 7 discusses relevant related research work and finally, Section 8 summarises the contribution of this work and indicates future works.
Introducing Open Knowledge Extraction
According to [12], an Open Information Extraction (OIE) system: “facilitates domain independent discovery of relations extracted from text and readily scales to the diversity and size of the Web corpus”. In other words, OIE revolutionised the information extraction paradigm by introducing unsupervised learning, domain-independence of the extracted relations, and the ability to scale both on size and heterogeneity dimensions of the Web. The Open Knowledge Extraction (OKE) paradigm poses its focus on making the extracted relations directly usable in a Semantic Web context.
An Open Knowledge Extraction (OKE) system is expected to perform unsupervised, open domain, and web scale extraction and to additionally have the following capabilities:
To assess if a natural language sentence provides an evidence of a relevant relation between a given pair of entities, which may be identified by hyperlinks; relevant here means that there are enough explicit traces in the sentence to support the existence of a (conceptual) relation;
To generate a predicate for this relation, with a label that is as close as possible to what a human would define for a Linked Data vocabulary;
To formalise this relation as an OWL object property with TBox axioms (conceptual level), as well as to produce ABox axioms (factual level) using that property.
More formally:
(Relevant relation).
Let s be a natural language textual sentence embedding some hyperlinks,
and
An OKE system is able to assess the existence of

Frame-based formal representation for the sentence: “
One of the main contributions of this paper is the implementation (and evaluation, cf. Section 5) of an OKE system, named Legalo.13
A demo of Legalo is available at
The method implemented by Legalo is based on six main steps: internal formal representation of the sentence
(abstractive step); assessment of the existence of a relevant relation between pairs of entities
identified in s, according to the content of the
sentence; extraction of relevant terms for
the predicate (extractive step); generation of the predicate label (abstractive
step); formal definition of the
predicate within the scope of its linguistic evidence and formal representation
(abstractive step); alignment (whenever possible) to existing Semantic Web
properties.
Legalo relies on a set of rules to be applied to a frame-based formal representation G of the sentence s (cf. Definition 2). G is a RDF graph designed following a frame-based approach, where nodes represent entities mentioned in s.
(Frame-based graph).
Let s be a natural language text sentence and
Frame Semantics [15] is a formal theory of meaning: its basic idea is that humans can better understand the meaning of a single word by knowing the relational knowledge associated to that word. For example, the sense of the word buy can be clarified in a certain context or task by knowing about the situation of a commercial transfer that involves certain individuals playing specific roles, e.g. a seller, a buyer, goods, money, etc.
In this work, frames are usually expressed by verbs or other linguistic constructions,
and their occurrences in a sentence are represented as RDF n-ary
relations, all being instances of some type of event or situation (e.g.
The prefix
A formal and detailed discussion of the theory behind frame-based formal representation
of knowledge extracted from text, and used by Legalo is beyond the scope of this paper.
This modelling approach and its founding theories are extensively described in [15,32,38]. However, an example may be
useful to convey the intuition behind the theory. Figure 1
shows a frame-based representation of the sentence:
Prefix
To assess if a relevant relation
(Graph path).
(φ assessment: necessary condition).
If
(Assessment of φ: sufficient condition without event occurrences).
In the other case, i.e. the path includes an event occurrence,

Frame-based formal representation for the sentence: “After a move to
Let f be a node of G such that
(Assessment of φ with event occurrences: sufficient condition).
This axiom is based on linguistic typology results (e.g. [7]), by which SVO (Subject-Verb-Object) languages such as English
have almost always an explicit (or explicitable) subject. This subject is formalised in a
frame-based representation of s by means of an agentive role. Based on
this observation, our method assumes that After a move to
While it is correct to state that Chouinard Art Institute and Disney Studios co-participate in an occurrence of the frame “teach”, it is far from straightforward to paraphrase the meaning of this relation. E.g., one might say that Chouinard Art Institute and Disney Studios are both places where Rico Lebrun used to teach, but this paraphrase is not easily reconstructable from the text, and needs a stronger language generation approach, which has not been tackled for the moment. Additionally, such a paraphrase would not be usable for a binary predicate. A way to represent this relation is a generic co-participation relation, which is however too generic to be considered as relevant.
For this reason, the investigation of paraphrases of relation between entities co-participating in an event with oblique roles is left to further study. An interesting analysis on this problem that could suggest new work directions is discussed in [8].

Frame-based formal representation for the sentence: “In February 2009
As far as the generation of
For example, consider the sentence: In February 2009
The Legalo design strategy for generating predicate labels is based on three main
generative rules (GR). The first one concerns the concatenation of the labels that are
used in the shortest path connecting the two nodes, including the labels of the edges and
the labels of the node types in the path. This rule is defined by GR 1. It is important to remark that the path used as a reference for generating
the predicate label is the one connecting the nodes
Given a pair identify the shortest path(s) extract
all labels (matching sentence terms) of the edges in the
path; extract all labels of the most general types of
the nodes that compose the path (if a node is typed by a taxonomy, the most
general type in the taxonomy is extracted), except the types of
concatenate
the extracted labels following their alternating sequence in
Hence, referring to Fig. 3, Legalo will produce a predicate
label
The second rule for generating predicate labels takes into account the possible presence
of an event occurrence in the path connecting the pair
For example, consider the (excerpt of the) frame-based representation of the sentence
“
(Path including an event).
Following GR 1 and applying this additional rule for the selected pair:
leads to a predicate λ = “publish on” for
Additionally, if the right branch of the tree path is of length 1 and the only edge is a
passive role, i.e.
For example, a frame-based representation of the sentence “
(Right branch of tree path with only passive role).
If we apply the additional rules described so far to the pair
(Path including event occurrences).
Given a selected pair extract extract all
edge labels in for each
concatenate the extracted labels following
their alternating sequence in if
The third rule for predicate label generation complements GR 1 and GR 2 by associating VerbNet roles to labels. Such labels have been defined top-down by analysing VerbNet thematic roles and their usage examples. The rule is defined in GR 3.
(Thematic roles labels).
If a path contains a VerbNet thematic role, replace its label with an empty one, unless
the role is associated with a non empty label according to the following scheme:

For example, consider the (excerpt of the) frame-based representation of the sentence
“
(Thematic roles associated with labels).
By applying GR 1, 2 and 3 to the path connecting the pair:
Legalo generates a label λ = “conspire with” for
Formalisation of extracted knowledge

Frame-based formal representation for the sentence: “
Given a textual sentence s and its frame-based formal representation
G, by following the generative rules described in Section 3.3 Legalo generates a label λ for each
relation
The aim of the formalisation step is to favour the reuse of the extracted knowledge by
representing it as RDF triples, by augmenting it with informative annotations and
axiomatisation, and by linking it to existing Semantic Web data. In particular, the
formalisation step addresses the following tasks: producing a RDF triple formally
defining annotating each triple
annotating each triple and
predicate with information about the frame-based formal representation from which
they were extracted.
RDF triples can be used for annotating
hyperlinks, e.g. with RDFa, OWL axiomatisation supports ontology reuse, and scope
annotations (i.e. linguistic evidence and formal representation) support reuse in relation
extraction systems, e.g. relation extraction based on distant supervision [1,27].
Locality of produced predicates Our method works on the assumption that each generated predicate and its associated formalisation are valid in the conceptual scope identified by the sentence s. This means that s identifies the scope of predicate names definitions, i.e. the namespace of a predicate depends on s. Pragmatically, this is implemented in Legalo by including the checksum of s in the predicate namespace. This strong locality constraint may lead to producing a high number of potentially equivalent properties (i.e. having the same intensional meaning) defined as they were different. This issue is tackled by formalising all predicates with domain and range axioms having values, i.e. classes, from external (open domain) resources, as well as by keeping the binding between a predicate, its linguistic evidence, i.e. s, and its formal representation source, i.e. G. The latter contains information about the disambiguated senses of the verbs, i.e. frame occurrences, used in s. All these features allow on one hand to inspect a specific property for understanding its meaning, e.g. in case of manual reuse, on the other hand to automatically reconcile predicates by computing a similarity measure based on them. In this paper, we focus on the generative part of the problem, i.e. generating usable labels for predicates and producing their formal definition, while we leave the reconciliation task to future work.
RDF factual statements For each hyperlink in s
associated with a true assessment of
According to a common Linked Data convention, using the CamelCase notation for OWL object properties makes the first term of the ID start with lower case, e.g. “invent programming language” -> inventProgrammingLanguage.
For example, consider the enriched frame-based formal representation of the sentence
OWL property formalisation For each generated property, Legalo produces
an additional set of OWL axioms that formally define it. The predicate formalisation
states that the predicate is an OWL object property, and includes domain and range axioms,
whose values are defined according to the WiBi types assigned to

Legalo’s triples produced from the sentence: “
As the reader may notice, an additional

The grounding vocabulary used for annotating the generated triples and properties with information about their linguistic and formal representation scope.
Scope annotations Finally, Legalo annotates all generated properties and triples with information related to the linguistic and formal representation scopes from which they were derived. To this aim a specific OWL ontology has been defined, named grounding,19
The vocabulary can be downloaded from
The first two axioms create an individual of type
Alignment to Semantic Web vocabularies
This step has the goal of aligning the generated properties to existing Semantic Web ones. The idea is to maximise reuse and linking of extracted knowledge to existing Linked Data. Legalo implements a simple string matching technique based on the Levenshtein distance measure for addressing this task. The implementation of more sophisticated approaches for aligning generated properties to existing vocabularies is part of future work. Relevant related work are ontology matching techniques such as [13] (cf. see the Ontology Alignment Evaluation Initiative22
Legalo uses three semantic resources for identifying possible targets for property alignment:
In principle other resources can be added and could be selected, we chose these three resources because they allow us to both cover most of public linked data vocabularies (i.e. LOV and Watson), and test with automatically generated resources (i.e. NELL).
Legalo is based on a pipeline of components and data sources, executed in the sequence illustrated in Fig. 7.

Pipeline implemented by Legalo for generating Semantic Web properties for semantic annotation of hyperlinks based on their linguistic trace, i.e. natural language sentence including the hyperlinks. Numbers indicate the order of execution of a component in the pipeline. Edges indicates input/output flows. (*) denotes tools developed in this work, which are part this paper contribution.
1. FRED: Semantic Web machine reader The core component of the system is
FRED [38], a Semantic Web
machine reader able to produce a RDF/OWL frame-based representation of a text. It integrates
the output of several NLP tools, enriches and transforms it by reusing Linguistic Frames
[32], Ontology Design Patterns [20], open data, and various vocabularies. FRED detects
events, roles, and n-ary relations and represents them in a RDF/OWL graph.
It also represents variable discourse referents, such as the variable in the first-order
predication Cat(x) extracted from the sentence The cat is on the
mat, they are formalised as reified individuals e.g.
All figures depicted in Section 2 show examples of FRED outputs: the reader may want to consider Fig. 3, which shows the RDF/OWL graph for the sentence “In February 2009 Evile began the pre-production process for their second album with Russ Russell” as a representative output of FRED.
2. Entity pair selection This component is in charge of detecting the
resolved entities and associate them with their lexical surface in s. This
is done by querying FRED text span annotations. Another task of this component is, for each
pair of detected entities
3. RDF/OWL writer This component is in charge of generating a predicate for each pair of entities received in input from the previous component, by applying the generative rules described in Section 3.3 to its associated path. In addition, this component implements two more modules: the “Property matcher” and the “Formaliser”.
The “Property matcher” is in charge of finding alignments between the generated predicate, and existing Semantic Web vocabularies. As described in Section 3.5, three main sources are used for retrieving semantic property candidates. For assessing their similarity with the generated predicate a string matching algorithm was implemented, which computes a Levenshtein distance [31] between the IDs of the two predicates. This component is not intended to be a contribution to advance the state of the art in ontology matching, its goal is to contribute to a complete implementation of OKE and to provide a possible baseline for comparing results with future improved versions.
Finally, the RDF/OWL writer includes the component “Formaliser”. This component implements the formalisation step of the method (cf. Section 3.4). It is in charge of producing the triples summarising the relation expressed in s, and that can be used for annotating the corresponding hyperlink, to generate OWL axioms defining domain and range of the generated predicates, and finally to annotate the produced triples and predicates with scope information.
Legalo for typing Wikipedia pagelinks A specialised version of Legalo for typing Wikipedia pagelinks (Legalo-Wikipedia)27
A demo is available at
Briefly, Legalo-Wikipedia takes in input a DBpedia entity URI, and retrieves all its pagelinks triples from the Pagelinks DBpedia dataset. For each pagelink triple it extracts all Wikipedia snippets containing an hyperlink corresponding to the triple by means of a specialised sentence extractor. Then, the subject resolver selects all and only the snippets that contain a lexicalisation of the Wikipedia page subject, by relying on the DBpedia Lexicalisations Dataset.28
For example, the wikipage “In 1972, Cobb moved to
“Edited
and published by
Legalo-Wikipedia has been previously evaluated. For the sake of completeness, these results are summarised in Section 5.2 (for additional details, the reader can refer to [39]). With the help of crowdsourcing an additional, more extensive evaluation of the current implementation of Legalo was performed, which allowed us to better assess its performances and open issues. This section reports this evaluation results in terms of precision, recall, and accuracy.
Legalo working hypothesis
Legalo is based on two working hypotheses.
(Relevant relation assessment).
Legalo is able to assess if, given a sentence s, a relevant relation exists
which holds between two entities, according to the content of s:
This means that if s contains evidence of a relevant relation between
Legalo is able to generate a usable
predicate
(Usable predicate
generation).
This section reports the evaluation of Legalo based on the validation of Hypothesis 1 and Hypothesis 2.
Evaluation sample As evaluation data, a corpus
It has to be noticed that Legalo addresses all the capabilities of an OKE system (cf.
Section 2), however by using
The evaluation was performed using a subset of
The resulting triples, predicate formalisations, and scope annotations are accessible via a Virtuoso SPARQL endpoint.30
Legalo results can be inspected at
There are several works demonstrating that crowdsourcing can be successfully used for
building and evaluating semantic resources [17,34,47]. Following these experiences, Legalo was evaluated with the help
of crowdsourcing. Five different crowdsourced tasks were defined: assessing if a sentence s provides
evidence for the referenced relation (i.e. either “institution” or “education”)
between two given entities assessing if a sentence
s provides evidence for any relation between two given entities
judging if a predicate judging if a predicate
creating a phrase λ that summarises the relation
expressed by the content of s, between two given entities
Tasks 1 and 2 were used for validating Hypothesis 1. The
results of these two tasks were then combined with those from Tasks 3 and 4, for
validating Hypothesis 2. Finally, Task 5 was used for
comparing the similarity between λ values generated by humans and
It is important to remark that Task 1 duplicates the information already available in
The CrowdFlower platform31
Number of different workers that performed the crowdsourced tasks
For Tasks 1 and 2, judgements were expressed as “yes” or “no” answers. For Tasks 3 and 4, judgements could be assessed on a scale of three values: Agree (corresponding to a value 1 when computing relevance measures), Partly Agree (corresponding to a value 0.5 when computing relevance measures), and Disagree (corresponding to a value 0 when computing relevant measures). Task 5 was completely open. The confidence measure is provided by CrowdFlower, it measures the inter-rater agreement between workers weighted by their trust values, hence indicating both agreement and quality of judgements at the same time. It is computed as described in Definition 5,32
Given a task unit u, a set of possible judgements
Example of confidence score computation for a task unit
Example of confidence score computation for a task unit
Table 3 shows the judgements of three raters on the same
task unit, where possible judgements are “yes” and “no”.
When aggregating results for a task unit, the judgement with the higher confidence score
is selected. Notice that
Results of Legalo performance in assessing the evidence of relations between entity
pairs in a given sentence s. Performance measures are computed on the
judgements collected in Tasks 1 and 2 based on data from
,
, and
Results of Legalo performance in assessing the evidence of relations between entity
pairs in a given sentence s. Performance measures are computed on the
judgements collected in Tasks 1 and 2 based on data from
Evaluation of Hypothesis
1
Table 4
shows the results of the evaluation of Hypothesis 1, i.e.
Legalo’s ability to assess if a sentence s provides evidence of a
relation
Results of Legalo performance in producing a usable label for relations between
entity pairs in a given sentence. Performance measures are computed on the judgements
collected in Tasks 3 and 4 based on data from
Legalo’s performance is measured by means of standard metrics: precision, recall, f-measure, and accuracy. With the aim of clarifying how to interpret them we briefly report an informal definition of true/false positive, and true/false negative in the context of Tasks 1 to 4. As for Tasks 1–2, given a sentence s, the crowd would say “yes” if a relevant relation exists between a given subject/object pair, and “no” if it does not. Legalo output means “true” (the relation exists) whenever it produces a relation, while it means “false” (the relation does not exist) whenever it does not. Hence, True positive = the number of (true, yes) pairs, False positive = the number of (true, no) pairs, True negative = the number of (false, no) pairs, False negative = the number of (false, yes) pairs.
The results of the crowdsourced tasks demonstrate that the Legalo method has high
performance (average F-measure = 0.92) on the assessment of
Evaluation of Hypothesis
2
Table 5
shows the results of the evaluation of Hypothesis 2, i.e.
Legalo’s ability of generating usable predicates for summarising relations between
entities, according to the content of a sentence. Task 3 was designed for evaluating this
capability on specific properties, while Task 4 was designed for evaluating this
capability on any property. Each row shows the performance results indicating the type of
relation tackled and the crowdsourced task. The results for “institution” relation and for
“any” relation are computed both on the overall set of results, as well as on a subset
that ensured a higher confidence rate (i.e., only results with
For these tasks, positive values (i.e. when Legalo generates a relation, i.e. “true”) can
be judged by the crowd with “agree”, “partly agree” and “disagree”. Let A
be the number of “agree”,
Finally, Hypothesis 2 was evaluated also by computing a
similarity score between human created predicates and Legalo generated ones. Task 5 was
performed for collecting at least three labels
Given two strings
Also for Hypothesis 2, Legalo shows very satisfactory performance. An impressive result is the high average value of the semantic similarity score (0.80) between user created predicates and Legalo generated ones. This result confirms the hypothesis discussed in [39], saying that the Legalo design strategy was good at producing predicates that are very close to what a human would do when creating a Linked Data vocabulary. In the context of this work, this hypothesis can be extended to the capability to summarise such relations in a way very close to what a generic user would do. This result is very promising from the perspective of evolving Legalo into a summarisation tool, which is one of the envisioned directions of research.
However, by inspecting the different relevance measures, it emerges that while recall is
very high on all tasks (0.90 on average), average accuracy is 0.73 and average precision
is 0.75. Although these are very satisfactory performances, it is worth identifying the
cases that cause the generation of less usable or even bad results. An insight is that
lower precision and accuracy are registered especially in the generation of predicates for
“institution” (accuracy 0.62, precision 0.65) relations and for “any” relations (accuracy
0.71, precision 0.68) while for “education” relations these measures show significantly
higher values (accuracy 0.85, precision 0.92). This turns out as an important lesson
learnt. In fact, less satisfactory precision seems due to the fact that many “institution”
relations between two entities
Currently, based on this representation, Legalo would generate a predicate by following
the path connecting X to Y, hence without considering
the information on Y. The resulting predicate in this case would be “receive from”, while
a more informative and usable one would clearly be, e.g. “receive degree from”, assuming
that the type of Y is degree. The term degree is an example of a possible
type for Y, however whatever is the type of Y, including its type in the predicate would
make it much more informative and usable. This case can be easily generalised by
exploiting the semantic information about the thematic role that Y plays
in participating in the event “Hassan Husseini became an organiser for the Communist
Party.”
and Legalo would produce the predicate “become for”. By applying the new suggested generative rule, the generated predicate would be instead, the more informative and usable “become organiser for”. This type of observations leads to the definition of additional generative rules that refine Legalo towards a highly probable improvement on precision and accuracy. New rules are implemented based on the data collected from the evaluation results, hence Legalo demo is constantly evolving.
Evaluating the alignment with existing Semantic Web vocabularies The
matching process performed against LOV, NELL [5], and Watson [9] returned a number of
proposed alignments between predicates generated by Legalo and existing properties in
Linked Data vocabularies. In order to accept an alignment and include it in the
formalisation of a Legalo property
All triples, property formalisations, and
alignments can be retrieved at
The alignment procedure was executed on 629 Legalo properties
Evaluation results on the accuracy of the alignment between
A previous study [39] described the evaluation of Legalo-Wikipedia. In this section the results of this evaluation are reported, for the sake of completeness. The main difference between Legalo and its Wikipedia specialised version is that in the latter, the subject of the predicate is always given and there is a high probability that it is correct based on the design principles that guide Wikipedia page writing. It is worth to remark that the evaluation experiment of Legalo-Wikipedia was performed by Linked Data experts, hence comparing the new results with the previous ones provides insights on the usability of the generated predicates, regardless the expertise of the evaluators.
The evaluation results of Legalo-Wikipedia are published as RDF data and accessible through a SPARQL endpoint.38
The evaluated sample set consisted of 629 pairs
The results of the user-based evaluation of
Kendall’s W measures the inter-rater agreement. Values ranges from 0 (complete disagreement) to 1 (complete agreement).
Evaluation results on the accuracy of
Dependency on entity linking An aspect that requires improvement is the potential dependency of Legalo performance on the recognition and linking of DBpedia entities in a sentence: if an entity is not in DBpedia, the relation is not generated. Ideally, this is easily solvable by treating any recognised named entity in a sentence as a potential hyperlink, regardless if it has a URI (one can be locally created on the go). The current version of Legalo shows this capability, however besides the need of rigorous experiments for assessing its performance, anecdotical tests show that in some cases this generalisation produces noise in the results. Identifying the causes and handling them is one of our current focus.
Passive form and skolemised entities Identifying recurrent errors helps us
identifying new patterns for improving label generation. However, some recurrent mistakes
are not easily treatable. One of such cases can be exemplified by the following sentence:
In March 2008, Evile’s track was featured on the Wii, Xbox 360, and
PlayStation 3 video game Rock Band as downloadable content.
Sample sentences involving non-trivial relations, expressed in a generic logical form
Open domain and any kind of text sources The OKE method is meant to support knowledge extraction from text in the open domain. “Open domain” has a twofold interpretation, both valid in this context: (i) any knowledge area: meaning that the approach must be independent from the topics addressed by a text, in other word it should not be tailored to specific languages, vocabularies or terminologies; (ii) any text style: considering that natural language on the Web can have many different writing styles (e.g., a text in a Wikipedia page is certainly cleaner than an average blog text, which in turn has a complete different style than twitter writing). The implementation presented in this paper shows very promising results as demonstrated by the performance measured after the execution of a set of crowdsourcing tasks. This evaluation was based on texts extracted from Wikipedia pages, focusing on both specific and general domains, hence showing that the tool works well with any knowledge area. Nevertheless, it remains important to investigate how the change of writing style impacts on the tool performance, in order to assess its behaviour when coping with any text source (going beyond the style of Wikipedia text). This investigation is a main action point in the next evolution of this work. It has to be noticed that Legalo’s main tasks are the relation assessment and the label generation, while parsing and role labelling, which are at the base of the frame-based graph representation, are embedded in FRED. In other words, Legalo performance highly depends on the ability of FRED to produce an accurate frame-based representation of the input sentence. This means that intervening for minimising the performance bias due to different writing styles requires to intervene on FRED components (especially parsing and role labelling).
Alignment to existing Semantic Web properties As for the alignment procedure, there is also space for significant improvement, since this task was addressed by computing a simple Levenshtein distance. More sophisticated alignment methods such as those from the Ontology Alignment Initiative41
As for the alignment recall, it was not possible to compute standard recall metrics because
it is impossible to compute False Negative results i.e., all existing Semantic Web
properties that would match
Comparison to open information extraction Extracting, discovering, or summarising relations from text is not an easy task. Natural language is very subtle in providing forms that can express, allude, or entail relations, and syntax offers complex solutions to relate explicitly named entities, anaphoras to mentioned or alluded entities, concepts, and entire phrases, let alone tacit knowledge. Table 8 shows some kinds of (formalisable) relations that can be derived from text.
Some relations extracted bu OIE from sample sentences
Two sample extractions by Legalo from the same sentences as in Table 9. For the sake of space we use prefix
A full-fledged analysis of those texts is possible to a certain extent, specially if associated with background knowledge (as FRED does), but the conciseness and directness of hyper-linking based on binary relations is often lost. Hence the importance of tools like Legalo, which are able to reconstruct binary relations from complex machine reading graphs.
It would be natural to compare the results of Legalo to relation extraction systems, but this would require to manipulate their output, which is beyond the scope of this work. Here follows an explanation of the difficulties involved.
A state-of-art tool like Open Information Extraction (OIE, [26]) applies an extractive approach to relation extraction, and solves the problem by extracting segments that can be assimilated to subjects, predicates, and objects of a triplet. As reported in [19], its accuracy was not very high with the version of OIE implemented as the ReVerb tool, but it has sensibly improved recently. However, the segments that are extracted, though useful, are not always intuitively reusable as formal RDF properties or individuals. Table 9 shows one case of a very complex segment #3, i.e. “with a West angry over Russia’s actions in Ukraine”, which is a phrase to be further analysed in order to be formalised, and typically leading to multiple triples; and another case of a complex segment #2, i.e. “developed a passion for the native flora of the arid West Darling region identifying”, which is not easily transformable into a RDF property.
The research presented here intends to go beyond text segmentation, by using an abstractive approach that selects paths in RDF graphs in order to generate RDF properties. The difference between the two approaches is striking, and leads to results that are difficult to compare. Table 10 shows two of the examples from Table 9 (the third one has no resolvable entity on the object position), but as they are extracted and formalised by Legalo.
For the reasons described above, this work has not attempted a direct comparison in terms of accuracy between OIE and Legalo: it would have needed the transformation and formalisation of OIE text segments into individuals and properties, and arbitrary choices on how to formalise complex segments. At the end, it is not a measure of their outputs that is obtained, but a measure of authors’ ability to redesign OIE’s output. For those interested in attempts to reuse heterogeneous NLP outputs for formal knowledge extraction, see [19].
The work presented here can be categorised as formal binary relation discovery and labelling from arbitrary walks in connected fully-labelled multi-digraphs, which means in practice that it is not just relation extraction (relations are extracted by FRED [38], and Legalo reuses them), but Legalo discovers complex relations that summarise information encoded in several nodes and edges in the graph (RDF graphs are actually connected, fully-labelled multi-digraphs). It considers certain paths along arbitrary directions of edges, aggregating some of the existing labels, and concatenating them in order to provide property names that are typical of Linked Data vocabularies, and finally axiomatising the properties with domain, range, subproperty, and property chain axioms.
In other words, Legalo tries to answer the following question: what is the relation that links two (possibly distant) entities in a RDF graph?
There is not much that can be directly comparable in the literature, but work from two related fields can be contrasted with what Legalo does: relation extraction, and automatic summarisation.
The term Open Knowledge Extraction was previously introduced in the context of Artificial Intelligence [10]. This work defines OKE as “conversion of arbitrary input sentences into general world knowledge represented in a logical form possibly usable for inference”, hence perfectly compatible with what defined in this paper. The cited work does not focus on Semantic Web technologies and languages, although it provides further support to our claims and definitions.
The closest works in relation extraction include Open Information Extraction (e.g. [26,30]), relation extraction exploiting Linked Data [24,46], and question answering on linked data [25].
Relation extraction The main antecedent to Open Information Extraction is probably the 1999 Open Mind Common Sense project [43], which adopted an ante-litteram crowdsourcing and games-with-a-purpose approach to populate a large informal knowledge base of facts expressed in triplet-based natural language. The crowd was left substantially free to express the subject, predicate, and object of a triplet, but during its evolution, forms started stabilising, or were learnt by machine learning algorithms. Currently Open Mind is being merged with several other repositories in ConceptNet [21].
Open Information Extraction (aka Machine Reading) as it is currently known in the NLP community performs bootstrapped (i.e. started with learning from a small set of seed examples, and then recursively and incrementally applied to a huge corpus, cf. [11]), open-domain, and unsupervised information extraction. E.g. OIE is based on learning frequent triplet patterns from a shallow parsing of the Web, in order to create a huge knowledge base of triplets composed of text chunks.
This idea (on a smaller scale) was explored in [6], with the goal of resolving predicates to, or to enlarge, a biomedical ontology. On the contrary, OIE extracts binary relations by segmenting the texts into triplets. However, there is usually no attempt to resolve the subjects and objects of those triplets, nor to disambiguate or harmonise the predicates used in the triples. Since predicates are not formally represented, they are hardly reusable for e.g. annotating links with RDFa tags. See Section 6 for a comparison between OIE and Legalo, proving the difficulty of even designing a comparison test.
Overall, Open Information Extraction looks like a component for extractive summarisation (see below). In [30], named entity resolution is used to resolve the subjects and objects, and there is an attempt to build a taxonomy of predicates, which are encoded as lexico-syntactic patterns rather than typical predicates.
Another important Open Information Extraction project is Never Ending Language Learning (NELL) [5], a learning tool that since 2010 processes the web for building an evolving knowledge base of facts, categories and relations. In this case there is a (shallow) attempt to build a structured ontology of recognised entities and predicates from the facts learnt by NELL. In this work, NELL is used in an attempt to align the semantic relations resulting from Legalo to the NELL ontology.
The main difference between approaches such as OIE and NELL, and Legalo is that the formers focus on extracting mainly direct relations between entities, while Legalo focuses on revealing the semantics of relations between entities that can be: a) directly linked, b) implicitly linked, c) suggested by the presence of links in Web pages, d) indirectly linked, i.e. expressed by longer paths or n-ary relations. Legalo novelty also resides in performing property label generation. From the acquisition perspective, Legalo is not bootstrapped, but it is open-domain and unsupervised.
Relation extraction and question answering targeted at Linked Data are quite different from both Open Information Extraction and Legalo, since they are oriented at formal knowledge, but they are not bootstrapped, open domain and unsupervised. They typically use a finite vocabulary of predicates (e.g. from DBpedia ontology), and use their extensional interpretation in data (e.g. DBpedia) to either link two entities recognised in some text (as in [24,46]), or to find an answer to a question, from which some entities have been recognised (as in [25]). Domain is therefore limited to the coverage of the vocabulary, and distant supervision is provided by the background knowledge (e.g. [1]. A growing repository of relationships extracted with this specific distantly supervised approach is sar-graphs [46].
Automatic summarisation Automatic summarisation deserves a short discussion, since ultimately Legalo’s relation discovery can be used as a component for that application task. According to [40], the main goal of a summary is to present the main ideas from one or more documents in less space, typically less than half of one document. Different categorisations of summaries have been proposed: topic-based, indicative, generic, etc., but the most relevant seems to distinguish between “extracts” and “abstracts”. Extracts are summaries created by reusing portions of the input text verbatim, while abstracts are created by reformulating or regenerating the extracted content. An extraction step is needed in any case, but while extracts compress the text by squeezing out unimportant material, and fuse the reused portions, abstracts typically model the text, by accessing external information, applying frames, deep parsing, etc., eventually generating a summary that in principle could contain no word in common with the original text.
Extractive summarisation is now in mass usage, e.g. with snippets provided by search engines. It has serious limits, because size and relevance of the extracts can be questionable and not as accurate as a human may be.
Legalo can be considered closer to abstractive summarisation, since it can be used to build frame-based abstractive summaries of texts, consisting in binary relation discovery, which can then be filtered for relevance. The current implementation of Legalo is not designed in view of abstractive summarisation, therefore it was not evaluated for that task, but it is appropriate to report at least one relevant example of related work in this area.
Opinosis [18] is the state-of-the-art system for abstractive summarisation. It performs graph-based summarisation, generating concise abstractive summaries of highly redundant opinions. It uses a word graph data structure to represent the text, whereas Legalo uses a semantic graph. As the authors say: “Opinosis is a shallow abstractive summariser as it uses the original text itself to generate summaries. This is unlike a true abstractive summariser that would need a deeper level of natural language understanding”. Legalo is indeed based on FRED [38], which provides such deeper level of understanding.
In order to be considered an abstractive summariser, Legalo will need to be complemented with more capabilities to rank discovered relations across an entire or even multiple texts, to associate them in a way that final users can make sense of, and to evaluate summaries appropriately. Results from both abstractive summarisation (e.g. [18,23,49]) and RDF graph summary (e.g. [4,37,48]) can be reused to that purpose.
Conclusion and future work
Conclusion This paper presents a novel approach for Open Knowledge Extraction, and its implementation called Legalo, for uncovering the semantics of hyperlinks based on frame-based formal representation of natural language text, and heuristics associated with subgraph patterns. The main novel aspects of the approach are: relevant relation assessment, label generation, Semantic Web property generation and formalisation.
The working hypothesis is that hyperlinks (either created by humans or knowledge extraction tools) provide a pragmatic trace of semantic relations between two entities, and that such semantic relations, their subjects and objects, can be revealed by processing their linguistic traces: the sentences that embed the hyperlinks. Evaluation experiments conducted with the help of a crowdsourcing platform confirm this hypothesis, and show very high performances: the method is able to assess the actual presence of a relation with a high precision (average F-measure 0.92), and generate accurate RDF properties between the hyperlinked entities in single-relation corpora (average F-measure 0.84), the Wikipedia page link corpus (average F-measure 0.84), as well as in the challenging open domain corpus (average F-measure 0.78). The accuracy remains constant across crowdsourced evaluation, and comparison to (crowdsourced) gold standard for the open domain corpus. We also provide alignments to Semantic Web vocabularies with a precision value of 0.84.
A demo of Legalo Web service is available online,13 as well as the prototype dedicated to Wikipedia pagelinks,27 and the binary properties produced in this study can be accessed by means of a sparql endpoint.36
Ongoing work Current work concentrates on designing and testing new heuristics, as required by evidence emerging from experiments and tests (cf. e.g. Section 6), on identifying new ways of aligning the relations generated by Legalo to existing ontologies, and on discovering regularities in the relation taxonomies that are increasingly discovered. Additionally, new experiments are under development for assessing Legalo scalability on the diversity and size of the Web.
Future work The main research line for the future is to apply Legalo to application tasks. An obvious one is a real abstractive summarisation task, both at single-text, and multiple-text level, evaluating the results against state-of-the-art tools. The challenges there include at least: (i) managing multiple (and possibly dynamically evolving) Open Knowledge Extraction graphs, (ii) assessing relevance of discovered relations, and their dependence across a same text, or across multiple texts, and (iii) generating factoid sequences that make sense to a final user of abstractive summaries. Also other applications of Legalo are envisioned, including question answering and textual entailment.
