Abstract
The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one still being the free-text document. This motivates the need for
Keywords
Introduction
The World Wide Web is nowadays one of the most prominent sources of information and knowledge. Despite the constantly increasing availability of semi-structured or structured data, a major portion of its content is still represented in an unstructured form, namely free text: understanding its meaning is a complex task for machines and yet relies on subjective human interpretations. Hence, there is an ever growing need for
In this scenario, the encyclopedia Wikipedia contains a huge amount of data, which may represent the best digital approximation of human knowledge. Recent efforts, most notably
https://googleblog.blogspot.it/2010/07/deeper-understanding-with-metaweb.html
https://www.google.com/intl/en_us/insidesearch/features/search/knowledge.html
https://plus.google.com/109936836907132434202/posts/bu3z2wVqcQc
However, the trustworthiness of a general-purpose KB like Wikidata is an essential requirement to ensure reliable (thus high-quality) content: as a support for their plausibility, data should be validated against third-party resources. Even though the Wikidata community strongly agrees on the concern,6 https://www.wikidata.org/wiki/Wikidata:Referencing_improvements_input, http://blog.wikimedia.de/2015/01/03/scaling-wikidata-success-means-making-the-pie-bigger/

Screenshot of the Wikidata
On the other hand, the DBpedia
https://en.wikipedia.org/w/index.php?title=Germany_national_football_team&oldid=738198938
(1) In Euro 1992, Germany reached the final, but lost 0–2 to Denmark. would produce a list of
(Germany, defeat, Defeat_01)
(Defeat_01, winner, Denmark)
(Defeat_01, loser, Germany)
(Defeat_01, score, 0–2)
(Defeat_01, competition, Euro 1992)
To fulfill both Wikidata and DBpedia duties, we aim at investigating to what extent can the
To alleviate this, crowdsourcing the annotation task is proven to dramatically reduce the financial and temporal expenses. Consequently, we foresee to exploit the novel annotation approach described in [23], which provides full frame annotation in a
In this paper, we endeavor to answer the following research questions, in descending order of specificity:
How can we populate general-purpose KBs like DBpedia and Wikidata, maximizing the use of automatic techniques, while keeping their implementation at a reasonable cost? Is it possible to improve the KB A-Box coverage? To what degree can data-driven approaches contribute to homogenize the KB T-Box?
Knowledge base population
The main research challenge is formulated as a KB population problem: specifically, we tackle how to enrich DBpedia resources with novel statements extracted from the text of Wikipedia articles. We conceive the solution as a machine learning task that leverages the frame semantics linguistic theory [21,22]: we investigate how to recognize meaningful factual parts given a natural language sentence as input. We cast this as a classification activity falling into the supervised learning paradigm. In particular, we focus on the construction of a new extractor, to be integrated into the current DBpedia infrastructure. Frame semantics will enable the discovery of relations that hold between entities in raw text. Its implementation takes as input a collection of documents from Wikipedia (i.e., the corpus) and outputs a structured dataset composed of machine-readable statements.
A-Box coverage
The DBpedia ontology (DBPO) suffers from a known data coverage issue [24,43,46]: ideally, each Wikipedia page should have a 1-to-1 mapping to each DBpedia resource. However, this does not seem to reflect the actual state of affairs: for instance, the As per the 2015 release, based on the Wikipedia dumps from January 2015.
Fact extraction examples on the Germany national football team article (English Wikipedia)
We argue that both DBPO and the Wikidata ontology (WDO) are exceedingly unbalanced. This is attributable to the collaborative nature of their development and maintenance: any registered contributor can edit them by adding, deleting or modifying their content, after an eventual discussion with the user community. At the time of writing this paper (September 2016), the latest DBPO stable release11
In this paper, we focus on Wikipedia as the source corpus and on DBpedia as the target KB. We propose to apply NLP techniques to Wikipedia text in order to harvest structured facts that can be used to automatically add novel statements to DBpedia. Our
The remainder of this paper is structured as follows. We introduce a use case in Section 2, which will drive the implementation of our system. Its high-level architecture is then described in Section 3, and devises the core modules, which we detail in Section 4, 5, 6, 7, 8, and 9. A baseline system is reported in Section 10: this enables the comparative evaluation presented in Section 11, among with an assessment of the T-Box and A-Box enrichment capabilities. In Section 12, we gather a list of research and technical considerations to pave the way for future work. The state of the art is reviewed in Section 13, before our conclusions are drawn in Section 14.
Use case
Soccer is a widely attested domain in Wikipedia: according to the Italian DBpedia, the Italian Wikipedia counts a total of
https://it.wikipedia.org/w/index.php?title=Nazionale_di_calcio_della_Germania&oldid=83055709
The implementation workflow is intended as follows, depicted in Fig. 2, and applied to the use case in Italian language.
High level overview of the each selected LU will trigger one or more frames together with their FEs, depending on the definitions contained in a given
We proceed with a simplification of the original frame semantics theory with respect to two aspects: (a) LUs may be evoked by additional POS (e.g., nouns), but we focus on verbs, since we assume that they are more likely to trigger factual information; (b) depending on the frame repository, full lexical coverage may not be guaranteed (i.e., some LUs may not trigger any frames), but we expect that ours will, otherwise LU candidates would not generate any fact.

Since Wikipedia also contains semi-structured data, such as formatting templates, tables, references, images, etc., a pre-processing step is required to obtain the raw text representation only. To achieve this, we leverage a third-party tool, namely the
Given the use case corpus, we first extract the complete set of verbs through a standard NLP pipeline: tokenization, lemmatization and POS tagging. POS information is required to identify verbs, while lemmas are needed to build the ranking.
The unordered set of extracted verbs needs to undergo a further analysis, which aims at discovering the most representative verbs with respect to the corpus. As a matter of fact, lexicon (LUs) in text is typically distributed according to the Zipf’s law,17
https://en.wikipedia.org/w/index.php?title=Zipf%27s_law&oldid=737144288
Two measures are leveraged to generate a score for each verb lemma. We first compute the term frequency-inverse document frequency (TF-IDF) of each verb lexicalization
The ranking is publicly available in the code repository.18
https://github.com/dbpedia/fact-extractor/blob/master/resources/stdevs-by-lemma.json
Among the top 50 LUs that emerged from the corpus analysis phase, we manually selected a subset of 5 items to facilitate the full implementation of our pipeline. Once the approach has been tested and evaluated, it can scale up to the whole ranking (cf. Section 12 for more observations). First, we performed a set of random choices, alternating between the top 10 and the worst 10 LUs, with the purpose of assessing the validity of the corpus analysis module. Second, we checked whether each random choice fitted the use case domain, and discarded generic ones accordingly, until we reached 5 satisfactory items. Consequently, we picked the following LUs: esordire (to start out), giocare (to play), perdere (to lose), rimanere (to stay, remain), and vincere (to win).
The next step consists of finding a language resource (i.e., frame repository) to suitably represent the use case domain. Given a resource, we first need to define a relevant subset, then verify that both its frame and FEs definitions are a relevant fit. After an investigation of FrameNet and to the best of our knowledge, no suitable domain-specific Italian FrameNet or Kicktionary are publicly available, in the sense that neither LU sets nor annotated sentences for the Italian language match our purposes; FrameNet is too coarse-grained to encode our domain knowledge. For instance, the Kicktionary is too specific, since it is built to model the speech transcriptions of football matches. While it indeed contains some in-scope frames such as
Therefore, we adopted a custom frame repository, maximizing the reuse of the available ones as much as possible, thus serving as a hybrid between FrameNet and Kicktionary. Moreover, we tried to provide a challenging model for the classification task, prioritizing FEs overlap among frames and LU ambiguity (i.e., focusing on very fine-grained semantics with subtle sense differences). We believe this does not only apply to machines, but also to humans: we can view it as a stress test both for the machine learning and the crowdsourcing parts. A total of 6 frames and 15 FEs are modeled with Italian labels as follows:
Supervised fact extraction
The first stage involves the creation of the training set: we leverage the crowdsourcing platform
Both frame and FEs recognition are cast to a multi-class classification task: while the former can be related to text categorization, the latter should answer questions such as
Given as input an unknown sentence, the full frame classification workflow involves the following tasks: tokenization, POS tagging, EL, FE classification, and frame classification.
The sentence selection procedure allows to harvest meaningful sentences from the input corpus, and to feed the classifier. Therefore, its outcome is two-fold: to build a representative training set and to extract relevant sentences for classification. We experimented multiple strategies as follows. They all share the same base constraint, i.e., each sentence must contain a LU lexicalization.
First, we note that all the strategies but the baseline necessitate an evident cost overhead in terms of language resources availability and engineering. Furthermore, given the soccer use case input corpus of
Consequently, we decided to leverage the baseline for the sake of simplicity and for the compliance to our contribution claims. We set the interval to
it is known that crowdsourced NLP tasks should be as simple as possible [54]. Hence, it is vital to maximize the accessibility, otherwise the job would be too confusing and frustrating, with a consistent impact in quality and execution time;
frame annotation is a particularly complex task [4], even for expert linguists. Therefore, the inter-annotator agreement is expected to be fairly low. Compact sentences minimize disagreement, as corroborated by the average score we obtained in the gold standard (cf. Section 11.1, Table 4 and 5).
since we aim at populating a KB, we prioritize precise statements instead of recall, for the sake of data quality. As a result, we focus on atomic factual information to reduce the risk of noise;
in light of the above points, EL acts as a surrogate of syntactic parsing, thus complying with our initial claim.
Comparative results of the Syntactic sentence extraction strategy against the Sentence Splitter one, over a uniform sample of a corpus gathered from 53 Web sources, with estimates over the full corpus
Comparative results of the
We still foresee further investigation of the other strategies for scaling besides the use case. Specifically, we believe that the refinement of the chunker grammar would be the most beneficial approach: POS tagging is already involved into the system architecture, thus allowing to concentrate the engineering costs on the grammar only.
We apply a one-step, bottom-up approach to let the crowd perform a full frame annotation over a set of training sentences. In frame semantics, lexical ambiguity is represented by the number of frames that a LU may trigger. For instance, vincere (to win) conveys
The training set is randomly sampled from the input corpus and contains

Worker interface example.
Swindles represent a widespread pitfall of crowdsourcing services: workers are usually rewarded a very low monetary amount (i.e., a few cents) for jobs that can be finalized with a single mouse click. Therefore, the results are likely to be excessively contaminated by random answers. CrowdFlower tackles the problem via
We ask the crowd to (a) read the given sentence, (b) focus on the “topic” (i.e., the potential frame that disambiguates the LU) written above it, and (c) assign the correct “label” (i.e., the FE) to each “word” (i.e., unigram) or “group of words” (i.e., n-grams) from the multiple choices provided below each n-gram. Figure 3 displays the front-end interface of a sample sentence, with Fig. 4 being its English translation.

Worker interface example translated in English.
During the preparation phase of the task input data, the main challenge is to automatically provide the crowd with relevant candidate FE text chunks, while minimizing the production of noisy ones. To tackle this, we experimented with the following chunking strategies: We surprisingly observed that the full-stack pipeline outputs a large amount of noisy chunks, besides being the slowest strategy. On the other hand, the custom chunker was the fastest one, but still too noisy to be crowdsourced. EL resulted in the best trade-off, and we adopted it for the final task.
The task parameters are as follows:
we set 3 judgments per sentence to enable the computation of an agreement based on majority vote; the pay sums to 5$ cents per page, where one page contains 5 sentences; we limit the task to Italian native speakers only by targeting the Italian country and setting the required language skills to Italian; the minimum worker accuracy is set to on account of a personal calibration, the minimum time per page threshold is set to 30 seconds, which allows to automatically discard a contributor when triggered; we set the maximum number of judgments per contributor to 280, in order to prevent each contributor from answering more than once on a given sentence, while avoiding to remove proficient contributors from the task.
The outcomes are resumed in Table 3.
Finally, the crowdsourced annotation results are processed and translated into a suitable format to serve as input training data for the classifier.
Training set crowdsourcing task outcomes. Cf. Section 6.2.1 for explanations of CrowdFlower-specific terms
We train our classifiers with the following linguistic features, in the form of bag-of-features vectors:
Numerical expressions normalization
During the pilot crowdsourcing annotation experiments, we noticed a low agreement on numerical FEs. This is likely to stem from the FE labels interpretation: workers got particularly confused by
The task is not formulated as a classification one, but we argue it is relevant for the completeness of the extracted facts: rather, it is carried out via matching and transformation rule pairs. Given for instance the input expression tra il 1920 e il 1925 (between 1920 and 1925), our normalizer first matches it through a regular expression rule, then applies a transformation rule complying to the XML schema datatypes24 We use the
All rule pairs are defined with the programming language-agnostic
https://github.com/dbpedia/fact-extractor/blob/master/date_normalizer/regexes.yml
The integration of the extraction results into DBpedia requires their conversion to a suitable data model, i.e., RDF. Frames intrinsically bear n-ary relations through FEs, while RDF naturally represents binary relations. Hence, we need a method to express FEs relations in RDF, namely standard reification;28
http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification
n-ary relations,29 named graphs.30 standard reification is too verbose, since it would require applying Pattern 1 of the aforementioned W3C working group note to n-ary relations would allow us to build named graphs can be used to encode provenance or context metadata, e.g., the article URI from where a fact was extracted. In our case however, the fourth element of the quad would be the frame (which represents the context), thus boiling down to minting
A recent overview [29] highlighted that all the mentioned strategies are similar with respect to query performance. Given as input
We opted for the less verbose strategy, namely n-ary relations. Given sentence (1), classified as a
We add an extra instance type triple to assign an ontology class to the reified frame, as well as a provenance triple to indicate the original sentence:
In this way, the generated statements amount to
It is not trivial to decide on the subject of the main frame statement, since not all frames are meant to have exactly one core FE that would serve as a plausible logical subject candidate: most have many, e.g.,
Besides the fact datasets, we also keep track of confidence scores and generate additional datasets accordingly. Therefore, it is possible to filter facts that are not considered as confident by setting a suitable threshold. When processing a sentence, our pipeline outputs two different scores for each FE, stemming from EL and the supervised classifier. We merge both signals by calculating the F-score between them, as if they were representing precision and recall, in a fashion similar to the standard classification metrics. The global fact score can be then produced via an aggregation of the single FE scores in multiple ways, namely: (a) arithmetic mean; (b) weighted mean based on core FEs (i.e., they have a higher weight than extra ones); (c) harmonic mean, weighted on core FEs as well.
The reader may refer to Section 12.5 for a distributional analysis of these scores over the output dataset.
Baseline classifier
To enable a performance evaluation comparison with the supervised method, we developed a rule-based algorithm that handles the full frame and FEs annotation. The main intuition is to map FEs defined in the frame repository to ontology classes of the target KB: such mapping serves as a set of rule pairs
Besides that, we exploit the notion of core FEs: this would cater for the frame disambiguation part. Since a frame may contain at least one core FE, we proceed with a

Rule-based baseline classifier
It is expected that the relaxed assignment strategy will not handle the overlap of FEs across competing frames that are evoked by a single LU. Therefore, if at least one core FE is detected in multiple frames, the baseline makes a random assignment for the frame. Furthermore, the method is not able to perform FE classification in case different FEs share the ontology class (e.g., both
We assess our main research contributions through the analysis of the following aspects:
classification performance; T-Box property coverage extension; A-Box statements addition; final fact correctness.
Classification performance
We assess the overall performance of the baseline and the supervised systems over a gold standard dataset. We randomly sampled 500 sentences containing at least one occurrence of our use case LU set from the input corpus. We first outsourced the annotation to the crowd as per the training set construction and the results were further manually validated twice by the authors. CrowdFlower provides a report including an agreement score for each answer, computed via majority vote weighted by worker trust: we calculated the average among the whole evaluation set, obtaining a value of 0.916.
With respect to the FEs classification task, we proceed with 2 evaluation settings, depending on how FE text chunks are treated, namely:
Table 4 illustrates the outcomes. FE measures are computed as follows: (1) a true positive is triggered if the predicted label is correct and the predicted text chunk matches the expected one (according to each setting); chunks that should not be labeled are marked with a “O” and (2) not counted as true positives if the predicted ones are correct, but (3) indeed counted as false positives in the opposite case. The high frequency of “O” occurrences (circa
Frame elements (FEs) classification performance evaluation over a gold standard of 500 random sentences from the Italian Wikipedia corpus. The average crowd agreement score on the gold standard amounts to 0.916
Frame elements (FEs) classification performance evaluation over a gold standard of 500 random sentences from the Italian Wikipedia corpus. The average crowd agreement score on the gold standard amounts to 0.916
The frame classification task does not need to undergo chunk assessment, since it copes with the whole input sentence. Therefore, the lenient and strict settings are not applicable, and we proceed with a standard evaluation. The results are reported in Table 5.
Frame classification performance evaluation over a gold standard of 500 random sentences from the Italian Wikipedia corpus. The average crowd agreement score on the gold standard amounts to 0.916

Supervised FE classification normalized confusion matrix, lenient evaluation setting. The color scale corresponds to the ratio of predicted versus actual classes. Normalization means that the sum of elements in the same row must be 1.0.

Supervised frame classification normalized confusion matrix. The color scale corresponds to the ratio of predicted versus actual classes. Normalization means that the sum of elements in the same row must be 1.0.

Supervised FE classification precision and recall breakdown, lenient evaluation setting.

Supervised frame classification precision and recall breakdown.
Furthermore, frames with no FEs are classified as “O”, thus considered wrong despite the correct prediction.
Figures 7 and 8 respectively plot the FE and frame classification performance, broken down to each label.
Lexicographical analysis of the Italian Wikipedia soccer player sub-corpus
One of our main goals is to extend the target KB ontology with new properties on existing classes. We focus on the use case and argue that our approach will have a remarkable impact if we manage to identify non-existing properties. This would serve as a proof of concept which can ideally scale up to all kinds of input. In order to assess such potential impact in discovering new relations, we need to address the following question:
The corpus analysis phase (cf. Section 4) yielded a ranking of LUs evoking the frames
http://mappings.dbpedia.org/server/ontology/classes/SoccerClubSeason
http://mappings.dbpedia.org/server/ontology/classes/SoccerLeagueSeason
For each of the 7 aforementioned DBPO classes, we computed the amount and frequency of ontology and raw infobox properties by querying the Italian DBpedia endpoint. Results (in ascending order of frequency) are publicly available,33
http://it.dbpedia.org/downloads/fact-extraction/soccer_statistics/

Italian DBpedia soccer property statistics.
First, we observe a lack of ontology property usage in 4 out of 7 DBPO classes, probably due to missing mappings between Wikipedia template attributes and DBPO. On the other hand, the ontology properties have a more homogenous distribution compared to the raw ones: this serves as an expected proof of concept, since the main purpose of DBPO and the ontology mappings is to merge heterogenous and multilingual Wikipedia template attributes into a unique representation. On average, most raw properties are concentrated below coverage and frequency threshold values of 0.8 and 4 respectively: this means that roughly
In light of the two analyses discussed above, it is clear that our approach would result in a larger variety and finer granularity of facts than those encoded into Wikipedia infoboxes and DBPO classes. Moreover, we believe the lack of dependence on infoboxes would enable more flexibility for future generalization to sources beyond Wikipedia.
Subsequent to the use case implementation, we manually identified the following mappings from frames and FEs to DBPO properties:
Frames: ( FEs: (
Our system would undeniably benefit from a property matching facility to discover more potential mappings, although a research contribution in ontology alignment is out of scope for this work. In conclusion, we claim that 3 out 6 frames and 12 out of 15 FEs represent novel T-Box properties.
Relative A-Box population gain compared to pre-existing T-Box property assertions in the Italian DBpedia chapter
Our methodology enables a simultaneous T-Box and A-Box augmentation: while frames and FEs serve as T-Box properties, the extracted facts feed the A-Box part. Out of
To assess the domain coverage gain, we can exploit two signals: (a) the amount of produced novel data with respect to pre-existing T-Box properties and (b) the overlap with already extracted assertions, regardless of their origin (i.e., whether they stem from the raw infobox or the ontology-based extractors). Given the same Italian Wikipedia dump input dating 21 January 2015, we ran both the baseline and the supervised fact extraction, as well as the DBpedia extraction framework to produce an Italian DBpedia chapter release, thus enabling the coverage comparison.
Table 7 describes the analysis of signal (a) over the 3 frames that are mapped to DBPO properties. For each property and dataset, we computed the amount of available assertions and reported the gain relative to the fact extraction datasets. Although we considered the whole Italian DBpedia KB in these calculations, we observe that it has a generally low coverage with respect to the analyzed properties, probably due to missing ontology mappings. For instance, the amount of assertions is always zero if we analyze the use case subset only, as no specific relevant mappings (e.g., Carriera_sportivo35
https://it.wikipedia.org/w/index.php?title=Template:Carriera_sportivo&oldid=80131828
Table 8 shows the results for signal (b). To obtain them, we proceed as follows.
slice the use case DBpedia subset; gather the subject-object patterns from all datasets. Properties are not included, as they are not comparable; compute the patterns overlap between DBpedia and each of the fact extraction datasets (including the confident subsets); compute the gain in terms of novel assertions relative to the fact extraction datasets.
The A-Box enrichment is clearly visible from the results, given the low overlap and high gain in all approaches, despite the rather large size of the DBpedia use case subset, namely
Overlap with pre-existing assertions in the Italian DBpedia chapter and relative gain in A-Box population
Overlap with pre-existing assertions in the Italian DBpedia chapter and relative gain in A-Box population
We estimate the overall correctness of the generated statements via an empirical evaluation over a sample of the output dataset. In this way, we are able to conduct a more comprehensive error analysis, thus isolating the performance of those components that play a key role in the extraction of facts: the frame semantics classifier, the numerical expression normalizer, and an external yet crucial element, i.e., the entity linker.
To achieve so, we randomly selected 10 instances for each frame from the supervised dataset and retrieve all the related triples. We excluded instance type triples (cf. Section 8), which are directly derived from the reified frame ones. Then, we manually assessed the validity of each triple element and assigned it to the component responsible for its generation. Finally, we checked the correctness of the whole triple.
More formally, given the evaluation set of triples
Fact correctness evaluation over 132 triples randomly sampled from the supervised output dataset. Results indicate the ratio of correct data for the whole fact (
generic dates appearing without years (as in the 13th of August) are resolved to their Wikipedia page.36
https://en.wikipedia.org/w/index.php?title=August_13&oldid=738125874
country names, e.g., Sweden are often linked to their national soccer team or to the major national soccer competition. This seems to mislead the classifier, which assigns a wrong role to the entity, instead of
the generic adjective Nazionale (national) is always linked to the Italian national soccer team, even though the sentence often contains enough elements to understand the correct country;
some yearly intervals, e.g., 2010–2011 are linked to the corresponding season of the major Italian national soccer competition.
Unfortunately, the linker tends to assign a fairly high confidence to these matches and so does the classifier, which assumes correct linking of entities. This leads to many assertions with undeserved high scores and underlines how important EL is in our pipeline.
We pinpoint and discuss here a list of notable aspects of this work.
Lu ambiguity
We acknowledge that the number of frames per LU in our use case repository may not be exhaustive to cover the potentially higher LU ambiguity. For instance, giocare (to play) may trigger an additional frame depending on the context (as in the sentence to play as a defender); esordire (to start out) may also trigger the frame
Manual intervention costs
Despite its low cost, we admit that crowdsourcing does not conceptually bypass the manual effort needed to create the training set: workers are indeed human annotators. However, we argue that the price can decrease even further by virtue of an automatic communication with the CrowdFlower API. This is already accomplished in the ongoing
Even though we recognize that the use case frame repository is hand-curated, we would like to emphasize that (a) it is intended as a test bed to assess the validity of our approach, and (b) its generalization should instead maximize the reuse of available resources. This is currently implemented in StrepHit, where we fully leverage FrameNet to look up relevant frames given a set of LUs.
NLP pipeline design
On account of our initial claim on the use of a shallow NLP machinery, we motivate below the choice of stopping to the POS layer. The decision essentially emanates from (1) the sentence selection phase, where we investigated several strategies, and (2) the construction of the crowdsourcing jobs, where we concurrently (2a) maximized the simplicity to smooth the way for the laymen workers, and (2b) automatically generated the candidate annotation chunks.
Simultaneous T-Box and A-Box augmentation
The Fact Extractor is conceived to extract factual information from text: as such, its primary output is a set of assertions that naturally feed the target KB A-Box. The T-Box enrichment is an intrinsic consequence of the A-Box one, since the latter provides evidence of new properties for the former. In other words, we adopt a data-driven method, which implies a bottom-up direction for populating the target KB. It is the duty of the corpus analysis module (Section 4) to understand the most meaningful relations between entities from the very bottom, i.e., the corpus. After that, the system proceeds upwards and translates the classification results into A-Box statements. These are already structured to ultimately carry the properties into the top layer of the KB, i.e., the T-Box.
Cumulative confidence scores distribution over the gold standard
Cumulative confidence scores distribution over the gold standard
Table 10 presents the cumulative (i.e., all FEs and frames aggregated) statistical distribution of confidence scores as observed in the gold standard. If we dig into single scores, we notice that the classifier usually outputs very high values for “O” and
Overall, due to the high presence of “O” chunks (circa
Scaling up
Our approach has been tested on the Italian language, a specific domain, and with a small frame repository. Hence, we may consider the use case implementation as a monolingual closed-domain information extraction system. We outline below the points that need to be addressed for scaling up to multilingual open information extraction. With respect to the language, we rely on training data availability for POS tagging and lemmatization. Moreover, the LUs automatically extracted through the corpus analysis phase should be projected to a suitable frame repository. Concerning the domain, the baseline system requires a mapping between FEs and target KB ontology classes. The supervised classifier needs financial resources for the crowdsourced training set construction, on average 4.79$ cents per annotated sentence; furthermore, it necessitates an adaptation of the query to generate the gazetteer.
Crowdsourcing generalization
With the Wikidata commitment in mind (cf. Section 1), we aim at expanding our approach towards a corpus of non-Wikimedia Web sources and a broader domain. This entails the generalization of the crowdsourcing step. Overall, it has been proven that the laymen execute natural language tasks with reasonable performances [54]. Specifically, crowdsourcing frame semantics annotation has been recently shown to be feasible by [32]. Furthermore, [4] stressed the importance of eliciting non-expert annotators to avoid the high recruitment cost of linguistics experts. In [23], we further validated the results obtained by [32], and reported satisfactory accuracy as well. Finally, [11] proposed an approach to successfully scale up frame disambiguation.
In light of the above references, we argue that the requirement can be indeed satisfied: as a proof of concept, we are working in this direction with StrepHit, where we have switched to a more extensive and heterogeneous input corpus. Here, we focus on a larger set
Related work
We locate our effort at the intersection of the following research areas:
information extraction; KB construction; open information semantification.
Information extraction
Although the borders are blurred, nowadays we can distinguish two information extraction procedures that focus on the discovery of relations holding between entities: relation extraction (RE) and open information extraction (OIE). While they both share the same purpose, their difference relies in the relations set size, either fixed or potentially infinite. In other words, the former is based on a pre-defined schema, the latter is instead schema-agnostic. It is commonly argued that the main OIE drawback is the generation of noisy data [15,57], while RE is usually more accurate, but requires expensive supervision in terms of language resources [2,55,57].
Relation extraction
RE traditionally takes as input a finite set
Open information extraction
OIE is defined as a function
In general, most efforts have focused on English, due to the high availability of language resources. Approaches such as [20] explore multilingual directions, by leveraging English as a source and applying statistical machine translation (SMT) for scaling up to target languages. Although the authors claim that their system does not directly depend on language resources, we argue that SMT still heavily relies on them. Furthermore, all the above efforts concentrate on binary relations, while we generate n-ary ones: under this perspective,
Knowledge base construction
Under a different perspective, [41] builds on [12] and illustrate a general-purpose methodology to translate FrameNet into a fully compliant Linked Open Data KB via the
Likewise,
OIE output can indeed be considered structured data compared to free text, but it still lacks of a disambiguation facility: extracted facts generally do not employ unique identifiers (i.e., URIs), thus suffering from intrinsic natural language polysemy (e.g., Jaguar may correspond to the animal or a known car brand).
To tackle the issue, [16] propose a framework that clusters OIE facts and maps them to elements of a target KB. Similarly to us, they leverage EL techniques for disambiguation and choose DBpedia as the target KB. Nevertheless, the authors focus on A-Box population, while we also cater for the T-Box part. Moreover, OIE systems are used as a black boxes, in contrast to our full implementation of the extraction pipeline. Finally, relations are still binary, instead of our n-ary ones.
The main intuition behind
FRED is a machine reader that harnesses several NLP techniques to produce RDF graphs out of free text. It is conceived as a domain-independent middleware enabling the implementation of specific applications. As such, its scope diverges from ours: we instead deliver datasets that are directly integrated into a target KB. In a fashion similar to our work, it encodes knowledge based on frame semantics and employs EL to mint unambiguous URIs for entities and properties. Furthermore, it relies on the same design pattern for expressing n-ary relations in RDF [30]. As opposed to us, it also encodes NLP tools output via standard formats, i.e.,
Semantic role labeling
In broad terms, the semantic role labeling (SRL) NLP task targets the identification of arguments attached to a given predicate in natural language utterances. From a frame semantics perspective, such activity translates into the assignment of FEs. This applies to efforts such as [34], and tools like
All the work mentioned above (and SRL in general) builds upon preceding layers of NLP machinery, i.e., POS-tagging and syntactic parsing: the importance of the latter is especially stressed in [50], thus being in strong contrast to our approach, where we propose a full bypass of the expensive syntactic step.
Conclusion
In a Web where the profusion of unstructured data limits its automatic interpretation, the necessity of
In this paper, we presented a system that puts into practice our fourfold research contribution: first, we perform (1)
Our work concurrently bears the advantages and leaves out the weaknesses of relation extraction and open information extraction: although we assess it in a closed-domain fashion via a use case (Section 2), the corpus analysis module (Section 4) allows to discover an exhaustive set of relations in an open-domain way. In addition, we overcome the supervision cost bottleneck trough crowdsourcing. Therefore, we believe our approach can represent a trade-off between open-domain high noise and closed-domain high cost.
The
We estimate the validity of our approach by means of a use case in a specific domain and language, i.e., soccer and Italian. Out of roughly
We have started to expand our approach under the Wikidata umbrella, where we feed the
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool#How_to_use
For future work, we foresee to progress towards multilingual open information extraction, thus paving the way to (a) its full deployment into the DBpedia Extraction Framework, and to (b) a thorough referencing system for Wikidata.
Footnotes
Acknowledgements
The
