Abstract
Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance to ensure their trustworthiness and usability. However, their ability to systematically assess and assure the quality of this provenance, most crucially whether it properly supports the graph’s information, relies mainly on manual processes that do not scale with size. ProVe aims at remedying this, consisting of a pipelined approach that automatically verifies whether a Knowledge Graph triple is supported by text extracted from its documented provenance. ProVe is intended to assist information curators and consists of four main steps involving rule-based methods and machine learning models: text extraction, triple verbalisation, sentence selection, and claim verification. ProVe is evaluated on a Wikidata dataset, achieving promising results overall and excellent performance on the binary classification task of detecting support from provenance, with
Introduction
A Knowledge Graph (KG) is a large network of interconnected entities, representing their semantic types, properties, and relationships to one another [10,17,25]. The information stored in most KGs can be seen as a set of semantic triples, formed by a subject, a predicate, and an object [17,67]. They represent both concrete and abstract entities internally as labelled and uniquely identifiable entities, such as
Developed and maintained by ontology experts, data curators, and even anonymous volunteers, KGs have massively grown in size and adoption in the last decade [8,10,17], including as secondary sources of information [62]. This means not storing new information, but taking it from authoritative and reliable sources which are explicitly referenced. As such, KGs depend on well-documented and verifiable provenance to ensure they are regarded as trustworthy and usable [68].
Processes to assess and assure the quality of information provenance are thus crucial to KGs, especially measuring and maintaining verifiability, i.e. the degree to which consumers of KG triples can attest these are truly supported by their sources [39,64,68]. However, such processes are currently performed mostly manually [35], with little automation options, and do not scale with size. Manually ensuring high verifiability on vital KGs such as Wikidata and DBpedia is prohibitive due to their sheer size [39]. On Wikidata, for instance, the creation of a provenance verification framework is highly desired by its product management, contributors, and users, and is actively tracked in its Phabricator page.1
ProVe (Provenance Verification) is proposed to assist data curators and editors in handling the upkeep of KG verifiability. It consists of an automated approach that leverages state of the art Natural Language Processing (NLP) models, public datasets on data verbalisation and fact verification, as well as rule-based methods. ProVe consists of a pipeline that aims at automatically verifying whether a KG triple is supported by a web page that is documented as its provenance. ProVe first extracts text passages from the triple’s reference. Then, it verbalises the KG triple and ranks the extracted passages according to their relevance to the triple. The most relevant passages have their stances towards the KG triple determined (i.e. supporting, refuting, neither) and finally ProVe estimates whether the whole reference supports the triple.
This task of KG provenance verification is a specific application of Automated Fact Checking (AFC). AFC is a currently well-explored topic of research with several published papers, surveys, and datasets [15,16,30,31,33,46,51,55–57,69–71]. It is generally defined as the verification of a natural language claim by collecting and reasoning over evidence extracted from text documents or structured data sources. Both the verification verdict and the collected evidence are its main outputs. We define AFC on KGs as AFC where verified claims are KG triples. While AFC in general, and also AFC on KGs, mostly take a claim and a searchable evidence base as inputs [14,37,41,52,53], KG provenance verification is further defined by us as AFC on KGs where the evidence source is the triple’s documented textual provenance.
Approaches tackling AFC on KGs through textual evidence are very few. The only works of note in a similar direction, as far we know, are DeFacto [14,28] and its successors FactCheck [53] and HybridFC [41]. Different to ProVe, however, they all rely on a searchable document base instead of a given reference and judge triples on a true-false spectrum instead of verifiability. ProVe is also amongst the first approaches to tackle AFC on KG triples with large pre-trained Language Models (LMs), which can be expanded to work in languages other than English and can benefit from Active Learning scenarios.
ProVe is evaluated on an annotated dataset of Wikidata triples and their references, combining multiple types of properties and web domains. ProVe achieves promising results overall (
In summary, this paper’s main contributions are:
A novel pipelined approach to evidence-based Automated Provenance Verification on KGs based on large LMs. This is our main contribution, as the usage of LMs to tackle AFC on KGs as provenance verification, as well as fine-tuning with adjacent datasets and tasks, is novel despite relying on existing models.
A benchmarking dataset of Wikidata triples and references for AFC on KGs, covering a variety of information domains as well as a balanced sample of diverse web domains.
Novel crowdsourcing task designs that facilitate repeatable, quick, and large-scale collection of human annotations on passage relevance and textual entailment at good agreement levels.
These contributions directly aid KG curators, editors, and researchers in improving KG provenance. Properly deployed, ProVe can do so in multiple ways. Firstly, by assisting in the detection of verifiability issues in existing references, bringing them to the attention of humans. Secondly, given a triple and its reference, it can promote re-usability of the reference by verifying it against neighbouring triples. Finally, given a new KG triple entered by editors or suggested by KG completion processes, it can analyse and suggest references. In this paper, ProVe’s applicability to the first use case is tested, with the remaining two tackled in future work.
The remainder of this paper is structured as follows. Section 2 explores related work on KG data quality, mainly verifiability, as well as related approaches to AFC. Section 3 presents ProVe’s formulation and covers each of its modules in detail. Section 4 presents an evaluation dataset consisting of triple-reference pairs, including its generation and its annotation. Section 5 details the results of ProVe’s evaluation. Finally, Section 6 delivers discussions around this work and final conclusions. All code and data used in this paper are available on Figshare2
ProVe attempts to solve the task of AFC on KGs, more specifically KG provenance verification, with the purpose of assisting data curators in improving the verifiability of KGs. Thus, to understand how ProVe approaches this task, it is important to first understand how the data quality dimension of verifiability is currently defined and measured in KGs. We will then explore how state of the art approaches to AFC in general and AFC on KGs tackle these tasks and how ProVe learns or differs from them.
Verifiability in KGs
In order to properly evaluate the degree to which ProVe adequately predicts verifiability, this dimension first needs to be well defined and a strategy needs to be established to measure it given an evaluation dataset. Verifiability in the context of KGs, whose information is mainly secondary, is defined as the degree to which consumers of KG triples can attest these are truly supported by their sources [68]. It is an essential aspect of trustworthiness [40,67,68], yet is amongst the least explored quality dimensions [40,68], with most measures carried superficially, unlike correctness or consistency [1,23,40,45,48].
For instance, Farber et al. [67] measure verifiability only by considering whether any provenance is provided at all. Flouris et al. [12] look deeper into sources’ contents, but only verify specific and handcrafted irrationalities, such as a city being founded before it had citizens. Algorithmic indicators are not suited to directly measure verifiability, as sources are varied and natural language understanding is needed. As such, recent works [2,39] measure KG verifiability through crowdsourced manual verification, giving crowdworkers direct access to triples and references. Crowdsourcing allows for more subjective and nuanced metrics to be implemented, as well as for natural text comprehension [7,65].
Thus, this paper employs crowdsourcing in order to measure verifiability metrics of individual triple-reference pairs. By comparing a pair’s metrics with ProVe’s outputs given said pair as input, ProVe and its components can be evaluated. Like similar crowdsourcing studies [2,39], multiple quality assurance techniques are implemented to ensure collected annotations are trustworthy [11]. To the best of our knowledge, this is the first work to use crowdsourcing as a tool to measure the relevance and stance of references in regard to KG triples at levels varying from whole references to individual text passages.
Automated fact checking on knowledge graphs
AFC in general
AFC is a topic of several works of research, datasets, and surveys [15,16,30,31,33,46,51,55–57,69–71]. AFC is commonly defined in the NLP domain as a broader category of tasks and subtasks [15,55,69] whose goal is to, given a claim, verify its veracity or support by automatically collecting and reasoning over pieces of evidence. Such claims are often textual, but can also be subject-predicate-object triples [15,55] while evidence is extracted from the searchable document or data corpora, such as collections of web pages or KGs. The collected evidence constitutes AFC’s output alongside the claim’s verdict [15,55,69]. While a detailed exploration of individual AFC state of the art approaches is out of this paper’s scope, it is crucial to define their general framework in order to properly cover ProVe’s architecture.
A general framework for AFC has been identified by recent surveys [15,69], and can be seen in Fig. 1. Zeng et al. [69] define it as a multi-step process where each step can be tackled as a subtask. Firstly, a

Overview of a general AFC pipeline. White diamond blocks are documents and objects, and grey square blocks are AFC subtasks. Specific formulations and implementation of course might differ.
AFC mostly deals with text, both as claims to be verified and as evidence documents, due to recent advances in this direction being greatly facilitated by textual resources like the FEVER shared task [57] and its associated large-scale benchmark FEVER dataset [56]. Still, some works in AFC take semantic triples as verifiable claims, either from KGs [22,49] or by extracting them from text. Some also utilise KGs as reasoning structures from where to draw evidence [9,15,50,54,58]. For instance, Thorne and Vlachos [54] directly map claims found in text to triples in a KG to verify numerical values. Both Ciampaglia et al. [9] and Shiralkar et al. [50] use entity paths in DBpedia to verify triples extracted from text. Other approaches based on KG embeddings associate the likelihood of a claim being true to that of it belonging as a new triple to a KG [4,19].
KGCleaner [37] uses string matching and manual mappings to retrieve sentences relevant to a KG triple from a document, using embeddings and handcrafted features to predict the triple’s credibility. Leopard [52] validates KG triples for three specific organisation properties, using specifically designed extractions from HTML content. Both approaches require manual work, cover a limited amount of predicates, and do not provide human-readable evidence.
DeFacto [14] and its successors FactCheck [53] and HybridFC [41] represent the current lineage of state of the art systems on AFC on KG given a wide searchable textual evidence base. They verbalise KG triples using text patterns and use it to retrieve web pages with related content. HybridFC converts retrieved evidence sentences into numeric vectors using a sentence embedding model. It additionally relies on graph embeddings and paths. These approaches score sentences based on relevance to the claim and use a supervised classifier to classify the entire web page as evidence. Despite their good performance, the first two approaches depend on string matching, which might miss verbalisations that are more nuanced and also entail considerable overhead for unseen predicates. Additionally, all three depend on large searchable document bases from where to retrieve evidence and rely on training on the KG predicates the user wants them to verify. ProVe, on the other hand, covers any non-ontological predicate by using pre-trained LMs that leverage context and meaning to infer verbalisations. We define ontological predicates as those whose meaning serves to structure or describe the ontology itself, such as
Due to its specific application scenario, approaches tackling AFC on KGs differ from the general AFC framework [15,69] seen in Fig. 1. A
Additionally, KG triples are often not understood by the components’ main labels alone. Descriptions, alternative labels, editor conventions, and discussion boards help define their proper usage and interpretation, rendering their meaning not trivial. As such, approaches tackling AFC on KGs rely on transforming KG triples into natural sentences [14,28,53] through an additional
Lastly, evidence document corpora normally used in general AFC tend to have a standard structure or come from a specific source. Both FEVER [56] and VitaminC [47] take their evidence sets from Wikipedia, which has a text-centered and text-rich layout, with FEVER’s even coming pre-segmented as clean individual sentences. Vo and Lee [59] use web articles from snopes.com and politifact.com only. KGs, however, accept provenance from potentially any website domains. As such, unlike general AFC approaches, ProVe employs a
Large pre-trained language models on AFC
Advances towards textual evidence-based AFC, particularly the
Tackling FEVER through pre-trained LMs [30,33,51] and graph networks [31,70,71] represents the current state of the art. While approaches using graph networks (such as KGAT [31], GEAR [71], and DREAM [70]) for
On
As a subtask in AFC on KGs,
Table 1 shows a comparison of ProVe to other AFC approaches mentioned in this section grouped by specific tasks. It showcases the particular subtasks each approach targets, as well as the datasets used as a basis for their evaluation. AFC on KGs through textual evidence is amongst the least researched topics within AFC, and this paper is the first to do so through a KG provenance verification approach. ProVe tackles it through fine-tuned LMs that adapt to unseen KG predicates and is evaluated on a dataset consisting of Wikidata triples with multiple non-ontological predicates.
Comparison between ProVe and others within AFC
KGE = KG Embeddings, DR = Document Retrieval, SS = Sentence Selection, PS = Path Selection, CV = Claim Verification, RE = Relation Extraction, EL = Embedding Learning, CVb = Claim Verbalisation, TA = Trustworthiness Analysis, TR = Text Retrieval.
Comparison between ProVe and others within AFC
KGE = KG Embeddings, DR = Document Retrieval, SS = Sentence Selection, PS = Path Selection, CV = Claim Verification, RE = Relation Extraction, EL = Embedding Learning, CVb = Claim Verbalisation, TA = Trustworthiness Analysis, TR = Text Retrieval.
ProVe consists of a pipeline for AFC on KGs. It takes as inputs a KG triple that is non-ontological in nature and its documented provenance in the form of a web page or text document. It then automatically verifies whether the provenance textually supports the triple, retrieving from it relevant text passages that can be used as evidence. This section presents an overview of ProVe and its task, as well as detailed descriptions of its modules and the subtasks they target.
Overview
From a KG’s set of non-ontological triples
Figure 2 shows a KG triple (taken from Wikidata), its reference to a web page, and ProVe’s processing according to the definitions provided in this section. ProVe extracts the text from

An example of the inputs and outputs of ProVe when applied to a Wikidata triple and its provenance. A triple’s (

Overview of ProVe’s pipeline. The white blocks are artefacts while the green are modules, further detailed in the subsections indicated in the circles.
A modular view of ProVe’s pipeline can be seen in Fig. 3. Its inputs, as previously stated, are a KG triple (
Pairs consisting of the verbalised claim (
ProVe’s workflow differs from the AFC framework seen in Fig. 1. This is due to the particular task ProVe tackles, i.e. the verification of KG triples using text from documented provenance, where triples can have any non-ontological predicate and such provenance can come from varied sources. As detailed in Section 2 and evidenced in Table 1, this is currently a little studied task compared to others in AFC, posing distinct problems and requiring specific subtasks to be solved. For instance, ProVe does not need to perform either claim detection or document retrieval, as both the claims and the sources are given to it as inputs.
On the other hand, ProVe handles both KG triples and unstructured text with the same model architectures and does not make use of KG paths for evidence. Thus, it needs to convert the triples into text through a claim verbalisation module, akin to most other approaches in this task [14,28,37,53]. As its input references can lead to web pages having any HTML layout and their text does not come pre-segmented (such as with FEVER [56]), ProVe needs a non-trivial text retrieval module so that it can identify informative passages. Finally, such KGs are often secondary sources of information and triples should not include conclusions or interpretations from editors. Hence, ProVe needs to consider pieces of evidence in isolation; this is to lower ProVe’s reliance on multi-sentence reasoning, as concluding a triple from multiple text passages should not configure support in this task. Thus, ProVe first identifies stances of retrieved evidence individually, aggregating them into a final verdict afterwards. Each of ProVe’s modules is further detailed in the remainder of this section.
KG entities and properties have natural language labels that help clarify their meanings, with KGs like Wikidata and Yago also containing multiple aliases and alternative names. However, these entities and predicates are often not created by prioritising human understanding, but rather data organisation, and thus rely heavily on descriptions in order to set out proper usage rules. Many serve as abstract concepts that unite other related but not identical concepts, using a very broad main label and more specific aliases. One example of such is Wikidata’s
ProVe’s claim verbalisation module takes as input a KG triple
The function
Amaral et al. [3] use this exact same model to produce the WDV dataset, which consists of verbalised Wikidata triples. They then evaluate its quality with human annotations on both fluency and adequacy. Such evaluations are covered in more detail in Section 5. In WDV, main labels are used as preferred labels for all triple components, despite aliases often representing better choices. ProVe allows editors to manually define the behaviour of
Text retrieval
In KGs, provenance is documented per individual triples and is often presented as URLs to web pages or text documents. Such references form the basis of KG verifiability and should point to sources of evidence that support their associated KG triples. Additionally, they can come from a huge variety of domains as long as they adhere to verifiability criteria, that is, humans can understand the information they contain. As humans are excellent in making sense of structured and unstructured data combined, KG editors do not need to worry much about how references express their information. Images, charts, tables, headers, infoboxes, and unstructured text, can all serve as evidence to the information contained in a KG triple. However, this complicates the automated extraction of such evidence in a standard format so that LMs can understand it.

Illustration of the text extraction module’s workflow, taking a reference
Rather than only free-flowing text, referenced web pages can have multiple sections, layouts, and elements, making it non-trivial to automatically segment its textual contents into passages. Thus, ProVe employs a combination of rule-based methods and pre-trained sentence segmenters to extract passages. Figure 4 details this process. The module takes as input a reference
As a last step, multiple
ProVe concatenates text segments in this fashion for two reasons. Firstly, meaning can often be spread between sequential segments, e.g. in the case of anaphora. Secondly, HTML layouts might separate text in ways ProVe’s general rules were not able to join, e.g. a paragraph describing an entity in the header. For a trade-off between coverage and sentence length, ProVe defines
Current best approaches to extracting textual content from web pages, often called boilerplate removal, are based on supervised classification [29,60]. As no classification is perfect, these might miss relevant text. ProVe’s text retrieval module aims at maximizing recall by retrieving all reachable text from the web page and arranging it into separate passages by following a set of rules based on the HTML structure. The sentence selection module is later responsible for performing relevance-based filtering. ProVe’s rule-based method can easily be updated with ad hoc cleaning techniques to help treat difficult cases, such as linearisation or summarisation of tables, automated image and chart descriptions, converting hierarchical header-paragraph structures and infoboxes into linear text, etc.
As ProVe’s text extraction module extracts all the text in a web page in the form of passages
Following on KGAT’s [31] and DREAM’s [70] approach to FEVER’s sentence selection subtask, ProVe employs a large pre-trained BERT transformer. ProVe’s sentence selection BERT is fine-tuned on the FEVER dataset by adding to it a dropout and a linear layer, as well as a final hyperbolic tangent activation, making outputted scores range from
This fine-tuned module is thus used to assign scores ranging from
ProVe ranks all passages
Claim verification
As discussed in Section 2.2, claim verification is a crucial subtask in AFC, central to various approaches [15,55], and consists on assigning a final verdict to a claim, be it on its veracity or support, given retrieved relevant evidence. ProVe’s claim verification relies on two steps: first, a pre-trained BERT model fine-tuned on data from FEVER [56] performs RTE to detect stances of individual pieces of evidence, and then an aggregation considers the stances and relevance scores from all evidence to define a final verdict. As ProVe also uses a BERT model for sentence selection, its approach is similar to that of Soleimani et al. [51]: two fine-tuned BERT models, one for sentence selection, another for claim verification. Although task-specific graph-based approaches [31,70] outperform Soleimani, it is by less than a percentage point (on FEVER score), while explainability for such generalist pre-trained LMs is increasingly researched [6,26,43] by the NLP community.
Recognizing textual entailment
Like sentence selection, claim verification is a well defined subtask of AFC, supported by both the FEVER shared task and the FEVER dataset. FEVER annotates claims as belonging to one of three RTE classes: those supported by their associated evidence (‘SUPPORTS’), those refuted by it (‘REFUTES’), and those wherein the evidence is not enough to reach a conclusion (‘NEI’). As previously mentioned, ProVe is meant to handle KGs as secondary sources of information. Thus, it assesses evidence first in isolation, aggregating assessments afterwards in a similar fashion to other works [33,51].
ProVe’s RTE step is a BERT model fine-tuned on a multiclass classification RTE task. It consists of identifying a piece of evidence’s stance towards a claim using the three classes from FEVER. To fine-tune such model, a labelled training dataset of claims-evidence pairs is built out of FEVER. For each claim in FEVER labeled as ‘SUPPORTS’, all sentences annotated as relevant to it are paired with the claim; such pairs are labelled as ‘SUPPORTS’. The same is done for all claims in FEVER labeled as ‘REFUTES’, generating pairs classified as ‘REFUTES’. For claims labeled as NEI’, FEVER does not annotate any sentence as relevant to them. Thus, ProVe’s sentence selection module is applied to documents deemed relevant to such claims (retrieved in a similar fashion to KGAT [31]) and each claim is paired with all sentences that have relevance scores greater than 0 in regards to them. All such pairings are labelled ‘NEI’. Fine-tuning was carried for 2 epochs with an AdamW [32] optimizer with 0.01 weight decay. Population Based Training was used to tune learning rate (
Thus, for a verbalisation
Stance aggregation
After classifying the stance and relevance of each individual piece of evidence A weighted sum The rule-based strategy adopted by Malon et al. [33] and Soleimani et al. [51]. A triple-reference pair A classifier, trained on an annotated set of triple-reference pairs
Reference evaluation dataset
This section presents and describes the dataset used to evaluate ProVe: Wikidata Textual References (WTR). WTR is mined from Wikidata, a large scale multilingual and collaborative KG, produced by voluntary anonymous editors and bots, and maintained by the Wikimedia Foundation [61]. In Wikidata, triples should, except for rare exceptions, have one or more references to their provenance [61]. Over
Each triple-reference pairing in WTR is annotated both at evidence-level and at reference-level. Evidence-level annotations are provided by crowd-workers and describe the stance that specific text passages from the reference display towards the triple. Reference-level annotations are provided by the authors and describe the stance the whole referenced web page displays towards the triple. Evaluation, described in Section 5, consists of comparing ProVe’s final class (
Dataset construction
Wikidata has been chosen as the source for ProVe’s evaluation dataset, as it contains over a billion triples that explicitly state their provenance, pertain to various domains, and are accompanied by aliases and descriptions that greatly aid annotators. Since many references in Wikidata can be automatically verified through API calls, as showcased by Amaral et al. [2], WTR is built focusing on those that can not. Furthermore, to prevent biases towards frequently used web domains, such as Wikipedia and The Peerage, WTR is built to represent a variety of web domains with equal amounts of samples from each.
Selecting references
WTR is constructed from the Wikidata dumps from March 2022. First, all nearly 90M reference identifiers are extracted. Those associated to at least one triple and which lead to a web page by either an external URL (through the
Next, each extracted reference has its initial web URL defined. For references with direct URLs (
The extracted set of references and their respective initial URLs is then filtered. References that are inadequate to the scenario in which ProVe will be used are removed. Three criteria were used:
References with URLs to domains that have APIs or content available as structured data (e.g. JSON or XML), as these can be automatically checked through APIs, e.g. PubMed, VIAF, UniProt, etc.
References with URLs linking to files such as CSV, ZIP, PDF, etc., as parsing these file formats is outside of ProVe’s scope.
References with URLs to pages that have very little to no information in textual format, such as images, document scans, slides, and those consisting only of infoboxes, e.g. nga.gov, expasy.org, Wikimedia commons, etc.
As shown by Amaral et al. [2], a substantial number of references fall in the first criteria, with an estimated over
Close to 7M references are left after these removals, wherein English Wikipedia alone represents over
Pairing references with triples
Given the total number of references contemplated (7M), a sample size of 385 represents a
Is not deprecated by Wikidata;
Has an object that is not the “novalue” or “somevalue” special nodes;
Has an object that is not of an unverbalisable type, i.e. URLs, globe coordinates, external IDs, and images.
Has a predicate that is not ontological, e.g.
These steps produce a stratified representative sample of references, including their unique identifiers, resolved URLs, web domains, and HTTP response attributes. Finally, for each sampled reference, a triple-reference pair is formed by extracting from Wikidata a random triple associated to that reference and fitting for evaluation. The triple’s unique identifiers, object data types, main labels, aliases, and descriptions are all kept. This construction process creates WTR, ensuring it is composed of triples carrying non-ontological meaning and verifiable through their associated references, thus useful for evaluating ProVe. It also ensures meaning and context understanding are evaluated, rather than merely string matching, e.g. in the case of URLs, globe coordinates, and IDs. As for image data, tackling such multimodal scenarios is outside ProVe’s scope.
Dataset annotation
As described in Section 3 (see Fig. 2), ProVe tackles its AFC task as a sequence of text extraction, ranking, classification, and aggregation subtasks. Given a triple-reference pair, ProVe extracts text passages from the reference, ranks them according to relevance to the triple, and individually classifies them according to their stance towards the triple. Then, triple-reference pairs are classified according to the overall stance of the reference towards the triple.
To allow for a fine-grained evaluation of ProVe’s subtasks, WTR receives three sets of annotations: (1) on the stance of individual pieces of evidence towards the triple, (2) on the collective stance of all evidence, and (3) on the overall stance of the entire reference. The two first sets of annotations are deemed evidence-level annotations, while the last is reference-level. Crowdsourcing is used to collect evidence-level annotations, due to the large number of annotations needed (six per triple-reference pair) in combination with the simplicity of the task, which requires workers only to read short passages of text. Reference-level annotations are less numerous (one per triple-reference pair), much more complex, and hence manually annotated by the authors.
Crowdsourcing evidence-level annotations
Collecting evidence-level annotations for all retrievable sentences of each triple-reference pair in WTR, in order to account for different rankings that can be outputted by ProVe, would be prohibitively expensive and inefficient. Thus, evidence-level annotations are only provided for the five most-relevant passages in each reference, i.e. the collected evidence. First, ProVe’s text retrieval (Section 3.3) and sentence selection (Section 3.4) modules are applied to each reference and the five pieces of evidence for each collected. Since often only a couple of passages among the five tend to be relevant, this does not severely bias the annotation towards highly-relevant text passages. It actually allows for a more even collection of both relevant and irrelevant text passages. Then, evidence-level annotations for each individual piece of evidence, as well as for the whole evidence set, are collected through crowdsourcing, totalling 6 annotation tasks per triple-reference pair.
Execution times for each task in the pilot were measured and used to define payment in USD proportionally to double the US minimum hourly wage (7.25): USD 0.50 for tasks in T1 and USD 1.00 for tasks in T2, calculated based on the highest between mean and median execution time. 500 tasks were generated for T1 and 91 tasks for T2, assigned to about 200 and 140 unique workers, respectively. Workers needed to have finished at least 1000 tasks in AMT with at least
To assure annotation aggregation was trustworthy, inter-annotator agreement was measured through kappa values, achieving 0.56 for tasks in T1 and 0.33 for tasks in T2. According to Landis and Koch [27], these results show fair and moderate agreement, respectively. Several factors that contribute to lower inter-annotator agreement [5] are present in this crowdsourcing setting: subjectivity inherent to natural language interpretation, a high number of annotators which also lack domain expertise, and class imbalance. On individual annotations for T1, the majority of passages (
Gathering reference-level annotations
WTR has reference-level annotations for each triple-reference pair. They define a reference’s overall stance towards its associated triple, and are manually provided by the authors. These annotations are crucial in order to provide a ground truth for an evaluation of the entire pipeline’s performance when taking the whole web page into consideration. They consider a reference’s full meaning and context, and not only what was captured and processed by the modules as evidence. Differently from sentence-level annotations, the mental load and task complexity of interacting with the page to inspect all information (e.g. in text, infoboxes, images, charts) is too high for cost-effective crowdsourcing. Thankfully, with one annotation per triple-reference pair, it is feasible for manual annotations to be created by the authors. The authors have thus annotated the over 400 references into different categories and sub-categories, which are a more detailed version of the three stance classes used at evidence-level annotations (and by FEVER):
Supporting References (directly maps to the ‘
Support explicitly stated as text as natural language sentences
Support explicitly stated as text, but not as natural language sentences
Support explicitly stated, but not as text
Support implicitly stated
Non-supporting References
Reference refutes claim (directly maps to the ‘
Reference neither supports nor refutes the claim (directly maps to the ‘
These six subclasses allow WTR to aid in evaluating the overall performance of ProVe in both ternary (the three sentence-level stances) and binary (supporting vs. not supporting) classifications. They also allow us to investigate which presentations of supporting information are better captured by the pipeline.
WTR contains 416 Wikidata triple-reference pairs, representing 32 groups of text-rich web domains commonly used as sources, as well as 76 distinct Wikidata properties.
Evaluation
This section covers the evaluation of ProVe’s performance by applying it to the evaluation dataset WTR, described in Section 4, and comparing ProVe’s outputs with WTR’s annotations. These inspections and comparisons provide insights into the pipelines’ execution and results at its different stages and modules. Each module in ProVe is covered in a following subsection. The overall classification performance of ProVe is indicated by the outputs of the claim verification module’s aggregation step and is covered at the end of the section. There, we also compare ProVe to other AFC on KG approaches on WTR.
Claim verbalisation
Given that ProVe’s verbalisation module is the exact same model used to create the Wikidata triple verbalisations found in the WDV dataset [3], this section first reports the relevant evaluation results obtained by WDV’s authors. It then analyses the claim verbalisation module’s execution on the WTR evaluation dataset, looking at the quality of its outputs.
Model validation
ProVe’s verbalisation module consists of a pre-trained T5 model [42] fine-tuned on the WebNLG dataset [13]. To confirm that fine-tuning was properly carried out, the authors measure the BLEU [38] scores of its verbalisations on WebNLG data. The scores are 65.51, 51.71, and 59.41 on the testing portion of the
Amaral et al. [3] use this exact same fine-tuned model to create multiple Wikidata triple verbalisations, which compose the WDV dataset, and evaluate them with human annotators. WDV consists of a large set of Wikidata triples, alongside their verbalisations, whose subject entities come from three distinct groups (partitions) of Wikidata entity classes. The first partition consists of 10 classes that thematically map to the 10 categories in WebNLG’s
WDV’s verbalisations were evaluated by Amaral et al. through aggregated crowdsourced human annotations of fluency and adequacy dimensions, as defined in Section 3.2. Fluency scores range from 0, i.e. very poor grammar and unintelligible sentences, to 5, i.e. perfect fluency and natural text. Adequacy scores consisted of
Execution on WTR
The verbalisation module was applied to all 416 triple-reference pairs in WTR. For each triple
The process of defining preferred labels for verbalisation, represented in Section 3.2 through function
Out of the 416 verbalisations generated by ProVe through main labels,
Finally, 7 of the claims verbalised ended up having both identical URLs and verbalisations, and were thus dropped from the evaluation downstream as they would have the exact same results. This results from ProVe not taking claim qualifiers into consideration, which is further discussed in Section 6.3.
Text extraction
Due to the complexity of defining metrics that measure success in text extraction, this section instead first defines metrics that can be used for an indirect evaluation of the text extraction module. It then explores insightful descriptive metrics obtained from executing the module on WTR.
Indirect evaluation
ProVe’s text extraction module essentially performs a full segmentation of the referenced web-page’s textual content without excluding any text. The model can not be directly evaluated through annotations due to the sheer quantity of ways in which one can segment all references contained in the evaluation dataset. Annotating such text extractions would require manually analysing entire web pages to find all textual content relevant to the claim, inspecting all the text extractable from such page, and segmenting it such that boundaries are properly placed in terms of syntax and relevant passages are kept unbroken. One would then need to compare one or multiple ideal extractions to the extraction performed by ProVe. It is neither trivial to simplify this process for crowdworkers, nor efficient for the authors to carry it out by hand. It is also not trivial to define what constitutes ‘well-placed’ sentence boundaries nor how and if one can break relevant passages of text.
Thus, instead of a direct evaluation, the performances of the subsequent sentence selection and final aggregation steps are used as indirect indicators of ProVe’s text extraction. A correlation between ProVe’s relevance scores (
Likewise, a good final classification performance, measured against WTR’s reference-level annotations, indicates ProVe’s capacity of extracting useful sentences. Classification metrics for ternary and binary classification tasks, such as accuracy and f-scores, are shown in Section 5.4. Still, the direct evaluation of ProVe’s text extraction module, encompassing sentence segmentation and meaning extraction from unstructured and semi-structured textual content, is intended as future work.
Execution on WTR
The text extraction module was applied to each of the 416 triple-reference pairs
These metrics confirm that extracted textual content mainly varies based on web domain. It indicates ProVe’s extraction depends heavily on particular web layouts, e.g. having difficulty segmenting the contents of specific domains like bioguide.congress.gov, due to their textual content being contained in a single paragraph (
Sentence selection
ProVe’s sentence selection module contains a BERT model fine-tuned on FEVER’s training partition, as described in Section 3.4. This section first performs a sanity check by evaluating the model on FEVER’s validation and testing partitions, measuring standard classification metrics to ensure the model has properly fine-tuned to FEVER. Afterwards, the entire module is applied to WTR and its performance is measured by relying on the crowdsourced annotations.
Model validation
For each claim-sentence pair in FEVER’s validation and testing partitions, the sentence selection module’s BERT model outputs a relevance score between
Inputs to the sentence selection module consist of the verbalisations

Relevance scores distributions across and within different web domains. ‘ALL’ stands for the combined distribution of the subsequent 32 groups. Data values here are the averages of the top 5 passages’ relevance scores for each reference.

Distributions of relevance scores given by the module divided by the percentage of crowd annotations deeming that passage as ‘relevant’ (either ‘supports’ or ‘refutes’).

Relevance score distributions of passages majority-voted as ‘relevant’ and of those voted ‘irrelevant’.
Such strong correlation can also be seen with aggregated annotations, as shown in Fig. 7, where the relevance score distributions for the group of passages majority-voted as ‘relevant’ vs. those voted as ‘irrelevant’. These metrics and distributions indicate that ProVe’s sentence selection module produces scores that are well-related to human judgements of relevance.
The first step of ProVe’s claim verification module consists of a RTE classification task, resolved by a fine-tuned BERT model, as described in Section 3.5. Such RTE model is used to classify the stances of individual pieces of evidence (
Model validation
ProVe’s claim verbalisation module has been applied to FEVER’s validation and testing partitions. For the validation partition, the sentences pre-annotated as relevant to the claim being judged were used as evidence. The first RTE step is performed to calculate the individual stance probabilities of each piece of evidence. The second aggregation step is then carried out to define the claim’s final verdict. The same process is carried out for the testing partition, however, since its sentences do not come pre-annotated, ProVe’s sentence selection module was used instead.
For the final aggregation step, only methods
Aggregation method
Execution on WTR
ProVe’s claim verification module is tested on WTR by comparing its outputs to WTR’s annotations, both at evidence level and reference level. Such annotations, as described in Section 4.2, consist of: crowd annotations denoting the stances of individual pieces of evidence towards KG triples (from crowdsourcing tasks T1), crowd annotations denoting the collective stances of sets of evidence towards KG triples (from crowdsourcing tasks T2), and author annotations denoting the stance of the entire reference towards its associated KG triple.

Stance classes predicted for single claim-reference pairs (obtained through argmax) vs. the crowd’s aggregated annotations.
ProVe’s pipeline focuses on the classification of entire references rather than individual sentences, and the RTE class probabilities are merely features for the final aggregation. Still, RTE classifications for individual pieces of evidence can be greatly improved by, rather than argmax, using a classifier with the three RTE class probabilities (

Binary stance classes of single claim-reference pairs predicted by a RFC vs. the crowd’s aggregated annotations.

ROC curve for the simplified binary stance classification performed by a RFC.
Classification performance of each of the three aggregation methods on both ternary and binary collective stance RTE classification formulations. Majority-voted annotations obtained in T2 are used as true labels. Results from method 3 were cross-validated with
As detailed in Section 4.2, reference-level annotations consist of six categories which directly map to the three RTE labels used in crowd annotations. Figure 11 shows WTR’s reference-level annotation distribution. On the ternary classification task, labels
Figure 12 shows a complete comparison between evidence-level collective stance annotations from the crowd and reference-level author labels. Crowd annotators very successfully judge references in

Distribution of reference-level author labels of the evaluation dataset.

Comparison between reference-level author label and sentence-level collective stance annotations from the crowd.
ProVe is evaluated on the entire WTR by using the reference-level annotations as ground-truth and adopting ProVe’s best performing aggregation method, the classifier (method
ProVe’s binary classification performance on all WTR and per type of textual support. Reference-level annotations were used as ground-truth, and a classifier as aggregation method. Values obtained through cross-validation (
As ProVe requires the provenance document as input, it is not possible to test it on FactBench, the benchmark dataset of the DeFacto lineage. However, if we assume that all retrievable information for a KG triple’s fact checking is a single pre-selected provenance document associated to it, and that its support is the same as a declaration of truthfulness, then DeFacto, FactCheck, and HybridFC can all be applied to the WTR dataset.
DeFacto relies on BOA patterns which are no longer available or maintained, thus only FactCheck and HybridFC can be evaluated. HybridFC consists of a mix of fact-checking methods, two of which rely on KG structures surrounding the input triple rather than solely on the provenance document. Since we do not consider the KG to hold a singular truth and are limiting ourselves to the provenance, we will not employ those two methods of HybridFC.
All these approaches rely on phrase patterns for document retrieval. For both FactCheck and HybridFC, documents and sentences are retrieved through an Elasticsearch7
ProVe’s results compared to FactCheck and HybridFC. Metrics for ProVe are based on the entirety of WTR. Metrics for FactCheck and HybridFC are based on the test partition
In this section, aspects and limitations of the implementation and evaluation results of ProVe are further discussed. Additionally, future directions of research are pointed out and final conclusions are drawn.
ProVe for fact verification
Fact checking as a tool to assist users in discerning between factual and non-factual information has a myriad of applications, formulations, approaches, and, overall, considerably ambiguous results. The effects of fact-checking interventions, while significant, are substantially weakened by its targets’ preexisting beliefs and knowledge [63]. Its effectiveness depends heavily on many variables, such as the type of scale used and whether facts can be partially checked, as well as different impacts if they go along or against a person’s ideology. This further motivates ProVe’s standpoint of judging support instead of veracity. Triples are evaluated not as factual or non-factual, but based on their documented provenance, passing the onus of providing trustworthy and authoritative sources to the graph’s curators. This keeps ProVe’s judgements from clashing with the ideologies of its users, as the pipeline does not pass factual, but linguistic judgement. Additionally, by using only two to three levels of verdict and not including graphical elements, the presence of elements that compromise fact-checking [63] is hampered. The authors’ focus for future research lies in increasing ProVe’s explainability in order to increase trust and understanding.
The results achieved by ProVe, especially on text-rich references, are considered by the authors as more than satisfactory, representing an excellent addition to a family of approaches that currently includes very few, e.g. DeFacto [14], FactCheck [53], and HybridFC [41]. ProVe clearly outperforms this line of systems on the provenance verification task presented by the WTR dataset. FactCheck and HybridFC suffered due to being limited to a single document as a source of evidence, as well as by the lack of semantic capabilities during sentence retrieval. For
ProVe’s use as a tool can greatly benefit from an active learning scenario which would further enhance the models and techniques it employs. At the same time, users of the tool inherently introduce a bias based on their demographics, with the same being valid for the crowdsourced evaluation of ProVe’s pipeline. Being aware of this bias is crucial to the proper deployment of such approaches.
Text extraction add-ons
ProVe’s text extraction module has three essential steps: rendering web pages with a web crawler, using rule-based methods to convert content inside HTML tags into text, and sentence segmentation. While the rule-based methods presented in this paper are simple, they are quite effective, as shown in Section 5. Better and more specialized rules and methods to detect and extract text from specific HTML layouts, such as turning tabular structures or sparse infoboxes into sequential and syntactically correct sentences, can be seamlessly integrated into ProVe. Both supervised and unsupervised approaches can also be applied.
In order to properly assess such added methods, as well as to provide more insight into the text extraction module in general, a direct evaluation of its performance would be extremely helpful. Although good performance on downstream tasks is a good indicator, it does not show where to improve text extraction. Be it through descriptive statistics or comparison against golden data, this is a focus of future research alongside model explainability.
Usage of qualifiers
Triples in KGs such as Wikidata often are accompanied by qualifiers that further detail it. A triple such as the one seen in Fig. 2 (
Ribeiro et al. [44] show transformers can verbalise multiple triples into a single sentence. Hence, adding qualifiers as secondary triples is possible, generating more detailed verbalisations. However, ProVe’s sentence selection and claim verification modules contain models fine-tuned on FEVER, whose vast majority of sentences contain only a main piece of information with little to no additional details, e.g. ‘Adrienne Bailon is an accountant’ and ‘The Levant was ruled by the House of Lusignan’. In order to make proper use of qualifiers during verbalisation, there needs to be an assurance that downstream modules can properly handle more complex sentences by either different or augmented training data.
Detecting refuting sources with FEVER
The FEVER dataset presents claims that are normally short and direct in nature, from multiple domains, and associated evidence extracted directly from Wikipedia. ProVe shows it is possible to use FEVER to train pipeline modules to detect supportive and non-supportive sources of evidence. However, as seen in Section 5.4, detecting refuting sources is hard for ProVe and is believed to be due to how FEVER generates refuted claims through artificial alterations. Claims labelled by FEVER as ‘REFUTES’ are those generated by annotators who alter claims that would otherwise be supported by its associated evidence. Alterations follow six types: paraphrasing, negation, entity/relationship substitution, and making the claim more general/specific. This leads to claims that, while meaningful and properly annotated, would never be encoded in a KG triple, such as “As the Vietnam War raged in 1969, Yoko Ono and her husband John Lennon
While useful for other tasks, these refuted claims are very different from refutable triples occurring naturally in KGs, which mainly consist of triples whose objects have different values in the provenance. One such example is “Robert Brunton was born on 23/03/1796”, whose reference actually mentions the “10th of February 1796”. In order to better detect KG provenance that refute their triples, ProVe’s claim verification module needs re-training on a fitting subset of FEVER, or on a new dataset containing non-artificial refuted claims.
Conclusions
KGs are widespread secondary sources of information. Their data is extremely useful and available in a semantic format that covers a myriad of domains. Ensuring the verifiability of this data through documented provenance is a task that is crucial to the upkeep of their usability. This provenance verification task should be actively supported by automated and semi-automated tools to help data curators and editors cope with the sheer volume of information. However, as of now, there are no such tools deployed at large scale KGs and only a very small family of approaches tackle this task from a research standpoint.
This paper proposes, describes, and evaluates ProVe, a pipelined approach to support the upkeep of KG triple verifiability through their documented provenance. ProVe leverages large pre-trained LMs, rule-based methods, and classifiers, to provide automated assistance to the activity of creating and maintaining references in a KG. ProVe’s pipeline aims at extracting relevant textual information from references and evaluating whether or not they support an associated KG triple. This provides its users with support classification, support probability, as well as relevance and textual entailment metrics for the evidence used. Deployed correctly, ProVe can help detect verifiability issues in existing references, as well as improve the reuse of good sources. Additionally, the approach can be expanded to work in a multilingual setting.
ProVe has been evaluated on WTR, a dataset of triple-reference pairs extracted directly from Wikidata and annotated by both crowdworkers and the authors. ProVe achieves
In addition to the application of ProVe on reference re-usability and recommendation, future work mainly lies in exploring techniques to improve ProVe’s explainability, with a focus on its sentence selection and claim verification steps. Other directions can include expanding the size and the distinct predicate coverage of the benchmarking dataset WTR, as well as a direct evaluation of text extraction and segmentation techniques.
Footnotes
Acknowledgements
This research received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 812997.
Crowdsourcing task designs
WTR dataset format
WTR is available at Figshare8
