Sage Journals: Discover world-class research

Abstract

Knowledge graphs (KGs) are crucial in human-centered AI as they provide large labeled machine learning datasets, enhance retrieval-augmented generation, and generate explanations. However, knowledge graph construction has evolved into a complex, semi-automatic process that increasingly relies on black-box deep learning models and heterogeneous data sources to scale. The knowledge graph lifecycle is not transparent, accountability is limited, and there are no accounts of, or methods to determine, how fair a knowledge graph is in downstream applications. KGs are thus at odds with AI regulation, for instance, the EU’s AI Act, and with efforts elsewhere in AI to audit and debias data and algorithms. This article reports on work towards designing explainable human-in-the-loop knowledge graph construction pipelines. Our work is based on a systematic literature review, in which we study tasks in knowledge graph construction that are often automated, as well as methods to explain how they work and their outcomes, and an interview study with 13 experts from the knowledge engineering community. To analyze the literature, we introduce use cases, related goals for explainable AI (XAI) methods in knowledge graph construction, and the gaps in each use case. To understand the role of XAI models in practice and reveal requirements for improving current methods, we designed interview questions covering broad transparency topics, along with example discussion sessions using examples from the review. From practical knowledge engineering experience, we identify user requirements, propose design blueprints, and outline directions for future research.

Keywords

knowledge graph knowledge graph construction knowledge engineering transparency explainability explainable AI trustworthy AI

1. Introduction

To reach its potential, AI needs data and context. Without the right (amounts of) data, machine learning (ML) cannot identify patterns or make predictions. Without a deeper understanding of context, AI applications cannot engage people in a meaningful way. Knowledge graphs (KGs) (Hogan et al., 2020; Peng et al., 2023), a term coined by Google in 2012 to refer to its general-purpose knowledge base, are critical to both: they reduce the need for large labeled ML datasets (Chen et al., 2021), enhance pre-trained language models (PLMs) (Lewis et al., 2020; Yang et al., 2024), and generate explanations (Tiddi & Schlobach, 2022). KGs are routinely used alongside ML in many applications, including search, question answering, recommendation (Guo et al., 2022) and, in industry contexts, enterprise data management, digital twins, supply chain management, procurement, and regulatory compliance (Sequeda & Lassila, 2021). Moreover, with the rise of large language models (LLMs) such as GPT (Brown et al., 2020; Radford et al., 2019) and Llama series (Touvron et al., 2023, 2023?), KGs and LLMs have influenced each other in both ways: LLMs for KGs (using LLMs for KG construction and maintenance) and KGs for LLMs (using KGs to train, prompt, augment, and evaluate LLMs) (Pan et al., 2023; Petroni et al., 2019; Razniewski et al., 2021).

As AI applications produce and consume more data, engineering KGs has evolved into a complex, semi-automatic process that increasingly relies on opaque deep-learning models and vast collections of heterogeneous data sources to scale to graphs with millions of entities and billions of statements (Hofer et al., 2024; Hur et al., 2021; Tamašauskaitė & Groth, 2023; Weikum et al., 2020). The KG lifecycle is not transparent (Wolf, 2020), accountability is limited, and accounts of how biased a KG is (Abián et al., 2022) or how fair the downstream applications that use it are patchy (Fisher et al., 2020). In recent works, KGs themselves are meant to make ML models explainable (Tiddi & Schlobach, 2022) and hence facilitate such compliance tasks, but that would imply that the KG lifecycle abides by the same rules.

We argue that this is not yet the case. As referred to in our previous work (Zhang et al., 2023), questions regarding the user-centric aspects of knowledge engineering are not yet fully answered, such as users’ tasks and goals, the way that they interact with KGs, KG construction (KGC) tools, and KG-related applications (Groth et al., 2023). Up-to-date comparative surveys regarding the scale, complexity, and degree of automation of KG construction systems nowadays are needed. User-centric design and empirical methods should be established for transparent KG construction to ensure that human-centric challenges are not overlooked.

With this article, we would like to advance the field of explainable knowledge engineering to allow KG stakeholders to rely appropriately on AI algorithms and use KGs with confidence (Lee & See, 2004). This article explores transparency issues from multiple dimensions, examining both the technical perspective, assessing the current state of explainable knowledge engineering models and techniques, and the user-centric perspective, focusing on the accessibility and acceptance of explanations. Our investigation of explainability encompasses both the inherent ability of models to elucidate their internal mechanisms (i.e., interpretability) and the techniques used to generate explanations across various models. Moreover, we examine their potential applications and integration in practical scenarios. To achieve this, we need to first gain a better understanding of emerging KG construction practices in the era of ML-as-a-service and develop human-in-the-loop approaches to ensure transparency and accountability throughout the KG lifecycle. This applies both to proprietary KGs used within organizations (Sequeda & Lassila, 2021) and publicly available KGs like Wikidata (Vrandečić & Krötzsch, 2014), DBPedia (Lehmann et al., 2015), YAGO (Pellissier Tanon et al., 2020), and ConceptNet (Speer et al., 2016), which are extensively used by researchers and practitioners. As AI laws and regulations enter into force, the trustworthy credentials of such KGs will have to be systematically assessed and documented.

Our article follows recent work that explores emergent neuro-symbolic AI architectures from a system-design perspective. van Bekkum et al. (2021) propose a taxonomy of hybrid (i.e., learning and reasoning) systems and discuss common architecture patterns and use cases. Building on their insights, Breit et al. (2023) carried out a comprehensive literature review to add details to those patterns in terms of inputs, outputs, processing units, types of ML models and their training, types of knowledge representation and reasoning, but also transparency and auditability. One of their main findings is that most system designers do not consider these latter aspects at all, or, when they do, they do not evaluate them sufficiently. A third paper by Tamašauskaitė and Groth (2023) draw from a survey of system papers to define a canonical KG construction process. Our work continues where they left off: starting from their KG construction process, we follow one of their main recommendations to map models and techniques for each step to provide additional guidance to researchers and developers.

Thus, we put forth the following research questions:

–
RQ 1: What is the state-of-the-art of explainable automated KG construction?
–
RQ 2: How do knowledge engineers and KG researchers understand their models and techniques and explain their output to stakeholders?
–
RQ 3: Do the existing explainable models and techniques meet the requirements of knowledge engineers and KG researchers in practical use cases?
–
RQ 4: What are the requirements of knowledge engineers and KG researchers for explainable approaches?

We analyze the KG lifecycle to identify tasks that are commonly automated with AI and those that still require human input and oversight and could potentially benefit from AI assistance. This work builds upon our previous study (Zhang et al., 2023), in which we surveyed the state-of-the-art in explainable AI (XAI) to inform the design of XAI approaches that are practically useful for KG stakeholders such as knowledge engineers, subject domain experts, and users. Furthermore, to extend our methodologies, we conducted an interview study involving 13 knowledge engineers and researchers from the knowledge engineering community. The interviews further explore topics such as their degree of understanding of models and techniques, their degree of automation, their transparency and explainability requirements, and various usage scenarios. Our main findings are:

There are tasks in KG construction, for instance, knowledge acquisition, where automation¹ is routinely used with promising results. At the same time, there are opportunities to use AI to assist other tasks, including ontology reuse, ontology evolution, ontology evaluation, and documentation, where (the latest) AI capabilities have remained under-explored.

While tasks around knowledge acquisition, taxonomy building, and data ingestion are often automated, human oversight is still needed to improve performance, establish trust, or comply with the law. In our review, we found little evidence of the integration of AI capabilities besides basic automation, no matter their level of interpretability, into standard knowledge-engineering tools and practices. Furthermore, our understanding of human-in-the-loop KG construction remains limited, with implications for user experience.

Comprehensive evaluations of XAI methods are lacking, with most studies focusing on simple ML models in lab settings, with mixed results (Poursabzi-Sangdeh et al., 2018; Smith-Renner et al., 2020; Wang & Yin, 2022). The KG community, just like elsewhere in AI, needs to gain a better understanding of how people react and use explanations to build trust and boost technology adoption.

Knowledge engineers have varying levels of understanding regarding the models and techniques they use, with many expressing concerns over the opaqueness of black-box models. Data provenance and lineage tracking are recognized as critical, yet there are still gaps in the comprehensiveness and standardization of these practices. Evaluation heavily relies on human effort, highlighting the need for more robust and scalable methods. Additionally, effective communication of tool functionality and results to diverse stakeholders remains a significant challenge, requiring tailored approaches to bridge knowledge gaps and align expectations.

Current XAI solutions often fail to meet practical requirements, as their explanations tend to be insufficiently informative, overly complex, and lacking in stability and coverage. Furthermore, findings from the interview study highlight the need for explanations that are both clear and confidence-indicating, with a strong preference for natural language representations.

The remainder of this article is structured as follows: Section 2 provides the background and related work, including an introduction to the KG lifecycle. Section 3 outlines the research methodologies, presenting the two-dimensional XAI taxonomy and use cases for literature analysis, as well as the foundation of the interview study. Section 4 explores the key findings from both the literature review and the interview study, with Sections 4.1 through 4.4 addressing research questions 1 to 4, respectively. In Section 5, we propose a blueprint for the design of explainable knowledge engineering models. Finally, Section 6 concludes the article. To facilitate further research, we maintain a public repository².
2. Background

2.1. Transparency and Explainability of ML Methods

Transparency as an AI design principle stands for the need to clearly document and explain how an AI system makes decisions, how the data is collected, used, and governed, and how the system is evaluated and audited (Ehsan et al., 2021; Kaur et al., 2022; Larsson & Heintz, 2020). Achieving transparency in machine learning (ML) models can be accomplished through explainability. Although some ML models, like decision trees, are naturally interpretable, larger models, such as language models, are too complex to comprehend in the same way. To address this issue, researchers and practitioners have proposed many XAI frameworks, guidance, standards (Schwalbe & Finzel, 2021), techniques (Lundberg & Lee, 2017; Ribeiro et al., 2016), and evaluation metrics (Hase & Bansal, 2020) for various models within the context of trustworthy AI. Typically, surveys on XAI models and techniques focus on aspects like problem formulation, taxonomies and classification, evaluation metrics, challenges, and future directions (Arrieta et al., 2019; Minh et al., 2022; Mohseni et al., 2021; Schwalbe & Finzel, 2021; Vilone & Longo, 2020). For works that are more related to ours, Danilevsky et al. (2020) conducted a survey on the state-of-the-art XAI models in natural language processing, which includes tasks that overlap with our work, such as named entity recognition and relation extraction. In the area of XAI and KGs, researchers have suggested using KGs to provide explanations. Tiddi and Schlobach’s systematic literature review (Tiddi & Schlobach, 2022) focused on the integration of KGs into explainable machine learning, where KGs are used as domain knowledge for explanations. In addition to the technical perspective, Miller’s review (Miller, 2019) provided a thorough examination of explainable AI through a sociotechnical lens, drawing from a variety of fields such as philosophy, cognitive science, and social psychology. Although previous studies have focused on some KG construction tasks and applications, a thorough review of the transparency and explainability of KG construction is still missing.

2.2. User Studies on Explainable AI

A deep understanding of the end-user requirements is essential in order to design trustworthy explanations, as explainability is a human-centric property (Mittelstadt et al., 2019; Rong et al., 2024). Preece et al. (2018) give an analysis of stakeholders in XAI by examining the concerns of various stakeholders communities and digging into their different intents and requirements. Ras et al. distinguished different users of deep learning models into two groups and discussed their concerns: the expert users, who are engineers and developers building and maintaining the systems, and lay users, who are the end users and stakeholders (Ras et al., 2018). Liao et al. (2020) conducted interviews with UX and design practitioners working on various AI products through question-driven explanations. It is noteworthy that there is a lack of user studies on XAI involving knowledge engineers and KG stakeholders as end-users. Therefore, there is no consensus among design disciplines for XAI in relevant domains. Similar to our intents, Dhanorkar et al. (2021) conducted an interview study on XAI towards AI researchers and stakeholders in industrial AI projects focusing on the AI lifecycle. Rong et al. (2024) surveyed user studies through characteristics including trust, fairness, understanding, usability, and human-AI collaboration performance, and provided guidelines for both XAI researchers and practitioners on designing and conducting user studies. Similar to our interview study, Kim et al. (2023) conducted an interactive feedback session in their interview study with the objective of understanding how explainability can support human–AI interaction. They mock up explanations that could be potentially used for AI application outputs in the field of computer vision to assess the participant’s perception of existing XAI approaches and how participants use explanations during their collaboration with the AI. Automated and transferable evaluation, benchmarking, and comparison of XAI approaches pose open challenges, as explainability is often seen as a subjective property, necessitating auditing from multiple aspects (Nauta et al., 2022). On the other hand, human-centered XAI evaluations that take an HCI perspective remain critical in XAI evaluation, where rigorous evaluation procedures need to be established (Chromik & Schuessler, 2020).

2.3. Human-Centric Knowledge Engineering

Knowledge engineering, the branch of AI concerned with building and managing knowledge-based systems (Schreiber, 2000; Studer et al., 1998), has changed dramatically with the latest innovations in machine learning, natural language processing, and computer vision. The process of constructing a KG can take on various forms, but it usually involves acquiring knowledge, processing it, and deploying the KG (Fensel et al., 2020; Hogan et al., 2020; Tamašauskaitė & Groth, 2023). And yet, as the most recent advances in natural language processing (especially LLMs) and generative AI demonstrate, the question of how to capture and encode domain knowledge into a computational representation remains as challenging as ever (Sarker et al., 2021). The technologies and end-user tools to support core knowledge-engineering tasks such as knowledge acquisition have advanced significantly to meet the scale requirements of modern KGs and to leverage the generative ability of sequence-to-sequence frameworks (Schneider et al., 2022; Ye et al., 2022). AI copilots, which leverage LLMs, have also become involved in the KG lifecycle through conversational interactions (Zhang et al., 2024), assisting knowledge engineers and users in a wide range of tasks. At the same time, the most effective approaches to knowledge representation still require human oversight at various levels (Simperl & Luczak-Rösch, 2014; Simsek et al., 2022), but increasingly human input is in the form of enhancing or validating algorithmic suggestions (Tamašauskaitė & Groth, 2023). The tasks of knowledge engineering require human-in-the-loop to a different extent and are considered human-centric (Groth et al., 2023; van Harmelen & ten Teije, 2019; Witschel et al., 2021). These developments have resulted in improved methods and techniques to support the knowledge engineering process, with a growing group of participants and stakeholders, including knowledge engineers and domain experts (Simperl & Luczak-Rösch, 2014). Witschel et al. identified human-in-the-loop patterns in hybrid learning and knowledge engineering activities, encapsulating them in two boxologies, where human agents function either as feedback-providers or feedback-consumers (Witschel et al., 2021). Back to 2002, after Holsapple and Joshi introduced the first collaborative approach to ontology design (Holsapple & Joshi, 2002), various collaborative ontology engineering methodologies have been proposed, including tasks like ontology design and construction (Auer & Herre, 2007; Braun et al., 2007; Debruyne et al., 2013; Kotis & Vouros, 2006), ontology evolution (Auer & Herre, 2007; de Moor et al., 2006; Kotis & Vouros, 2006; Vrandečić et al., 2005), and ontology evaluation (Guarino & Welty, 2004; Poveda-Villalón et al., 2014). The tasks of ontology engineering continue to rely heavily on manual labor, and many of the reviewed works are outdated and pre-date the era of deep learning. There are evident challenges in improving the methodologies used in this process and adapting them to meet the requirements of automation, scalability, and transparency.

2.4. The KG Lifecycle

Building on the process from Tamašauskaitė and Groth (2023), Figure 1 shows that the KG lifecycle today consists of four stages with a mix of automated and manual capabilities and contributions from several stakeholder groups: knowledge engineering and machine learning specialists, subject domain experts, online volunteers, and crowdsourcing services, as well as developers of applications using KGs.

Figure 1.

The Knowledge Graph Lifecycle Today.

As the figure illustrates, KGs interact with AI capabilities in complex ways, involving multiple groups of people collaborating both with each other and with machines. Human-in-the-loop tasks in KG lifecycle increasingly use ML models with varying levels of interpretability. On the left side of the figure, at stage A, which is an entry point and essential step of the KG lifecycle, knowledge engineers and KG stakeholders (e.g., domain experts) will first determine the scope of work and the success criteria (Kendall & McGuinness, 2019). After that, at the second stage, KG construction, knowledge engineers and other specialists (potentially) reuse standard ontologies and build KGs from scratch through data lifting and knowledge extraction. Multiple data sources, structured and unstructured, are lifted into KGs using ML for named entity recognition (Yadav & Bethard, 2019), relation extraction (Lin et al., 2016), entity reconciliation (Sevgili et al., 2020(@), and many others. The ontology organizing the KG can be provided upfront or derived from the data itself, depending on whether there is a clear domain or available structured data with predefined types of entities and relations (Tamašauskaitė & Groth, 2023). In this context, (Wolf, 2020) discusses the need for more transparency with respect to data provenance and currency; both can affect whether application developers and end-users will be able to use the KG with confidence as a source of reliable, complete, unbiased, and up-to-date information. KGs can also be created on a larger scale through human collaboration, utilizing crowdsourcing platforms, collaborative-editing platforms, and so on Hogan et al. (2020). Crowd workers and volunteer editors have important roles in the KG lifecycle, especially in KG creation and updates, where annotation tasks such as quizzes and voting are often designed for leveraging their background knowledge (Acosta et al., 2013; Kou et al., 2022; Revenko et al., 2018). While KGs constructed using these approaches may exhibit quality issues such as errors (Piscopo & Simperl, 2019; Shenoy et al., 2021), disagreement (Koutsiana et al., 2023), bias (Hogan et al., 2020), and so on, crowdsourcing for supervised ML may have similar transparency challenges as the algorithms it complements. This is because the digital services commonly used for this purpose, e.g., Prolific and Mechanical Turk, are black-box, proprietary platforms with limited means to replicate or reproduce results (Qarout et al., 2019). Educating crowd workers in the process of performing crowdsourcing tasks is also a nontrivial task (Revenko et al., 2018). Interleaving explanations during this process could aid in educating crowd workers, enhancing their comprehension of the task, and ultimately improving output quality.

The result of knowledge acquisition is shown in the middle of the figure, where KGs are often linked to third-party data, reuse standard ontologies and identifiers, and are encoded as RDF, JSON, or other formats. On the right-hand side of the figure, KG maintenance (stage C) is prompted by source updates from stage B, and requirements, audits, and assessments from stage D. To further increase their completeness, correctness, and utility, KGs are refined by completion tasks such as link prediction and error detection and correction tasks, and so on (Cimiano & Paulheim, 2017; Rossi et al., 2020; Zamini et al., 2022; Zhang et al., 2022). At stage D, there are a selection of use cases for KGs alongside other forms of AI. KGs are used as knowledge bases to query and reason upon, for instance in search (Wiegmann et al., 2022), question answering (Chowdhery et al., 2022; Guo et al., 2022), and retrieval-augmented generation (Gao et al., 2023; Lewis et al., 2020). Information can be obtained from a graph through deductive (e.g., logical rules) and inductive methods (e.g., as continuous graph embeddings) (Hogan et al., 2020). Both methods need to be transparent and accountable to the user (Bianchi et al., 2020; Rossi et al., 2022) to be trustworthy and compliant with laws.

3. Methodology

To address our four research questions, we employed a mixed methodology of systematic review and interview study. The systematic review involved collecting and analyzing literature on explainable AI in the context of knowledge engineering to gain insight into its current development. The interview study allowed us to directly explore the role of explainable AI in broader contexts, understand the needs of knowledge engineering and KG stakeholders for explanations, identify potential gaps and challenges in this field, and provide valuable insights for further research.

3.1. Literature Review

3.1.1. The PRISMA-guided Review

Following the discussion of the lifecycle, we carried out a PRISMA (Page et al., 2021) literature review on databases including ACM Digital Library, IEEExplore, ScienceDirect, arXiv, SpringerLink, and Google Scholar. We searched for queries combining, on the one side, keywords related to trustworthy (mainly transparent and explainable/interpretable) and, on the other side, keywords related to KG construction tasks, as shown in Table 1. The search initially encompasses all keywords related to KG construction tasks, as depicted in Figure 1. We conducted a prototype search by examining the top 20 results generated by these keyword patterns. Subsequently, we eliminate keywords associated with tasks that do not yield hits within the top 20 results, thereby streamlining the review process. The search took place from October to December 2022 and resulted in more than 735K hits. We then took the top 50 hits per query, which led to around 4000 papers with duplicates³. The workflow of paper selection is shown in Figure 2. We assessed relevance based on titles, abstracts, and keywords first, and in a second step, reviewed the text of the paper to select only those papers that proposed a solution to transparent and explainable KG construction, either as a whole process or for individual tasks. We discarded papers that only mentioned transparency and related concepts rather than putting forward a solution. The final corpus consisted of 84 papers. The papers were all published in the past ten years, which was to be expected given the term ‘‘knowledge graph’’ was coined in 2012 and is in line with other recent knowledge-graph surveys (Schneider et al., 2022; Tamašauskaitė & Groth, 2023).

Figure 2.

The PRISMA Flow Diagram for Systematic Review.

Table 1.

Keywords for the Literature Search Query. Keywords from Two Groups Were Combined for Query Construction. ‘*’ Represents Wild Characters that can Match any Word Suffix in the Search.

Knowledge Graph Construction	Transparency AI
knowledge graph construct, knowledge graph develop, knowledge graph complet, knowledge graph refine, knowledge graph reasoning, knowledge graph inference, knowledge engineering, named entity recognition, extract entit, relation extract, entity linking, entity matching, entity resolution, entity alignment, link prediction	transparent, transparency, interpretable, interpretability, explainable, explainability

3.1.2. Use Case Analysis

In addition to reviewing the existing work categorized in Section 4.1.1, we adopt use cases as an orthogonal dimension for literature analysis inspired by Amarasinghe et al. (2023), recognizing that explanations serve diverse end users with varying needs across different scenarios and stages of the KG lifecycle. To derive XAI use cases in the KG lifecycle, we first examine use cases in the broader AI lifecycle and specific domain applications (Adhikari et al., 2022; Amarasinghe et al., 2023). We then map these use cases to practical scenarios and task spaces within the KG lifecycle, as illustrated in Figure 1. Specifically, we identify and present four key use cases, along with their objectives, in Table 2.

Table 2.
Summary of Use Cases of XAI Methods in Knowledge Graph Construction Process and Their Related Objectives. The Use Cases Are Intentionally Defined with More Flexibility Than the Taxonomy, as their Primary Purpose is not to Serve as a Rigid Classification System but to link the Reviewed Works to Practical, Illustrative Scenarios.

Use Case Intentions

Model selection and building Help users understand the characteristics of ML models thus select the fitted model and build the pipeline

Model debugging Detect errors may happen in the process and help model users avoid or fix the error

Understanding performance and contributing factors Give explanations to the predictions and analysis to the contributing factors of the final results

Managing updates Help users understand how the pipeline will change as data updates and help improve the results

Use Case	Intentions
Model selection and building	Help users understand the characteristics of ML models thus select the fitted model and build the pipeline
Model debugging	Detect errors may happen in the process and help model users avoid or fix the error
Understanding performance and contributing factors	Give explanations to the predictions and analysis to the contributing factors of the final results
Managing updates	Help users understand how the pipeline will change as data updates and help improve the results

Use Case 1: ML Model Selection and Building

When ML is incorporated into knowledge engineering, ML and knowledge engineers must select the proper models and build them. To help users evaluate and select suitable ML models, explanations should reveal the characteristics and limitations of the model, potential risks associated with its use, and its specialization for data or domains. In particular, they should address questions such as the model’s capabilities, strengths and weaknesses, and data fitting. It is also important to determine if the model exhibits bias toward specific groups of data sources.

Use Case 2: ML Model Debugging

One of the purposes of providing explanations for ML models is to facilitate debugging by allowing knowledge engineers to identify inaccuracies and flawed predictions and providing them with actionable information to correct them.

Use Case 3: Understanding Performance and Contributing Factors

To ensure a thorough comprehension of performance, explainable KG construction pipelines should include the following elements: –

A clear understanding of the inference/reasoning process, which can be represented as rules, paths, etc.

–

Identification and highlighting of the factors, important features, and supporting evidence that contribute to the final predictions.

–

Provision of counterfactual interpretation through perturbation/permutation.

Use Case 4: Managing Updates

Explainability is crucial for KG maintenance. When updates occur in data sources and contextual information, the KG can be updated by rerunning the construction pipeline, executing update or modification models, and so on.

To validate the use cases, we compared the use cases derived from the literature review to the ones collected from the interview study and found that the use cases derived from the literature review are mostly reflected through the interview study, and the latter also provide new ones, which we further discussed in Section 4.3. While we acknowledge that the identified use cases are not exhaustive, they are intended to be adaptable and expandable as new requirements and application areas emerge.

After identifying use cases, we conducted further investigations into the capabilities of existing works with respect to these use cases. There are two main aspects to consider for this purpose. Firstly, we need to determine whether the reviewed methods have been applied in real-world scenarios of the given use case or could be adapted to suit them. Additionally, we need to consider whether the models have been trained and tested on real-world data. In the domain of KG construction, benchmarks and datasets are usually close to real-world KGs, such as Wikidata, DBpedia, and Freebase. The second aspect to consider is whether the explanations provided are understandable and satisfactory to the intended audience for the given use case. This can be determined if the work has done comprehensive evaluations that include metrics and human evaluations. Thus, we will evaluate the capabilities of the existing methods based on the following criteria: –

$\times$ : It is not clear if the method is applicable to the given use case.

–

☆: The method has potential for the given use case.

–

$⋆$ : The method has been applied to the use case but has not yet been integrated into toolkits or applications in real-world scenarios. Additionally, the explanations provided by the method have not been evaluated through user studies or evaluations.

–

$⋆ ⋆$ : The method has been integrated into toolkits in real-world scenarios. Furthermore, the explanations provided by the method have been tested through real-world studies with the target audience.

3.2. Interview Study

Besides the literature review, we conducted semi-structured interviews with the objective to (1) acquire a basic understanding of the current status of knowledge engineering models and techniques, including transparency issues and obstacles, (2) figure out gaps between existing solutions and practical knowledge engineering scenarios, (3) collect practical requirements for explainable capabilities, and (4) capture insights to design automated explainable knowledge engineering pipelines. Table 3 lists all participants and their background information⁴. In total, we interviewed 13 researchers and knowledge engineers from August to November 2023. All participants were recruited via contact lists of research events, a hackathon, and mailing lists hosted by W3C⁵. We maintained a balanced gender distribution among participants (6 females and 7 males) and ensured diverse coverage in experience, domain, and tasks. In terms of sector representation, 7 participants are affiliated with universities, indicating a relatively stronger academic background, while 2 are from research institutes and 4 from companies. The latter 6 participants are considered to have a stronger industry background with a focus on industry-related scenarios. Each interview lasted 35 to 50 minutes via an online video call, which involved the authors and the participants. The ethical clearance was granted from the Research Ethics Office of King’s College London with ethics registration confirmation reference number MRSP-22/23-34456.

Table 3.
Background Information About Interview Study Participants Includes the Sector, Experience Working With KGs and Knowledge Engineering in Years, Domains of KGs, and Tasks Involved in the KG Lifecycle.

ID Sector Experience Domains Tasks

A Academic 2 Culture Ontology engineering

B Academic 9 Legal, finance, culture Refinement, knowledge extraction, knowledge integration

C Academic 1.5 Culture Ontology engineering

D Industry 23 Medicine, scholarly, industry, finance, etc. Ontology engineering, intelligent applications

E Academic 3 Social science Knowledge extraction, knowledge integration

F Academic 11 Industry, environment, tourism Ontology engineering

G Academic 9 Public knowledge graphs Knowledge extraction, refinement, knowledge integration

H Academic 17 IoT, medicine, insurance, tourism, etc. Ontology engineering, datab transformation

I Industry 10 Customer data, public knowledge graphs, geography Knowledge extraction, enrichment, data transformation

J Industry 10 Scholarly, cross-domain Ontology engineering, knowledge extraction

K Industry 3 Cross-domain Ontology engineering

L Industry 20 Mobility, manufacturing Ontology engineering, knowledge extraction, data transformation

M Industry 11 Biology, property, medicine, legal, energy, history Knowledge extraction, refinement

ID	Sector	Experience	Domains	Tasks
A	Academic	2	Culture	Ontology engineering
B	Academic	9	Legal, finance, culture	Refinement, knowledge extraction, knowledge integration
C	Academic	1.5	Culture	Ontology engineering
D	Industry	23	Medicine, scholarly, industry, finance, etc.	Ontology engineering, intelligent applications
E	Academic	3	Social science	Knowledge extraction, knowledge integration
F	Academic	11	Industry, environment, tourism	Ontology engineering
G	Academic	9	Public knowledge graphs	Knowledge extraction, refinement, knowledge integration
H	Academic	17	IoT, medicine, insurance, tourism, etc.	Ontology engineering, datab transformation
I	Industry	10	Customer data, public knowledge graphs, geography	Knowledge extraction, enrichment, data transformation
J	Industry	10	Scholarly, cross-domain	Ontology engineering, knowledge extraction
K	Industry	3	Cross-domain	Ontology engineering
L	Industry	20	Mobility, manufacturing	Ontology engineering, knowledge extraction, data transformation
M	Industry	11	Biology, property, medicine, legal, energy, history	Knowledge extraction, refinement

3.2.1. Interview Questions

Table 4 presents all the interview questions organized by topics and the order in which they were asked. The questions addressed various topics, including the understanding level of models and techniques, degree of automation, data provenance and lineage, trust, evaluation and human intervention, explainability, and associated risks. The design of interview questions incorporated multiple factors, drawing from previous interview studies on explainable AI in other fields (Dhanorkar et al., 2021; Kim et al., 2023), taxonomies and surveys of transparency and explainability (Rong et al., 2024), and the Explanation Ontology (Confalonieri et al., 2024) to ensure comprehensiveness. We adapted these trustworthy factors to the context of KG construction. Firstly, we asked questions about the research background, including experience and domain, to acquire demographic information. Next, we asked about the participants’ experience and understanding of the models and techniques they use. This foundation allowed us to assess the extent to which transparency is an issue and its impact on their practical work. Given the importance of data provenance as a dimension of transparency (Firmani et al., 2019), we include questions specifically about this. To examine the human role in knowledge engineering and gain insight into human factors, we asked questions related to the evaluation of results and how humans interact with the pipeline, providing oversight and intervention. Inspired by Dhanorkar et al. (2021), we designed questions about explanation scenarios and use cases. These questions delved into scenarios where participants explain results or models to their stakeholders, seeking to identify explainability concerns, challenges, and requirements. Finally, we addressed risk concerns that might arise if transparency and explainability are provided with current models and techniques, ensuring a comprehensive understanding of potential issues.

Table 4.
The List of Interview Questions. The XAI Example Discussions are Accompanied by Slides Introducing and Showing Explanations, and the Following Questions in this Part are Mostly Intrigued by the Responses of Participants.

Topics Questions

Background What is your job title?

Information How long have you been working on knowledge engineering and knowledge graphs?

Domain & Tasks Can you give a brief description of your work with knowledge graph construction, including:

an introduction to the knowledge graphs, their types and domains,

tasks in which you have been engaged, such as knowledge extraction or completion?

Status Could you please briefly describe the models and techniques that you use for the tasks you mentioned?

Are they fully automated or incorporate human efforts, e.g., human-in-the-loop?

Are they explainable or transparent? And why?

How do you perform the model selection?

Understanding Do you understand, or do you need to understand how the automated components work in detail?

What are the obstacles to understand the performance of the component or the results it generated?

If not understood, will the opaqueness of the toolkits impact your work?

Data Provenance & Lineage
Do you know where the data comes from?

Do you keep track of all operations that have been carried out?

How do you keep track of data provenance and lineage?

Evaluation & Human Intervention
How do you verify or evaluate the results generated by the automated components or the pipeline?

Are there any mechanisms to help you?

If you could verify the results, is there any way that you can correct or modify them?

What kind of intervention do you take? Explain when and how you perform the oversight.

Explanation Do you explain to another person how the automated components work or the generated results?

To whom do you explain the components or results?

What type of content do you explain?

How do you explain the results? Do you adopt any methods to help you deliver the explanation?

Do you encounter any challenges in this process?

Use Case In what scenarios would you need the pipeline to give you an explanation?

XAI Example Please select one of the following tasks, we will provide explainable examples and we can discuss them:

Discussion (1) entity extraction, (2) relation extraction, (3) entity linking, (4) link prediction, (5) inconsistency detection

Requirements After answering and thinking about the above questions, how would you envision a solution?

What kind of information do you hope to be provided by explainable pipelines?

What is your preferred form of explanation?

How will the explanations help your work?

Risk What is your concern about the risk of explainability or transparency?

Topics	Questions
Background	What is your job title?
Information	How long have you been working on knowledge engineering and knowledge graphs?
Domain & Tasks	Can you give a brief description of your work with knowledge graph construction, including:
	an introduction to the knowledge graphs, their types and domains,
	tasks in which you have been engaged, such as knowledge extraction or completion?
Status	Could you please briefly describe the models and techniques that you use for the tasks you mentioned?
	Are they fully automated or incorporate human efforts, e.g., human-in-the-loop?
	Are they explainable or transparent? And why?
	How do you perform the model selection?
Understanding	Do you understand, or do you need to understand how the automated components work in detail?
	What are the obstacles to understand the performance of the component or the results it generated?
	If not understood, will the opaqueness of the toolkits impact your work?
Data Provenance & Lineage	Do you know where the data comes from?
	Do you keep track of all operations that have been carried out?
	How do you keep track of data provenance and lineage?
Evaluation & Human Intervention	How do you verify or evaluate the results generated by the automated components or the pipeline?
	Are there any mechanisms to help you?
	If you could verify the results, is there any way that you can correct or modify them?
	What kind of intervention do you take? Explain when and how you perform the oversight.
Explanation	Do you explain to another person how the automated components work or the generated results?
	To whom do you explain the components or results?
	What type of content do you explain?
	How do you explain the results? Do you adopt any methods to help you deliver the explanation?
	Do you encounter any challenges in this process?
Use Case	In what scenarios would you need the pipeline to give you an explanation?
XAI Example	Please select one of the following tasks, we will provide explainable examples and we can discuss them:
Discussion	(1) entity extraction, (2) relation extraction, (3) entity linking, (4) link prediction, (5) inconsistency detection
Requirements	After answering and thinking about the above questions, how would you envision a solution?
	What kind of information do you hope to be provided by explainable pipelines?
	What is your preferred form of explanation?
	How will the explanations help your work?
Risk	What is your concern about the risk of explainability or transparency?

XAI Example Discussion

Furthermore, by selecting examples from the previous literature review, we designed XAI examples and facilitated discussions on their usefulness, faithfulness, and acceptance. This approach directly connects stakeholders in the context of knowledge engineering with existing literature methods, highlighting the pros and cons of current explainable solutions and the gaps between these solutions and practical needs, given the limited application of existing XAI approaches in real-world knowledge engineering scenarios. The XAI examples were directly selected from the reviewed papers. We first identified papers that provided examples of explanations, such as visualizations of attention weights, graph paths, and tables of reasoning rules. We then randomly selected two papers per task as examples for participants to discuss. During XAI example discussions, participants were first asked to select one (or two, if time permitted) task that they were familiar with. We then provided two examples of two explainable approaches to the selected task. Each example was presented on a slide, consisting of the input, output, and explanations as provided by the original publication. Table 5 lists the examples we selected, along with their representations and citations. After reviewing the examples, participants discussed the usefulness and acceptance of the explanations, such as whether they found the explanations helpful and whether they would accept them in their work scenarios or expose them to stakeholders, such as domain experts and users. Moreover, they were encouraged to identify defects in the explanations and suggest improvements or alternative solutions to make the explanations more acceptable. During this process, participants were free to ask questions about the provided examples, and we responded based on the original publication.

Table 5.

User Acceptance Count of Explainable Examples, ‘ $✓$ ’ Indicates Vote for Acceptable Explanations, ‘ $\times$ ’ Indicates Vote for Unsatisfactory Explanations, and ‘ $\oplus$ ’ Indicates Vote for Explanations That are Somewhat Reasonable but not Fully Trustworthy. For Inconsistency Detection, only one Example is Provided.

	Example 1			Example 2
Task	Representation	Cite	Feedback	Representation	Cite	Feedback
Entity extraction	Words, attention score visualization	TMN (Lin et al., 2020)	$\oplus$ $\oplus$ $\oplus$ $\times$	Training instances	Instance-based span (Ouchi et al., 2020)	$\oplus$ $\oplus$ $\times$ $\times$
Relation extraction	Words	D-REX (Albalak et al., 2021)	$\oplus$ $\oplus$ $\times$ $\times$ $\times$	Logic rules	LogiRE (Ru et al., 2021)	$✓$ $✓$ $\oplus$ $\oplus$ $\times$
Entity linking	Attribution scores and their visualization	LEMON (Barlaug, 2021)	$\times$ $\times$	Entity resolution rules	SystemER (Qian et al., 2019)	$✓$ $✓$
Link prediction	Reasoning path	Bhowmik and de Melo (2020)	$✓$ $\oplus$	Training triples	Kelpie (Rossi et al., 2022)	$\times$ $\times$
Inconsistency detection	Triples and their visualization	Abstraction (Tran et al., 2020)	$\times$ $\times$	/

3.2.2. Coding and Analysis

The interviews were recorded using Microsoft Teams and transcribed with its automatic transcription services. The transcripts were then further cleaned and edited by the authors to remove repeated words, pauses, filler words, and to recover errors such as software names and abbreviations. The edited transcripts were coded into keywords and patterns, consisting of phrases and sentences. We employed three levels of coding strategies for different types of questions. First, for questions related to background information, domain and tasks, and status, we used in vivo coding, extracting the exact words from the transcripts. For questions on data provenance and lineage, evaluation and human intervention, explanation scenarios, and requirements, we extracted the phrases and identified patterns such as operations, methods, and examples. Finally, for questions on understanding, XAI example discussions, and risks, we extracted patterns such as comments and suggestions, and coded the attitudes and beliefs towards the explainable examples. To analyze the coded data, we grouped identical and similar content into clusters of thoughts and insights, and counted the occurrence of each cluster. We also highlighted quotes to provide important supporting evidence, insights, and original ideas.

4. Findings

4.1. The Status of Explainable Automated Knowledge Engineering

4.1.1. The State-of-the-Art Explainable Models

We classified the papers reviewed with respect to the KG construction tasks they addressed and their approach to explainability, starting with categories widely used in the literature. For explainability, we started with what is explained: local (data point) vs. global (outcome); and when: post-hoc (after prediction) vs. self-explaining (while predicting). We then added another layer for post-hoc methods, splitting the methods into two subgroups: model-specific (specific to one or a group of models) and model-agnostic (can be applied to any model).

The results are presented in Figure 3 and visualized using a Sankey diagram in Figure 4. At a glance, the papers do not cover the entire KG lifecycle. Most papers are concerned with knowledge acquisition via entity extraction (as a source of classes and instances in KGs) and relation extraction (as a source of property classes, but more importantly connecting entities to each other through properties), or with curation and maintenance via entity resolution (consolidating the data that refers to the same entities) and link prediction (suggesting missing or emerging facts). Besides the four core tasks in the bottom half of the figure, we found one paper dealing with the evolution of the KG schema or ontology (Meroño Peñuela et al., 2021) and another one about detecting and explaining inconsistency in KGs (Tran et al., 2020). We note that link prediction was by far the most popular task, and that a majority of papers dealt with curation and maintenance rather than building a KG for a particular purpose. This is somewhat concerning, as many applications of KGs are in enterprise contexts (Sequeda & Lassila, 2021), where the first step is to build a computational representation of the enterprise’s data, which is stored across various systems and modalities. We argue that for the tasks not included in the review, there are several potential reasons why almost no papers were found. Many of these tasks still rely heavily on manual work and human oversight and have not yet been automated, as we will later verify based on interview results. This includes tasks such as ontology reuse and ontology design. Additionally, there are tasks where automation, such as the use of LLMs, has been employed, like ontology alignment (He et al., 2021) and data lifting from databases, but explanations have not been considered.

Figure 3.

Taxonomy of Explainable Knowledge Graph Construction. A Summary Table of the Reviewed Papers is Also Available in Our Previous Work (Zhang et al., 2023) and the GitHub Repository.

Figure 4.

Sankey Diagram Illustrating the Categorization of Methods. The Left Column Represents KGC Tasks and the Right Column Represents XAI Taxonomy. The Total Count for Each Category is Indicated Next to Its Label.

A second high-level observation is the balanced split in the chosen format for explanations. Methods based on input and generated features use attention weights (Jung et al., 2021; Zhou et al., 2020), words (Lee et al., 2021; Lin et al., 2020), attributes (Barlaug, 2021), and so on to generate explanations, which can be numerical, textual, or visual. By contrast, methods based on human-understandable background knowledge provide explanations in the format like logical rules (Rocktäschel & Riedel, 2017), reasoning paths (Lei et al., 2020), and structured contextual information (Shahbazi et al., 2020) as explanations. Given that we are interested in explanations that are accessible to knowledge engineers and subject domain experts, it would be interesting to evaluate if their familiarity with knowledge representation and/or the subject domain impacts how useful knowledge-based explanations are compared to feature-based ones, which sometimes require an understanding of machine learning. At the same time, explanations are generated in a different way for each of the four core KG construction tasks in the bottom half of the figure.

Entity Extraction

For entity extraction, explanations often leverage contextual cues such as triggers (Lee et al., 2021; Lin et al., 2020) and patterns of words (Hedderich et al., 2021), utilizing attention mechanism (Vaswani et al., 2017) and saliency map techniques. One notable work is myDIG (Kejriwal, 2021), a human-in-the-loop system that compiles sophisticated rules written by domain experts into SpaCy rules for backend execution. This reduces the barrier for domain experts to interact with the machine and minimizes training effort. Additionally, myDIG records extraction provenance, allowing users to explore the downstream effects of their specifications. Another type of explanation used for entity extraction is example-based explanations, which rely on training instances (Plumb et al., 2018). In Ouchi et al., similarities between pairs of candidate(s) and the training instances are computed, with the term having the highest derived label probability being returned (Ouchi et al., 2020).

Relation Extraction

For relation extraction, explanations frequently employ contextual information from the input, such as words and sentences, similar to entity extraction. The attention mechanism is a prominent principle among relation extraction methods, with 4 out of 9 studies using attention weights and their associated input context to generate explanations. For instance, Nero uses word-level attention to calculate matching scores between sentences and generated rule patterns, where attention weights represent word importance for constructing attention-pooled rule/sentence representations (Zhou et al., 2020). SIRE (Zeng et al., 2021) employs the attention mechanism in both the evidence selector (Bahdanau et al., 2015) to identify supporting evidence and in the logical reasoning module (Vaswani et al., 2017). In light of research questioning the validity of attention as faithful explanations, Shahbazi et al. adopted a mixed explanation mechanism extended by saliency, and so on (Shahbazi et al., 2020). Another prevalent type of explanation in relation extraction involves relation learning/logic rules. Beyond Nero, LogiRE integrates a rule generator and a relation extractor, optimizing these modules using the expectation-maximization algorithm for document-level relation extraction (Ru et al., 2021). Diverging from text-based explanations, ProtoRE learns prototypes for each relation from contextual information, exploring the intrinsic semantics of relations and visualizing them as geometric explanations (Ding et al., 2021). Under optimal conditions, prototypes are unit vectors uniformly dispersed on the surface of a unit ball, with datasets clustered around each prototype vector.

Entity Resolution

There are two primary types of explanations for entity resolution: entity matching (EM) rules (Paganelli et al., 2019; Qian et al., 2019; Singh et al., 2017; Yao et al., 2021) and (ranked) attributes of the entity pair with relevant scores (Baraldi et al., 2021; Barlaug, 2021; Di Cicco et al., 2019; Ebaid et al., 2019; Teofili et al., 2022). EM rules, represented in forms such as disjunctive normal form and general boolean formula, are commonly used in EM systems to enhance interpretability (Singh et al., 2017). For automatic EM rule-based models, Yao et al. proposed a framework consisting of Heterogeneous Information Fusion for learning feature representation from unlabeled data and Key Attribute Tree for interpretable EM decision making (Yao et al., 2021). This framework translates decision trees into EM rules, making explanations more accessible to domain experts. RuleSynth, proposed by Singh et al., formulates the rule discovery problem in entity matching as a program synthesis problem (Singh et al., 2017). They adopted a more concise and interpretable form of General Boolean Formula to represent EM rules and proposed a novel rule synthesis algorithm. In contrast to EM rules, using attributes and their relevant scores as explanations focuses on uncovering the contribution and importance of each attribute or combinations of attribute sets in the decision-making process of entity matching. Most works employing this representation adopt perturbation-based methods (Ivanovs et al., 2021). By applying LIME perturbation to the entity resolution problem (Ribeiro et al., 2016), these methods introduce perturbations, such as removing words and reducing similarity between entity pairs, to analyze variations and compute predefined importance scores. Conversely, they also explore the insertion of input attributes into entity pairs to assess whether such modifications can increase similarity between non-matching pairs.

Link Prediction

Most explainable link prediction methods leverage the topology and reasoning capabilities of KGs. Rule- and path-based methods have become the predominant forms of explanations, achieved through various approaches such as random walk-based methods (Lin et al., 2023; Liu et al., 2022; Meilicke et al., 2020), reinforcement learning agents (Das et al., 2017; Xiong et al., 2017), and perturbation-based methods (Pezeshkpour et al., 2019; Rossi et al., 2022). A significant body of work utilizes RL for reasoning over KGs and searching for paths to explain link prediction results (Bhowmik & de Melo, 2020; Das et al., 2017; Fu et al., 2019; Hildebrandt et al., 2020; Lei et al., 2020; Sun et al., 2021; Xia et al., 2022; Xiong et al., 2017). These models typically comprise KG environments and policy network agents. The KG environment transitions elements within the graphs (e.g., entities, relations, queries) into RL agent elements, where states are usually entities (in practical terms, embeddings) and queries (subject entities and relations); actions are typically outgoing edges/relations; transitions map current entities and their outgoing edges to their neighboring nodes; and rewards are heuristic indicators, awarding 1 when the agent reaches the correct target entities. Policy networks then maximize the expected reward to perform path finding. Variations exist in environment transitions, rewards, and the parameterization of the policy function. For example, R2D2 (Hildebrandt et al., 2020) and RuleGuider (Lei et al., 2020) employ multi-agent architectures. R2D2 uses two agents, with one arguing the fact is true and the other arguing it is false, feeding their arguments into a judge network. RuleGuider uses a relation agent and an entity agent that interact to generate paths fed into a rule miner. Perturbation-based methods are also applied in link prediction, similar to those used in entity resolution. CRIAGE (Pezeshkpour et al., 2019) introduces graph perturbation by removing a neighboring link from the target fact to assess the influence of the fact and by adding a new, fake fact to evaluate model robustness and sensitivity. Another prevalent method in explainable link prediction models is the attention mechanism, used in 16 out of 53 total link prediction works. For instance, XTransE employs attention values on items to reveal the relevance between different property-value pairs and the current prediction, which are then ranked to identify the most relevant triples (Zhang et al., 2020). In xERTE, Han et al. propose a temporal relational graph attention layer that calculates query-dependent attention scores for each edge (Han et al., 2020). These scores propagate to each node’s prior neighbors, pruning the inference graph using edge contribution scores. The pruned graph, with node attention scores and edge contribution scores, is used to produce the explanations.

Human-in-the-Loop

There are very few papers considering human inputs or oversight, which are critical in trustworthy AI frameworks and guidance (Dignum, 2019). In the few cases of human-in-the-loop systems, human input often involves the provision or revision of rules for tasks such as entity extraction (Kejriwal et al., 2019) and entity resolution (Paganelli et al., 2019; Qian et al., 2019). In myDIG (Kejriwal et al., 2019), a GUI-based rule specification system is provided for domain experts to input expressive entity extraction rule sets without programming. SystemER (Qian et al., 2019), which adopts an active learning methodology, learns explainable entity resolution logical rules and offers functionalities for domain experts, both with and without programming backgrounds, to verify and customize the learned models in feature engineering to ensure extensibility. For generating entity resolution rules, TuneR (Paganelli et al., 2019) involves developers (i.e., coders, scientists, and domain experts) in tuning rule sets by defining the contribution of optimization metrics. The framework defines interpretability-related metrics as the preference between the number of rules in the rule set and their overlap. All three approaches use an ensemble of rules to achieve high precision. Several factors influence the success of these human-in-the-loop approaches, some of which have been considered in these three systems. One critical factor is balancing the minimization of training with the extent of human intervention. More human intervention can reduce training efforts, which require feeding more data thus extending training time. Conversely, increased training efforts can reduce human intervention, thereby minimizing unnecessary human labor and avoiding time-consuming and error-prone trial-and-error processes. Another factor is the degree of operational freedom given to users. The complexity of functions and the freedom of operations provided to users affect the time required to educate them. The design of functions should enable users to maximize their input to produce high-quality work while minimizing the time needed to familiarize themselves with the tool. Providing too few intervention options might hinder users from fully expressing the correct input, thereby increasing human effort. These factors are crucial when designing human-in-the-loop systems, and more user studies, especially for knowledge engineers and KG stakeholders, are needed to explore them further.

Evaluation of Explanations

We also collected and analyzed the evaluation of explanations. A primary observation is that most XAI approaches have not been thoroughly and/or comprehensively evaluated. The majority of methods (58 out of 84) do not perform any evaluation on explanations or only use anecdotal evidence by visualizing and commenting on a limited number of cases of explaining outcomes intuitively. There are efforts to design metrics to evaluate explanations. Seventeen works adopted metrics to evaluate their explanations, and most of them are task-dependent. Shahbazi et al. (2020) created a ground-truth explanation set and computed the Kendall Tau correlations for the sentence importance scores for the annotated test set. approxSemanticCrossE (d’Amato et al., 2022) proposed explanation evaluation metrics targeting the link prediction tasks, which calculate the ratio of triples for which the model can generate explanations (recall) and the number of explanations, on average, for each prediction (average support). In gradient rollback (Lawrence et al., 2020), Lawrence et al. adopted the ‘‘RemOve And Retrain (ROAR)’’ (Hooker et al., 2019) evaluation paradigm to evaluate the faithfulness of the explanations.

Twelve studies use human evaluation, detailed in Table 6. We identified 5 types of evaluation tasks commonly adopted in these studies. The most frequent tasks involve asking participants to compare model-generated explanations with those from baseline models and to judge the relevance and correctness of a set of examples. Various metrics are used in human evaluations. One approach is to have participants rate the usability, reliability, and trust of explanations in a survey. A notable example in this group is SQUIRE (Bai et al., 2022), which annotates BIMR-based interpretability scores (Lv et al., 2021) for paths generated by their models and baseline models. Another group of methods measures the accuracy or precision of user predictions with or without provided explanations. The backgrounds of human evaluators are varied, including domain experts, such as e-commerce experts in Zhang et al. (2022) and linguists in Emboot (Zupon et al., 2019), people with technical backgrounds, and laypeople such as crowdsourcing.

Table 6.

Works That Use Human Evaluation to Analyze Explanations. ‘*’ indicates that no Group Label is Provided, but other Detailed Background Information of Participants is Reported. ‘/’ means ‘not Reported’ in the Paper.

Evaluation Tasks	Methods	Number of Participants	Background
Comparing model-generated and human-provided explanations	AutoTriggER (Lee et al., 2021)	/	crowd-workers
	D-REX (Albalak et al., 2021)	3	crowd-workers
Judging the relevance and correctness of explanations (examples)	AutoTriggER (Lee et al., 2021)	See above
	Emboot (Zupon et al., 2019)	2	domain experts
	xERTE (Han et al., 2020)	53	*
	DRUM (Sadeghian et al., 2019)	2	CS students
Comparing explanations generated by different models	D-REX (Albalak et al., 2021)	See above
	RuleSynth (Singh et al., 2017)	27	CS researchers
	RuleGuider (Lei et al., 2020)	/	crowd-workers
	DRUM (Sadeghian et al., 2019)	See above
	SQUIRE (Bai et al., 2022)	/	authors
Survey with questions measuring the quality (usability, reliability, trust, etc.) of explanations	Kelpie (Rossi et al., 2022)	44	/
	SQUIRE (Bai et al., 2022)	See above
Evaluating the accuracy or precision of user predictions with or without explanations	R2D2 (Hildebrandt et al., 2020)	44	/
	Zhang et al. (2022)	/	domain experts

From the above observations, we identified several issues with the evaluation methods. First, reporting a limited number of examples selected based on the researchers’ intuition can be biased and not sufficient for robust verification (Leavitt & Morcos, 2020; Nauta et al., 2022). Since not all results have satisfied explanations generated, another issue is that the ratio of results for which the model can generate satisfied explanations is not commonly reported. In our interview study, we found it to be a crucial factor that might influence the user’s trust in the XAI models.

4.1.2. Use Cases and Capabilities Measurement

The capability of various explainable techniques for each use case is shown in Table 7. In general, the reviewed literature indicates that global post-hoc methods, especially model-agnostic ones, have the potential to address all use cases. Local post-hoc methods have also demonstrated similar potential across all use cases. Although no global self-explaining methods were identified for the first two use cases, this does not imply that these methods lack potential for model selection, construction, and debugging. Instead, they are suitable for providing model analysis due to their global assessment capabilities. Among the use cases, all except for understanding performance and contributing factors have received less attention and research. This could pose challenges when integrating developed methods into real-world applications, making it essential to address these gaps.

Table 7.
Capabilities of XAI Methods in Knowledge Graph Construction. Symbols are Referenced from Section 3.1.2: $\times$ : Applicability to the Given Use Case is Unclear; ☆: Method shows Potential for the Given use Case; $⋆$ : Method Has Been Applied to the use Case but is not yet Integrated into Toolkits or Real-World Applications. Explanations Provided by the Method Have Not Been Evaluated Through User Studies or Any Other Evaluation Methods; $⋆ ⋆$ : Method is Integrated into Toolkits in Real-World Scenarios, and Its Explanations Have Been Tested through Real-World Studies With the Target Audience.

Capabilities

Local Local Global Global

Use Case Post-hoc Self-explaining Post-hoc Self-explaining Methods

Model Selecting and Building $⋆$ ☆ $⋆$ $\times$ ExplainER (Ebaid et al., 2019), CPM (Stadelmaier & Padó, 2019), Kelpie (Rossi et al., 2022)

Model Debugging ☆ ☆ $⋆$ $\times$ LEMON (Barlaug, 2021), ExpalinER (Ebaid et al., 2019), D-REX (Albalak et al., 2021), Instance-based (Ouchi et al., 2020), (Kejriwal et al., 2019), TuneR (Paganelli et al., 2019), CRIAGE (Pezeshkpour et al., 2019), Kelpie (Rossi et al., 2022), SparKGR (Xia et al., 2022), GCNN w/att (Neil et al., 2018), MINERVA (Das et al., 2017)

Understanding Performance and Contributing Factors $⋆ ⋆$ $⋆ ⋆$ $⋆ ⋆$ $⋆ ⋆$ All papers

Managing Updates ☆ $⋆$ $⋆$ $⋆$ ExplainER (Ebaid et al., 2019), CPM (Stadelmaier & Padó, 2019), TuneR (Paganelli et al., 2019), Abstraction (Tran et al., 2020), TLogic (Liu et al., 2022), Emboot (Zupon et al., 2019), SystemER (Qian et al., 2019), RNNLogic (Qu et al., 2020), (Zhang et al., 2022) Neural LP (Yang et al., 2017), FTL-LM (Lin et al., 2023), ITCN (Wu et al., 2022), CPL (Fu et al., 2019), SQUIRE (Bai et al., 2022), MGNN (Cucala et al., 2022), DRUM (Sadeghian et al., 2019), ProtoRE (Ding et al., 2021), CRIAGE (Pezeshkpour et al., 2019), TITer (Sun et al., 2021), PathCon (Wang et al., 2021), METransE (Wang et al., 2022), RED-GNN (Zhang & Yao, 2021), (Meroño Peñuela et al., 2021)

	Capabilities
Model Selecting and Building	$⋆$	☆	$⋆$	$\times$	ExplainER (Ebaid et al., 2019), CPM (Stadelmaier & Padó, 2019), Kelpie (Rossi et al., 2022)
Model Debugging	☆	☆	$⋆$	$\times$	LEMON (Barlaug, 2021), ExpalinER (Ebaid et al., 2019), D-REX (Albalak et al., 2021), Instance-based (Ouchi et al., 2020), (Kejriwal et al., 2019), TuneR (Paganelli et al., 2019), CRIAGE (Pezeshkpour et al., 2019), Kelpie (Rossi et al., 2022), SparKGR (Xia et al., 2022), GCNN w/att (Neil et al., 2018), MINERVA (Das et al., 2017)
Understanding Performance and Contributing Factors	$⋆ ⋆$	$⋆ ⋆$	$⋆ ⋆$	$⋆ ⋆$	All papers
Managing Updates	☆	$⋆$	$⋆$	$⋆$	ExplainER (Ebaid et al., 2019), CPM (Stadelmaier & Padó, 2019), TuneR (Paganelli et al., 2019), Abstraction (Tran et al., 2020), TLogic (Liu et al., 2022), Emboot (Zupon et al., 2019), SystemER (Qian et al., 2019), RNNLogic (Qu et al., 2020), (Zhang et al., 2022) Neural LP (Yang et al., 2017), FTL-LM (Lin et al., 2023), ITCN (Wu et al., 2022), CPL (Fu et al., 2019), SQUIRE (Bai et al., 2022), MGNN (Cucala et al., 2022), DRUM (Sadeghian et al., 2019), ProtoRE (Ding et al., 2021), CRIAGE (Pezeshkpour et al., 2019), TITer (Sun et al., 2021), PathCon (Wang et al., 2021), METransE (Wang et al., 2022), RED-GNN (Zhang & Yao, 2021), (Meroño Peñuela et al., 2021)

Use Case 1: ML Model Selection and Building

Most model-agnostic methods, such as explainers designed for KG embedding models, and some model-specific methods, have the advantage of providing explanations across different models and facilitating comparison. While some of the reviewed works have demonstrated their applicability in this use case, most have not emphasized addressing concerns related to model selection and comparison. A notable example that covers this use case is ExplainER (Ebaid et al., 2019), which offers a mechanism for model analysis. The analysis engine of ExplainER comprises multiple explanation models and techniques (LIME Ribeiro et al., 2016, Anchors Ribeiro et al., 2018, BRL Letham et al., 2015, and Skater Choudhary et al., 2018) that are independent of any entity resolution models. For link prediction, explainable methods such as CPM (Stadelmaier & Padó, 2019) and Kelpie (Rossi et al., 2022) can be used with any embedding-based link prediction models, allowing for comparison across different embedding models. The main gap for current models in this use case is not solely related to model design and architecture, but also to better documentation. One potential solution is to document an interactive model card (Crisan et al., 2022) that lists all the necessary information regarding explainability. For instance, for explainable link prediction models, this could include the ratio of faithful and correct explanations generated for each embedding model and a comparison of generated explanations for the same input.

Use Case 2: ML Model Debugging

Some works provide analyses of errors. For example, the instance-based explainable method performed error analysis using relevant examples to identify factors causing model confusion (Ouchi et al., 2020). ExplainER visualized representative explanations to highlight where the model fails (Ebaid et al., 2019). D-REX conducted error analysis on explanations alongside model predictions, further revealing the model’s error detection capabilities (Albalak et al., 2021). Pezeshkpour et al. demonstrated the potential application of CRIAGE for automated detection of erroneous triples in KGs. Their approach focused on identifying triples with the least influence on the model’s prediction of the training data (Pezeshkpour et al., 2019). Similarly, Rossi et al. highlighted the ability of Kelpie to uncover bias and imbalance in data, enabling researchers to correct it. However, although these works provided analyses of errors, most did not offer actionable steps for rectifying the identified issues. This could be achieved by providing options to adjust parameters, model architectures, and leverage external sources such as human knowledge. Human-in-the-loop methods exemplify approaches for correcting errors and improving model output, such as manually correcting rules by domain experts in rule-based explainable systems (Kejriwal et al., 2019; Paganelli et al., 2019). One approach following this line is to offer local actionable information, such as suggestions for correcting predictions directly. A future direction in designing local explainable methods would be to help users identify error cases and enable corrections at the data point level.

Use Case 3: Understanding Performance and Contributing Factors

The majority of the reviewed work performed well across various tasks. As detailed in Section 4.1.1, a range of representations are employed to understand the inner workings of models and the factors contributing to their outputs. For knowledge extraction tasks, such as entity and relation extraction, models provided supporting evidence from the source data (e.g., text) to aid in predictions (Lee et al., 2021; Lin et al., 2020). Similarly, for knowledge integration tasks like entity linking, attributes of entities were selected through mechanisms such as matching or non-matching votes, as demonstrated in Baraldi et al. (2021), Barlaug (2021). Explainable link prediction models offered rules (Lei et al., 2020; Meilicke et al., 2020; Singh et al., 2017) and paths (Das et al., 2017; Fu et al., 2019; Hildebrandt et al., 2020; Xiong et al., 2017) to illustrate the reasoning process, as well as subgraphs (Du et al., 2023) to measure the influence of nodes and edges. Notably, rule-based methods are prevalent across all tasks due to their concise and straightforward representation and their ability to generalize to new data.

Use Case 4: Managing Updates

Global explainable methods such as rule-based methods (Liu et al., 2022; Paganelli et al., 2019) can potentially express the model evolution through modifications in their global explanations. Similarly, visualization-based explanations (Ding et al., 2021; Wang et al., 2022), where users can compare different versions of visualizations, can also provide valuable insights when managing updates to KGs. Models that provide local explanations, such as inductive models (Sadeghian et al., 2019; Sun et al., 2021; Wang et al., 2021) and perturbation-based models (Pezeshkpour et al., 2019) could track differences for specific instances or groups of instances. Very few of the models directly implemented this capability, but most of them could be potentially extended to support this use case. For rule-based explainable methods, a straightforward way to manage updates is to use the generalization ability of existing rules and perform inductive reasoning. For instance, TLogic (Liu et al., 2022) stated that the temporal rules they generate are applicable to any new dataset, as long as the new dataset covers common relations, even in cases where new entities appear. Zhang et al. (2022) also emphasized the benefits of transferable rules. Their model could generate reusable rules to accelerate the deployment of a KG to new tasks or systems. In addition to directly transferring rules to new data, rules can also be updated. For example, RNNLogic (Qu et al., 2020) used an EM-based algorithm to update rules. Once the explanation rule sets were updated, to gain more insights, the users could compare two sets of rules and see what changes the new data had brought in. Similar strategies can be applied to other explanations, such as the visualization of attention weights and embeddings.

4.2. Explainable Knowledge Graph Construction in Practical Scenarios

We now report on the interviews. We first present the current status of knowledge engineering tools in practical scenarios, focusing on the degree of automation and the level of understanding that knowledge engineers have towards these tools, as well as aspects including data provenance and lineage, evaluation, and human intervention. By addressing a series of sub-questions, we aim to gain a basic understanding of these critical transparency factors from our interview study. This foundation will enable us to delve deeper into identifying the desired properties of explainable models and techniques. A summary of the key findings from the interview study is presented at the end of this section (see Table 8).

Table 8.
Summary of Key Findings From the Interview Study.

Sub-questions for Interview Study Summary of Findings

How much human effort is leveraged in the knowledge graph lifecycle? Among the participants, the majority engage in manual (38.5%) and semi-automatic (38.5%) work, while a minority (23.1%) rely exclusively on automation for their tasks.

What is the level of understanding of the models and techniques? 46.2% of participants emphasized the impact of opaqueness on their work and the importance of understanding the models, while the remaining were less concerned with transparency. All participants exhibited a high level of understanding of their models and techniques.

Do knowledge engineers know where the data comes from? 92.3% of participants reported knowing their data sources, which varied by provider. Many received multi-sourced data depending on their projects.

How do people keep track of data provenance and lineage? 69.2% of participants actively tracked data provenance. Participants prefer standard representations using standardized languages and vocabularies.

How do knowledge engineers evaluate the results? Most participants rely on human evaluation at varying levels to assess results.

What do people do when they find the results incorrect? Human intervention is essential to addressing machine limitations in knowledge graph construction, occurring equally across three stages: pre-processing, in-processing, and post-processing.

How do people explain to others their models and results? 92.3% of participants have experience explaining models and results, primarily using example-based explanations. The main challenge when providing explanations is bridging gaps in background, knowledge, and requirements between knowledge engineers, researchers, and stakeholders.

What are practical use cases of XAI models? The interviews revealed two novel use cases: enhancing human–machine interactions and uncovering previously unnoticed insights.

Do current explainable solutions meet the requirements for practical use cases? The examples rarely met participants’ requirements, with only 17.9% feedback responses being positive and nearly half negative. Participants cited concerns about informativeness, complexity, stability, and coverage of the explainable models.

What are characteristics of an explainable method that knowledge engineers and researchers expected? Two key requirements were identified: a confidence indicator for models, explaining how confident the models are when producing results, and the use of natural language as the explanation format.

Sub-questions for Interview Study	Summary of Findings
How much human effort is leveraged in the knowledge graph lifecycle?	Among the participants, the majority engage in manual (38.5%) and semi-automatic (38.5%) work, while a minority (23.1%) rely exclusively on automation for their tasks.
What is the level of understanding of the models and techniques?	46.2% of participants emphasized the impact of opaqueness on their work and the importance of understanding the models, while the remaining were less concerned with transparency. All participants exhibited a high level of understanding of their models and techniques.
Do knowledge engineers know where the data comes from?	92.3% of participants reported knowing their data sources, which varied by provider. Many received multi-sourced data depending on their projects.
How do people keep track of data provenance and lineage?	69.2% of participants actively tracked data provenance. Participants prefer standard representations using standardized languages and vocabularies.
How do knowledge engineers evaluate the results?	Most participants rely on human evaluation at varying levels to assess results.
What do people do when they find the results incorrect?	Human intervention is essential to addressing machine limitations in knowledge graph construction, occurring equally across three stages: pre-processing, in-processing, and post-processing.
How do people explain to others their models and results?	92.3% of participants have experience explaining models and results, primarily using example-based explanations. The main challenge when providing explanations is bridging gaps in background, knowledge, and requirements between knowledge engineers, researchers, and stakeholders.
What are practical use cases of XAI models?	The interviews revealed two novel use cases: enhancing human–machine interactions and uncovering previously unnoticed insights.
Do current explainable solutions meet the requirements for practical use cases?	The examples rarely met participants’ requirements, with only 17.9% feedback responses being positive and nearly half negative. Participants cited concerns about informativeness, complexity, stability, and coverage of the explainable models.
What are characteristics of an explainable method that knowledge engineers and researchers expected?	Two key requirements were identified: a confidence indicator for models, explaining how confident the models are when producing results, and the use of natural language as the explanation format.

4.2.1. Automation and Understanding

How much human effort is leveraged in the knowledge graph lifecycle?

Among the participants, the majority engage in manual (38.5% of participants) and semi-automatic (38.5%) work, while a minority (23.1%) exclusively utilize automation for the tasks they work on. From the perspective of task execution in ontology engineering, participants predominantly employed manual and/or semi-automatic methodologies. These approaches necessitate extensive communication and collaboration among knowledge engineers, domain experts, and stakeholders, often facilitated through semi-structured interviews. Conversely, for tasks related to knowledge extraction and completion, participants demonstrated a preference for automated models and techniques. Methods, which focus on tasks like data transformation that lifts other formats of data into RDF triples through RML mappings⁶ and tools like SPARQL Anything (Asprino et al., 2022), always involved the manual creation of the mappings. One participant assessed the performance of leveraging language models in generating such mappings. Language models in knowledge engineering have enhanced automation due to their user-friendly nature, characterized by simple natural language input and output, which require fewer specialist skills. However, their opacity and tendency to generate hallucinations impact their trustworthiness. When evaluating the outcomes of models, such as the triples generated by knowledge extraction models, human evaluation is always necessary. This is particularly crucial when dealing with new domains and data, where datasets are lacking.

What is the level of understanding of the models and techniques?

Participants had varying opinions regarding the impact of the opaqueness of models and techniques and the necessity to thoroughly understand them. 46.2% of them felt that opaqueness did impact their work and emphasized the importance of understanding the models. As participant A highlighted, this importance extends beyond merely explaining why the models produce certain outputs. It also involves helping humans ‘‘understand the extent to which these outputs can be trusted’’⁷ and determining ‘‘how they might need to change the way they interact with the model.’’ Participant B also noted that the opaqueness of the models and techniques might complicate evaluations, as it becomes challenging to determine how specific inputs influence the outputs. In contrast, the remaining 53.8% of participants were less concerned with transparency issues, feeling that opaqueness was not a significant problem. They provided several reasons. Two participants stated that only the model’s performance and the final quality of the output KGs mattered to them. Since they primarily deal with public datasets and transparency and explainability are not within their research scope, they pay less attention to these topics. Three participants mentioned that their tasks are predominantly manual, so transparency and explainability are less applicable. Some participants noted that even collaborative projects require some level of explanation for better communications and outcomes between human agents.

All 13 participants demonstrated a relatively high level of understanding towards the models and techniques they used, particularly when these models and techniques were open-sourced and/or came with documentation (e.g., publications, technical documents), or if the models were self-developed. Two participants mentioned that it is not always necessary to delve into the code level, and sometimes it is challenging to fully comprehend how the models make decisions. However, it is crucial to gain a conceptual understanding of the mechanisms of specific components, their technical limitations, and underlying assumptions. 53.8% of participants reported no significant obstacles in understanding the models and techniques. For the remaining 6, the challenges of understanding the models and techniques varied. The primary obstacle, mentioned by 3 participants, was the difficulty in understanding errors produced by models, their causes, and how the models arrived at decisions. Participant B pointed out that models could be difficult to understand due to their mathematical complexity and insufficient background knowledge in ML and NLP, particularly when it comes to understanding the inner workings of LLMs. Participant E noted the difficulty in determining the optimal size of data and model parameters to train models effectively or to transfer them to another domain or input type. Participant L emphasized the challenge of evaluating both the correctness and the completeness of results, noting that both aspects are critically important. Additionally, understanding ‘‘what level of quality is good enough for the task’’ is also challenging. To address these challenges, participant A suggested designing models to provide additional outputs that help in understanding the models. This is somewhat achieved by works in the literature review, such as models leveraging attention mechanisms that output attention weights (Lin et al., 2020), and reinforcement learning models that produce reasoning paths (Xiong et al., 2017) to explain results. For generative models, asking them to generate additional or intermediate outputs, such as reasons for certain outputs could help. However, this is not always technically feasible. For example, adjusting embedding models to generate intermediate outputs for model understanding is not as straightforward as with LLMs through chain-of-thought (Wei et al., 2022). Another approach proposed by participant G for understanding incorrect results is to seek training examples similar to some test data instances. This aligns with existing reviewed example-based explanations (Plumb et al., 2018), such as (Ouchi et al., 2020), where similar training instances are returned as explanations for the assigned label of the candidate instance.

4.2.2. Data Provenance and Lineage

Do knowledge engineers know where the data comes from?

Among the 13 participants, 92.3% of them reported knowing the sources of their data. Data sources and providers varied, with participants often receiving multi-sourced data, depending on the projects they were working on. As shown in Figure 5(a), half of the participants used open KGs and publicly available data. However, this does not mean the data sources are always clear to them and verifiable. Not all participants are aware of how these datasets and KGs were created. For instance, participant E mentioned that they were unaware of how the benchmark datasets were created, but recognized that this data often has significant limitations, such as skewed distributions and incompleteness. Similarly, participant H noted that when using external APIs to obtain data, it was unclear where the data originated from. Data can also be collected from domain experts and stakeholders or acquired from partners and collaborators with data sharing policies and platforms. However, participants using data from these sources reported similar challenges: some of them generally did not make extra efforts to understand the origins of the data and how exactly it was selected. Participant M also noted the difficulty in assessing the qualifications of annotators when data is manually annotated, as detailed information about their expertise is often unavailable. Participants from industry also collect data from customers, which can be sensitive and requires extra effort for data masking. A minority of participants constructed datasets themselves.

Figure 5.

Distribution of Participant Responses to Four Questions: (a) Where Does the Data Come From? (b) How do You Evaluate the Results? (c) How do You Explain the Results? (d) When do You Perform the Iintervention? The x-axis Represents the Total Number of Participants (13), with Multiple Responses Allowed Per Participant.

How do people keep track of data provenance and lineage?

Among the participants flagging data provenance as essential, 69.2% actively tracked it in their tasks. Notably, all participants from industry and academia alike recognized the importance of data provenance and lineage and have established methods for documenting these aspects, given that their data primarily comes from partners and customers. The interview revealed a list of (semi-) automatic techniques either currently in use or planned for adoption to manage data provenance and lineage by the participants, including PROV Ontology⁸, RDF Star⁹, metadata, OpenRefine¹⁰, Data Version Control (DVC)¹¹, data catalogs, NLP Interchange Format (NIF) (Hellmann et al., 2013), and blockchain. These tools document a set of details, such as the creation time, involved personnel, operation timelines, algorithms used to create the data, and potentially even the parameterization of these algorithms. Data provenance is tracked at different granular levels, from the model level (e.g., entire ontologies) to the data level (e.g., individual ontology elements). The availability of a wide range of tools offers knowledge engineers flexibility in fitting their specific pipelines. However, challenges and requirements remain. For instance, participant M noted the difficulty in determining the extent to which data provenance should be tracked and the level of details required. There is also a preference for using standard representations based on standardized languages and vocabularies. As participant B said, an ideal approach would involve ‘‘an explicit and standard representation of provenance and lineage directly attached to the produced artifacts,’’ such as ‘‘metadata that accompanies the actual knowledge assets.’’ It is also crucial to have an automatic, scalable, and trustworthy method for tracking data provenance and lineage, particularly when dealing with sensitive and frequently updated data. Participants also highlighted that integrating LLMs into the knowledge engineering process introduced new challenges for data provenance, as their extensive training data is often unknown.

4.2.3. Evaluation and Human Intervention

How do knowledge engineers evaluate the results?

As shown in Figure 5(b), most participants rely on human evaluation to evaluate the outcome of knowledge engineering tasks. The human evaluation methods used are usually qualitative analysis or randomly sampling a subset of data for manual inspection. Depending on the task and the required expertise, the evaluators are usually domain experts and/or the researchers themselves. From our interviews, two participants reported recruiting domain experts for evaluation, one conducted evaluations both by the developers and domain experts, while the remaining 9 evaluated the results themselves. Several reasons contribute to the heavy reliance on human labor for evaluation. Firstly, there is a lack of available datasets and testing platforms. Secondly, existing datasets are often unsuitable for new scenarios. A model that performs well on current datasets may not necessarily perform well on new data, rendering the existing datasets less helpful. Thirdly, metrics used, such as average precision and accuracy, can sometimes be misleading. It is often the case that either the metrics look too good and the results are worse, or the metrics look very bad and the results are better than they appear. There are several additional issues with human evaluation. It is time-consuming, not scalable, and not always feasible. Additionally, randomly evaluating a subset of data might not accurately reflect the actual quality of the generated results. Participant L noted, ‘‘the main difficulty is to select an actual relevant sample.’’

Besides manual evaluation, 46.2% of participants also attempt to reuse benchmarks and datasets with ground truths to automate the evaluation process. They are aware that benchmarks are incomplete, biased, and skewed, and thus might not always truthfully reflect the models’ abilities. Participant M from the industry mentioned that there are often gaps between the focus of the benchmarks and the requirements of customers. While benchmarks are more focused on challenging cases and specific domains, customers are often interested in general domain cases, making the benchmarks less relevant to practical scenarios. Participants from academia also face difficulties, as mentioned by G, when there are insufficient resources for annotations or re-annotations, forcing them to work with the existing data. Participants also highlighted other methods to automate evaluation. Two participants mentioned using SHACL shapes for validation. Participant J stated they are developing a quality management concept and have a machine learning-based algorithm on top of other methods to estimate quality. Additionally, two participants indicated they construct new datasets for evaluating their methods. Although there are ongoing efforts to automate evaluation, the interviews revealed a consensus that different extents of human evaluation are always required.

What do people do when they find the results incorrect?

Similar to human evaluation, the interviews revealed a consensus that human intervention is essential to compensate for the limitations of machines at various stages and levels of detail in the KG construction process. We categorized human intervention based on the stages at which it occurs, as shown in Figure 5(d). The first stage is pre-processing, where human intervention primarily involves working with the inputs. The most common approaches are data augmentation and cleaning. When working with LLMs, this also involves improving the input prompts. The second stage is in-process intervention, where researchers and knowledge engineers adjust the models and techniques or specific steps in the process to resolve issues. This could involve fine-tuning and re-training models, debugging code, or adding, removing, and modifying components and steps. Before making these modifications, there is typically a troubleshooting process to identify error patterns, systematic mistakes, and biases. Participant B mentioned that when using LLM-based pipelines, disambiguation is always a problem, so they either improve the prompts or add an extra disambiguation step. Another example, provided by participant I, is involving humans in identifying incorrect inference rules and then rerunning the models. The third stage involves directly modifying the model output, where humans manually correct a group of generated outputs (if the errors are manageable in size) or add post-hoc filters to exclude problematic results (i.e., post-processing). Statistically, 53.8% of participants adopted pre-processing methodologies, 46.2% engaged in in-process modifications, and 53.8% employed post-process corrections. This indicates that participants typically adopted multiple types of interventions at various stages. Moreover, the individuals performing the intervention are crucial. It’s not just about their availability to check the results, but also about their expertise. As noted by participant L, when mistakes are ‘‘a mix of technical and domain-specific issues,’’ it can be more challenging to identify and address them.

How do people explain to others their models and results?

92.3% of participants have experience explaining models and results to others. 38.5% of participants explained their models and results to stakeholders who may not have a technical background, typically domain experts. Eight participants explained their work with ontologists and knowledge engineers, who have a similar technical background, usually project partners and team members. Additionally, two participants mentioned producing explanations for educational purposes, targeting university students. This indicates that designing and delivering explanations has become a crucial and challenging task in the KG lifecycle. As highlighted by participant L, if the model performance does not meet stakeholder expectations and the model is not explainable, it does not foster acceptance or transparency.

The methods used for explanations are summarized in Figure 5(c). For now, there are no standardized methods for explaining the models and outputs in the KG lifecycle. We observed that almost no methods from the literature review are used in participants’ daily work scenarios. We argue that there may be two main reasons for this. First, participants may not be aware of these methods and therefore rely on their own intuitive ways to explain results when needed. Second, the available methods may not be ready for practical use, and integrating existing XAI methods into their workflows is challenging. Only participant B mentioned having used one of the presented models in the example discussion session (Bhowmik & de Melo, 2020), finding it generally useful, although not all explanations produced by the model were helpful.

The most frequent method (used by six participants) is to select examples, including corner cases and errors, to explain the model’s functionality, the relevance between input and output, the difficulties of the problems, and the range of the model’s abilities. Three participants explained the pipelines and models through lectures and conceptual introductions to the technical components, often providing high-level overviews of the algorithms and models. The other two participants adopted visualization methods. Participant A reported success using visualizations to represent embeddings and clusters, which helped ‘‘define a clear boundary between technical and intuitive content.’’ One participant mentioned using contrastive explanations, such as why the machine made one decision instead of another. Two participants did not have a specific method but relied on plain explanations.

Using the same taxonomy adopted in the literature review in Section 3, we categorized the explanation methods collected from the interviews into two categories: contrastive explanations and example-based explanations as local post-hoc methods, and visualization, plain explanations, and introduction to inner workings as global post-hoc methods. Our analysis reveals that, out of the 14 responses regarding explanation methods, half of the responses are local post-hoc methods, while the other half are global post-hoc methods. Notably, no self-explaining methods were reported. In contrast, the literature review indicates that a substantial proportion of explainable methods consist of local self-explaining (59.5%) and local post-hoc (19%) methods. We posit that several factors contribute to this discrepancy. Self-explaining methods are preferred in academia-developed models because researchers often work on implementing models from scratch or improving models by adjusting components or integrating additional components for better performance. This objective aligns with the design of self-explaining models. Among the 50 local self-explaining papers reviewed, 37 pertain to link prediction models, which typically incorporate explanation mechanisms into their developed models, enhance existing models by making components explainable, or reformulate problems in an interpretable manner. For practitioners, however, implementing self-explaining methods poses challenges. Post-hoc explanations of model output offer greater flexibility, allowing practitioners to customize supporting evidence, visualize this evidence, and adapt explanations into other languages that are more comprehensible to their stakeholders.

Participants reported several challenges. The most significant challenge when providing explanations is the gap in background, knowledge, and requirements between knowledge engineers/researchers and their stakeholders. This gap makes it difficult to capture the interests and needs of stakeholders and to determine the appropriate level of technical detail for explanations. Reporting too many technical details may disinterest and frustrate stakeholders who lack a technical background. As participant J mentioned, ‘‘they might feel confused and show very little interest in what the numbers in the explanations represent.’’ Participant B noted that there is often a gap in expectations, with knowledge engineers and researchers focusing on ‘‘understanding from the technical aspects’’ while stakeholders are more interested in ‘‘the practical use case and deployment perspectives.’’ Participant L added that audiences often have ‘‘distorted expectations of the machine’’ and expect it to ‘‘reason or think as people do.’’ Addressing this challenge involves finding a common language that is simple enough for non-technical stakeholders and customizing explanations for different audiences. Another challenge is generating robust and satisfactory explanations, which can be difficult due to a lack of ground truths, poor model performance, and the black-box nature of models.

4.3. Gaps and Challenges in Explainable KGC Solutions and Practical Usage

4.3.1. Use Cases From Interview Study

What are practical use cases of XAI models?

We first compared the use cases in Section 3.1.2 with those collected from the interview study. We found that the use cases in Section 3.1.2 were largely reflected through the interview study, which also provided new insights and additional use cases. The most prominent use case, highlighted by 76.9% of participants, is understanding the model output and its inner workings. This includes providing supporting evidence, mapping results to the original input, and explaining how the models generate the output. This aligns with the previously identified use case of understanding performance and contributing factors. The second common use case, mentioned by 38.5% of participants, is debugging models and assisting in rectifying and adjusting them. This extends the previous use case of model debugging by indicating where the machine fails or is unstable, identifying systematic error patterns and problematic parts of data sets, and understanding mistakes and errors and their causes.

Two novel use cases were identified from the interviews. The first involves enhancing human–machine interactions by facilitating human involvement at various stages of the pipeline and providing effective interaction with the models. Participant A emphasized that having explainability can streamline the workflow, stating that ‘‘the more explainable the models are, the less human intervention is required’’ during model deployment. To this end, clear and informative explanations play a crucial role in bridging the knowledge gap among different stakeholders, ensuring a shared understanding at the right level. Additionally, simplifying the reuse and sharing of results and pipelines among stakeholders and other technical experts is crucial. This is similar to the model update use case in Section 3.1.2, as reusing the pipeline in other processes also involves feeding new data into the pipeline and explaining the differences. By mapping the works discussed in the literature review to this use case, we found that human-in-the-loop approaches, including myDIG (Kejriwal, 2021), SystemER (Qian et al., 2019), and TuneR (Paganelli et al., 2019), align perfectly with this context. Another novel use case that emerged from the interview study is uncovering new and previously unnoticed insights. This involves explaining how unexpected (but not necessarily wrong) results are obtained and offering additional details or information that may be overlooked when humans perform the same tasks. Among the works reviewed in 4.1.1, rule-based explanations, including those presented in (Paganelli et al., 2019; Qian et al., 2019; Singh et al., 2017; Yao et al., 2021) for entity resolution and (Hildebrandt et al., 2020; Lei et al., 2020) for link prediction, demonstrate notable potential for contributing to this use case.

4.3.2. XAI Example Discussion

Do current explainable solutions meet the requirements for practical use cases?

During the example discussion session, participants provided feedback on various tasks: 5 commented on relation extraction, 4 on entity extraction, and 2 each on entity resolution, link prediction, and inconsistency detection (Table 5). Overall, the examples seldom met participants’ requirements, with only 17.9% of feedback responses being positive, while nearly half were negative. Participants highlighted several concerns and issues regarding the practical adoption of the provided explanations. The primary issue, raised in 28.6% of responses, was that the explanations were not sufficiently informative. This could mean several things: the explanations might only cover one or a few aspects of the results, making them insufficient to fully explain the outcomes. Additionally, the correlation or relevance between the explanation and the output was often weak, rendering the explanations inadequate. For instance, using trigger words from the context to explain entity or relation extraction results might show some relevance but still fail to explain why the models produced those specific results instead of others. Participants also noted the complexity of the explanations, which made them difficult to understand and evaluate. A specific barrier was the use of technical terms. For example, explanations represented in logic rules were found useful but too complex for those without a technical background. Participant M mentioned that numerical thresholds used in rules were perplexing, and participant C expressed concerns that logic rules could quickly become overly complex, especially for long and intricate contexts. Once an error occurred, it was challenging to pinpoint its source, complicating the validation of the explanations. Additionally, visualizations or elements used in visualization-based explanations were not always clearly defined, such as numbers or color bars, and the representations were not in a standard language familiar to knowledge engineers or lay users. A third concern, raised by 21.4% of responses, was the stability and coverage of the explainable models. Participants questioned whether these models could consistently provide reliable explanations for all results. This issue, known as coverage, refers to how many cases from the entire set of results can be explained. Participants worried that there might not always be an explanation, or an explanation of sufficient quality, for all cases of interest. This concern was particularly focused on path-based and attention-based explanations. For path-based explanations, participants doubted the reliability of reasoning paths for every link prediction result. For attention-based explanations, they were concerned about the models’ stability and the possibility of incorrect attention being paid to words, thus making the explanations less reliable.

4.4. Requirements for Explainable Approaches

What are characteristics of an explainable method that knowledge engineers and researchers expected?

From the example discussions, we identified two key requirements. First, 30.8% of participants emphasized the need for a confidence indicator for models, explaining how confident the models are when producing results. This indicator should truthfully reflect the model’s confidence in both the output and the explanations, and models should have the option to acknowledge uncertainty rather than provide incorrect answers or hallucinations. Wrong explanations, whether for correct or incorrect results, can bias users’ impressions of the explanations, thereby eroding user trust in the model. Confidence indicators can reduce such cases and facilitate better human-machine collaboration by highlighting uncertain instances where human intervention is necessary. As participant E noted, when generating explanations for relation extraction, there should be ‘‘an option of like they don’t know the relationships or they cannot predict this relationship based on the current context.’’ Similarly, participant F stated, ‘‘we would like to make sure that the machine says it doesn’t know when it doesn’t know.’’

Secondly, the representation of explanations largely depends on the task and user. Although explanation formats like visualization and logic rules received varying levels of acceptance, the most acceptable representation for participants was natural language. Eleven out of 13 participants preferred natural language explanations, either alone or combined with other representations such as visualization and logic rules. The main concern with visualizations and logic rules was semantic grounding, meaning the use of clearly defined language that ensures users understand the underlying semantics. Participants noted that visual notations without clear definitions are ‘‘prone to ambiguity.’’ The same concern was raised for natural language explanations, as ambiguity can create challenges for human understanding of the model, thereby complicating human–machine interaction.

From the requirement elicitation questions, we also identified two common requirements from participants more directly. First, 30.8% of participants highlighted that the type of information users most require in explanations is contextual information. They want explainable models capable of tracing which inputs or intermediate outputs led to a certain result. As participant D mentioned, the ability to ‘‘pinpoint’’ refers to ‘‘identifying the minimal relevant information that a user needs to understand the result and the problem.’’ This ability to map outputs to inputs can establish the basic trustworthiness of explainable models. Participant I added, ‘‘the sources from where the explanations are given are crucial to users.’’ For example, in link prediction tasks, when the input is KGs, the required information includes which triples are used to derive the new one. Generally, this involves providing contextualized information with input data and KGs, reflecting the relevance between input and output, and the relevance between specific components or steps and the output.

Moreover, 30.8% of participants envisioned a solution involving a ‘‘hybrid pipeline’’ where people and machines work in cooperation, providing ways of interaction such as (a) machine-generated explanations for people to understand performance, corner or uncertain cases, etc.; and (b) a feedback loop for people to provide feedback on explanations to the system, explaining why something is right or wrong, or directly giving explanations, so that the machine can learn from this feedback and adjust itself. Follow that line, two participants specifically mentioned the need for iterative explanations in a conversational form. As participant M suggested, such ‘‘dynamic’’ explanations could extend over several rounds, allowing users to ‘‘keep asking for more depth if they see the need.’’ Similarly, participant G believed that the explainable model should be able to select appropriate explanations based on the specific use case and input data, such as rules-based, similarity-based, or visualization-based explanations.

5. Explanation Design Blueprint

Based on the findings from the literature review and interview study, we propose a set of guidelines that are consolidated into a blueprint for designing explainable solutions in knowledge engineering tasks that are both usable and trustworthy for target users, as illustrated in Figure 6. The figure presents a workflow for designing and maintaining XAI methods, beginning with requirement analysis and incorporating an evaluation–feedback loop to continuously refine and update the models and techniques.

Figure 6.
Blueprint of the XAI Method Design Workflow. The Boxes Represent Key Stages in the Design and Development Process. The Arrows Indicate the Flow of Inputs, Outputs, and Feedback Between Stages.

The first step in designing explainable models involves XAI requirement analysis, which collects design insights and creates goals for explanations. Several factors must be carefully considered and investigated to capture the scope and objectives of explainable models.

The most important factor is the users who consume the explanations. As participant A noted, ‘‘designing the system with the users in mind” and ”users are the central component.’’ In the context of the KG lifecycle, these users can be stakeholders, domain experts, knowledge engineers, etc. From the literature review, only 10 works explicitly described the intended users who engage with the generated explanations, as well as their background information. For instance, xERTE (Han et al., 2020) reported the background information of respondents involved in the evaluation of explanations, including their education levels. TuneR (Paganelli et al., 2019) specifies that the tool is designed to support developers, including coders, scientists, and domain experts. And from the interview study, it is evident that the design of explanations should consider the users’ level of understanding and interest in the technical details. Therefore, user analysis should include investigating their background, particularly their technical expertise and domain knowledge, and their expectations for the explanations, including the level of technical detail they require. Furthermore, if the consumers of explanations are involved in collaborative work with machines, understanding how they consume explanations and interact with models is crucial.

The second part of XAI requirement analysis focuses on the use cases of explanations. We identified six use cases, each requiring explanations to focus on different aspects. Developers need to decide which practical use cases the explanations will serve, which may extend beyond the six identified. For example, if explanations are used for model debugging and adjustment, they should provide details on inner workings and contributing factors to help identify error sources. A confidence indicator, as mentioned in Section 4.4, is also useful. If the goal is to enhance human–AI interactions, it is recommended to design mechanisms for providing explainability at multiple stages through collaboration and a good feedback loop to personalize the model’s output based on user input.

The third factor is the representation of explanations, which primarily depends on the task and its related input and output (datatype, modality, or property) and user needs. For example, in knowledge extraction tasks where the input modality is text, the context of the original input might be useful (though not necessarily sufficient) as supporting evidence to show the relevance of the output. This needs to be combined with user analysis to determine what ‘‘language’’ the user speaks, such as description logic, natural language, images, etc.

Moreover, this list of factors can be expanded to reflect real-world scenarios. Additional considerations, such as AI regulations discussed in the Introduction, may also play a role in the requirements analysis. This analysis helps guide the selection and implementation of XAI methods to ensure they align with practical applications.

XAI methods can then be implemented based on identified end users, use cases, requirements, and so on. After implementing XAI methods, the workflow involves iterative loops for maintaining and continuously improving the methods. One loop (top-right corner of Figure 6) focuses on the evaluation and assessment of explanations (Di Bonaventura et al., 2024). Evaluation methods should go beyond anecdotal evidence, selecting appropriate metrics or designing evaluation paradigms. Another iterative loop, (bottom-right corner of Figure 6) derived from the ‘‘hybrid pipeline’’ requirements in Section 4.4, aims to improve explainable models and explanations in practical scenarios. Users who consume the explanations provide feedback and example explanations, which can be used in various ways to enhance the XAI model. This includes creating datasets of explanations for training and fine-tuning XAI models, providing few-shot examples, or even abstracting improvement directions for architecture-level adjustments. Both evaluation results and user feedback are integrated into the implementation stage, providing critical insights that guide ongoing updates and refinements of the XAI methods, as represented by the arrows from the evaluation and user feedback stages to the implementation and update block.
6. Conclusion and Future Work

6.1. Conclusion

In this article, we adopted a mixed methodology, conducting a literature review on explainable methods within the domain of KG construction and an interview study on the same topic with 13 participants to capture how XAI methods support knowledge engineering. We performed the analysis in three dimensions, tasks related to KG construction, the taxonomy of XAI methods, and the use cases of XAI methods in KG construction. We observed that the most effort has been directed towards automation and explainability in entity extraction, relation extraction, entity linking, and link prediction. Additionally, we considered the use cases in explainable automatic KG construction, such as ML model selecting and building, ML model debugging, understanding performance and contributing factors, and managing updates. The interview study largely corroborated the considered use cases, adding new insights and highlighting additional use cases, including enhancing human–machine interactions and providing new insights from unexpected results. We found that the reviewed models primarily focused on explaining the performance and contributing factors to the outcome while neglecting other use cases, such as error detection and correction, which could help establish trust with users. The interview study revealed that while current knowledge engineering models and techniques exhibit varying degrees of automation and understanding, significant challenges remain in data provenance, evaluation methods, and providing clear explanations to stakeholders. The current explainable solutions often fell short of participants’ requirements, with concerns about their informativeness, complexity, and reliability. These insights established a foundational understanding of critical transparency factors, enabling the development of a blueprint for designing explainable methods for knowledge engineering tasks.

In summary, we addressed RQ 1 by reviewing the state-of-the-art XAI models and techniques for KG construction and analyzing them across multiple dimensions. RQ 2 was answered through our interview study, which provided insights into users– perspectives and expectations. RQ 3 was addressed by synthesizing findings from RQ 1 and RQ 2, revealing a clear gap between current XAI models and techniques and user needs. RQ 4 was partially answered by identifying key user requirements, which informed preliminary design considerations for XAI methods. We acknowledge that additional requirements may emerge, especially in particular application scenarios.

6.2. Future Work

We identified five future directions for research on explainable automatic KG construction. First, going back to prior literature on knowledge engineering methodologies (Kendall & McGuinness, 2019; Schreiber, 2000; Studer et al., 1998; Suárez-Figueroa et al., 2011), there are many tasks and activities where automation remains an exception. Aside from the tasks in Figure 3, there is an opportunity to think about other ways for AI assistance to add value: for instance, one design principle of KGs is that they are meant to integrate across multiple sources and be able to tackle evolving requirements. Reusing existing schemas or ontologies can help with interoperability, but the task of finding or assessing an ontology for reuse is still mostly manual. At the other end of the lifecycle, documenting KGs can help with maintenance and reuse, and advances in generative AI make it a chief candidate for automation. While we found a range of explainable link prediction approaches, it would be useful to dive deeper into this sub-field to understand the extent to which these different approaches solve common concerns around the quality of KGs. One difference between representing knowledge in a KG and a machine-learning model is that a KG can provide guarantees about the validity of the information, its provenance, its currency, etc. upon retrieval. However, this is predicated by KGs being regularly audited according to these and other quality dimensions and improved. Link prediction is one way to do this, alongside many others, e.g., debiasing (Fisher et al., 2020). Furthermore, while knowledge acquisition is generally well represented in the literature, a lot of work focuses on text rather than other data modalities, which is a concern in many KG application areas, e.g., enterprise data management (which needs to work with structured data) or cultural heritage (where a lot of domain data is neither text nor numbers).

Second, as we noted earlier, the fewest of approaches look at the human-in-the-loop aspects of KG construction, including human agency and oversight, feedback, etc (Dignum, 2019) and the integration of the developed models into established knowledge-engineering practices. While there is a lot of work in human–AI interaction and interactive ML in the HCI community, they tend to focus so far on simpler ML models and different applications that the knowledge production scenarios we are interested in. One exception is the work on ORES (Halfaker & Geiger, 2019), a participatory ML system used in Wikipedia and Wikidata (a large open-source KG). However, the Wikidata KG construction process is unique because it is community-based, with more than 24K active contributors¹² who receive AI assistance for distinct tasks such as vandalism detection and consistency checks. We need to follow their example to develop the same types of workflows and techniques for other KG construction scenarios - in most cases, these involve much smaller teams and different tool environments. The majority of existing integrated development environments (IDEs) for KGs (e.g., PoolParty¹³, data.world¹⁴, Protégé¹⁵ ) assume KGs are mostly built manually, with some basic automation to speed-up routine tasks like translating node labels or creating documentation from node and edge descriptions. LLMs offer chances to develop novel KG editing tools and interactions, allowing people to interact with their AI agents via natural language and ensuring transparency. Meanwhile, developers working with KGs require KG-related process blueprints that utilize AI algorithms and adhere to AI regulations for creating downstream applications. Recent works, such as OntoChat (Zhang et al., 2024), which incorporate chatbots into tasks during the requirement engineering stage of ontology engineering, have introduced new directions for human-in-the-loop practices, driven by AI chatbots and agents.

Thirdly, our research flagged the need for better evaluations on explanations, which encompasses metrics, benchmarks, and datasets, as well as toolkits and guidance for conducting studies that assess how effective the explanations supplied in KG construction tasks are as proxies and enablers for transparent and hence trusted KGs.

Fourthly, our research revealed an imbalance in the distribution of use cases identified in the study. There was a strong emphasis on understanding the inner workings, performance, and contributing factors of models, while relatively few efforts were made to address other use cases also demanded by the community, such as model debugging, model updating, and human–AI interaction. However, our example discussions indicated that the reviewed explanations often failed to meet these requirements, and participants expressed low confidence in using them in their work or providing them to users. A future direction, reflected in our study and requested by the community, involves adapting current explainable methods to representations and formats that are reusable across multiple use cases.

Finally, although our research provided a blueprint for designing XAI methods, practical applications and verification of the blueprint are missing. Given the different use cases and groups of stakeholders in a knowledge engineering project, several details can be enriched. For instance, parts of the blueprint, such as the user feedback loop, can be refined. Future work could investigate what formats, workflows, and feedback frequencies best prompt users to provide high-quality explanations efficiently.

Footnotes

Funding

This research is supported by the King's College London Research Training Student Grant and co-funded by SIEMENS AG and the Institute for Advanced Study, Technical University of Munich, Germany.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Bohui Zhang

Albert Meroño-Peñuela

Elena Simperl

Notes

References

Abián

Meroño Peñuela

Simperl

(2022). An analysis of content gaps versus user needs in the wikidata knowledge graph. In The Semantic Web—ISWC 2022: 21st International semantic web conference, virtual event, October 23–27, 2022, Proceedings (pp. 354–374). Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-031-19432-0. https://doi.org/10.1007/978-3-031-19433-7_21

Acosta

Zaveri

Simperl

Kontokostas

Auer

Lehmann

(2013). Crowdsourcing linked data quality assessment. In H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. X. Parreira, L. Aroyo, N. Noy, C. Welty & K. Janowicz (Eds.), The semantic web—ISWC 2013 (pp. 260–276). Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-41338-4. https://doi.org/10.1007/978-3-642-41338-4_17

Adhikari

Wenink

van der Waa

Bouter

Tolios

Raaijmakers

(2022). Towards fair explainable AI: A standardized ontology for mapping XAI solutions to use cases, explanations, and AI systems. In Proceedings of the 15th International conference on pervasive technologies related to assistive environments (pp. 562–568). PETRA’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450396318. https://doi.org/10.1145/3529190.3535693

Albalak

Embar

Tuan

Getoor

Wang

W. Y.

(2021). D-REX: Dialogue relation extraction with explanations. CoRR abs/2109.05126. https://arxiv.org/abs/2109.05126

Amarasinghe

Rodolfa

K. T.

Lamba

Ghani

(2023). Explainable machine learning for public policy: Use cases, gaps, and research directions. Data & Policy, 5, e5. https://doi.org/10.1017/dap.2023.2

Arrieta

A. B.

Rodrıguez

N. D.

Ser

J. D.

Bennetot

Tabik

Barbado

Garcıa

Gil-Lopez

Molina

Benjamins

Chatila

Herrera

(2019). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. CoRR abs/1910.10045. http://arxiv.org/abs/1910.10045

Asprino

Daga

Gangemi

Mulholland

(2022). Knowledge graph construction with a Façade: A unified method to access heterogeneous data sources on the web. ACM Transactions on Internet Technology, 23(1), 1–31. https://doi.org/10.1145/3555312

Auer

Herre

(2007). Rapidowl—An agile knowledge engineering methodology. In I. Virbitskaite & A. Voronkov (Eds.), Perspectives of systems informatics (pp. 424–430). Springer Berlin Heidelberg. ISBN 978-3-540-70881-0. https://doi.org/10.1007/978-3-540-70881-0_36

Bahdanau

Cho

Bengio

(2015). Neural machine translation by jointly learning to align and translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings. http://arxiv.org/abs/1409.0473

10.

Bai

Hou

Dai

Xiong

(2022). SQUIRE: A sequence-to-sequence framework for multi-hop knowledge graph reasoning. CoRR abs/2201.06206. https://arxiv.org/abs/2201.06206

11.

Baraldi

Del Buono

Paganelli

Guerra

(2021). Landmark explanation: An explainer for entity matching models. In Proceedings of the 30th ACM international conference on information and knowledge management (pp. 4680–4684). CIKM’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450384469. https://doi.org/10.1145/3459637.3481981

12.

Barlaug

(2021). LEMON: Explainable entity matching. CoRR abs/2110.00516. https://arxiv.org/abs/2110.00516

13.

Bastos

Singh

Nadgeri

Shekarpour

Mulang

I. O.

Hoffart

(2021). Hopfe: Knowledge graph representation learning using inverse hopf fibrations. In Proceedings of the 30th ACM international conference on information and knowledge management (pp. 89–99). CIKM’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450384469. https://doi.org/10.1145/3459637.3482263

14.

Betz

Meilicke

Stuckenschmidt

(2022). Supervised knowledge aggregation for knowledge graph completion. In P. Groth, M. Vidal, F. M. Suchanek, P. A. Szekely, P. Kapanipathi, C. Pesquita, H. Skaf-Molli, & M. Tamper (Eds.), The Semantic Web - 19th international conference, ESWC 2022, hersonissos, crete, Greece, May 29 - June 2, 2022, Proceedings, Lecture Notes in Computer Science (Vol. 13261, pp. 74–92). Springer. https://doi.org/10.1007/978-3-031-06981-9_5

15.

Bhowmik

de Melo

(2020). A joint framework for inductive representation learning and explainable reasoning in knowledge graphs. CoRR abs/2005.00637. https://arxiv.org/abs/2005.00637

16.

Bianchi

Rossiello

Costabello

Palmonari

Minervini

(2020). Knowledge Graph Embeddings and Explainable AI. CoRR abs/2004.14843. https://arxiv.org/abs/2004.14843

17.

Braun

Schmidt

A. P.

Walter

Nagypál

Zacharias

(2007). Ontology maturing: a collaborative web 2.0 approach to ontology engineering. In CKC. https://ceur-ws.org/Vol-273/paper_14.pdf

18.

Breit

Waltersdorfer

Ekaputra

F. J.

Sabou

Ekelhart

Iana

Paulheim

Portisch

Revenko

Teije

A. T.

Van Harmelen

(2023). Combining machine learning and semantic web: A systematic mapping study. ACM Computing Surveys, 55(14s), 1–41. https://doi.org/10.1145/3586163

19.

Brown

T. B.

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

D. M.

Winter

Hesse

,... Amodei

(2020). Language models are few-shot learners. CoRR abs/2005.14165. https://arxiv.org/abs/2005.14165

20.

Chen

Geng

Chen

Pan

J. Z.

Zhang

Horrocks

Chen

(2021). Zero-shot and Few-shot Learning with Knowledge Graphs: A Comprehensive Survey. CoRR abs/2112.10006. https://arxiv.org/abs/2112.10006

21.

Choudhary

Kramer

Dyke

B. V.

Thompson

(2018). datascienceinc/Skater: Enable Interpretability via Rule Extraction(BRL). https://doi.org/10.5281/zenodo.1198885

22.

Chowdhery

Narang

Devlin

Bosma

Mishra

Roberts

Barham

Chung

H. W.

Sutton

Gehrmann

Schuh

Shi

Tsvyashchenko

Maynez

Rao

Barnes

Tay

Shazeer

Prabhakaran

Reif

,... Fiedel

(2022). Palm: Scaling language modeling with pathways. https://doi.org/10.48550/ARXIV.2204.02311

23.

Chromik

Schuessler

2020). A taxonomy for human subject evaluation of Black-Box explanations in XAI. ExSS-ATEC@IUI. https://ceur-ws.org/Vol-2582/paper9.pdf

24.

Cimiano

Paulheim

(2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3), 489–508. https://doi.org/10.3233/SW-160218

25.

Confalonieri

Kutz

Calvanese

Alonso

J. M.

Zhou

S. M.

Chari

Seneviratne

Ghalwash

Shirai

Gruen

D. M.

Meyer

Chakraborty

McGuinness

D. L.

(2024). Explanation ontology: A general-purpose, semantic representation for supporting user-centered explanations. Semantic Web, 15(4), 959–989. https://doi.org/10.3233/SW-233282

26.

Crisan

Drouhard

Vig

Rajani

(2022). Interactive model cards: A human-centered approach to model documentation. In 2022 ACM conference on fairness, accountability, and transparency (pp. 427–439). FAccT ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450393522. https://doi.org/10.1145/3531146.3533108

27.

Cucala

D. J. T.

Grau

B. C.

Kostylev

E. V.

Motik

(2022). Explainable GNN-based models over knowledge graphs. In International conference on learning representations. https://openreview.net/forum?id=CrCvGNHAIrz

28.

d’Amato

Masella

Fanizzi

(2022). An approach based on semantic similarity to explaining link predictions on knowledge graphs. In IEEE/WIC/ACM International conference on web intelligence and intelligent agent technology (pp. 170–177). WI-IAT ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450391153. https://doi.org/10.1145/3486622.3493956

29.

Danilevsky

Qian

Aharonov

Katsis

Kawas

Sen

(2020). A Survey of the State of Explainable AI for Natural Language Processing. CoRR abs/2010.00711. https://arxiv.org/abs/2010.00711

30.

Das

Dhuliawala

Zaheer

Vilnis

Durugkar

Krishnamurthy

Smola

A. J.

McCallum

(2017). Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. CoRR abs/1711.05851. http://arxiv.org/abs/1711.05851

31.

Debruyne

Tran

T. K.

Meersman

(2013). Grounding ontologies with social processes and natural language. Journal on Data Semantics, 2(2-3), 89–118. https://doi.org/10.1007/s13740-013-0023-3

32.

de Moor

De Leenheer

Meersman

(2006). Dogma-mess: A meaning evolution support system for interorganizational ontology engineering. In H. Schärfe, P. Hitzler, & P. Øhrstrøm (Eds.), Conceptual structures: Inspiration and application (pp. 189–202). Springer Berlin Heidelberg. ISBN 978-3-540-35902-9. https://doi.org/10.1007/11787181_14

33.

Deng

Hou

Han

(2020). Keys as features for graph entity matching. In 2020 IEEE 36th International conference on data engineering (ICDE) (pp. 1974–1977). https://doi.org/10.1109/ICDE48307.2020.00217

34.

Dhanorkar

Wolf

C. T.

Qian

Popa

(2021). Who needs to know what, when?: Broadening the Explainable AI (XAI) design space by looking at explanations across the AI lifecycle. In Designing interactive systems conference 2021 (pp. 1591–1602). DIS ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450384766. https://doi.org/10.1145/3461778.3462131

35.

Di Bonaventura

Siciliani

Basile

Merono Penuela

Mcgillivray

(2024). Is explanation all you need? An expert survey on llm-generated explanations for abusive language detection. In F. Dell’Orletta, A. Lenci, S. Montemagni, & R. Sprugnoli (Eds.), Proceedings of the 10th italian conference on computational linguistics (CLiC-it 2024) (pp. 280–288). Pisa, Italy: CEUR Workshop Proceedings. ISBN 979-12-210-7060-6. https://aclanthology.org/2024.clicit-1.34/

36.

Di Cicco

Firmani

Koudas

Merialdo

Srivastava

(2019). Interpreting deep learning models for entity resolution: An experience report using lime. aiDM ’19. New York, NY, USA: Association for Computing Machinery. ISBN 9781450368025. https://doi.org/10.1145/3329859.3329878

37.

Dignum

(2019). responsible artificial intelligence: How to develop and use AI in a Responsible way. 1st edition. Springer Publishing Company, Incorporated. ISBN 3030303705. https://doi.org/10.1007/978-3-030-30371-6

38.

Ding

Wang

Xie

Shen

Huang

Zheng

Zhang

(2021). Prototypical representation learning for relation extraction. CoRR abs/2103.11647. https://arxiv.org/abs/2103.11647

39.

Zhou

Yao

Cheng

Yang

Zhou

Tang

(2023). Cogkr: Cognitive graph for multi-hop knowledge reasoning. IEEE Transactions on Knowledge and Data Engineering, 35(2), 1283–1295. https://doi.org/10.1109/TKDE.2021.3104310

40.

Ebaid

Thirumuruganathan

Aref

W. G.

Elmagarmid

Ouzzani

(2019). Explainer: Entity resolution explanations. In 2019 IEEE 35th International conference on data engineering (ICDE) (pp. 2000–2003). https://doi.org/10.1109/ICDE.2019.00224

41.

Ehsan

Liao

Q. V.

Muller

Riedl

M. O.

Weisz

J. D.

(2021). Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI conference on human factors in computing systems CHI ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450380966. https://doi.org/10.1145/3411764.3445188

42.

Fensel

Simsek

Angele

Huaman

Kärle

Panasiuk

Toma

Umbrich

Wahler

(2020). Knowledge graphs. Springer. https://doi.org/10.1007/978-3-030-37439-6

43.

Firmani

Tanca

Torlone

(2019). Ethical dimensions for data quality. Journal of Data and Information Quality, 12(1), 1–5. https://doi.org/10.1145/3362121

44.

Fisher

Mittal

Palfrey

Christodoulopoulos

(2020). Debiasing knowledge graph embeddings. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 7332–7345). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.595

45.

Chen

Jin

Ren

(2019). Collaborative policy learning for open knowledge graph reasoning. CoRR abs/1909.00230. http://arxiv.org/abs/1909.00230

46.

Gao

Xiong

Gao

Jia

Pan

Dai

Sun

Guo

Wang

(2023). Retrieval-augmented generation for large language models: A survey. CoRR abs/2312.10997. https://doi.org/10.48550/ARXIV.2312.10997

47.

Groth

Simperl

van Erp

Vrandečić

(2023). knowledge graphs and their role in the Knowledge engineering of the 21st century (dagstuhl seminar 22372). Dagstuhl Reports, 12(9), 60–120. https://doi.org/10.4230/DagRep.12.9.60

48.

Guan

Missier

(2020). Efficient rule learning with template saturation for knowledge graph completion. CoRR abs/2003.06071. https://arxiv.org/abs/2003.06071

49.

Guarino

Welty

C. A.

(2004). An overview of OntoClean. Springer Berlin Heidelberg. ISBN 978-3-540-24750-0, pp. 151–171. https://doi.org/10.1007/978-3-540-24750-0_8

50.

Guo

Tiong

A. M. H.

Tao

Hoi

S. C. H.

(2022). From images to textual prompts: Zero-shot vqa with frozen large language models. https://doi.org/10.48550/ARXIV.2212.10846

51.

Guo

Zhuang

Qin

Zhu

Xie

Xiong

(2022). A survey on knowledge graph-based recommender systems. IEEE Transactions on Knowledge and Data Engineering, 34(08), 3549–3568. https://doi.org/10.1109/TKDE.2020.3028705

52.

Halfaker

Geiger

R. S.

(2019). ORES: Lowering barriers with participatory machine learning in wikipedia. CoRR abs/1909.05189. http://arxiv.org/abs/1909.05189

53.

Han

Chen

Tresp

(2020). xerte: Explainable reasoning on temporal knowledge graphs for forecasting future links. CoRR abs/2012.15537. https://arxiv.org/abs/2012.15537

54.

Hase

Bansal

(2020). Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5540–5552). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.491

55.

Chen

Antonyrajah

Horrocks

(2021). Bertmap: A bert-based ontology alignment system. CoRR abs/2112.02682. https://arxiv.org/abs/2112.02682

56.

Hedderich

M. A.

Fischer

Klakow

Vreeken

(2021). Label-descriptive patterns and their application to characterizing classification errors. CoRR abs/2110.09599. https://arxiv.org/abs/2110.09599

57.

Hellmann

Lehmann

Auer

Brümmer

(2013). Integrating NLP using linked data. In H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. X. Parreira, L. Aroyo, N. F. Noy, C. Welty, & K. Janowicz (Eds.), The semantic web - ISWC 2013 - 12th international semantic web conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part II, Lecture Notes in Computer Science (Vol. 8219, pp. 98–113). Springer. https://doi.org/10.1007/978-3-642-41338-4_7

58.

Hildebrandt

Quintero Serna

J. A.

Ringsquandl

Joblin

Tresp

(2020). Reasoning on knowledge graphs with debate dynamics. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 4123–4131. https://doi.org/10.1609/aaai.v34i04.6600

59.

Hofer

Obraczka

Saeedi

Köpcke

Rahm

(2024). Construction of knowledge graphs: Current state and challenges. Information, 15(8), 1–61. https://doi.org/10.3390/info15080509

60.

Hogan

Blomqvist

Cochez

d’Amato

de Melo

Gutierrez

Gayo

J. E. L.

Kirrane

Neumaier

Polleres

Navigli

Ngomo

A. N.

Rashid

S. M.

Rula

Schmelzeisen

Sequeda

J. F.

Staab

Zimmermann

(2020). Knowledge graphs. CoRR abs/2003.02320. https://arxiv.org/abs/2003.02320

61.

Holsapple

C. W.

Joshi

K. D.

(2002). A collaborative approach to ontology design. Communications of the ACM, 45(2), 42–47. https://doi.org/10.1145/503124.503147

62.

Hooker

Erhan

Kindermans

P. J.

Kim

(2019). A benchmark for interpretability methods in deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 32). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/fe4b8556000d0f0cae99daa5c5c5a410-Paper.pdf

63.

Hur

Janjua

Ahmed

(2021). A survey on state-of-the-art techniques for knowledge graphs construction and challenges ahead. In 2021 IEEE Fourth international conference on artificial intelligence and knowledge engineering (AIKE) (pp. 99–103). https://doi.org/10.1109/AIKE52691.2021.00021

64.

Islam

M. K.

Aridhi

Smail-Tabbone

(2022). Negative sampling and rule mining for explainable link prediction in knowledge graphs. Knowledge-Based Systems, 250, 109083. https://doi.org/10.1016/j.knosys.2022.109083

65.

Ivanovs

Kadikis

Ozols

(2021). Perturbation-based methods for explaining deep neural networks: A survey. Pattern Recognition Letters, 150, 228–234. https://doi.org/10.1016/j.patrec.2021.06.030

66.

Jiang

Zhao

Wan

(2022). Graph intention neural network for knowledge graph reasoning. In 2022 International joint conference on neural networks (IJCNN) (pp. 1–8). https://doi.org/10.1109/IJCNN55064.2022.9892730

67.

Jung

Kang

(2021). Learning to walk across time for interpretable temporal knowledge graph completion. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining (pp. 786–795). KDD ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450383325. https://doi.org/10.1145/3447548.3467292

68.

Kaur

Uslu

Rittichier

K. J.

Durresi

(2022). Trustworthy artificial intelligence: A review. ACM Computing Surveys, 55(2), 1–38. https://doi.org/10.1145/3491209

69.

Kejriwal

(2021). A meta-engine for building domain-specific search engines. Software Impacts, 7, 100052. https://doi.org/10.1016/j.simpa.2020.100052

70.

Kejriwal

Shao

Szekely

(2019). Expert-guided entity extraction using expressive rules. SIGIR’19. New York, NY, USA: Association for Computing Machinery. ISBN 9781450361729. https://doi.org/10.1145/3331184.3331392

71.

Kendall

E. F.

McGuinness

D. L.

(2019). Ontology engineering. Synthesis Lectures on the Semantic Web: Theory and Technology, 9(1), i–102. https://doi.org/10.1007/978-3-031-79486-5

72.

Kendall

E. F.

McGuinness

D. L.

(2019). Requirements and use cases. Cham: Springer International Publishing. pp. 25–44. ISBN 978-3-031-79486-5. https://doi.org/10.1007/978-3-031-79486-5_3

73.

Kilicoglu

Rosemblat

Fiszman

Shin

(2020). Broad-coverage biomedical relation extraction with semRep. BMC Bioinformatics, 21(1), 188. https://doi.org/10.1186/s12859-020-3517-7

74.

Kim

S. S. Y.

Watkins

E. A.

Russakovsky

Fong

Monroy-Hernández

(2023). “help me help the ai”: Understanding how explainability can support human-ai interaction. In Proceedings of the 2023 CHI conference on human factors in computing systems CHI ’23. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394215. https://doi.org/10.1145/3544548.3581001

75.

Kotis

Vouros

(2006). Human-centered ontology engineering: The hcome methodology. Knowledge and Information Systems, 10(1), 109–131. https://doi.org/10.1007/s10115-005-0227-4

76.

Kou

Zhang

Wang

(2022). CrowdGraph: A crowdsourcing multi-modal knowledge graph approach to explainable fauxtography detection. Proceedings of the ACM on Human-Computer Interaction (PACMHCI), 6(CSCW2), 1–28. https://doi.org/10.1145/3555178

77.

Koutsiana

Yadav

Jain

Meroño-Peñuela

Simperl

(2023). Agreeing and disagreeing in collaborative knowledge graph construction: An analysis of wikidata. CoRR abs/2306.11766. https://doi.org/10.48550/ARXIV.2306.11766

78.

Larsson

Heintz

(2020). Transparency in artificial intelligence. Internet Policy Review, 9(2), 1–16. https://doi.org/10.14763/2020.2.1469

79.

Lawrence

Sztyler

Niepert

(2020). Explaining neural matrix factorization with gradient rollback. CoRR abs/2010.05516. https://arxiv.org/abs/2010.05516

80.

Leavitt

M. L.

Morcos

A. S.

(2020). Towards falsifiable interpretability research. CoRR abs/2010.12016. https://arxiv.org/abs/2010.12016

81.

Lee

Selvam

R. K.

Sarwar

S. M.

Lin

B. Y.

Agarwal

Morstatter

Pujara

Boschee

Allan

Ren

(2021). Autotrigger: Named entity recognition with auxiliary trigger extraction. CoRR abs/2109.04726. https://arxiv.org/abs/2109.04726

82.

Lee

J. D.

See

K. A.

(2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46(1), 50–80. https://doi.org/10.1518/hfes.46.1.50_30392

83.

Lehmann

Isele

Jakob

Jentzsch

Kontokostas

Mendes

P. N.

Hellmann

Morsey

van Kleef

Auer

Bizer

(2015). Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195. https://doi.org/10.3233/SW-140134

84.

Lei

Jiang

Sun

Mao

Ren

(2020). Learning collaborative agents with rule guidance for knowledge graph reasoning. CoRR abs/2005.00571. https://arxiv.org/abs/2005.00571

85.

Letham

Rudin

McCormick

T. H.

Madigan

(2015). Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3), 1350–1371. https://doi.org/10.1214/15-AOAS848

86.

Lewis

P. S. H.

Perez

Piktus

Petroni

Karpukhin

Goyal

Küttler

Lewis

Yih

Rocktäschel

Riedel

Kiela

(2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR abs/2005.11401. https://arxiv.org/abs/2005.11401

87.

Liao

Q. V.

Gruen

Miller

(2020). Questioning the AI: Informing design practices for explainable AI user experiences. In Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–15). CHI ’20. New York, NY, USA: Association for Computing Machinery. ISBN 9781450367080. https://doi.org/10.1145/3313831.3376590

88.

Lin

B. Y.

Lee

Shen

Moreno

Huang

Shiralkar

Ren

(2020). Triggerner: Learning with entity triggers as explanations for named entity recognition. CoRR abs/2004.07493. https://arxiv.org/abs/2004.07493

89.

Lin

Mao

Liu

Cambria

(2023). Fusing topology contexts and logical rules in language models for knowledge graph completion. Information Fusion, 90, 253–264. https://doi.org/10.1016/j.inffus.2022.09.020

90.

Lin

Shen

Liu

Luan

Sun

(2016). Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 2124–2133). Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1200

91.

Liu

Hildebrandt

Joblin

Tresp

(2022). Tlogic: Temporal logical rules for explainable link forecasting on temporal knowledge graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4), 4120–4127. https://doi.org/10.1609/aaai.v36i4.20330

92.

Lin

(2022). DensE: An enhanced non-commutative representation for knowledge graph embedding with adaptive semantic hierarchy. Neurocomputing, 476, 115–125. https://doi.org/10.1016/j.neucom.2021.12.079

93.

Lundberg

S. M.

Lee

S. I.

(2017). A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems 30 (pp. 4765–4774). Curran Associates, Inc., http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf

94.

Cao

Hou

Liu

Zhang

Dai

(2021). Is multi-hop reasoning really explainable? towards benchmarking reasoning interpretability. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 8899–8911). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.700

95.

Huang

(2021). Hiam: A hierarchical attention based model for knowledge graph multi-hop reasoning. Neural Networks, 143, 261–270. https://doi.org/10.1016/j.neunet.2021.06.008

96.

Mao

Wang

Lan

(2022). Lightea: A scalable, robust, and interpretable entity alignment framework via three-view label propagation. https://doi.org/10.48550/ARXIV.2210.10436

97.

Meilicke

Chekol

M. W.

Fink

Stuckenschmidt

(2020). Reinforced anytime bottom up rule learning for knowledge graph completion. CoRR abs/2004.04412. https://arxiv.org/abs/2004.04412

98.

Meroño Peñuela

Pernisch

Guéret

Schlobach

(2021). Multi-domain and explainable prediction of changes in web vocabularies. In Proceedings of the 11th on knowledge capture conference (pp. 193–200). K-CAP ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450384575. https://doi.org/10.1145/3460210.3493583

99.

Miller

(2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38. https://doi.org/10.1016/j.artint.2018.07.007

100.

Minh

Wang

H. X.

Y. F.

Nguyen

T. N.

(2022). Explainable artificial intelligence: A comprehensive review. Artificial Intelligence Review, 55(5), 3503–3568. https://doi.org/10.1007/s10462-021-10088-y

101.

Mittelstadt

Russell

Wachter

(2019). Explaining explanations in ai. In Proceedings of the conference on fairness, accountability, and transparency (pp. 279–288). FAT* ’19. New York, NY, USA: Association for Computing Machinery. ISBN 9781450361255. https://doi.org/10.1145/3287560.3287574

102.

Mohseni

Zarei

Ragan

E. D.

(2021). A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Transactions on Interactive Intelligent Systems, 11(3–4), 1–45. https://doi.org/10.1145/3387166

103.

Nauta

Trienes

Pathak

Nguyen

Peters

Schmitt

Schlötterer

van Keulen

Seifert

(2022). From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. CoRR abs/2201.08164. https://arxiv.org/abs/2201.08164

104.

Neil

Briody

Lacoste

Sim

Creed

Saffari

(2018). Interpretable graph convolutional neural networks for inference on noisy knowledge graphs. CoRR abs/1812.00279. http://arxiv.org/abs/1812.00279

105.

Niu

(2022). Logic and commonsense-guided temporal knowledge graph completion. https://doi.org/10.48550/ARXIV.2211.16865

106.

Niu

Zhang

(2022). Cake: A scalable commonsense-aware framework for multi-view knowledge graph completion. https://doi.org/10.48550/ARXIV.2202.13785

107.

Niu

Zhang

Sheng

Shi

(2022). Joint semantics and data-driven path representation for knowledge graph reasoning. Neurocomputing, 483, 249–261. https://doi.org/10.1016/j.neucom.2022.02.011

108.

Niu

Zhang

Cui

Liu

Zhang

(2019). Rule-guided compositional representation learning on knowledge graphs. CoRR abs/1911.08935. http://arxiv.org/abs/1911.08935

109.

Ott

Meilicke

Samwald

(2021). SAFRAN: an interpretable, rule-based link prediction method outperforming embedding models. CoRR abs/2109.08002. https://arxiv.org/abs/2109.08002

110.

Ouchi

Suzuki

Kobayashi

Yokoi

Kuribayashi

Konno

Inui

(2020). Instance-based learning of span representations: A case study through named entity recognition. CoRR abs/2004.14514. https://arxiv.org/abs/2004.14514

111.

Paganelli

Sottovia

Guerra

Velegrakis

(2019). Tuner: Fine tuning of rule-based entity matchers. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 2945–2948). CIKM ’19. New York, NY, USA: Association for Computing Machinery. ISBN 9781450369763. https://doi.org/10.1145/3357384.3357854

112.

Page

M. J.

Moher

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

S. E.

Chou

Glanville

Grimshaw

J. M.

Hróbjartsson

Lalu

M. M.

Loder

E. W.

Mayo-Wilson

McDonald

McGuinness

L. A.

…McKenzie

J. E.

(2021). PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ (Clinical Research ed.), 372, 1–36. https://doi.org/10.1136/bmj.n160

113.

Pan

J. Z.

Razniewski

Kalo

J. C.

Singhania

Chen

Dietze

Jabeen

Omeliyanenko

Zhang

Lissandrini

Biswas

de Melo

Bonifati

Vakaj

Dragoni

Graux

(2023). Large language models and knowledge graphs: Opportunities and challenges. Transactions on Graph Data and Knowledge, 1(1), 2:1–2:38. https://doi.org/10.4230/TGDK.1.1.2

114.

Pellissier Tanon

Weikum

Suchanek

(2020). YAGO 4: A reason-able knowledge base. In A. Harth, S. Kirrane, A. C. Ngonga Ngomo, H. Paulheim, A. Rula, A. L. Gentile, P. Haase, & M. Cochez (Eds.), The semantic web (pp. 583–596). Cham: Springer International Publishing. ISBN 978-3-030-49461-2

115.

Peng

Xia

Naseriparsa

Osborne

(2023). Knowledge graphs: Opportunities and challenges. Artificial Intelligence Review, 56(11), 1–32. https://doi.org/10.1007/s10462-023-10465-9

116.

Petroni

Rocktäschel

Lewis

P. S. H.

Bakhtin

Miller

A. H.

Riedel

(2019). Language models as knowledge bases? CoRR abs/1909.01066. http://arxiv.org/abs/1909.01066.

117.

Pezeshkpour

Tian

Singh

(2019). Investigating robustness and interpretability of link prediction via adversarial modifications. CoRR abs/1905.00563. http://arxiv.org/abs/1905.00563

118.

Piscopo

Simperl

(2019). What we talk about when we talk about wikidata quality: a literature survey. In Proceedings of the 15th international symposium on open collaboration, OpenSym ’19. New York, NY, USA: Association for Computing Machinery. ISBN 9781450363198. https://doi.org/10.1145/3306446.3340822

119.

Plumb

Molitor

Talwalkar

(2018). Model agnostic supervised local explanations. In Proceedings of the 32nd international conference on neural information processing systems (pp. 2520–2529). NIPS’18. Red Hook, NY, USA: Curran Associates Inc. http://arxiv.org/abs/1807.02910

120.

Poursabzi-Sangdeh

Goldstein

D. G.

Hofman

J. M.

Vaughan

J. W.

Wallach

H. M.

(2018). Manipulating and measuring model interpretability. CoRR abs/1802.07810. http://arxiv.org/abs/1802.07810

121.

Poveda-Villalón

Gómez-Pérez

Suárez-Figueroa

M. C.

(2014). Oops! (ontology pitfall scanner!): An on-line tool for ontology evaluation. International Journal on Semantic Web and Information Systems, 10(2), 7–34. https://doi.org/10.4018/ijswis.2014040102

122.

Preece

A. D.

Harborne

Braines

Tomsett

Chakraborty

(2018). Stakeholders in explainable AI. CoRR abs/1810.00184. http://arxiv.org/abs/1810.00184

123.

Qarout

Checco

Demartini

Bontcheva

(2019). Platform-related factors in repeatability and reproducibility of crowdsourcing tasks. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7(1), 135–143. https://doi.org/10.1609/hcomp.v7i1.5264

124.

Qian

Popa

Sen

(2019). Systemer: A human-in-the-loop system for explainable entity resolution. Proceedings of the VLDB Endowment, 12(12), 1794–1797. https://doi.org/10.14778/3352063.3352068

125.

Chen

Xhonneux

L. A. C.

Bengio

Tang

(2020). Rnnlogic: Learning logic rules for reasoning on knowledge graphs. CoRR abs/2010.04029. https://arxiv.org/abs/2010.04029

126.

Radford

Child

Luan

Amodei

Sutskever

(2019). Language models are unsupervised multitask learners. OpenAI Retrieved November 15, 2024, from https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

127.

Ras

van Gerven

Haselager

(2018). Explanation methods in deep learning: Users, values, concerns and challenges. Springer International Publishing. ISBN 978-3-319-98131-4, pp. 19–36. https://doi.org/10.1007/978-3-319-98131-4_2

128.

Razniewski

Yates

Kassner

Weikum

(2021). Language models as or for knowledge bases. CoRR abs/2110.04888. https://arxiv.org/abs/2110.04888

129.

Revenko

Sabou

Ahmeti

Schauer

(2018). Crowd-sourced knowledge graph extension: A belief revision based approach. In A. Bozzon, & M. Venanzi (Eds.), Proceedings of the HCOMP 2018 works in progress and demonstration papers track of the sixth AAAI conference on human computation and crowdsourcing (HCOMP 2018), Zurich, Switzerland, July 5–8, 2018, CEUR Workshop Proceedings (Vol. 2173). CEUR-WS.org. https://ceur-ws.org/Vol-2173/paper4.pdf

130.

Ribeiro

M. T.

Singh

Guestrin

(2016). ‘Why should I trust you?’: Explaining the predictions of any classifier. CoRR abs/1602.04938. http://arxiv.org/abs/1602.04938

131.

Ribeiro

M. T.

Singh

Guestrin

(2018). Anchors: High-precision model-agnostic explanations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 1527–1535. https://doi.org/10.1609/aaai.v32i1.11491

132.

Rocktäschel

Riedel

(2017). End-to-end differentiable proving. In I. Guyon, U. A. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/b2ab001909a8a6f04b51920306046ce5-Paper.pdf

133.

Rong

Leemann

Nguyen

T. T.

Fiedler

Qian

Unhelkar

Seidel

Kasneci

(2024). Towards human-centered explainable AI: A survey of user studies for model explanations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4), 2104–2122. https://doi.org/10.1109/TPAMI.2023.3331846

134.

Rossi

Firmani

Matinata

Merialdo

Barbosa

(2020). Knowledge graph embedding for link prediction: A comparative analysis. CoRR abs/2002.00819. https://arxiv.org/abs/2002.00819

135.

Rossi

Firmani

Merialdo

Teofili

(2022). Explaining link prediction systems based on knowledge graph embeddings. In Proceedings of the 2022 international conference on management of data (pp. 2062–2075). SIGMOD’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392495. https://doi.org/10.1145/3514221.3517887

136.

Sun

Feng

Qiu

Zhou

Zhang

(2021). Learning logic rules for document-level relation extraction. CoRR abs/2111.05407. https://arxiv.org/abs/2111.05407

137.

Sadeghian

Armandpour

Ding

Wang

D. Z.

(2019). DRUM: End-to-end differentiable rule mining on knowledge graphs. CoRR abs/1911.00055. http://arxiv.org/abs/1911.00055

138.

Sarker

M. K.

Zhou

Eberhart

Hitzler

(2021). Neuro-symbolic artificial intelligence. AI Communications, 34(3), 197–209. https://doi.org/10.3233/AIC-210084

139.

Schneider

Schopf

Vladika

Galkin

Simperl

Matthes

(2022). A decade of knowledge graphs in natural language processing: A survey. In Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing (Volume 1: Long Papers) (pp. 601–614). Online only: Association for Computational Linguistics. https://aclanthology.org/2022.aacl-main.46

140.

Schreiber

(2000). Knowledge engineering and management: The commonKADS methodology. A Bradford book. MIT Press. ISBN 9780262193009. https://books.google.co.uk/books?id=HlXOW_1fsIEC

141.

Schwalbe

Finzel

(2021). XAI method properties: A (meta-)study. CoRR abs/2105.07190. https://arxiv.org/abs/2105.07190

142.

Sequeda

Lassila

(2021). Designing and building enterprise knowledge graphs. Springer International Publishing. ISBN 9783031019166. https://doi.org/10.1007/978-3-031-01916-6

143.

Sevgili

Ö.

Shelmanov

Arkhipov

M. Y.

Panchenko

Biemann

(2020). Neural entity linking: A survey of models based on deep learning. CoRR abs/2006.00575. https://arxiv.org/abs/2006.00575

144.

Shahbazi

Fern

Ghaeini

Tadepalli

(2020). Relation extraction with explanation. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 6488–6494). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.579

145.

Shenoy

Ilievski

Garijo

Schwabe

Szekely

P. A.

(2021). A study of the quality of wikidata. CoRR abs/2107.00156. https://arxiv.org/abs/2107.00156

146.

Simperl

Luczak-Rösch

(2014). Collaborative ontology engineering: A survey. The Knowledge Engineering Review, 29(1), 101–131. https://doi.org/10.1017/S0269888913000192

147.

Simsek

Kärle

Angele

Huaman

Opdenplatz

Sommer

Umbrich

Fensel

(2022). A knowledge graph perspective on knowledge engineering. SN Computer Science, 4(1), 1–16. https://doi.org/10.1007/s42979-022-01429-x

148.

Singh

Meduri

Elmagarmid

Madden

Papotti

Quiané-Ruiz

J. A.

Solar-Lezama

Tang

(2017). Generating concise entity matching rules. In Proceedings of the 2017 ACM international conference on management of data (pp. 1635–1638). SIGMOD ’17. New York, NY, USA: Association for Computing Machinery. ISBN 9781450341974. https://doi.org/10.1145/3035918.3058739

149.

Singh

Meduri

V. V.

Elmagarmid

Madden

Papotti

Quiané-Ruiz

J. A.

Solar-Lezama

Tang

(2017). Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, 11(2), 189–202. https://doi.org/10.14778/3149193.3149199

150.

Smith-Renner

Fan

Birchfield

Boyd-Graber

Weld

D. S.

Findlater

(2020). No explainability without accountability: An empirical study of explanations and feedback in interactive ml. In Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–13). CHI ’20. New York, NY, USA: Association for Computing Machinery. ISBN 9781450367080. https://doi.org/10.1145/3313831.3376624

151.

Speer

Chin

Havasi

(2016). ConceptNet 5.5: An open multilingual graph of general knowledge. CoRR abs/1612.03975. http://arxiv.org/abs/1612.03975

152.

Stadelmaier

Padó

(2019). Modeling paths for explainable knowledge base completion. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP (pp. 147–157). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4816

153.

Studer

Benjamins

Fensel

(1998). Knowledge engineering: Principles and methods. Data & Knowledge Engineering, 25(1), 161–197. https://doi.org/10.1016/S0169-023X(97)00056-6

154.

Suárez-Figueroa

M. C.

Gómez-Pérez

Fernández-López

(2011). The neon methodology for ontology engineering. In Ontology engineering in a networked world (pp. 9–34). Springer. https://doi.org/10.1007/978-3-642-24794-1_2

155.

Sun

Zhong

Han

(2021). Timetraveler: Reinforcement learning for temporal knowledge graph forecasting. CoRR abs/2109.04101. https://arxiv.org/abs/2109.04101

156.

Tamašauskaitė

Groth

(2023). Defining a knowledge graph development process through a systematic review. ACM Transactions on Software Engineering and Methodology, 32(1), 1–40. https://doi.org/10.1145/3522586

157.

Teofili

Firmani

Koudas

Martello

Merialdo

Srivastava

(2022). Effective explanations for entity resolution models. https://doi.org/10.48550/ARXIV.2203.12978

158.

Tiddi

Schlobach

(2022). Knowledge graphs as tools for explainable machine learning: A survey. Artificial Intelligence, 302, 103627. https://doi.org/10.1016/j.artint.2021.103627

159.

Touvron

Lavril

Izacard

Martinet

Lachaux

Lacroix

Rozière

Goyal

Hambro

Azhar

Rodriguez

Joulin

Grave

Lample

(2023). Llama: Open and efficient foundation language models. CoRR abs/2302.13971. https://doi.org/10.48550/ARXIV.2302.13971

160.

Touvron

Martin

Stone

Albert

Almahairi

Babaei

Bashlykov

Batra

Bhargava

Bhosale

Bikel

Blecher

Canton-Ferrer

Chen

Cucurull

Esiobu

Fernandes

Fuller

,... Scialom

(2023). Llama 2: Open foundation and fine-tuned chat models. CoRR abs/2307.09288. https://doi.org/10.48550/ARXIV.2307.09288

161.

Tran

T. K.

Gad-Elrab

M. H.

Stepanova

Kharlamov

Strötgen

(2020). Fast computation of explanations for inconsistency in large-scale knowledge graphs. In Proceedings of the web conference 2020 (pp. 2613–2619). WWW ’20. New York, NY, USA: Association for Computing Machinery. ISBN 9781450370233. https://doi.org/10.1145/3366423.3380014

162.

Vadrevu

Nagi

Xiong

Hwu

(2021). xER: An explainable model for entity resolution using an efficient solution for the clique partitioning problem. In Proceedings of the First workshop on trustworthy natural language processing (pp. 34–44). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.trustnlp-1.5

163.

van Bekkum

de Boer

van Harmelen

Meyer-Vitali

ten Teije

(2021). Modular design patterns for hybrid learning and reasoning systems. Applied Intelligence, 51(9), 6528–6546. https://doi.org/10.1007/S10489-021-02394-3

164.

van Harmelen

ten Teije

(2019). A Boxology of design patterns for hybrid learning and reasoning systems. Journal of Web Engineering, 18(1-3), 97–123. https://journals.riverpublishers.com/index.php/JWE/article/view/3175

165.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. CoRR abs/1706.03762. http://arxiv.org/abs/1706.03762

166.

Vilone

Longo

(2020). Explainable artificial intelligence: A systematic review. CoRR abs/2006.00093. https://arxiv.org/abs/2006.00093

167.

Vrandečić

Krötzsch

(2014). Wikidata: A free collaborative knowledgebase. Communications of the ACM, 57(10), 78–85. https://doi.org/10.1145/2629489

168.

Vrandečić

Pinto

H. S.

Tempich

Sure-Vetter

(2005). The diligent knowledge processes. Journal of Knowledge Management, 9, 85–96. https://doi.org/10.1108/13673270510622474

169.

Wang

Qin

Yin

Zakari

R. Y.

Owusu

J. W.

(2021). Document-level relation extraction using evidence reasoning on rst-graph. Knowledge-Based Systems, 228, 107274. https://doi.org/10.1016/j.knosys.2021.107274

170.

Wang

Ren

Leskovec

(2021). Relational message passing for knowledge graph completion. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery and data mining (pp. 1697–1707). KDD ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450383325. https://doi.org/10.1145/3447548.3467247

171.

Wang

(2022). Minun: Evaluating counterfactual explanations for entity matching. In Proceedings of the Sixth workshop on data management for end-to-end machine learning, DEEM ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450393751. https://doi.org/10.1145/3533028.3533304

172.

Wang

Agarwal

Ham

Choudhury

Reddy

C. K.

(2021). Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks. (pp. 2946–2957). WWW ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450383127. https://doi.org/10.1145/3442381.3450060

173.

Wang

Yin

(2022). Effects of explanations in AI-assisted decision making: Principles and comparisons. ACM Transactions on Interactive Intelligent Systems, 12(4), 1–36. https://doi.org/10.1145/3519266

174.

Wang

Gao

(2021). Tagat: Type-aware graph attention networks for reasoning over knowledge graphs. Knowledge-Based Systems, 233, 107500. https://doi.org/10.1016/j.knosys.2021.107500

175.

Wang

Yan

(2022). Metranse: Manifold-like mechanism enhanced embedding for reasoning over knowledge graphs. Expert Systems with Applications, 209, 118288. https://doi.org/10.1016/j.eswa.2022.118288

176.

Wei

Wang

Schuurmans

Bosma

Chi

E. H.

Zhou

(2022). Chain of thought prompting elicits reasoning in large language models. CoRR abs/2201.11903. https://arxiv.org/abs/2201.11903

177.

Weikum

Dong

Razniewski

Suchanek

F. M.

(2020). Machine knowledge: Creation and curation of comprehensive knowledge bases. CoRR abs/2009.11564. https://arxiv.org/abs/2009.11564

178.

Wiegmann

Völske

Stein

Potthast

(2022). Language models as context-sensitive word search engines. In Proceedings of the First workshop on intelligent and interactive writing assistants (In2Writing 2022) (pp. 39–45). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.in2writing-1.5

179.

Witschel

H. F.

Pande

Martin

Laurenzi

Hinkelmann

(2021). Visualization of Patterns for Hybrid Learning and Reasoning with Human Involvement. Cham: Springer International Publishing. pp. 193–204. ISBN 978-3-030-48332-6. https://doi.org/10.1007/978-3-030-48332-6_13

180.

Wolf

C. T.

(2020). From knowledge graphs to knowledge practices: On the need for transparency and explainability in enterprise knowledge graph applications. In: Knowledge Graph Bias Workshop. https://api.semanticscholar.org/CorpusID:230516430

181.

Mai

(2022). Contextual relation embedding and interpretable triplet capsule for inductive relation prediction. Neurocomputing, 505, 80–91. https://doi.org/10.1016/j.neucom.2022.07.043

182.

Shi

Cao

Chen

Lei

Zhang

(2021). Disenkgat: Knowledge graph embedding with disentangled graph attention network. In Proceedings of the 30th ACM international conference on information and knowledge management (pp. 2140–2149). CIKM ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450384469. https://doi.org/10.1145/3459637.3482424

183.

Xia

Lan

Luo

Chen

Zhou

(2022). Iterative rule-guided reasoning over sparse knowledge graphs with deep reinforcement learning. Information Processing & Management, 59(5), 103040. https://doi.org/10.1016/j.ipm.2022.103040

184.

Xiao

Zhang

Mao

Yang

Han

(2021). SAIS: Supervising and augmenting intermediate steps for document-level relation extraction. CoRR abs/2109.12093. https://arxiv.org/abs/2109.12093

185.

Xie

Dai

Hovy

E. H.

(2017). An interpretable knowledge transfer model for knowledge base completion. CoRR abs/1704.05908. http://arxiv.org/abs/1704.05908

186.

Xiong

Hoang

Wang

W. Y.

(2017). Deeppath: A reinforcement learning method for knowledge graph reasoning. CoRR abs/1707.06690. http://arxiv.org/abs/1707.06690

187.

Yadav

Bethard

(2019). A survey on recent advances in named entity recognition from deep learning models. CoRR abs/1910.11470. http://arxiv.org/abs/1910.11470

188.

Yang

Cohen

W. W.

(2017). Differentiable learning of logical rules for knowledge base completion. CoRR abs/1702.08367. http://arxiv.org/abs/1702.08367

189.

Yang

Chen

Ding

(2024). Give us the facts: Enhancing large language models with knowledge graphs for fact-aware language modeling. IEEE Transactions on Knowledge and Data Engineering, 36(7), 3091–3110. https://doi.org/10.1109/TKDE.2024.3360454

190.

Yao

Dong

Hou

Zhang

Dai

(2021). Interpretable and low-resource entity matching via decoupling feature learning from decision making. CoRR abs/2106.04174. https://arxiv.org/abs/2106.04174

191.

Zhang

Chen

(2022). Generative knowledge graph construction: A review. CoRR abs/2210.12714. https://doi.org/10.48550/arXiv.2210.12714

192.

Yeo

Park

Lee

E. W.

Hwang

S. w.

(2020). Xina: Explainable instance alignment using dominance relationship. IEEE Transactions on Knowledge and Data Engineering, 32(2), 388–401. https://doi.org/10.1109/TKDE.2018.2881956

193.

Yuan

Lei

Chen

(2022). Fine-grained relational learning for few-shot knowledge graph completion. ACM SIGAPP Applied Computing Review, 22(3), 25–38. https://doi.org/10.1145/3570733.3570735

194.

Zamini

Reza

Rabiei

(2022). A review of knowledge graph completion. Information, 13(8), 1–19. https://doi.org/10.3390/info13080396

195.

Zeng

Chang

(2021). SIRE: Separate intra- and inter-sentential reasoning for document-level relation extraction. CoRR abs/2106.01709. https://arxiv.org/abs/2106.01709

196.

Zhang

Carriero

V. A.

Schreiberhuber

Tsaneva

González

L. S.

Kim

de Berardinis

(2024). Ontochat: A framework for conversational ontology engineering using language models. In European semantic web conference (pp. 102–121). Springer. https://doi.org/10.1007/978-3-031-78952-6_10

197.

Zhang

Ilievski

Szekely

(2022). Enriching wikidata with linked open data. https://arxiv.org/abs/2207.00143

198.

Zhang

Meroño Peñuela

Simperl

(2023). Towards explainable automatic knowledge graph construction with human-in-the- Loop. In HHAI 2023: Augmenting human intellect (pp. 274–289). IOS Press. https://doi.org/10.3233/FAIA230091

199.

Zhang

Hsu

C. N.

Katsis

Kim

H. C.

Vázquez-Baeza

(2022b). Theoretical rule-based knowledge graph reasoning by connectivity dependency discovery. In 2022 International joint conference on neural networks (IJCNN) (pp. 1–9). https://doi.org/10.1109/IJCNN55064.2022.9891938

200.

Zhang

Deng

Chen

Wang

Chen

Xiong

Liu

Chen

(2022). Knowledge graph embedding in e-commerce applications: Attentive reasoning, explanations, and transferable rules. (pp. 71–79). IJCKG’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450395656. https://doi.org/10.1145/3502223.3502232

201.

Zhang

Deng

Wang

Chen

Zhang

Chen

(2020). Xtranse: Explainable knowledge graph embedding for link prediction with lifestyles in e-commerce. In X. Wang, F. A. Lisi, G. Xiao, & E. Botoeva (Eds.), Semantic technology (pp. 78–87). Singapore: Springer Singapore. https://doi.org/10.1007/978-981-15-3412-6_8

202.

Zhang

Paudel

Zhang

Bernstein

Chen

(2019). Interaction embeddings for prediction and explanation in knowledge graphs (pp. 96–104). WSDM’19. New York, NY, USA: Association for Computing Machinery. ISBN 9781450359405. https://doi.org/10.1145/3289600.3291014

203.

Zhang

Yao

(2021). Knowledge graph reasoning with relational directed graph. CoRR abs/2108.06040. https://arxiv.org/abs/2108.06040

204.

Zhou

Lin

B. Y.

Wang

Neves

Ren

(2020). Nero: A neural rule grounding framework for label-efficient relation extraction. In Proceedings of the web conference 2020 (pp. 2166–2176). WWW ’20. New York, NY, USA: Association for Computing Machinery. ISBN 9781450370233. https://doi.org/10.1145/3366423.3380282

205.

Zhu

Ouyang

Liang

Shao

(2022). Step by step: A hierarchical framework for multi-hop knowledge graph reasoning with reinforcement learning. Knowledge-Based Systems, 248, 108843. https://doi.org/10.1016/j.knosys.2022.108843

206.

Zulaika

Almeida

López-de Ipiña

(2022). Influence functions for interpretable link prediction in knowledge graphs for intelligent environments. In 2022 7th International conference on smart and sustainable technologies (SpliTech) (pp. 1–7). https://doi.org/10.23919/SpliTech55088.2022.9854264

207.

Zupon

Alexeeva

Valenzuela-Escárcega

Nagesh

Surdeanu

(2019). Lightly-supervised representation learning with global interpretability. In Proceedings of the Third workshop on structured prediction for NLP (pp. 18–28). Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1504

	Capabilities
	Local	Local	Global	Global
Use Case	Post-hoc	Self-explaining	Post-hoc	Self-explaining	Methods
Model Selecting and Building	$⋆$	☆	$⋆$	$\times$	ExplainER (Ebaid et al., 2019), CPM (Stadelmaier & Padó, 2019), Kelpie (Rossi et al., 2022)
Model Debugging	☆	☆	$⋆$	$\times$	LEMON (Barlaug, 2021), ExpalinER (Ebaid et al., 2019), D-REX (Albalak et al., 2021), Instance-based (Ouchi et al., 2020), (Kejriwal et al., 2019), TuneR (Paganelli et al., 2019), CRIAGE (Pezeshkpour et al., 2019), Kelpie (Rossi et al., 2022), SparKGR (Xia et al., 2022), GCNN w/att (Neil et al., 2018), MINERVA (Das et al., 2017)
Understanding Performance and Contributing Factors	$⋆ ⋆$	$⋆ ⋆$	$⋆ ⋆$	$⋆ ⋆$	All papers
Managing Updates	☆	$⋆$	$⋆$	$⋆$	ExplainER (Ebaid et al., 2019), CPM (Stadelmaier & Padó, 2019), TuneR (Paganelli et al., 2019), Abstraction (Tran et al., 2020), TLogic (Liu et al., 2022), Emboot (Zupon et al., 2019), SystemER (Qian et al., 2019), RNNLogic (Qu et al., 2020), (Zhang et al., 2022) Neural LP (Yang et al., 2017), FTL-LM (Lin et al., 2023), ITCN (Wu et al., 2022), CPL (Fu et al., 2019), SQUIRE (Bai et al., 2022), MGNN (Cucala et al., 2022), DRUM (Sadeghian et al., 2019), ProtoRE (Ding et al., 2021), CRIAGE (Pezeshkpour et al., 2019), TITer (Sun et al., 2021), PathCon (Wang et al., 2021), METransE (Wang et al., 2022), RED-GNN (Zhang & Yao, 2021), (Meroño Peñuela et al., 2021)

Towards Explainable Automated Knowledge Engineering With Human-in-the-Loop

Abstract

Keywords

1. Introduction

2.1. Transparency and Explainability of ML Methods

2.2. User Studies on Explainable AI

2.3. Human-Centric Knowledge Engineering

2.4. The KG Lifecycle

3.1. Literature Review

3.1.1. The PRISMA-guided Review

Use Case 1: ML Model Selection and Building

Use Case 2: ML Model Debugging

Use Case 3: Understanding Performance and Contributing Factors

Use Case 4: Managing Updates

XAI Example Discussion

4. Findings

4.1. The Status of Explainable Automated Knowledge Engineering

4.1.1. The State-of-the-Art Explainable Models

Entity Extraction

Relation Extraction

Entity Resolution

Link Prediction

Human-in-the-Loop

Evaluation of Explanations

Use Case 1: ML Model Selection and Building

Use Case 2: ML Model Debugging

Use Case 3: Understanding Performance and Contributing Factors

Use Case 4: Managing Updates

How much human effort is leveraged in the knowledge graph lifecycle?

What is the level of understanding of the models and techniques?

4.2.2. Data Provenance and Lineage

Do knowledge engineers know where the data comes from?

How do people keep track of data provenance and lineage?

How do knowledge engineers evaluate the results?

What do people do when they find the results incorrect?

How do people explain to others their models and results?

4.3. Gaps and Challenges in Explainable KGC Solutions and Practical Usage

4.3.1. Use Cases From Interview Study

What are practical use cases of XAI models?

4.3.2. XAI Example Discussion

Do current explainable solutions meet the requirements for practical use cases?

4.4. Requirements for Explainable Approaches

What are characteristics of an explainable method that knowledge engineers and researchers expected?

5. Explanation Design Blueprint

6.1. Conclusion

6.2. Future Work

Footnotes

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

References