Sage Journals: Discover world-class research

Abstract

Recently, there has been significant progress in the development of robust and highly scalable neurosymbolic description logic reasoners. However, the field faces challenges arising from diverse design strategies and evaluation methods. We address the latter challenge by emphasizing the critical requirement for a comprehensive benchmark framework tailored to the unique evaluation needs of neurosymbolic description logic reasoners. In this paper, we address barriers that must be overcome to facilitate the effective evaluation of these reasoners and outline a potential methodology for designing the benchmark framework. This work contributes toward a more systematic and principled evaluation framework for neurosymbolic reasoning, highlighting the broader role of benchmarks in advancing the field.

Keywords

benchmark neurosymbolic AI reasoning description logics ontology neural network

1. Introduction

Neurosymbolic artificial intelligence (AI; d’Avila Garcez et al., 2015; Sarker et al., 2021) is a promising field that aims to bridge the gap between traditional symbolic logic and modern neural network-based machine learning. The idea is to combine the strengths of both approaches while overcoming their weaknesses. The focus of this paper lies within the realm of neurosymbolic reasoning. At its core, neurosymbolic reasoning involves integrating symbolic reasoning, which relies on structured logic and formal knowledge representation, with neural network-based methods known for their capacity to process large-scale, unstructured data and learn complex patterns from it. This fusion holds the potential for developing systems with enhanced performance, explainability, and generalization abilities (Ott et al., 2023). It is important to note that these approaches, unlike traditional reasoning methods, are not necessarily sound and complete. Instead, they strike a balance between approximating the precise reasoning capabilities of symbolic systems and harnessing the robust learning capabilities of machine learning techniques.

However, progress in this field faces significant challenges because neurosymbolic reasoning is emerging, in contrast to other areas with extensive research and well-established benchmarks. For instance, several models (graph neural networks (GNNs; Scarselli et al., 2009) and logic tensor networks (Badreddine et al., 2022b)), methodologies (inductive logic programming (Sen et al., 2022), innovative ideas (explainable AI; Xu et al., 2019), and zero-shot learning (Chen et al., 2020)) enrich this field. As a result, existing works in this field exhibit diversity in techniques, and hence, different methods and criteria are used to evaluate the performance of neurosymbolic reasoning systems (see Table 1). The lack of a standardized approach makes it difficult to compare these systems and make progress in the field. Furthermore, based on the reciprocal relationships between neural and symbolic components and how they benefit each other, neurosymbolic reasoning systems, and in general neurosymbolic AI systems, as discussed by Henry Kautz, can be categorized into one of the six distinct categories (Kautz, 2022).

–
Symbolic Neuro Symbolic: In this category, the input and output are represented symbolically, such as with words or sequences of words. These symbols are converted into vectors using methods such as word2vec (Mikolov et al., 2013) and then fed into a neural network for processing.
–
Symbolic[Neuro]: Symbolic solvers use neural models internally for some functions, as seen in systems such as AlphaGo (Silver et al., 2016).
–
Neuro $|$ Symbolic: This category involves a refined integration of neural and symbolic approaches, where both systems collaborate to enhance specific tasks, such as in the case of Neurosymbolic concept-learner (Mao et al., 2019).
–
Neuro:Symbolic $\to$ Neuro: These approaches take symbolic rules as input and compile them during training, effectively integrating symbolic knowledge into the structure of neural models, as demonstrated in Deep Learning For Symbolic Mathematics (Lample & Charton, 2020).
–
${Neuro}_{Symbolic}$ : This category involves transforming symbolic rules into templates for structures within the neural network, such as Logic Tensor Network (Badreddine et al., 2022a).
–
Neuro[Symbolic]: Refers to the embedding of symbolic reasoning inside a neural engine, such as GNNs (Scarselli et al., 2009).
Each of these categories represents a unique approach to neurosymbolic AI, adding an extra layer of diversity to the advancements in this field.

Table 1.
Overview of Variations in Neurosymbolic Reasoning and Evaluation Approaches.

Paper Logic Reasoning task Datasets used Metrics Summary of approaches used

ELEm (Kulmanov et al., 2019) ${E L}^{+ +}$ Subsumption GO Hits@n, AUC, mean rank To capture entity relationships, embeddings were created by representing concepts as $n$ -balls and the relations as translation vectors between the centers of each concept ball. The embeddings were utilized to predict protein–protein interactions.

EmEL $^{+ +}$ (Mondal et al., 2021) ${E L}^{+ +}$ Subsumption SNOMED CT, anatomy, GO, Galen Hits@n, AUC, median rank, 90 $^{t h}$ percentile rank Extended ELEm with relation inclusion and role chains. Also introduced negative samples for training.

EmEL-V (Mohapatra et al., 2021) ${E L}^{+ +}$ Subsumption SNOMED CT, GO, Galen Top@n, median rank, 90 $^{t h}$ percentile rank Extended EmEL $^{+ +}$ to include many-to-many relationships.

BoxEL (Xiong et al., 2022) ${E L}^{+ +}$ Subsumption Anatomy, GO, Galen Hits@n, AUC, mean rank To capture entity relationships, mapped concepts as boxes and deals with the limitations of $n$ -ball (Kulmanov et al., 2019; Mohapatra et al., 2021; Mondal et al., 2021) based embeddings.

$B o x^{2} E L$ (Jackermeier et al., 2024) ${E L}^{+ +}$ Subsumption, role assertion, and deductive reasoning Anatomy, GO, Galen Hits@n, AUC, median, mean rank Maps both concepts and roles as boxes, and models inter-concept relationships using a bumping mechanism.

Leemhuis et al. (2020) $A L C$ Concept membership NA NA Embeds concepts in the ontology as convex regions in vector spaces.

E2R (Garg et al., 2019) $A L C$ Concept membership LUBM Hits@n, mean rank, MRR Aiming to preserve the logical structure, proposed embeddings in the quantum space.

Makni and Hendler (2019) RDFS Entailment reasoning LUBM and scientist dataset created from DBpedia Precision, recall, and F1 score The evaluation focused on assessing noise tolerance by employing an encoder–decoder architecture to translate input RDF graph embeddings into corresponding inference graph embeddings.

Ebrahimi et al. (2018) RDFS Query-based classification Created from Linked Data Cloud and Data Hub websites Precision, recall, and F1 score Explored the capabilities of end-2-end memory networks. The model’s capability for multihop reasoning is demonstrated. The use of normalized embeddings supports transfer.

Ebrahimi et al. (2021) RDFS and ${E L}^{+}$ Entailment reasoning Synthetic data and LUBM Exact matching accuracy Utilized pointer networks for learning the sequential application of inference rules used in many deductive reasoning algorithms.

Hohenecker and Lukasiewicz (2020) OWL 2 RL Entailment reasoning Claros, DBpedia, UMLS, and synthetic data Accuracy Developed a deep learning-based model called RNNs.

Eberhart et al. (2020) ${E L}^{+}$ Ontology completion (concept inclusions and existential restrictions) Synthetic data and SNOMED Precision, recall, and F1 score Showcases completion reasoning behavior using various LSTM neural networks to learn reasoning patterns, employing three distance measures to assess prediction accuracy.

Makni et al. (2020) RDFS Explainable entailment reasoning LUBM and real-world scholarly dataset Accuracy Built upon the previous work (Makni & Hendler, 2019) for generating explanations for the derived conclusions by taking the RDF graph and inferred triples as input and the explanations as the target.

Hohenecker and Lukasiewicz (2017) RDF Concept membership and relation prediction LUBM, UOBM, Claros, DBpedia F1 score and accuracy Proposed RTN. Embeddings of the individuals are computed by applying RTNs on the directed acyclic graph representation of the ontology (including the inferences).

Farzana et al. (2023) RDF Query pruning and complete query rewriting Created from user search logs from eBay Inc. Precision, recall, and F score, and query accuracy Proposes a KG enhanced approach for query rewriting in e-commerce, leveraging RDF2Vec entity embeddings, entity types, category information, and entity frequency extracted from a product KG.

OWL2Vec* (Chen et al., 2021) $S R O I Q$ Concept membership and concept subsumption HeLis, FoodOn, GO MRR and Hits@n Ontologies are transformed into RDF graphs, and Word2Vec is utilized on the resulting paths. The training dataset comprises three documents: structural, lexical, and a combination of both, enhancing entity interrelation understanding compared to earlier Word2Vec methodologies.

Note. GO = gene ontology; AUC = area under the area; LUBM = Lehigh University Benchmark; UOBM = University Ontology Benchmark; MRR = mean reciprocal rank; FoodOn = food ontology; KG = knowledge graph; RNN = recursive reasoning network; RTN = relational tensor network; LSTM = long short term memory.

Drawing inspiration from Jim Gray’s pioneering work (Gray, 1993) on domain-specific benchmarks for databases, our goal is to tackle the challenge of benchmarking neurosymbolic reasoners. The primary purpose of such a benchmark is two-fold. Firstly, it serves as a tool to identify the performance bottlenecks, enabling targeted improvements in the systems where algorithms are still evolving. Secondly, benchmarks facilitate meaningful comparisons between various systems, offering insights into their relative strengths and weaknesses. While this paper does not put forth an alternative benchmark, we highlight the strong need for such benchmarks, including their features, and explain why they are essential for moving the field forward.

In Section 2, we delve into the recent advancements in neurosymbolic reasoning, highlighting the challenges in evaluating and comparing the existing state-of-the-art neurosymbolic reasoners. Subsequently, in Section 3, we address the barriers that must be overcome to facilitate the effective evaluation of neurosymbolic reasoners. Finally, in Section 4, we outline a potential methodology for designing the benchmark.
2. Neurosymbolic Reasoning for Description Logics

Paper	Logic	Reasoning task	Datasets used	Metrics	Summary of approaches used
ELEm (Kulmanov et al., 2019)	${E L}^{+ +}$	Subsumption	GO	Hits@n, AUC, mean rank	To capture entity relationships, embeddings were created by representing concepts as $n$ -balls and the relations as translation vectors between the centers of each concept ball. The embeddings were utilized to predict protein–protein interactions.
EmEL $^{+ +}$ (Mondal et al., 2021)	${E L}^{+ +}$	Subsumption	SNOMED CT, anatomy, GO, Galen	Hits@n, AUC, median rank, 90 $^{t h}$ percentile rank	Extended ELEm with relation inclusion and role chains. Also introduced negative samples for training.
EmEL-V (Mohapatra et al., 2021)	${E L}^{+ +}$	Subsumption	SNOMED CT, GO, Galen	Top@n, median rank, 90 $^{t h}$ percentile rank	Extended EmEL $^{+ +}$ to include many-to-many relationships.
BoxEL (Xiong et al., 2022)	${E L}^{+ +}$	Subsumption	Anatomy, GO, Galen	Hits@n, AUC, mean rank	To capture entity relationships, mapped concepts as boxes and deals with the limitations of $n$ -ball (Kulmanov et al., 2019; Mohapatra et al., 2021; Mondal et al., 2021) based embeddings.
$B o x^{2} E L$ (Jackermeier et al., 2024)	${E L}^{+ +}$	Subsumption, role assertion, and deductive reasoning	Anatomy, GO, Galen	Hits@n, AUC, median, mean rank	Maps both concepts and roles as boxes, and models inter-concept relationships using a bumping mechanism.
Leemhuis et al. (2020)	$A L C$	Concept membership	NA	NA	Embeds concepts in the ontology as convex regions in vector spaces.
E2R (Garg et al., 2019)	$A L C$	Concept membership	LUBM	Hits@n, mean rank, MRR	Aiming to preserve the logical structure, proposed embeddings in the quantum space.
Makni and Hendler (2019)	RDFS	Entailment reasoning	LUBM and scientist dataset created from DBpedia	Precision, recall, and F1 score	The evaluation focused on assessing noise tolerance by employing an encoder–decoder architecture to translate input RDF graph embeddings into corresponding inference graph embeddings.
Ebrahimi et al. (2018)	RDFS	Query-based classification	Created from Linked Data Cloud and Data Hub websites	Precision, recall, and F1 score	Explored the capabilities of end-2-end memory networks. The model’s capability for multihop reasoning is demonstrated. The use of normalized embeddings supports transfer.
Ebrahimi et al. (2021)	RDFS and ${E L}^{+}$	Entailment reasoning	Synthetic data and LUBM	Exact matching accuracy	Utilized pointer networks for learning the sequential application of inference rules used in many deductive reasoning algorithms.
Hohenecker and Lukasiewicz (2020)	OWL 2 RL	Entailment reasoning	Claros, DBpedia, UMLS, and synthetic data	Accuracy	Developed a deep learning-based model called RNNs.
Eberhart et al. (2020)	${E L}^{+}$	Ontology completion (concept inclusions and existential restrictions)	Synthetic data and SNOMED	Precision, recall, and F1 score	Showcases completion reasoning behavior using various LSTM neural networks to learn reasoning patterns, employing three distance measures to assess prediction accuracy.
Makni et al. (2020)	RDFS	Explainable entailment reasoning	LUBM and real-world scholarly dataset	Accuracy	Built upon the previous work (Makni & Hendler, 2019) for generating explanations for the derived conclusions by taking the RDF graph and inferred triples as input and the explanations as the target.
Hohenecker and Lukasiewicz (2017)	RDF	Concept membership and relation prediction	LUBM, UOBM, Claros, DBpedia	F1 score and accuracy	Proposed RTN. Embeddings of the individuals are computed by applying RTNs on the directed acyclic graph representation of the ontology (including the inferences).
Farzana et al. (2023)	RDF	Query pruning and complete query rewriting	Created from user search logs from eBay Inc.	Precision, recall, and F score, and query accuracy	Proposes a KG enhanced approach for query rewriting in e-commerce, leveraging RDF2Vec entity embeddings, entity types, category information, and entity frequency extracted from a product KG.
OWL2Vec* (Chen et al., 2021)	$S R O I Q$	Concept membership and concept subsumption	HeLis, FoodOn, GO	MRR and Hits@n	Ontologies are transformed into RDF graphs, and Word2Vec is utilized on the resulting paths. The training dataset comprises three documents: structural, lexical, and a combination of both, enhancing entity interrelation understanding compared to earlier Word2Vec methodologies.

In recent years, there have been significant advancements in developing neurosymbolic reasoners for description logics (DLs; Baader et al., 2003), a formal underpinning for the Web Ontology Language 2 (OWL 2; Grau et al., 2008). While most of these works predominantly focus on classification and consistency checking (Hitzler et al., 2010; Makni et al., 2021; Singh et al., 2023), the other reasoning tasks, such as instance retrieval, query rewriting, materialization, abduction, and explanation generation, remain relatively unexplored. The intricacy of these tasks varies significantly, and delving into their complexities offers a promising avenue for further exploration.

Research in this domain takes an alternative approach to traditional reasoning tasks such as classification and consistency, breaking them into class subsumption, class membership, and satisfiability tasks. Various techniques are employed, such as geometric embeddings (Kulmanov et al., 2019; Mohapatra et al., 2021; Mondal et al., 2021; Xiong et al., 2022) that map ontological relationships to geometric spaces and emulating logical reasoning through machine learning (Eberhart et al., 2020; Ebrahimi et al., 2021; Makni & Hendler, 2019). A comprehensive overview and detailed insights into the state-of-the-art neurosymbolic reasoning landscape are discussed in Makni et al. (2021); Singh et al. (2023). Regarding other categories, a limited amount of work, such as that for e-commerce search (Farzana et al., 2023), merges neurosymbolic reasoning with query rewriting. This involves a knowledge graph (KG; Hogan et al., 2022) enhanced neural network approach that integrates auxiliary knowledge from a product KG, enhancing semantic understanding of user queries and improving query reformulation.

The existing traditional benchmarks such as Lehigh University Benchmark (LUBM; Guo et al., 2005), University Ontology Benchmark (UOBM; Ma et al., 2006), and OWL2Bench (Singh et al., 2020) lack suitability for evaluating neurosymbolic reasoners due to their narrow focus on conventional reasoning tasks. Traditional evaluations of reasoning systems often rely on metrics such as reasoning time, which may not align well with the evaluation requirements of neurosymbolic reasoners. Although the ontologies of these benchmarks, along with those from the OWL reasoner evaluation competition (Parsia et al., 2017), can serve as initial datasets for the proposed neurosymbolic benchmark framework, these datasets fall short of addressing the distinct challenges posed by neurosymbolic reasoning. To our knowledge, no benchmarks or evaluation frameworks have been designed to evaluate and compare neurosymbolic reasoning systems. Most reasoner evaluations are performed on different publicly available ontologies, including but not restricted to SNOMED CT,¹ gene ontology (GO),² and Galen,³ as well as other ontologies available in public repositories such as DBpedia (Lehmann et al., 2014), YAGO (Suchanek et al., 2007), Wikidata (Vrandečić, 2012), Claros,⁴ NCBO Bioportal,⁵ and AgroPortal.⁶ However, these offer a limited set of ontologies for evaluation, which does not cover the full spectrum of possible scenarios.

As discussed in Section 1, neurosymbolic approaches encompass a range of evaluation methodologies and reasoning techniques. This diversity becomes evident in Table 1, highlighting the necessity for a dedicated benchmark to systematically and comprehensively assess the performance of neurosymbolic reasoning systems. The table reveals the utilization of subsets of description logics, such as $A L C$ and ${E L}^{+ +}$ , and various OWL 2 profiles such as EL and RL (Motik et al., 2012). Some works also incorporate RDF and RDFS into their reasoning techniques, underlining the diversity in the supported ontology languages and profiles, which implies that existing works handle different levels of complexity. Furthermore, the table showcases the variety of reasoning tasks undertaken, different datasets utilized, and the diverse metrics employed for evaluating each approach. The summary column in Table 1 highlights the differences in techniques used by each work. It is important to note that the paper does not aim to provide an exhaustive list of all the existing work. Instead, it emphasizes the variations in reasoning and evaluation approaches. The collective representation highlights the pressing need for a standardized benchmark to facilitate fair and consistent comparisons, thereby advancing the progress of neurosymbolic reasoning research. The table reveals that similar works may differ significantly by employing distinct metrics and datasets to evaluate their contributions. For instance, consider the works of Makni and Hendler (2019) and Ebrahimi et al. (2021). Both studies focus on RDFS entailment reasoning, aiming to replicate deductive reasoning processes. However, they adopt different metrics and datasets to assess the effectiveness and performance of their approaches. Such variations in evaluation criteria can lead to diverse insights and perspectives on the contributions within the field.

To further highlight the diversity in the current approaches, we classify the works mentioned in Table 1 into one of the six distinct categories discussed in Section 1. Chen et al. (2021) convert the symbolic input, ontologies, and RDF graphs, to vectors (Symbolic Neuro Symbolic). Makni and Hendler (2019), Ebrahimi et al. (2018), Ebrahimi et al. (2021), Eberhart et al. (2020), Hohenecker and Lukasiewicz (2017), and Makni et al. (2020) take symbolic reasoning rules as input and compile them during training (Neuro:Symbolic $\to$ Neuro), integrating symbolic knowledge into neural models. Kulmanov et al. (2019), Mondal et al. (2021), Mohapatra et al. (2021), Xiong et al. (2022), Leemhuis et al. (2020), Garg et al. (2019), and Jackermeier et al. (2024) embed symbolic reasoning inside neural engines, representing symbolic information in geometric or vector spaces and employing neural methods for reasoning tasks (Neuro[Symbolic]). Farzana et al. (2023) fall into the category involving a refined integration of neural and symbolic approaches to enhance query rewriting (Neuro—Symbolic).

3. Desiderata for Benchmarking Neurosymbolic Reasoners

Creating an effective benchmark demands careful consideration of critical principles such as simplicity for accessibility, portability for impartial assessments across various approaches, scalability to accommodate diverse system sizes, and relevance to reflect practical challenges in benchmark scenarios (Gray, 1993). However, the evaluation of neurosymbolic reasoners presents its own set of distinctive challenges. Given the field’s novelty, state-of-the-art solutions do not approach such challenges systematically. Therefore, we advocate below the issues that should be prioritized in constructing a fair neurosymbolic reasoning benchmark.

Diverse Benchmark Scenarios

To effectively evaluate neurosymbolic reasoners, the benchmark must incorporate diverse scenarios that mirror the complexity and variety encountered in real-world applications. This approach ensures a thorough assessment of the reasoners’ capabilities across different contexts. Key aspects to consider include:

–
Variety of Ontologies: The benchmark should encompass a range of ontologies differing in size, profile, and axiom types. This includes: $$
Size and Complexity: Include ontologies with varying sizes and complexities in both assertional knowledge (ABox) and terminological knowledge (TBox) to evaluate how reasoners handle different levels of detail and scope.
$$
OWL 2 Profiles: Use ontologies that adhere to various OWL 2 profiles (such as EL, QL, RL, and DL) to test the reasoners’ ability to handle different levels of expressiveness.
$$
Axiom Types: Incorporate different types of axioms (such as subclass relations and property restrictions) and their combinations to assess how well the reasoners manage diverse logical constructs.

–
Specific and Generic Reasoning Tasks: Benchmark scenarios should include both specific reasoning tasks and generic information needs: $$
Specific Reasoning Tasks: Design tasks that test particular reasoning capabilities, such as classification, consistency checking, and instance retrieval. These tasks enable micro-benchmarking and provide insights into the strengths and limitations of individual reasoners.
$$
Generic Information Needs: Include tasks that assess the reasoners’ ability to handle complex and broader reasoning scenarios, such as evaluating how well the reasoners can address multistep queries that involve integrating diverse information sources and applying both symbolic rules and neural network-derived insights. This includes testing the reasoners’ ability to synthesize and leverage contextual information to generate coherent and relevant responses.

–
Real-World Applicability: Ensure that benchmark scenarios reflect real-world use cases: $$
Real-World Ontologies: Analyze and incorporate real-world ontologies and existing benchmarks to capture practical challenges and scenarios.
$$
Scalability and Realism: Design scenarios that not only address current requirements but also scale beyond them to foster technological advancement and future-proof the evaluation process.

2.
Introducing Controlled Inconsistencies

Incorporating controlled inconsistencies into benchmark design presents a significant challenge but is essential for evaluating the robustness of neurosymbolic reasoners. Controlled inconsistencies should be introduced in a deterministic manner to assess how well the systems handle and resolve contradictions. Key aspects to consider include: –
Types of Inconsistencies: $$
Structural Inconsistency: Introduce structural inconsistencies by creating contradictions in the ontological hierarchy or relationships. For example: If the ontology specifies that the entities “Male” and “Female” are disjoint classes, add instances in the ABox that are classified as both “Male” and “Female.” This tests the system’s ability to detect and resolve structural conflicts. Similarly, create inconsistencies by defining contradictory class hierarchies or property restrictions that violate the logical constraints of the ontology.
$$
Semantic Inconsistency: Introduce semantic inconsistencies by modifying entity names or attributes to introduce ambiguity or slight deviations. For example: Change names or attributes in a way that creates near-identical but distinct instances, such as altering “John” to “Jonh” to test how well the system identifies and resolves semantic conflicts. Introduce synonyms or typographical errors that may lead to semantic ambiguities and test how the system manages these issues.

–
Reproducibility and Control: $$
Deterministic Generation: Ensure that the process of generating inconsistencies is deterministic, allowing for consistent reproduction of test scenarios. This is crucial for evaluating the effectiveness of the system’s handling of inconsistencies.
$*$
Controlled Environment: Design mechanisms to introduce inconsistencies in a controlled manner, avoiding randomness that could obscure the evaluation of specific reasoning capabilities.

Note that existing benchmarks may lack the capability to introduce generic inconsistencies effectively or in a contextually relevant manner. This highlights the need for novel approaches to benchmark design. Traditional generative AI models, such as Large Language Models, may not be well-suited for creating controlled inconsistencies. This underscores the unique requirements for benchmark design that effectively simulates real-world contradictions. Incorporating controlled inconsistencies into the benchmark will provide a deeper understanding of a reasoner’s robustness and its ability to manage and resolve conflicts, reflecting the complexity of real-world scenarios where inconsistencies are prevalent.
3.
Input Representation for Benchmarking

A critical aspect of benchmarking neurosymbolic reasoners is the representation of input data. This involves ensuring that ontological knowledge, both ABox (assertional knowledge) and TBox (terminological knowledge), is formatted in a manner that various reasoning systems can effectively process. This flexibility ensures comprehensive and realistic evaluation conditions, enabling the assessment of reasoning systems across the spectrum of neurosymbolic methodologies.

–
Ontology Formats: The benchmark should support multiple ontology formats such as RDF/XML,⁷ Turtle,⁸ and Manchester OWL Syntax,⁹ among others. This ensures compatibility with a wide range of systems that may require specific formats.
–
Axiom Format: While some neurosymbolic systems may utilize embedding techniques to transform ontological entities and relationships into continuous vector spaces, others might directly process axioms in their logical form. For instance, certain systems might require axioms in a normalized form as per the $E L^{+ +}$ profile. Additionally, some approaches may require axioms to be in triple format. Therefore, the benchmark should accommodate these varying requirements by providing tools for transforming and normalizing ontological data as needed.
–
Preembedded Entities: Some systems may necessitate entities represented as embeddings, using models such as BERT (Devlin et al., 2019) or other neural embeddings such as TransE (Bordes et al., 2013). The benchmark should offer pre-embedded entity representations, ensuring compatibility with these methods and enabling comprehensive evaluation across different representation techniques.
–
Dataset Splits: The benchmark should facilitate the generation of dataset splits tailored to diverse testing needs, such as train–test–validation splits. This enables a thorough evaluation of a system’s learning and generalization capabilities across different segments of data. Properly managed splits ensure that the performance metrics accurately reflect the system’s ability to handle unseen data and prevent overfitting.
–
Domain-Agnostic Datasets: To assess a system’s understanding of logical semantics independently of specific domain knowledge, the benchmark should have the capability to generate domain-agnostic datasets. This allows for evaluation focused on the system’s ability to interpret and apply logical rules universally rather than relying on domain-specific information.

4.
Assessment of the Deductive Capabilities of Existing Approaches

In the trajectory toward developing a new generation of reasoners that effectively harness the potential of both neural networks and logical reasoning, a foundational requirement involves conducting an equitable assessment of state-of-the-art solutions. This assessment provides insights into the present capabilities of these approaches and illuminates the trajectory of the field’s future development. Evaluating these aspects ensures that the reasoners can generalize beyond specific datasets and apply logical rules consistently across different domains. The key points of this assessment include: –
Soundness and Completeness: Traditional deductive reasoners are sound and complete, meaning they produce correct and exhaustive inferences based on given axioms. Evaluating whether neurosymbolic reasoners can achieve similar standards is critical.
–
Generalization Capabilities: Deductive reasoners should be able to work across any ontology from any domain. This includes verifying that the reasoners can generalize logical rules universally and not be confined to specific datasets or domains. For instance, a rule stating “if A is a subclass of B and B is a subclass of C, then A is a subclass of C” should apply universally, irrespective of the specific terms involved. This ensures the systems can apply logical rules consistently across different domains.
–
Scalability and Efficiency: Assessing the scalability and efficiency of these reasoners in handling large and complex ontologies is essential. Traditional deductive reasoners are designed to handle extensive datasets and intricate logical structures, which serve as a benchmark for emerging neurosymbolic systems. Understanding how these models perform under varying degrees of complexity and scale can guide the development of more robust and versatile reasoning systems.
–
Handling Noise and Inconsistent Data: When evaluating the deductive capabilities of these reasoners, it is crucial to consider how well they handle noise and inconsistent data. Real-world applications often involve datasets with inaccuracies, ambiguities, and inconsistencies that can challenge the reasoning process. Assessing a system’s ability to manage and mitigate the impact of such issues provides valuable insights into its robustness and practical applicability. This includes evaluating how well the system maintains soundness and completeness in the presence of noisy or conflicting data and its effectiveness in adapting to varying degrees of data quality and integrity.

When assessing deductive reasoning capabilities and comparing them with conventional deductive reasoners, it is advantageous to also include neural-based approaches, such as large language models, in the evaluation framework. While neural methods may not always excel in every aspect of deductive reasoning, incorporating them as a baseline can offer valuable comparative insights. This approach not only underscores the benefits of neurosymbolic methods, which integrate both neural and symbolic reasoning, but also provides a more comprehensive understanding of the strengths and potential synergies between different reasoning paradigms.
5.
Success Metrics and Key Performance Indicators

In order to accurately measure the performance of neurosymbolic reasoners, the benchmark must support a range of metrics and key performance indicators that capture various aspects of system performance.

Standard metrics commonly used in evaluation include:

–
Accuracy, Precision, Recall, and F1 Score: These metrics provide insights into the classification performance of neural components, assessing how well the system identifies correct versus incorrect predictions. That is, giving insights into how accurately the system produces only correct inferences (soundness) and whether it generates all possible correct inferences based on the given axioms (completeness).
–
Mean Reciprocal Rank (MRR) and Hits@K: These are used to evaluate ranking tasks, measuring the position of correct answers in a ranked list of predictions. They help assess the effectiveness of the system in retrieving relevant information.
–
Scalability Metrics: Metrics such as processing time and memory usage that evaluate how well the system handles large and complex datasets.

While these standard metrics are essential for evaluating traditional aspects of system performance, there remains a need for developing new metrics tailored to the unique characteristics of neurosymbolic reasoning. Current benchmarks might not fully capture critical aspects such as: –
Robustness to Noise and Inconsistent Data: Evaluating how well the system manages inaccuracies, ambiguities, and inconsistencies in real-world data. This requires metrics that measure the impact of noisy or conflicting data on performance and the system’s ability to maintain robustness.
–
Inference Generation Efficiency: Metrics that assess the system’s capability to generate all inferences in a single run, while ensuring system soundness, measuring computational efficiency and the number of iterations required.
–
Explanatory Capabilities: Evaluating the quality and usefulness of explanations provided by the system for its inferences. This includes measuring the clarity and completeness of explanations, which is crucial for transparency and user trust.
–
Generalization Across Domains: Metrics to assess how well the system transfers reasoning capabilities across different domains, ensuring consistent performance and applicability in varied contexts.
–
Embedding Quality: For systems that use embeddings, metrics evaluate how well embeddings capture logical relationships and nuances. This includes assessing the embeddings’ effectiveness in preserving logical structures and supporting accurate inferences.

The inclusion of these novel metrics, alongside traditional ones, ensures a comprehensive evaluation of neurosymbolic reasoners. This approach provides deeper insights into the performance and limitations of current systems, guiding future improvements and research directions.
6.
Adaptability

In the rapidly evolving field of neurosymbolic reasoning, the benchmark’s adaptability is crucial for ensuring its relevance and effectiveness. The benchmark should be designed to accommodate the following aspects: –
Continuous Updates: The benchmark should be capable of integrating new tasks and methodologies as they emerge. This involves regularly updating the benchmark to reflect the latest advancements and challenges in neurosymbolic reasoning.
–
Ongoing Review and Feedback: Regular reviews and updates based on the latest research and feedback from the community are essential to keep the benchmark aligned with current practices and real-world needs.

4. Proposed Methodology for Designing the Benchmark

In this section, we propose one of the possible methodologies to design a benchmark comprising of the objectives outlined in Section 3. The methodology involves the following key steps:

Generating Diverse Benchmark Scenarios: The initial step toward creating a benchmark involves curating datasets that cover a wide range of benchmarking scenarios. One approach is to start with a study of existing neurosymbolic description logic reasoners, beginning with basic ontology profiles such as RDFS and gradually progressing to more complex ones such as OWL 2 EL and OWL 2 DL. This process includes evaluating existing datasets across various models and systems, comparing results with traditional reasoning systems such as Konclude (Steigmiller et al., 2014), and identifying performance variations. This analysis can provide insights about the datasets that prove critical for existing systems. We can then focus on generating synthetic datasets that replicate these patterns at various scales and complexities, ensuring coverage of diverse ontology constructs and reasoning tasks.

Introduction of Controlled Inconsistencies: After generating the datasets, the next step is to introduce controlled inconsistencies, similar to those discussed in desiderata 2 of Section 3. This approach allows for the evaluation of how effectively the system handles and resolves these inconsistencies.

Input Formats: One essential benchmarking feature could be support for various OWL and RDF serialization formats, such as RDF/XML¹⁰ or OWL/XML.¹¹ This capability would allow the tool to handle ontologies in any input format and convert them into the required format, facilitating seamless integration and testing. Additionally, incorporating options for generating different dataset splits and profile-specific features, such as generating axioms in normal form for the $E L + +$ profile, can further enhance the tool’s versatility and effectiveness in benchmarking neurosymbolic reasoning systems.

Evaluation of Deductive Capabilities: Traditional reasoners often struggle with inconsistent ontologies, underlining the importance of starting with consistent ontologies. Therefore, generating inconsistencies is kept as a separate step in the benchmarking process. Simultaneously, it is crucial to evaluate the features attributed to the neural aspect of the system, such as learning capabilities, repair abilities, and scalability. This involves assessing performances based on handling ontological complexities, scalability, and overall performance compared to traditional reasoning systems. Evaluating performances after introducing controlled inconsistencies highlights the benefits and results obtained in the presence of inconsistencies, emphasizing deductive prowess and unique contributions of the neural aspect in neurosymbolic reasoning, showcasing the system’s overall capabilities comprehensively.

Metric Design: The existing standard learning metrics, such as accuracy, precision, and F1 score, only provide an overall idea of the efficacy of the systems. However, these systems need a thorough analysis, emphasizing areas well-supported by systems and areas needing improvement, such as handling different ontological constructs. These metrics should encompass not only deductive capabilities but also adaptability to diverse scenarios and overall efficacy in handling complex neurosymbolic reasoning tasks.

This methodology outlines a foundational approach to benchmark design that can be adapted and expanded to include more expressive profiles. It provides a systematic starting point for addressing the challenges in neurosymbolic reasoning, with the flexibility to evolve and incorporate additional complexity and features as the field progresses.

5. Conclusion

We highlighted the significant need for a comprehensive benchmark framework to tackle the challenges tied to evaluating neurosymbolic description logic reasoning systems. Merging symbolic logic and neural network-based machine learning brings great promise, but the lack of common evaluation methods has held back progress in the field. By underlining the importance of creating benchmarks, our aim for the future is to establish a structured way of evaluating these systems that can drive the field forward.

Footnotes

Acknowledgment

Gunjan Singh and Raghava Mutharaju would like to acknowledge the partial support of the Infosys Centre for Artificial Intelligence (CAI), IIIT-Delhi, in this work.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Gunjan Singh

Riccardo Tommasini

Sumit Bhatia

Raghava Mutharaju

Notes

References

Baader

Calvanese

McGuinness

D. L.

Nardi

Patel-Schneider

P. F.

(2003). The description logic handbook: Theory, implementation, and applications. Cambridge University Press.

Badreddine

d’Avila Garcez

A. S.

Serafini

Spranger

(2022a). Logic tensor networks. Artificial Intelligence, 303, 103649. https://doi.org/10.1016/j.artint.2021.103649

Badreddine

d’Avila Garcez

A. S.

Serafini

Spranger

(2022b). Logic tensor networks. Artificial Intelligence, 303, 103649. https://doi.org/10.1016/j.artint.2021.103649

Bordes

Usunier

Garcia-Durán

Weston

Yakhnenko

(2013). Translating embeddings for modeling multi-relational data. In Proceedings of the 26th international conference on neural information processing systems – volume 2, NIPS’13 (pp. 2787–2795). Curran Associates Inc.

Chen

Jiménez-Ruiz

Holter

O. M.

Antonyrajah

Horrocks

(2021). OWL2Vec*: Embedding of OWL ontologies. Machine Learning, 110(7), 1813–1845. https://doi.org/10.1007/s10994-021-05997-6

Chen

Lécué

Geng

Pan

J. Z.

Chen

(2020). Ontology-guided semantic composition for zero-shot learning. In D. Calvanese, E. Erdem & M. Thielscher (Eds.), Proceedings of the 17th International conference on principles of knowledge representation and reasoning, KR 2020, Rhodes, Greece, September 12–18, 2020 (pp. 850–854).

d’Avila Garcez

A. S.

Besold

T. R.

Raedt

L. D.

Földiák

Hitzler

Icard

Kühnberger

Lamb

L. C.

Miikkulainen

Silver

D. L.

(2015). Neural-symbolic learning and reasoning: Contributions and challenges. In 2015 AAAI spring symposia, Stanford University, Palo Alto, California, USA, March 22–25, 2015 (pp. 18–21). AAAI Press. http://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10281.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics.

Eberhart

Ebrahimi

Zhou

Shimizu

Hitzler

(2020). Completion reasoning emulation for the description logic

{E L}^{+}

. In A. Martin, K. Hinkelmann, H. Fill, A. Gerber, D. Lenat, R. Stolle & F. van Harmelen (Eds.), Proceedings of the AAAI 2020 spring symposium on combining machine learning and knowledge engineering in practice, AAAI-MAKE 2020, Palo Alto, CA, USA, March 23–25, 2020, Volume I, volume 2600 of CEUR workshop proceedings. CEUR-WS.org. http://ceur-ws.org/Vol-2600/paper5.pdf.

10.

Ebrahimi

Eberhart

Hitzler

(2021). On the capabilities of pointer networks for deep deductive reasoning. CoRR, abs/2106.09225. https://arxiv.org/abs/2106.09225.

11.

Ebrahimi

Sarker

M. K.

Bianchi

Xie

Doran

Hitzler

(2018). Reasoning over RDF knowledge bases using deep learning. CoRR, abs/1811.04132. http://arxiv.org/abs/1811.04132.

12.

Farzana

Zhou

Ristoski

(2023). Knowledge graph-enhanced neural query rewriting. In Companion proceedings of the ACM web conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023–4 May 2023 (pp. 911–919). ACM. https://doi.org/10.1145/3543873.3587678.

13.

Garg

Ikbal

Srivastava

S. K.

Vishwakarma

Karanam

H. P.

Subramaniam

L. V.

(2019). Quantum embedding of knowledge for reasoning. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox & R. Garnett (Eds.), Advances in neural information processing systems 32: Annual conference on neural information processing systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada (pp. 5595–5605). Red Hook, NY: Curran Associates Inc. https://proceedings.neurips.cc/paper/2019/hash/cb12d7f933e7d102c52231bf62b8a678-Abstract.html.

14.

Grau

B. C.

Horrocks

Motik

Parsia

Patel-Schneider

P. F.

Sattler

(2008). OWL 2: The next step for OWL. Journal of Web Semantics, 6(4), 309–322. https://doi.org/10.1016/j.websem.2008.05.001

15.

Gray

(1993). The benchmark handbook for database and transaction systems (2nd ed.). Morgan Kaufmann.

16.

Guo

Pan

Heflin

(2005). LUBM: A benchmark for owl knowledge base systems. Journal of Web Semantics, 3(2–3), 158–182. https://doi.org/10.1016/j.websem.2005.06.005

17.

Hitzler

Krötzsch

Rudolph

(2010). Foundations of semantic web technologies. Chapman and Hall/CRC Press. http://www.semantic-web-book.org/.

18.

Hogan

Blomqvist

d’Amato

Cochez

de Melo

Gutierrez

Kirrane

Gayo

J. E. L.

Navigli

Neumaier

Ngomo

A. N.

Polleres

Rashid

S. M.

Rula

Schmelzeisen

Sequeda

J. F.

Staab

Zimmermann

(2022). Knowledge graphs. ACM Computing Surveys, 54(4), 711–7137. https://doi.org/10.1145/3447772

19.

Hohenecker

Lukasiewicz

(2017). Deep learning for ontology reasoning. CoRR, abs/1705.10342. http://arxiv.org/abs/1705.10342.

20.

Hohenecker

Lukasiewicz

(2020). Ontology reasoning with deep neural networks. Journal of Artificial Intelligence Research, 68, 503–540. https://doi.org/10.1613/jair.1.11661

21.

Jackermeier

Chen

Horrocks

(2024). Dual box embeddings for the description logic EL++. Association for Computing Machinery.

22.

Kautz

H. A.

(2022). The third AI summer: AAAI robert s. engelmore memorial lecture. AI Magazine, 43(1), 93–104. https://doi.org/10.1609/aimag.v43i1.19122

23.

Kulmanov

Liu-Wei

Yan

Hoehndorf

(2019). EL Embeddings: Geometric construction of models for the Description Logic

{E L}^{+ +}

. In S. Kraus (Ed.), Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10–16, 2019 (pp. 6103–6109). ijcai.org. https://doi.org/10.24963/ijcai.2019/845.

24.

Lample

Charton

(2020). Deep learning for symbolic mathematics. In 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net. https://openreview.net/forum?id=S1eZYeHFDS.

25.

Lehmann

Isele

Jakob

Jentzsch

Kontokostas

Mendes

Hellmann

Morsey

Van Kleef

Auer

Bizer

(2014). Dbpedia – A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, 6(2),167–195. https://doi.org/10.3233/SW-140134

26.

Yang

Qiu

Xie

Pan

Liu

(2006). Towards a complete owl ontology benchmark. In The semantic web: Research and applications (pp. 125–139). Springer.

27.

Makni

Abdelaziz

Hendler

J. A.

(2020). Explainable deep RDFS reasoner. CoRR, abs/2002.03514. https://arxiv.org/abs/2002.03514.

28.

Makni

Ebrahimi

Gromann

Eberhart

(2021). Neuro-symbolic semantic reasoning. In P. Hitzler & M. K. Sarker (Eds.), Neuro-symbolic artificial intelligence: The state of the art, vol. 342 of frontiers in artificial intelligence and applications (pp. 253–279). IOS Press. https://doi.org/10.3233/FAIA210358.

29.

Makni

Hendler

J. A.

(2019). Deep learning for noise-tolerant RDFS reasoning. Semantic Web, 10(5), 823–862. https://doi.org/10.3233/SW-190363

30.

Mao

Gan

Kohli

Tenenbaum

J. B.

(2019). The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net. https://openreview.net/forum?id=rJgMlhRctm.

31.

Mikolov

Chen

Corrado

G. S.

Dean

(2013). Efficient estimation of word representations in vector space. In 1st international conference on learning representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceeding (pp. 1–12). https://api.semanticscholar.org/CorpusID:5959482.

32.

Mohapatra

Bhatia

Mutharaju

Srinivasaraghavan

(2021). Why settle for just one? Extending

{E L}^{+ +}

ontology embeddings with many-to-many relationships. CoRR, abs/2110.10555. https://arxiv.org/abs/2110.10555.

33.

Mondal

Bhatia

Mutharaju

(2021). EmEL

^{+ +}

: Embeddings for

{E L}^{+ +}

description logic. In A. Martin, K. Hinkelmann, H. Fill, A. Gerber, D. Lenat, R. Stolle & F. van Harmelen (Eds.), Proceedings of the AAAI 2021 spring symposium on combining machine learning and knowledge engineering (AAAI-MAKE 2021), Stanford University, Palo Alto, California, USA, March 22–24, 2021, vol. 2846 of CEUR workshop proceedings. CEUR-WS.org. http://ceur-ws.org/Vol-2846/paper19.pdf.

34.

Motik

Grau

B. C.

Horrocks

Fokoue

Lutz

(2012). OWL 2 Web ontology language profiles (2nd ed.). https://www.w3.org/TR/owl2-profiles/.

35.

Ott

Ledaguenel

Hudelot

Hartwig

(2023). How to think about benchmarking neurosymbolic AI? In A. S. d’Avila Garcez, T. R. Besold, M. Gori & E. Jiménez-Ruiz (Eds.), Proceedings of the 17th international workshop on neural-symbolic learning and reasoning, La Certosa di Pontignano, Siena, Italy, July 3–5, 2023, vol. 3432 of CEUR workshop proceedings (pp. 248–254). CEUR-WS.org. https://ceur-ws.org/Vol-3432/paper22.pdf.

36.

Özçep

Ö. L.

Leemhuis

Wolter

(2020). Cone semantics for logics with negation. In C. Bessiere (Ed.), Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020 (pp. 1820–1826). ijcai.org. https://doi.org/10.24963/ijcai.2020/252.

37.

Parsia

Matentzoglu

Gonçalves

R. S.

Glimm

Steigmiller

(2017). The owl reasoner evaluation (ORE) 2015 competition report. Journal of Automated Reasoning, 59(4), 455–482. https://doi.org/10.1007/s10817-017-9406-8.

38.

Sarker

M. K.

Zhou

Eberhart

Hitzler

(2021). Neuro-symbolic artificial intelligence. AI Communications, 34(3), 197–209. https://doi.org/10.3233/AIC-210084

39.

Scarselli

Gori

Tsoi

A. C.

Hagenbuchner

Monfardini

(2009). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80. https://doi.org/10.1109/TNN.2008.2005605

40.

Sen

de Carvalho

B. W. S.

Riegel

Gray

A. G.

(2022). Neuro-symbolic inductive logic programming with logical neural networks. In Thirty-sixth AAAI conference on artificial intelligence, AAAI 2022, thirty-fourth conference on innovative applications of artificial intelligence, IAAI 2022, the twelveth symposium on educational advances in artificial intelligence, EAAI 2022 virtual event, February 22–March 1, 2022 (pp. 8212–8219). AAAI Press. https://ojs.aaai.org/index.php/AAAI/article/view/20795.

41.

Silver

Huang

Maddison

C. J.

Guez

Sifre

van den Driessche

Schrittwieser

Antonoglou

Panneershelvam

Lanctot

Dieleman

Grewe

Nham

Kalchbrenner

Sutskever

Lillicrap

T. P.

Leach

Kavukcuoglu

Graepel

Hassabis

(2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961

42.

Singh

Bhatia

Mutharaju

(2020). OWL2Bench: A Benchmark for OWL 2 Reasoners. In J. Z. Pan, V. A. M. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne & L. Kagal (Eds.), The semantic web – ISWC 2020 – 19th international semantic web conference, Athens, Greece, November 2–6, 2020, proceedings, part II, vol. 12507 of lecture notes in computer science (pp. 81–96). Springer. https://doi.org/10.1007/978-3-030-62466-8_6.

43.

Singh

Bhatia

Mutharaju

(2023). Neuro-symbolic RDF and description logic reasoners: The state-of-the-art and challenges. In P. Hitzler & M. K. Sarker (Eds.), Compendium of neurosymbolic artificial intelligence, vol. 369 of frontiers in artificial intelligence and applications (pp. 29–63). IOS Press. https://doi.org/10.3233/FAIA230134.

44.

Steigmiller

Liebig

Glimm

(2014). Konclude: System description. Journal of Web Semantics, 27, 78–85. https://doi.org/10.1016/j.websem.2014.06.003

45.

Suchanek

F. M.

Kasneci

Weikum

(2007). YAGO: A core of semantic knowledge unifying wordnet and Wikipedia. In Proceedings of the 16th international conference on world wide web, WWW ’07 (pp. 697–706). Association for Computing Machinery. https://doi.org/10.1145/1242572.1242667.

46.

Vrandečić

(2012). Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st international conference on world wide web, WWW ’12 companion (pp. 1063–1064). Association for Computing Machinery. https://doi.org/10.1145/2187980.2188242.

47.

Xiong

Potyka

Tran

Nayyeri

Staab

(2022). Faithful embeddings for

{E L}^{+ +}

knowledge bases. In U. Sattler, A. Hogan, C. M. Keet, V. Presutti, J. P. A. Almeida, H. Takeda, P. Monnin, G. Pirrò & C. d’Amato (Eds.), The semantic web – ISWC 2022 – 21st international semantic web conference, virtual event, October 23–27, 2022, proceedings, vol. 13489 of lecture notes in computer science (pp. 22–38). Springer. https://doi.org/10.1007/978-3-031-19433-7_2.

48.

Uszkoreit

Fan

Zhao

Zhu

(2019). Explainable AI: A brief survey on history, research areas, approaches and challenges. In J. Tang, M. Kan, D. Zhao, S. Li & H. Zan (Eds.), Natural language processing and Chinese computing – 8th CCF international conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, proceedings, part II, vol. 11839 of lecture notes in computer science (pp. 563–574). Springer. https://doi.org/10.1007/978-3-030-32236-6_51.

Benchmarking Neurosymbolic Description Logic Reasoners: Existing Challenges and a Way Forward

Abstract

Keywords

1. Introduction

3. Desiderata for Benchmarking Neurosymbolic Reasoners

5. Conclusion

Footnotes

Acknowledgment

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

References