Abstract
Recently, there has been significant progress in the development of robust and highly scalable neurosymbolic description logic reasoners. However, the field faces challenges arising from diverse design strategies and evaluation methods. We address the latter challenge by emphasizing the critical requirement for a comprehensive benchmark framework tailored to the unique evaluation needs of neurosymbolic description logic reasoners. In this paper, we address barriers that must be overcome to facilitate the effective evaluation of these reasoners and outline a potential methodology for designing the benchmark framework. This work contributes toward a more systematic and principled evaluation framework for neurosymbolic reasoning, highlighting the broader role of benchmarks in advancing the field.
Introduction
Neurosymbolic artificial intelligence (AI; d’Avila Garcez et al., 2015; Sarker et al., 2021) is a promising field that aims to bridge the gap between traditional symbolic logic and modern neural network-based machine learning. The idea is to combine the strengths of both approaches while overcoming their weaknesses. The focus of this paper lies within the realm of neurosymbolic reasoning. At its core, neurosymbolic reasoning involves integrating symbolic reasoning, which relies on structured logic and formal knowledge representation, with neural network-based methods known for their capacity to process large-scale, unstructured data and learn complex patterns from it. This fusion holds the potential for developing systems with enhanced performance, explainability, and generalization abilities (Ott et al., 2023). It is important to note that these approaches, unlike traditional reasoning methods, are not necessarily sound and complete. Instead, they strike a balance between approximating the precise reasoning capabilities of symbolic systems and harnessing the robust learning capabilities of machine learning techniques.
However, progress in this field faces significant challenges because neurosymbolic reasoning is emerging, in contrast to other areas with extensive research and well-established benchmarks. For instance, several models (graph neural networks (GNNs; Scarselli et al., 2009) and logic tensor networks (Badreddine et al., 2022b)), methodologies (inductive logic programming (Sen et al., 2022), innovative ideas (explainable AI; Xu et al., 2019), and zero-shot learning (Chen et al., 2020)) enrich this field. As a result, existing works in this field exhibit diversity in techniques, and hence, different methods and criteria are used to evaluate the performance of neurosymbolic reasoning systems (see Table 1). The lack of a standardized approach makes it difficult to compare these systems and make progress in the field. Furthermore, based on the reciprocal relationships between neural and symbolic components and how they benefit each other, neurosymbolic reasoning systems, and in general neurosymbolic AI systems, as discussed by Henry Kautz, can be categorized into one of the six distinct categories (Kautz, 2022).
Symbolic Neuro Symbolic: In this category, the input and output are represented symbolically, such as with words or sequences of words. These symbols are converted into vectors using methods such as word2vec (Mikolov et al., 2013) and then fed into a neural network for processing. Symbolic[Neuro]: Symbolic solvers use neural models internally for some functions, as seen in systems such as AlphaGo (Silver et al., 2016). Neuro Neuro:Symbolic Neuro[Symbolic]: Refers to the embedding of symbolic reasoning inside a neural engine, such as GNNs (Scarselli et al., 2009).
Each of these categories represents a unique approach to neurosymbolic AI, adding an extra layer of diversity to the advancements in this field.
Overview of Variations in Neurosymbolic Reasoning and Evaluation Approaches.
Note. GO = gene ontology; AUC = area under the area; LUBM = Lehigh University Benchmark; UOBM = University Ontology Benchmark; MRR = mean reciprocal rank; FoodOn = food ontology; KG = knowledge graph; RNN = recursive reasoning network; RTN = relational tensor network; LSTM = long short term memory.
Drawing inspiration from Jim Gray’s pioneering work (Gray, 1993) on domain-specific benchmarks for databases, our goal is to tackle the challenge of benchmarking neurosymbolic reasoners. The primary purpose of such a benchmark is two-fold. Firstly, it serves as a tool to identify the performance bottlenecks, enabling targeted improvements in the systems where algorithms are still evolving. Secondly, benchmarks facilitate meaningful comparisons between various systems, offering insights into their relative strengths and weaknesses. While this paper does not put forth an alternative benchmark, we highlight the strong need for such benchmarks, including their features, and explain why they are essential for moving the field forward.
In Section 2, we delve into the recent advancements in neurosymbolic reasoning, highlighting the challenges in evaluating and comparing the existing state-of-the-art neurosymbolic reasoners. Subsequently, in Section 3, we address the barriers that must be overcome to facilitate the effective evaluation of neurosymbolic reasoners. Finally, in Section 4, we outline a potential methodology for designing the benchmark.
In recent years, there have been significant advancements in developing neurosymbolic reasoners for description logics (DLs; Baader et al., 2003), a formal underpinning for the Web Ontology Language 2 (OWL 2; Grau et al., 2008). While most of these works predominantly focus on classification and consistency checking (Hitzler et al., 2010; Makni et al., 2021; Singh et al., 2023), the other reasoning tasks, such as instance retrieval, query rewriting, materialization, abduction, and explanation generation, remain relatively unexplored. The intricacy of these tasks varies significantly, and delving into their complexities offers a promising avenue for further exploration.
Research in this domain takes an alternative approach to traditional reasoning tasks such as classification and consistency, breaking them into class subsumption, class membership, and satisfiability tasks. Various techniques are employed, such as geometric embeddings (Kulmanov et al., 2019; Mohapatra et al., 2021; Mondal et al., 2021; Xiong et al., 2022) that map ontological relationships to geometric spaces and emulating logical reasoning through machine learning (Eberhart et al., 2020; Ebrahimi et al., 2021; Makni & Hendler, 2019). A comprehensive overview and detailed insights into the state-of-the-art neurosymbolic reasoning landscape are discussed in Makni et al. (2021); Singh et al. (2023). Regarding other categories, a limited amount of work, such as that for e-commerce search (Farzana et al., 2023), merges neurosymbolic reasoning with query rewriting. This involves a knowledge graph (KG; Hogan et al., 2022) enhanced neural network approach that integrates auxiliary knowledge from a product KG, enhancing semantic understanding of user queries and improving query reformulation.
The existing traditional benchmarks such as Lehigh University Benchmark (LUBM; Guo et al., 2005), University Ontology Benchmark (UOBM; Ma et al., 2006), and OWL2Bench (Singh et al., 2020) lack suitability for evaluating neurosymbolic reasoners due to their narrow focus on conventional reasoning tasks. Traditional evaluations of reasoning systems often rely on metrics such as reasoning time, which may not align well with the evaluation requirements of neurosymbolic reasoners. Although the ontologies of these benchmarks, along with those from the OWL reasoner evaluation competition (Parsia et al., 2017), can serve as initial datasets for the proposed neurosymbolic benchmark framework, these datasets fall short of addressing the distinct challenges posed by neurosymbolic reasoning. To our knowledge, no benchmarks or evaluation frameworks have been designed to evaluate and compare neurosymbolic reasoning systems. Most reasoner evaluations are performed on different publicly available ontologies, including but not restricted to SNOMED CT, 1 gene ontology (GO), 2 and Galen, 3 as well as other ontologies available in public repositories such as DBpedia (Lehmann et al., 2014), YAGO (Suchanek et al., 2007), Wikidata (Vrandečić, 2012), Claros, 4 NCBO Bioportal, 5 and AgroPortal. 6 However, these offer a limited set of ontologies for evaluation, which does not cover the full spectrum of possible scenarios.
As discussed in Section 1, neurosymbolic approaches encompass a range of evaluation methodologies and reasoning techniques. This diversity becomes evident in Table 1, highlighting the necessity for a dedicated benchmark to systematically and comprehensively assess the performance of neurosymbolic reasoning systems. The table reveals the utilization of subsets of description logics, such as
To further highlight the diversity in the current approaches, we classify the works mentioned in Table 1 into one of the six distinct categories discussed in Section 1. Chen et al. (2021) convert the symbolic input, ontologies, and RDF graphs, to vectors (Symbolic Neuro Symbolic). Makni and Hendler (2019), Ebrahimi et al. (2018), Ebrahimi et al. (2021), Eberhart et al. (2020), Hohenecker and Lukasiewicz (2017), and Makni et al. (2020) take symbolic reasoning rules as input and compile them during training (Neuro:Symbolic
Desiderata for Benchmarking Neurosymbolic Reasoners
Creating an effective benchmark demands careful consideration of critical principles such as simplicity for accessibility, portability for impartial assessments across various approaches, scalability to accommodate diverse system sizes, and relevance to reflect practical challenges in benchmark scenarios (Gray, 1993). However, the evaluation of neurosymbolic reasoners presents its own set of distinctive challenges. Given the field’s novelty, state-of-the-art solutions do not approach such challenges systematically. Therefore, we advocate below the issues that should be prioritized in constructing a fair neurosymbolic reasoning benchmark.
To effectively evaluate neurosymbolic reasoners, the benchmark must incorporate diverse scenarios that mirror the complexity and variety encountered in real-world applications. This approach ensures a thorough assessment of the reasoners’ capabilities across different contexts. Key aspects to consider include:
Variety of Ontologies: The benchmark should encompass a range of ontologies differing in size, profile, and axiom types. This includes:
Size and Complexity: Include ontologies with varying sizes and complexities in both assertional knowledge (ABox) and terminological knowledge (TBox) to evaluate how reasoners handle different levels of detail and scope. OWL 2 Profiles: Use ontologies that adhere to various OWL 2 profiles (such as EL, QL, RL, and DL) to test the reasoners’ ability to handle different levels of expressiveness. Axiom Types: Incorporate different types of axioms (such as subclass relations and property restrictions) and their combinations to assess how well the reasoners manage diverse logical constructs. Specific and Generic Reasoning Tasks: Benchmark scenarios should include both specific reasoning tasks and generic information needs:
Specific Reasoning Tasks: Design tasks that test particular reasoning capabilities, such as classification, consistency checking, and instance retrieval. These tasks enable micro-benchmarking and provide insights into the strengths and limitations of individual reasoners. Generic Information Needs: Include tasks that assess the reasoners’ ability to handle complex and broader reasoning scenarios, such as evaluating how well the reasoners can address multistep queries that involve integrating diverse information sources and applying both symbolic rules and neural network-derived insights. This includes testing the reasoners’ ability to synthesize and leverage contextual information to generate coherent and relevant responses. Real-World Applicability: Ensure that benchmark scenarios reflect real-world use cases:
Real-World Ontologies: Analyze and incorporate real-world ontologies and existing benchmarks to capture practical challenges and scenarios. Scalability and Realism: Design scenarios that not only address current requirements but also scale beyond them to foster technological advancement and future-proof the evaluation process.
Incorporating controlled inconsistencies into benchmark design presents a significant challenge but is essential for evaluating the robustness of neurosymbolic reasoners. Controlled inconsistencies should be introduced in a deterministic manner to assess how well the systems handle and resolve contradictions. Key aspects to consider include:
Types of Inconsistencies:
Structural Inconsistency: Introduce structural inconsistencies by creating contradictions in the ontological hierarchy or relationships. For example: If the ontology specifies that the entities “Male” and “Female” are disjoint classes, add instances in the ABox that are classified as both “Male” and “Female.” This tests the system’s ability to detect and resolve structural conflicts. Similarly, create inconsistencies by defining contradictory class hierarchies or property restrictions that violate the logical constraints of the ontology. Semantic Inconsistency: Introduce semantic inconsistencies by modifying entity names or attributes to introduce ambiguity or slight deviations. For example: Change names or attributes in a way that creates near-identical but distinct instances, such as altering “John” to “Jonh” to test how well the system identifies and resolves semantic conflicts. Introduce synonyms or typographical errors that may lead to semantic ambiguities and test how the system manages these issues. Reproducibility and Control:
Deterministic Generation: Ensure that the process of generating inconsistencies is deterministic, allowing for consistent reproduction of test scenarios. This is crucial for evaluating the effectiveness of the system’s handling of inconsistencies. Controlled Environment: Design mechanisms to introduce inconsistencies in a controlled manner, avoiding randomness that could obscure the evaluation of specific reasoning capabilities. Note that existing benchmarks may lack the capability to introduce generic inconsistencies effectively or in a contextually relevant manner. This highlights the need for novel approaches to benchmark design. Traditional generative AI models, such as Large Language Models, may not be well-suited for creating controlled inconsistencies. This underscores the unique requirements for benchmark design that effectively simulates real-world contradictions. Incorporating controlled inconsistencies into the benchmark will provide a deeper understanding of a reasoner’s robustness and its ability to manage and resolve conflicts, reflecting the complexity of real-world scenarios where inconsistencies are prevalent.
A critical aspect of benchmarking neurosymbolic reasoners is the representation of input data. This involves ensuring that ontological knowledge, both ABox (assertional knowledge) and TBox (terminological knowledge), is formatted in a manner that various reasoning systems can effectively process. This flexibility ensures comprehensive and realistic evaluation conditions, enabling the assessment of reasoning systems across the spectrum of neurosymbolic methodologies.
Ontology Formats: The benchmark should support multiple ontology formats such as RDF/XML,
7
Turtle,
8
and Manchester OWL Syntax,
9
among others. This ensures compatibility with a wide range of systems that may require specific formats. Axiom Format: While some neurosymbolic systems may utilize embedding techniques to transform ontological entities and relationships into continuous vector spaces, others might directly process axioms in their logical form. For instance, certain systems might require axioms in a normalized form as per the Preembedded Entities: Some systems may necessitate entities represented as embeddings, using models such as BERT (Devlin et al., 2019) or other neural embeddings such as TransE (Bordes et al., 2013). The benchmark should offer pre-embedded entity representations, ensuring compatibility with these methods and enabling comprehensive evaluation across different representation techniques. Dataset Splits: The benchmark should facilitate the generation of dataset splits tailored to diverse testing needs, such as train–test–validation splits. This enables a thorough evaluation of a system’s learning and generalization capabilities across different segments of data. Properly managed splits ensure that the performance metrics accurately reflect the system’s ability to handle unseen data and prevent overfitting. Domain-Agnostic Datasets: To assess a system’s understanding of logical semantics independently of specific domain knowledge, the benchmark should have the capability to generate domain-agnostic datasets. This allows for evaluation focused on the system’s ability to interpret and apply logical rules universally rather than relying on domain-specific information.
In the trajectory toward developing a new generation of reasoners that effectively harness the potential of both neural networks and logical reasoning, a foundational requirement involves conducting an equitable assessment of state-of-the-art solutions. This assessment provides insights into the present capabilities of these approaches and illuminates the trajectory of the field’s future development. Evaluating these aspects ensures that the reasoners can generalize beyond specific datasets and apply logical rules consistently across different domains. The key points of this assessment include:
Soundness and Completeness: Traditional deductive reasoners are sound and complete, meaning they produce correct and exhaustive inferences based on given axioms. Evaluating whether neurosymbolic reasoners can achieve similar standards is critical. Generalization Capabilities: Deductive reasoners should be able to work across any ontology from any domain. This includes verifying that the reasoners can generalize logical rules universally and not be confined to specific datasets or domains. For instance, a rule stating “if A is a subclass of B and B is a subclass of C, then A is a subclass of C” should apply universally, irrespective of the specific terms involved. This ensures the systems can apply logical rules consistently across different domains. Scalability and Efficiency: Assessing the scalability and efficiency of these reasoners in handling large and complex ontologies is essential. Traditional deductive reasoners are designed to handle extensive datasets and intricate logical structures, which serve as a benchmark for emerging neurosymbolic systems. Understanding how these models perform under varying degrees of complexity and scale can guide the development of more robust and versatile reasoning systems. Handling Noise and Inconsistent Data: When evaluating the deductive capabilities of these reasoners, it is crucial to consider how well they handle noise and inconsistent data. Real-world applications often involve datasets with inaccuracies, ambiguities, and inconsistencies that can challenge the reasoning process. Assessing a system’s ability to manage and mitigate the impact of such issues provides valuable insights into its robustness and practical applicability. This includes evaluating how well the system maintains soundness and completeness in the presence of noisy or conflicting data and its effectiveness in adapting to varying degrees of data quality and integrity. When assessing deductive reasoning capabilities and comparing them with conventional deductive reasoners, it is advantageous to also include neural-based approaches, such as large language models, in the evaluation framework. While neural methods may not always excel in every aspect of deductive reasoning, incorporating them as a baseline can offer valuable comparative insights. This approach not only underscores the benefits of neurosymbolic methods, which integrate both neural and symbolic reasoning, but also provides a more comprehensive understanding of the strengths and potential synergies between different reasoning paradigms.
In order to accurately measure the performance of neurosymbolic reasoners, the benchmark must support a range of metrics and key performance indicators that capture various aspects of system performance. Standard metrics commonly used in evaluation include:
Accuracy, Precision, Recall, and F1 Score: These metrics provide insights into the classification performance of neural components, assessing how well the system identifies correct versus incorrect predictions. That is, giving insights into how accurately the system produces only correct inferences (soundness) and whether it generates all possible correct inferences based on the given axioms (completeness). Mean Reciprocal Rank (MRR) and Hits@K: These are used to evaluate ranking tasks, measuring the position of correct answers in a ranked list of predictions. They help assess the effectiveness of the system in retrieving relevant information. Scalability Metrics: Metrics such as processing time and memory usage that evaluate how well the system handles large and complex datasets. While these standard metrics are essential for evaluating traditional aspects of system performance, there remains a need for developing new metrics tailored to the unique characteristics of neurosymbolic reasoning. Current benchmarks might not fully capture critical aspects such as:
Robustness to Noise and Inconsistent Data: Evaluating how well the system manages inaccuracies, ambiguities, and inconsistencies in real-world data. This requires metrics that measure the impact of noisy or conflicting data on performance and the system’s ability to maintain robustness. Inference Generation Efficiency: Metrics that assess the system’s capability to generate all inferences in a single run, while ensuring system soundness, measuring computational efficiency and the number of iterations required. Explanatory Capabilities: Evaluating the quality and usefulness of explanations provided by the system for its inferences. This includes measuring the clarity and completeness of explanations, which is crucial for transparency and user trust. Generalization Across Domains: Metrics to assess how well the system transfers reasoning capabilities across different domains, ensuring consistent performance and applicability in varied contexts. Embedding Quality: For systems that use embeddings, metrics evaluate how well embeddings capture logical relationships and nuances. This includes assessing the embeddings’ effectiveness in preserving logical structures and supporting accurate inferences. The inclusion of these novel metrics, alongside traditional ones, ensures a comprehensive evaluation of neurosymbolic reasoners. This approach provides deeper insights into the performance and limitations of current systems, guiding future improvements and research directions.
In the rapidly evolving field of neurosymbolic reasoning, the benchmark’s adaptability is crucial for ensuring its relevance and effectiveness. The benchmark should be designed to accommodate the following aspects:
Continuous Updates: The benchmark should be capable of integrating new tasks and methodologies as they emerge. This involves regularly updating the benchmark to reflect the latest advancements and challenges in neurosymbolic reasoning. Ongoing Review and Feedback: Regular reviews and updates based on the latest research and feedback from the community are essential to keep the benchmark aligned with current practices and real-world needs.
In this section, we propose one of the possible methodologies to design a benchmark comprising of the objectives outlined in Section 3. The methodology involves the following key steps:
Generating Diverse Benchmark Scenarios: The initial step toward creating a benchmark involves curating datasets that cover a wide range of benchmarking scenarios. One approach is to start with a study of existing neurosymbolic description logic reasoners, beginning with basic ontology profiles such as RDFS and gradually progressing to more complex ones such as OWL 2 EL and OWL 2 DL. This process includes evaluating existing datasets across various models and systems, comparing results with traditional reasoning systems such as Konclude (Steigmiller et al., 2014), and identifying performance variations. This analysis can provide insights about the datasets that prove critical for existing systems. We can then focus on generating synthetic datasets that replicate these patterns at various scales and complexities, ensuring coverage of diverse ontology constructs and reasoning tasks. Introduction of Controlled Inconsistencies: After generating the datasets, the next step is to introduce controlled inconsistencies, similar to those discussed in desiderata 2 of Section 3. This approach allows for the evaluation of how effectively the system handles and resolves these inconsistencies. Input Formats: One essential benchmarking feature could be support for various OWL and RDF serialization formats, such as RDF/XML
10
or OWL/XML.
11
This capability would allow the tool to handle ontologies in any input format and convert them into the required format, facilitating seamless integration and testing. Additionally, incorporating options for generating different dataset splits and profile-specific features, such as generating axioms in normal form for the Evaluation of Deductive Capabilities: Traditional reasoners often struggle with inconsistent ontologies, underlining the importance of starting with consistent ontologies. Therefore, generating inconsistencies is kept as a separate step in the benchmarking process. Simultaneously, it is crucial to evaluate the features attributed to the neural aspect of the system, such as learning capabilities, repair abilities, and scalability. This involves assessing performances based on handling ontological complexities, scalability, and overall performance compared to traditional reasoning systems. Evaluating performances after introducing controlled inconsistencies highlights the benefits and results obtained in the presence of inconsistencies, emphasizing deductive prowess and unique contributions of the neural aspect in neurosymbolic reasoning, showcasing the system’s overall capabilities comprehensively. Metric Design: The existing standard learning metrics, such as accuracy, precision, and F1 score, only provide an overall idea of the efficacy of the systems. However, these systems need a thorough analysis, emphasizing areas well-supported by systems and areas needing improvement, such as handling different ontological constructs. These metrics should encompass not only deductive capabilities but also adaptability to diverse scenarios and overall efficacy in handling complex neurosymbolic reasoning tasks.
This methodology outlines a foundational approach to benchmark design that can be adapted and expanded to include more expressive profiles. It provides a systematic starting point for addressing the challenges in neurosymbolic reasoning, with the flexibility to evolve and incorporate additional complexity and features as the field progresses.
Conclusion
We highlighted the significant need for a comprehensive benchmark framework to tackle the challenges tied to evaluating neurosymbolic description logic reasoning systems. Merging symbolic logic and neural network-based machine learning brings great promise, but the lack of common evaluation methods has held back progress in the field. By underlining the importance of creating benchmarks, our aim for the future is to establish a structured way of evaluating these systems that can drive the field forward.
Footnotes
Acknowledgment
Gunjan Singh and Raghava Mutharaju would like to acknowledge the partial support of the Infosys Centre for Artificial Intelligence (CAI), IIIT-Delhi, in this work.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
