Abstract
Recently, there has been significant progress in the development of robust and highly scalable neurosymbolic description logic reasoners. However, the field faces challenges arising from diverse design strategies and evaluation methods. We address the latter challenge by emphasizing the critical requirement for a comprehensive benchmark framework tailored to the unique evaluation needs of neurosymbolic description logic reasoners. In this paper, we address barriers that must be overcome to facilitate the effective evaluation of these reasoners and outline a potential methodology for designing the benchmark framework. This work contributes toward a more systematic and principled evaluation framework for neurosymbolic reasoning, highlighting the broader role of benchmarks in advancing the field.
Introduction
Neurosymbolic artificial intelligence (AI; d’Avila Garcez et al., 2015; Sarker et al., 2021) is a promising field that aims to bridge the gap between traditional symbolic logic and modern neural network-based machine learning. The idea is to combine the strengths of both approaches while overcoming their weaknesses. The focus of this paper lies within the realm of neurosymbolic reasoning. At its core, neurosymbolic reasoning involves integrating symbolic reasoning, which relies on structured logic and formal knowledge representation, with neural network-based methods known for their capacity to process large-scale, unstructured data and learn complex patterns from it. This fusion holds the potential for developing systems with enhanced performance, explainability, and generalization abilities (Ott et al., 2023). It is important to note that these approaches, unlike traditional reasoning methods, are not necessarily sound and complete. Instead, they strike a balance between approximating the precise reasoning capabilities of symbolic systems and harnessing the robust learning capabilities of machine learning techniques.
However, progress in this field faces significant challenges because neurosymbolic reasoning is emerging, in contrast to other areas with extensive research and well-established benchmarks. For instance, several models (graph neural networks (GNNs; Scarselli et al., 2009) and logic tensor networks (Badreddine et al., 2022b)), methodologies (inductive logic programming (Sen et al., 2022), innovative ideas (explainable AI; Xu et al., 2019), and zero-shot learning (Chen et al., 2020)) enrich this field. As a result, existing works in this field exhibit diversity in techniques, and hence, different methods and criteria are used to evaluate the performance of neurosymbolic reasoning systems (see Table 1). The lack of a standardized approach makes it difficult to compare these systems and make progress in the field. Furthermore, based on the reciprocal relationships between neural and symbolic components and how they benefit each other, neurosymbolic reasoning systems, and in general neurosymbolic AI systems, as discussed by Henry Kautz, can be categorized into one of the six distinct categories (Kautz, 2022).
Symbolic Neuro Symbolic: In this category, the input and output are represented symbolically, such as with words or sequences of words. These symbols are converted into vectors using methods such as word2vec (Mikolov et al., 2013) and then fed into a neural network for processing. Symbolic[Neuro]: Symbolic solvers use neural models internally for some functions, as seen in systems such as AlphaGo (Silver et al., 2016). Neuro Neuro:Symbolic Neuro[Symbolic]: Refers to the embedding of symbolic reasoning inside a neural engine, such as GNNs (Scarselli et al., 2009).
Each of these categories represents a unique approach to neurosymbolic AI, adding an extra layer of diversity to the advancements in this field.
Overview of Variations in Neurosymbolic Reasoning and Evaluation Approaches.
Drawing inspiration from Jim Gray’s pioneering work (Gray, 1993) on domain-specific benchmarks for databases, our goal is to tackle the challenge of benchmarking neurosymbolic reasoners. The primary purpose of such a benchmark is two-fold. Firstly, it serves as a tool to identify the performance bottlenecks, enabling targeted improvements in the systems where algorithms are still evolving. Secondly, benchmarks facilitate meaningful comparisons between various systems, offering insights into their relative strengths and weaknesses. While this paper does not put forth an alternative benchmark, we highlight the strong need for such benchmarks, including their features, and explain why they are essential for moving the field forward.
In Section 2, we delve into the recent advancements in neurosymbolic reasoning, highlighting the challenges in evaluating and comparing the existing state-of-the-art neurosymbolic reasoners. Subsequently, in Section 3, we address the barriers that must be overcome to facilitate the effective evaluation of neurosymbolic reasoners. Finally, in Section 4, we outline a potential methodology for designing the benchmark.
In recent years, there have been significant advancements in developing neurosymbolic reasoners for description logics (DLs; Baader et al., 2003), a formal underpinning for the Web Ontology Language 2 (OWL 2; Grau et al., 2008). While most of these works predominantly focus on classification and consistency checking (Hitzler et al., 2010; Makni et al., 2021; Singh et al., 2023), the other reasoning tasks, such as instance retrieval, query rewriting, materialization, abduction, and explanation generation, remain relatively unexplored. The intricacy of these tasks varies significantly, and delving into their complexities offers a promising avenue for further exploration.
Research in this domain takes an alternative approach to traditional reasoning tasks such as classification and consistency, breaking them into class subsumption, class membership, and satisfiability tasks. Various techniques are employed, such as geometric embeddings (Kulmanov et al., 2019; Mohapatra et al., 2021; Mondal et al., 2021; Xiong et al., 2022) that map ontological relationships to geometric spaces and emulating logical reasoning through machine learning (Eberhart et al., 2020; Ebrahimi et al., 2021; Makni & Hendler, 2019). A comprehensive overview and detailed insights into the state-of-the-art neurosymbolic reasoning landscape are discussed in Makni et al. (2021); Singh et al. (2023). Regarding other categories, a limited amount of work, such as that for e-commerce search (Farzana et al., 2023), merges neurosymbolic reasoning with query rewriting. This involves a knowledge graph (KG; Hogan et al., 2022) enhanced neural network approach that integrates auxiliary knowledge from a product KG, enhancing semantic understanding of user queries and improving query reformulation.
The existing traditional benchmarks such as Lehigh University Benchmark (LUBM; Guo et al., 2005), University Ontology Benchmark (UOBM; Ma et al., 2006), and OWL2Bench (Singh et al., 2020) lack suitability for evaluating neurosymbolic reasoners due to their narrow focus on conventional reasoning tasks. Traditional evaluations of reasoning systems often rely on metrics such as reasoning time, which may not align well with the evaluation requirements of neurosymbolic reasoners. Although the ontologies of these benchmarks, along with those from the OWL reasoner evaluation competition (Parsia et al., 2017), can serve as initial datasets for the proposed neurosymbolic benchmark framework, these datasets fall short of addressing the distinct challenges posed by neurosymbolic reasoning. To our knowledge, no benchmarks or evaluation frameworks have been designed to evaluate and compare neurosymbolic reasoning systems. Most reasoner evaluations are performed on different publicly available ontologies, including but not restricted to SNOMED CT, 1 gene ontology (GO), 2 and Galen, 3 as well as other ontologies available in public repositories such as DBpedia (Lehmann et al., 2014), YAGO (Suchanek et al., 2007), Wikidata (Vrandečić, 2012), Claros, 4 NCBO Bioportal, 5 and AgroPortal. 6 However, these offer a limited set of ontologies for evaluation, which does not cover the full spectrum of possible scenarios.
As discussed in Section 1, neurosymbolic approaches encompass a range of evaluation methodologies and reasoning techniques. This diversity becomes evident in Table 1, highlighting the necessity for a dedicated benchmark to systematically and comprehensively assess the performance of neurosymbolic reasoning systems. The table reveals the utilization of subsets of description logics, such as
To further highlight the diversity in the current approaches, we classify the works mentioned in Table 1 into one of the six distinct categories discussed in Section 1. Chen et al. (2021) convert the symbolic input, ontologies, and RDF graphs, to vectors (Symbolic Neuro Symbolic). Makni and Hendler (2019), Ebrahimi et al. (2018), Ebrahimi et al. (2021), Eberhart et al. (2020), Hohenecker and Lukasiewicz (2017), and Makni et al. (2020) take symbolic reasoning rules as input and compile them during training (Neuro:Symbolic
Desiderata for Benchmarking Neurosymbolic Reasoners
Creating an effective benchmark demands careful consideration of critical principles such as simplicity for accessibility, portability for impartial assessments across various approaches, scalability to accommodate diverse system sizes, and relevance to reflect practical challenges in benchmark scenarios (Gray, 1993). However, the evaluation of neurosymbolic reasoners presents its own set of distinctive challenges. Given the field’s novelty, state-of-the-art solutions do not approach such challenges systematically. Therefore, we advocate below the issues that should be prioritized in constructing a fair neurosymbolic reasoning benchmark.
To effectively evaluate neurosymbolic reasoners, the benchmark must incorporate diverse scenarios that mirror the complexity and variety encountered in real-world applications. This approach ensures a thorough assessment of the reasoners’ capabilities across different contexts. Key aspects to consider include:
Incorporating controlled inconsistencies into benchmark design presents a significant challenge but is essential for evaluating the robustness of neurosymbolic reasoners. Controlled inconsistencies should be introduced in a deterministic manner to assess how well the systems handle and resolve contradictions. Key aspects to consider include:
Note that existing benchmarks may lack the capability to introduce generic inconsistencies effectively or in a contextually relevant manner. This highlights the need for novel approaches to benchmark design. Traditional generative AI models, such as Large Language Models, may not be well-suited for creating controlled inconsistencies. This underscores the unique requirements for benchmark design that effectively simulates real-world contradictions. Incorporating controlled inconsistencies into the benchmark will provide a deeper understanding of a reasoner’s robustness and its ability to manage and resolve conflicts, reflecting the complexity of real-world scenarios where inconsistencies are prevalent.
A critical aspect of benchmarking neurosymbolic reasoners is the representation of input data. This involves ensuring that ontological knowledge, both ABox (assertional knowledge) and TBox (terminological knowledge), is formatted in a manner that various reasoning systems can effectively process. This flexibility ensures comprehensive and realistic evaluation conditions, enabling the assessment of reasoning systems across the spectrum of neurosymbolic methodologies.
In the trajectory toward developing a new generation of reasoners that effectively harness the potential of both neural networks and logical reasoning, a foundational requirement involves conducting an equitable assessment of state-of-the-art solutions. This assessment provides insights into the present capabilities of these approaches and illuminates the trajectory of the field’s future development. Evaluating these aspects ensures that the reasoners can generalize beyond specific datasets and apply logical rules consistently across different domains. The key points of this assessment include:
When assessing deductive reasoning capabilities and comparing them with conventional deductive reasoners, it is advantageous to also include neural-based approaches, such as large language models, in the evaluation framework. While neural methods may not always excel in every aspect of deductive reasoning, incorporating them as a baseline can offer valuable comparative insights. This approach not only underscores the benefits of neurosymbolic methods, which integrate both neural and symbolic reasoning, but also provides a more comprehensive understanding of the strengths and potential synergies between different reasoning paradigms.
In order to accurately measure the performance of neurosymbolic reasoners, the benchmark must support a range of metrics and key performance indicators that capture various aspects of system performance. Standard metrics commonly used in evaluation include:
While these standard metrics are essential for evaluating traditional aspects of system performance, there remains a need for developing new metrics tailored to the unique characteristics of neurosymbolic reasoning. Current benchmarks might not fully capture critical aspects such as:
The inclusion of these novel metrics, alongside traditional ones, ensures a comprehensive evaluation of neurosymbolic reasoners. This approach provides deeper insights into the performance and limitations of current systems, guiding future improvements and research directions.
In the rapidly evolving field of neurosymbolic reasoning, the benchmark’s adaptability is crucial for ensuring its relevance and effectiveness. The benchmark should be designed to accommodate the following aspects:
In this section, we propose one of the possible methodologies to design a benchmark comprising of the objectives outlined in Section 3. The methodology involves the following key steps:
This methodology outlines a foundational approach to benchmark design that can be adapted and expanded to include more expressive profiles. It provides a systematic starting point for addressing the challenges in neurosymbolic reasoning, with the flexibility to evolve and incorporate additional complexity and features as the field progresses.
Conclusion
We highlighted the significant need for a comprehensive benchmark framework to tackle the challenges tied to evaluating neurosymbolic description logic reasoning systems. Merging symbolic logic and neural network-based machine learning brings great promise, but the lack of common evaluation methods has held back progress in the field. By underlining the importance of creating benchmarks, our aim for the future is to establish a structured way of evaluating these systems that can drive the field forward.
Footnotes
Acknowledgment
Gunjan Singh and Raghava Mutharaju would like to acknowledge the partial support of the Infosys Centre for Artificial Intelligence (CAI), IIIT-Delhi, in this work.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
