Sage Journals: Discover world-class research

Abstract

While remarkable recent developments in deep neural networks have significantly contributed to advancing the state-of-the-art in computer vision (CV), several studies have also shown their limitations and defects. In particular, CV models often make systematic errors on important subsets of data called slices, which are groups of data sharing a set of attributes. A slice discovery method (SDM) is meant to detect semantically meaningful slices on which the model performs poorly, called rare slices. We propose a modular neurosymbolic SDM whose distinctive advantage is the extraction via inductive logic programming of human-readable logical rules describing rare slices, and thus enhancing the explainability of CV models. To this end, a methodology for inducing the occurrence of rare slices in a model is presented. We validate the SDM approach on both the synthetic Super-CLEVR and real-world ImageNet datasets. Our experiments demonstrate the complete pipeline: first, we successfully induce targeted rare slices using our taxonomy-based heuristic; second, our neurosymbolic SDM correctly identifies these slices and produces precise, human-readable logical rules to describe them; and finally, these rules are used to guide a data augmentation process that successfully mends model behaviour and improves its predictive performance.¹

Keywords

neurosymbolic AI slice discovery inductive logic programming

1. Introduction

Computer vision (CV) (Szeliski, 2022) is a field of artificial intelligence (AI) that enables computer systems to obtain semantic information from digital images and videos. Following the remarkable recent developments of deep neural networks, significant achievements have been made in advancing state-of-the-art performance in various CV tasks (Krizhevsky et al., 2017), among which it is crucial to mention safety-critical applications, such as autonomous driving (M. Zhang et al., 2018).

However, empirical studies, for example Recht et al. (2019), show that CV models struggle to generalise to new data slightly different from those on which they were initially trained and tested. A related problem is the presence of important subsets of data, called slices, for which deep learning models often make systematic errors (Eyuboglu et al., 2022). A slice is defined as a group of data sharing a set of attributes. For instance, one study found that some object recognition models systematically underperform in identifying common household items from non-Western countries and low-income communities (DeVries et al., 2019). This underperformance likely stems from variations in the objects themselves and the different contexts in which they appear.

Accurately detecting underperforming slices, called rare slices, allows one to carefully analyse such prediction errors and subsequently improve the model. Expectedly, identifying rare slices is a complex task, especially for high-dimensional and unstructured data, for example, images, where such slices often manifest as subtle, non-obvious patterns that are difficult to spot and extract. Furthermore, it is non-trivial to understand what makes slices rare. In view of this, the slice discovery problem (Eyuboglu et al., 2022) has been described as mining unstructured input data for semantically meaningful slices on which the model performs poorly.

In this work, we propose to tackle the slice discovery problem with a neurosymbolic AI approach (Hitzler & Sarker, 2021), given the capabilities of machine learning (ML), and in particular deep learning (DL), for unstructured data classification and knowledge representation and reasoning (KRR) methods for transparent logical inference and explainability. In particular, we provide a framework to experiment with different datasets the effectiveness of our inductive logic programming-based slice discovery method (SDM). This framework allows us to evaluate our SDM according to the semantic quality of the extracted rules in describing rare slices and the effect of such rules in reducing model misclassifications. To this end, we leverage Super-CLEVR (Z. Li et al., 2023), a well-known synthetic dataset with a data generator for images of vehicles organised in hierarchical classes, and ImageNet (Krizhevsky et al., 2017), a large-scale real-world image dataset organised according to the WordNet (Pedersen et al., 2004) hierarchy. The main contributions of our work are summarised as follows:

We present our modular neurosymbolic framework for slice discovery, which consists of a closed loop that involves data generation (or subsampling), object detection (or image classification), scene graph generation that describes the semantic contents of images, rule learning to detect rare slices, and neural network model mending. First, we provide image datasets with rare slices leveraging Super-CLEVR (via generation) and ImageNet (via controlled subsampling), on which we train YOLOv5 (Redmon et al., 2016) models. We then translate the images classified by YOLOv5 into scene graphs in the language of inductive logic programming (ILP) (Cropper & Dumancic, 2022). Depending on the ground truth, these scene graphs constitute the positive and negative ILP examples, that is, those in which the neural network incorrectly resp. correctly classified the image. Subsequently, we use three different ILP systems, Popper (Cropper & Morel, 2021), FOLD-R++ (Wang & Gupta, 2022), and FastLAS (Law et al., 2020), to obtain succinct logical rules that reveal which images are hard for the model to classify. Finally, the neural network model is trained on its checkpoint with further data generated using these rules.

In order to test the proposed approach on various slice discovery settings, we focus on generating datasets with rare slices. Closest to our work, Eyuboglu et al. (2022) considered the generation of rare slices in the context of the hierarchical class structure, but did not consider further class taxonomies besides the default one; this makes their method not really suitable for the scenarios we are considering. In contrast, a taxonomy-based approach is pursued in this work, and a methodology for building datasets with rare slices is presented.

We provide an implementation along with experimental results for both datasets to test the effectiveness of rare slice generation, rule extraction on the classification results of the neural network model, and model mending. The results show that our approach could reliably generate rare slices and that rule learning delivered meaningful rules describing rare slices. Furthermore, feeding the neural network with additional training data generated according to such rules resulted in a significant performance improvement, as misclassifications decreased considerably.

With our framework, we can generate controlled rare slices in datasets to then test the model behaviour on them. Furthermore, it allows the automatic mining via ILP of human-readable logical rules that pinpoint the deficiencies of a classification model and benefit the user’s intuition for model mending. The transparent nature of logical rules makes them highly interpretable and provides a basis for finding model explanations from possible background information.

This article extends our previous work (Collevati et al., 2024) with (i) a more detailed and extended related work section, (ii) the use of further ILP systems, (iii) a more rigorous experimental evaluation using an additional real-world dataset, and (iv) an improved implementation of the proposed SDM framework.

The remainder of the article is organised as follows. In Section 2, we provide a review of related work on SDMs. Section 3 presents an introduction to the Super-CLEVR and ImageNet datasets and the ILP systems. Section 4 describes the proposed neurosymbolic framework for slice discovery. Section 5 presents a taxonomy-based methodology for generating datasets containing rare slices. Section 6 describes the experimental setup and presents an overview of the obtained results. The experimental results are discussed in Section 7. Finally, conclusions and future work are provided in Section 8.

2. Related Work

Several studies (Buolamwini & Gebru, 2018; DeVries et al., 2019; Koenecke et al., 2020; Oakden-Rayner et al., 2020) have shown that neural network models often make systematic errors on data slices. The impact of such errors is especially pronounced for critical application areas, such as medical diagnostics (Olesen et al., 2024) and fraud detection (Kalid et al., 2024), where accurate identification of rare slices positively influences essential decision-making. Consequently, recent research has proposed automated SDMs aimed at identifying semantically meaningful slices in which the model exhibits prediction errors. An optimal SDM should automatically detect data slices containing coherent instances that closely correspond to a concept understandable by humans (Johnson et al., 2023) and on which the model underperforms.

Previous research has addressed the slice discovery problem by focussing on datasets with metadata or structured (e.g., tabular) data. In Chung et al. (2019) the Slice Finder system is proposed, which employs two different automated data slicing methods, viz. decision tree training and lattice searching. In Sagadeeva and Boehm (2021), the authors present SliceLine, an exact yet fast and practical enumeration algorithm to find problematic data slices leveraging monotonicity properties and upper bounds for effective pruning. On the other hand, the Premise algorithm (Hedderich et al., 2022) heuristically discovers those feature-value combinations (i.e., patterns) that provide clear insight into the systematic errors of NLP classifiers.

Dealing with the slice discovery problem becomes particularly challenging for unstructured data, such as images and audio. Recent studies have proposed methods for identifying slices in this context. Several of them embed the data in a representation space and then use clustering or dimensionality reduction techniques. The Domino SDM (Eyuboglu et al., 2022) exploits cross-modal embeddings and an error-aware Gaussian mixture model to discover and describe coherent slices, while the Spotlight method (d’Eon et al., 2022) for finding systematic errors is based on the idea that similar inputs tend to have similar representations in the final hidden layer of a neural network. Spotlight exploits this similarity by focussing on such representation space, aiming to identify contiguous regions where the model underperforms. In Sohoni et al. (2020), the authors describe a two-step method, called George, for identifying underperforming slices without requiring access to slice labels. In the first step, slice labels are estimated by training a model and splitting each class into estimated slices through unsupervised clustering in the model feature space. In the second step, these estimated slices are used to train a new model, optimizing for worst-case performance over all estimated slices via a robust optimisation technique (Sagawa et al., 2020).

The recent explosion of generative AI has seen various works considering the use of such models to address the slice discovery problem. The PromptAttack procedure (Metzen et al., 2023) identifies systematic errors by exploiting a text-to-image model to synthesise images, conditioned on a prompt that encodes information about subgroups and classes. In Gao et al. (2023), a human-in-the-loop tool is proposed, called AdaVision, which uses GPT-3 (Brown et al., 2020) to suggest coherent but potentially underperforming slices to explore, and CLIP (Radford et al., 2021) to retrieve relevant images to improve slice identification. In Boreiko et al. (2023), the authors present the SCROD pipeline for slice discovery in object detectors applied to synthetic street scenes. Such a pipeline consists of several generative models to synthesise images with fine-grained control in a fully automated and scalable way. The interactive VLSlice system (Slyman et al., 2023) is designed to test vision-and-language models by discovering their slices from unlabelled image datasets. In Luo et al. (2024), the SSD-LLM framework is proposed for automatic subpopulation structure discovery using a large language model (LLM) (Brown et al., 2020). Such a framework is based on the idea of generating informative image captions via a multimodal LLM (Wu et al., 2023), and then analysing and summarising the subpopulation structure of datasets through an LLM. SSD-LLM can be combined with subsequent operations to tackle various tasks better, including slice discovery.

Distinguishing between positive and negative examples is central to our method. Prior work has leveraged linear temporal logic over finite traces (LTLf) to separate temporal classes (Francescomarino et al., 2024), and used learning from interpretation transitions (LFIT) to explain black-box behaviour (Tello et al., 2023). Other approaches generate interpretable signal temporal logic (STL) formulas for time-series classification (Yan et al., 2022), integrate symbolic reasoning with neural models via abductive inference (Dai et al., 2019), or learn differentiable rule sets from continuous features (W. Zhang et al., 2023). Our method follows this line of research but focuses on ILP for discovering logical rules in the context of slice discovery for high-dimensional visual data.

While several prior works described above have tackled the problem of identifying data slices on which models underperform, they typically focus on black-box or subsymbolic techniques. A key challenge in this area concerns the lack of interpretability in the discovered slices. Our work directly addresses this gap by introducing a neurosymbolic approach that extracts human-readable rules to describe underperforming slices. Furthermore, we demonstrate that the rules can also be effectively applied to mend the CV model, thereby providing their direct practical application.

3. Preliminaries

In this section, we provide an overview of both the Super-CLEVR and ImageNet datasets, as well as a brief overview of ILP and the systems we use.

3.1. Super-CLEVR

Inspired by the seminal work on CLEVR (Johnson et al., 2017), the Super-CLEVR dataset was designed to test the visual reasoning capabilities of AI systems. It comprises images featuring classes of vehicles, such as motorcycles, cars, and aeroplanes. The classes are further divided into vehicle subclasses, which make the dataset hierarchical, a crucial characteristic for inducing the occurrence of rare slices within our SDM framework. For example, the “motorcycle” class contains “chopper”, “sportbike”, “dirtbike”, and “scooter” subclasses, also referred to as “shapes”. The hierarchical structure of vehicle classes and their corresponding subclasses (shapes) defined in the original Super-CLEVR dataset are shown in Table 1. Vehicles have four attributes, that is, colour, size, material, and texture. Each image is accompanied by a set of questions designed to test various aspects of visual reasoning, including types such as counting, existence, comparison, attribute identification, and spatial relationships as shown in Figure 1. Super-CLEVR contains about $30 k$ images and $10$ question-answer pairs for each of them.

Figure 1.

The figure on the left shows images from Super-CLEVR (Z. Li et al., 2023) of vehicles made up of their parts characterised by four attributes, that is, colour, size, material, and texture. The middle and right figures show examples of Super-CLEVR renderings with generated questions.

Table 1.

Hierarchical Structure of Vehicle Classes and Their Corresponding Subclasses (Shapes) Defined in the Original Super-CLEVR Dataset.

Vehicle Class	Vehicle Subclasses
Aircraft	Private Jet, Fighter Jet, Biplane, Airliner
Bicycle	Road Bike, Mountain Bike, Utility Bike, Tandem Bike
Bus	Transit Bus, Double Bus, Articulated Bus, School Bus
Car	SUV, Pickup Truck, Station Wagon, Minivan, Sedan
Motorcycle	Dirtbike, Sportbike, Chopper, Scooter

The Super-CLEVR dataset generator employs an algorithm that uses Blender (Blender, 2018) to create a diverse set of images and corresponding questions. Each image is generated by randomly placing vehicles in a three-dimensional scene. The attributes of these objects are also randomly assigned within predefined categories. Spatial relationships are managed to ensure objects do not overlap unrealistically. Once an image is composed, the generator creates questions based on different types of reasoning tasks. The questions are formulated by randomly selecting objects and their attributes in the image and constructing queries that require an understanding of the objects and their relations. This procedural generation ensures a wide variety of questions and scenes, overcoming possible human biases when creating datasets.

Example 1 Running example

Throughout the article, we use a running example to illustrate our methodology: a rare slice from the Super-CLEVR dataset involving the “utility bike” subclass. This slice is statistically rare, appearing with a very low occurrence frequency in the training data, and it is visually similar to the “mountain bike” subclass, making it a candidate for model misclassification and an ideal test case for the proposed SDM pipeline.

3.2. ImageNet

To validate our SDM framework on a real-world benchmark, we use the well-known ImageNet (Krizhevsky et al., 2017) dataset. Unlike the synthetic Super-CLEVR, ImageNet is a large-scale image dataset consisting of real-world images organised according to the WordNet (Pedersen et al., 2004) hierarchy. It contains over 14 million annotated images representing more than 20,000 categories, making it one of the most widely used benchmarks in CV. For our experiments, we did not use the entire, vast WordNet hierarchy. Instead, to create a realistic but controlled setting for evaluating the proposed SDM, we defined a custom taxonomy based on a curated subset of vehicles from WordNet. This allowed us to apply our taxonomy-based heuristic to induce the generation of specific rare slices in a focussed and challenging experimental environment to test our methodology.

ImageNet is primarily designed for image classification, where each image is associated with a single class label corresponding to the main object in the scene. It does not provide the detailed, object-level bounding box annotations found in object detection datasets. Furthermore, as a dataset of real-world images, ImageNet lacks a procedural image generator. Therefore, creating rare slices or augmenting the data for model mending are achieved through controlled subsampling of the existing dataset or using external data augmentation techniques.

3.3. Inductive Logic Programming

Inductive logic programming (Muggleton, 1991) is a subfield at the intersection of ML and KRR that aims to find patterns in data by learning logical descriptions, utilising background knowledge ( $B$ ) and sets of positive ( $E^{+}$ ) and negative ( $E^{-}$ ) ground examples. The learning process in ILP aims to find a hypothesis $h$ from a hypothesis space $H$ , such that $B \cup h ⊨ E^{+}$ , and $B \cup h ⊭ E^{-}$ , that is, the background $B$ plus the hypothesis $h$ entails each positive example while it does not entail any negative example. It typically involves, starting from the known facts and relations contained in $B$ , generating hypotheses consistent with $E^{+}$ , testing them against $E^{-}$ to ensure that no negative example is entailed, and refining them until they entail all the positive examples and none of the negative ones. If no hypothesis satisfies this condition, the learning process terminates without a solution. While classical ILP systems often assume a single, global background knowledge $B$ shared across all examples, our framework operates on context-dependent examples. In this setting, each example is accompanied by its own scene-specific background knowledge derived from the corresponding image.

ILP has applications in various fields, among them robotics (Youssef & Müller, 2023), bioinformatics, for example, protein structure discovery (Turcotte et al., 1998), medicine, for example, drug design (Enot & King, 2003; Finn et al., 1998), and ECG waveform learning (Kókai et al., 1997), to mention a few; see Bratko and Muggleton (1995) and Lavrac and Dzeroski (1994) for more of them.

A number of ILP approaches and tools are available; for a comprehensive survey on ILP, we refer to Cropper et al. (2022). This work tests the following three ILP systems as symbolic reasoning components within the proposed neurosymbolic architecture for slice discovery:

Popper (Cropper & Morel, 2021) is a state-of-the-art first-order ILP system that implements the learning from failures approach by combining Answer Set Programming (ASP) (Lifschitz, 2019) and Prolog (Bratko, 2012). It supports infinite problem domains, reasoning about lists and numbers, learning textually minimal programs, and learning recursive programs. Furthermore, Popper can learn minimal description length logic programs as hypotheses from noisy data.

FOLD-R++ (Wang & Gupta, 2022) is, in terms of efficiency and scalability, an improvement of the FOLD-R first-order inductive learning algorithm (Shakerin et al., 2017), which serves to learn answer set programs from mixed (numerical and categorical) data for classification tasks. The three main improvements of FOLD-R++ are the following: (i) it uses the prefix sum algorithm to speed up computation; (ii) it allows negated literals in the default portion of the learnt rules; (iii) it introduces the hyper-parameter exception ratio, which is the threshold of the ratio of false-positive examples (i.e., exceptions) to true-positive examples that a rule can imply.

FastLAS (Law et al., 2020) is a recent first-order ILP system designed to perform learning tasks in the context of ASP, based on the context-dependent learning-from-answer-sets framework used by the ILASP (Law et al., 2015) system. FastLAS comes with several restrictions, that is, it is not as general as ILASP, but it is significantly more scalable. Furthermore, FastLAS has the advantage of taking as input a customised scoring function for hypotheses that allows the user to express domain-specific optimisation criteria. Such a scoring function defines the cost of a rule. Finally, a key feature of FastLAS is its capability to handle noisy data by introducing a penalty mechanism. This mechanism assigns a penalty to each example, representing the cost of not covering that example. The penalties are defined by a user-specified weight parameter $λ$ (where $λ > 0$ and $λ \in N$ ), which allows fine-tuned control over the relative importance of different examples. The FastLAS learning process then computes an optimal solution by optimising a combined objective function that minimises the sum of the given scoring function (such as hypothesis length) and the total penalty cost from uncovered examples. This formulation naturally balances model simplicity against example coverage, as examples with higher penalties exert stronger pressure to be covered by the learned program, while examples with lower penalties may be left uncovered when doing so enables simpler hypotheses.

The ILP systems we use, with the exception of FOLD-R++, are guided by a mode bias specification, which is a set of declarations that constrains the structure of the hypotheses the system can learn. These declarations specify, for example, which predicates can appear in the head or body of a rule, their argument types, and whether negation is permitted. This syntactic specification is crucial for pruning the vast hypothesis space and focussing the search on meaningful rule templates.

The ability of these three ILP systems to learn from noisy data is fundamental to the functioning of our SDM. Indeed, a rare slice can be interpreted as a set of exceptions on which a classifier underperforms. In order to find the pattern that characterises such a set of exceptions, we use the same idea as in Shakerin et al. (2017), which is to consider misclassifications as positive examples and correct classifications as negative examples to obtain rules describing rare slices.

Example 2 Continued

Continuing our example, suppose the trained model frequently confuses “utility bike” with “mountain bike” and misclassifies it as “sports bicycle” instead of “urban bicycle”. After feeding these misclassifications (positive examples) and correct classifications (negative examples) into an ILP system, it might produce the following logical rule:

This rule provides a precise, human-readable diagnosis. It has learned that “a scene V0 is hard for the model to classify if it contains an object V1 whose shape is utility bike, colour is yellow, material is rubber, and direction is south”. The sce_id(V0) and obj_id(V1) predicates simply bind the variables to the scene and object identifiers, respectively. This symbolic output is the key to understanding the rare slice and is used to guide the subsequent model mending process.

4. Neurosymbolic Framework for Slice Discovery

In order to construct a neurosymbolic SDM approach, we propose an architecture of a system as shown in Figure 2. The system comprises several modules, shown as boxes, which process inputs in a pipeline. From configuration files or available data sources, datasets containing rare slices are constructed (either through generation or subsampling) on which a neural network model is trained and evaluated. Then, a semantic description of the images is produced, from which rules for detecting rare slices are extracted. Finally, the rules are used to generate further training data to mend the neural network model, thus closing the loop of model learning. In the following, we describe the tasks in the processing pipeline in more detail.

Figure 2.

Overview of the proposed neurosymbolic SDM architecture. The solid arrows show the data flow, while the dashed arrow represents user input in selecting extracted logical rules to generate additional training data to improve classification performance.

According to Kautz’s taxonomy of neurosymbolic systems (Sarker et al., 2021), our SDM approach aligns with the [Neuro $\to$ Symbolic] paradigm, where the outputs of a neural system (here, a vision model) are post-processed by a symbolic module to derive interpretable logical rules.

4.1. Data Preparation

The first step in the pipeline is concerned with data preparation, that is, producing datasets containing rare slices. We provide a methodology for them, which will be detailed in Section 5.

CV encompasses a wide range of tasks, among which image classification and object detection are two of the most prominent. Image classification assigns a single label to an entire image, identifying the most prominent object or scene within it. In contrast, object detection involves identifying and localising multiple objects within an image by predicting both their classes and bounding boxes. Our framework handles both tasks, but requires adjusting the processing pipeline accordingly.

At an abstract level, the task involves creating a labelled dataset $D$ , where elements are labelled with their ground-truth annotations. In our general framework, $D$ consists of pairs $(I, L)$ , where $I$ is an image and $L$ is its label. For Super-CLEVR, $L$ includes both the class and bounding box annotations, whereas for ImageNet, $L$ contains only the class label.

For illustration, in the Super-CLEVR setting $L = {(b_{1}, h_{1}), \dots, (b_{n}, h_{n})}$ , where each $b_{i}$ is a bounding box of some object $o_{i}$ in $I$ , and $h_{i}$ is in the underlying hierarchy $H$ the root class of the subclass of $o_{i}$ , $1 \leq i \leq n$ . For example, the vehicle subclass (or “shape”) “dirtbike” may have as its root class “land vehicle”, “motorcycle”, or another class depending on the hierarchy $H$ under consideration. A bounding box $b_{i}$ is a tuple $(x^{-}, y^{-}, x^{+}, y^{+})$ , where $(x^{-}, y^{-})$ is the top-left corner point and $(x^{+}, y^{+})$ is the bottom-right corner point. Objects are identified by their bounding boxes, that is, we can view $b_{i}$ as an object ID. On the other hand, following the ImageNet standard, the dataset provides a single class label $L$ for each image. It does not include bounding boxes or scene graphs, so additional annotations can be generated using external tools.

For datasets with a synthetic generator, such as Super-CLEVR, we directly modify the generator that renders images to control the distribution of objects. By adjusting the occurrence frequency of specific objects according to a given hierarchy $H$ , we can create a new dataset that contains controlled rare slices. The dataset is split into a training and a validation set, based on information provided in configuration files, for example, whether or not each split contains rare slices. The generator produces a labelled dataset $D = {(I_{1}, L_{1}), \dots, (I_{N_{s}}, L_{N_{s}})}$ where the number of images $N_{s}$ per split $s$ is approximately $n_{s} / β$ , with $n_{s}$ being the total number of objects per split and $β$ the average number of objects per image. For real-world datasets where no generator is available, such as ImageNet, we simulate this process. We achieve the same outcome by performing a controlled subsampling of the original dataset to construct new training and validation splits with the desired distribution of rare slices.

The Super-CLEVR generator produces further data for the images, such as questions about them (which we disregard, as not needed) and scene descriptions consisting of object attributes (e.g., colour and size). This enriched description serves as the ground truth for the images, which can be utilised for synthetic scene graph generation and fine-grained classification. For ImageNet, which lacks such built-in annotations, we obtain comparable semantic descriptions through automated scene graph generation methods applied to the subsampled images, as later described in Section 6.3.1.

4.2. Object Detection and Image Classification

Once a dataset is prepared, we train and evaluate a neural network model to produce classification results that will be analysed for the discovery of rare slices. This process involves a standard training and validation cycle, followed by a specific step to categorise the results for our SDM pipeline. First, a model (e.g., YOLOv5 ) is trained on the training split of the dataset $D$ , which contains the induced rare slices. The trained model is then run on the validation split, where the object distribution is balanced to fairly evaluate the model’s performance. The model prediction format depends on the CV task associated with each dataset. For image classification (e.g., on ImageNet ), the model returns a single class label $\hat{h}$ for a given image $I$ . For object detection (e.g., on Super-CLEVR ), it returns a set of pairs $(\hat{b}, \hat{h})$ , where each pair consists of a predicted bounding box $\hat{b}$ , and its corresponding class label $\hat{h}$ from the hierarchy $H$ with its associated confidence score. Finally, to prepare the data for rule extraction, we analyse the model performance on the validation set and, for each class $h \in H$ , partition the validation images into two sets:

$E_{h}^{+}$ (positive examples): This set contains images where the model failed. For object detection, this means any image where at least one object of class $h$ was misclassified. For image classification, it is any image of class $h$ that received an incorrect label.

$E_{h}^{-}$ (negative examples): This set contains images where the model succeeded. For object detection, this means all objects of class $h$ in the image were correctly classified. For image classification, it is any image of class $h$ that was correctly labelled.

These sets of positive (failure) and negative (success) examples constitute the input for the subsequent scene graph generation and rule extraction steps.

4.3. Scene Graph Generation

Scene Graph Generation (SGG) involves creating a semantic graph representation from an input image. A scene graph is a (labelled) directed graph $G = (V, E)$ comprising object nodes, attribute nodes, and relation nodes. Each object is typically associated with a bounding box, a class label, and attributes such as colour or size. Relations capture connections between object pairs, for example, spatial relations like behind or next to. Scene graphs offer a powerful semantic abstraction of visual data that can be leveraged to identify patterns associated with misclassifications and facilitate the extraction of logical rules. Various SGG methods exist, most of which are based on deep neural networks; however, symbolic or hybrid approaches are also possible.

In synthetic datasets like Super-CLEVR, ground-truth scene graphs can be directly obtained from the dataset generator, as rich annotations are available for all objects, attributes, and relations. These ground-truth graphs provide a perfect basis for constructing logical examples for rule extraction. In contrast, for real-world datasets like ImageNet, which lack such ground-truth annotations, scene graphs must be generated through external tools (e.g., automated SGG methods or additional annotation pipelines (H. Li et al., 2024)). These tools are able to derive the necessary object nodes, attribute nodes, and relation nodes, the latter also deducible from spatial configurations. In both settings, we convert validation examples in $E_{h}^{+}$ and $E_{h}^{-}$ into scene graphs $G_{E_{h}^{+}}$ and $G_{E_{h}^{-}}$ , respectively: for Super-CLEVR, these are constructed using ground-truth annotations; for ImageNet, they are generated using additional tools as described in Section 6. Regardless of the source, each image in the validation set is converted into this structured graph format, ensuring a consistent semantic representation suitable for the subsequent ILP-based rule extraction step.

4.4. Rule Extraction Via Inductive Logic Programming

We define an instance of the rule extraction problem in the ILP language to detect rare slices. To this end, we translate the scene graphs in $G_{E_{h}^{+}}$ resp. $G_{E_{h}^{-}}$ of class $h$ into their logical representations. These representations are suitable for assembling the sets ${I L P}_{E_{h}^{+}}$ and ${I L P}_{E_{h}^{-}}$ of positive and negative examples, respectively, and the background knowledge $B$ describing the semantic information about objects in the images. This step involves converting the objects and their attributes represented in the scene graphs into logical facts.

Context-Dependent Background Knowledge. We clarify that, while classical ILP systems often assume a single, global background knowledge $B$ shared across all examples, in our setting the examples are context-dependent: each example includes its own scene-specific information derived from the corresponding image. This is the case for FastLAS, where the background facts describing object attributes are included directly in each example (see Figure 3). For Popper, the background knowledge is provided separately from the examples, but it is still paired uniquely with each example via identifiers. In the case of FOLD-R++, the background knowledge is explicitly represented in the tabular format defined by CSV columns as features. Formally, for a set of examples $E$ , for each $e_{i} \in E$ there exists a distinct background knowledge $B_{i}$ specifically related to $e_{i}$ . For ease of notation, we will refer to this context-dependent background knowledge generally as $B$ throughout the rest of the article. Thus, while the notion of $B$ is preserved, it is virtually instantiated per-example in different forms across the ILP systems we evaluate.

Figure 3.

The left side of the figure shows excerpts of positive and negative examples with their context-dependent background knowledge, in the language of FastLAS. The right side of the figure shows an excerpt of the mode bias.

Then, ${I L P}_{E_{h}^{+}}$ , ${I L P}_{E_{h}^{-}}$ , and $B$ are fed into a rule extraction system. Notably, the positive examples ${I L P}_{E_{h}^{+}}$ represent the input images for which the model made an incorrect classification, as we look for an explanation of why the model fails. The rule extraction system, for which we envisage using an ILP system, then outputs a set of rules as a hypothesis for rare slice detection.

As an example of ILP encoding, Figure 3 shows excerpts of positive and negative examples and part of the mode bias used for Super-CLEVR. The representation is in the language of FastLAS, a state-of-the-art ILP system. Its expressive language allows the description of data and the specification of parameters and mode declarations to shape the search space. Specifically, FastLAS allows for a penalty to be set for each example by coding sX@Y, where $X$ is a scene ID and $Y$ is the cost for not covering that example. The positive example, denoted by #pos, is entailed by its background knowledge $B$ , which is the third set {contains(19, 0)…} listed, combined with the hypothesis $h$ if there is at least one answer set that includes the ground atom hard(19). The negative example, denoted by #neg, is not entailed if there is no answer set of its background knowledge $B$ combined with $h$ that includes the hard(32) atom. Background knowledge $B$ for each example is derived from the scene graphs of the images. The hard/1 predicate was explicitly introduced as a rule head to represent in which case a scene is difficult for the classifier, that is, it contains rare slices, depending on its composition of objects and attributes. The mode bias specifies that the hard(X) predicate must only appear as a rule head, where $X$ is a scene ID. Conversely, the contains(X,Z) predicate, where $X$ is a scene ID and $Z$ is an object ID, can only appear in the rule body. The same applies to the shape(Z,sha), colour(Z,col), size(Z,siz), direction(Z,dir), and material(Z,mat) predicates, where $Z$ is an object ID and $s h a$ , $c o l$ , $s i z$ , $d i r$ , and $m a t$ are constants of the respective domains. Furthermore, four predicates, that is, contains, size, direction, and material, can also appear as negative literals in the rule body. In the FastLAS mode bias, only a subset of predicates is specified as negative to tailor the search space. Finally, FastLAS allows the specification of the maximum number of variables per rule via the #maxv directive, and the scoring function via #bias(‘‘penalty…’’).

Example 3 Continued

Consider a scene where “a yellow, rubber utility bike facing south” is misclassified by the object detector. Its corresponding scene graph is translated into a positive example for the ILP system for rule extraction. In the FastLAS syntax, each #pos or #neg block defines one example. For instance, in the positive example #pos(s19@4, {hard(19)}, {}, {contains(19, 0). …}). in Figure 3, the first argument is the unique example identifier plus its penalty, the second is the set of atoms that the learned rules must prove (in this case, hard(19)), the third specifies atoms the rules must not prove (unused in our examples), and the fourth block contains the context-dependent background knowledge, a set of facts describing this specific scene. In this case, the fact contains(19, 0) links object 0 to scene 19, while the subsequent facts, such as shape(0, utility) and colour(0, yellow), define its attributes. In contrast, a scene containing a correctly classified vehicle, like a “tandem bike” in Figure 3, would be added as a negative example #neg(…)..

The mode bias then defines the structure for the rules the ILP system can learn. For example, #modeh(hard(var(sce_id))). declares that the head of any learned rule must be of the form hard(SceneID), identifying difficult scenes for the model, while #modeb lists all admissible object predicates and their values for the rule body.

Given the complete input in Figure 3, FastLAS produces the following hypothesis $h$ :

Informally, these rules express that a scene is considered difficult for the model to classify if it contains a “utility bike” with specific attributes. For example, the first rule says that whenever there is a scene with a “large rubber utility bike facing south”, the neural network model will likely make a misclassification error on such an object.

The Popper encoding is very similar in structure to the FastLAS encoding since it consists of a file for specifying the examples, one for the background knowledge, and one for the mode bias. In contrast, the FOLD-R++ encoding is more simplified as it only consists of a CSV file of tabular data, where the first row specifies the feature names for each column, and subsequent rows provide all the examples. The complete encodings for the Popper, FOLD-R++, and FastLAS systems used in the experiments are available in the online repository, with excerpts provided in Appendix B.

4.5. Model Mending

The final step in our SDM pipeline is model mending, where the extracted rules discovered in the previous stage are used to correct model deficiencies. In particular, we note that while ILP systems automatically extract logical rules from validation examples, our current SDM implementation requires user input for hypothesis formation and rule selection. The user analyses the extracted rules to identify common patterns and formulate generalised candidate rules for model mending. This human-in-the-loop approach enables the selection of the most appropriate rules to characterise rare slices. Model mending is then achieved by augmenting the original training data with new images that specifically target the identified rare slices. For a synthetic dataset like Super-CLEVR, we use the data generator to procedurally create new images that precisely match the conditions specified by the logical rules. For a real-world dataset like ImageNet, while generative models offer one possible path for data augmentation, our approach instead relies on curated subsampling. We identify and select images from the complete ImageNet dataset that match the rule conditions to effectively augment the training data without generating synthetic images. In both cases, the model is then retrained on this enriched dataset to improve its robustness and performance on rare slices.

Example 4 Continued

Figure 4 shows the effectiveness of our model mending process: before the intervention, the model misclassifies the “utility bike” rare slice as “sports bicycle”, whereas after retraining it with data guided by the extracted rules, it correctly classifies the “utility bike” as “urban bicycle”.

Figure 4.

The left figure shows a scene (based on the VT: $H$ 4 hierarchy defined in Section 6.2.1) in which vehicles corresponding to the “utility bike” and “articulated bus” rare slices are misclassified by YOLOv5 into the “sports bicycle” and “regular bus” classes, respectively. In contrast, the right figure shows the same scene in which such vehicles are correctly classified, after model mending, into their “urban bicycle” and “specialised bus” classes, respectively.

5. Rare Slice Generation Methodology

Motivated by the limitations of existing rare slice generation methods in our context and the need for a reliable testbed for our SDM, we present a taxonomy-based methodology to induce the occurrence of controlled rare slices in a model.

In our framework, following the characterisation introduced by Domino (Eyuboglu et al., 2022), we define a rare slice as an object subclass that appears infrequently in the dataset and on which the model underperforms. Therefore, a rare slice has two key properties:

Statistical Rarity: The slice appears with a very low frequency in the dataset.

Functional Rarity: The model systematically underperforms on the slice.

While statistical rarity can be directly controlled (e.g., by setting a low occurrence probability via the Super-CLEVR generator, or via controlled subsampling of the ImageNet dataset), functional rarity is a model-dependent property. To reliably induce functional rarity, we propose a heuristic that is based on the intuition that a CV model is more likely to fail when forced to distinguish between visually similar objects that belong to different target classes. Therefore, our heuristic is to intentionally design custom taxonomies that separate these similar object subclasses into distinct target classes (e.g., placing “dirtbike” in the “motorcycle” class and the visually similar “mountain bike” in the “bicycle” class). By then making one of these subclasses statistically rare, we induce the generation of a controlled rare slice in the classification model. This approach can be applied to any dataset, including Super-CLEVR (by defining custom class hierarchies for the generator) and ImageNet (by grouping and re-mapping classes from its native WordNet hierarchy).

To formally identify these underperforming slices, we use a main performance metric for each task as a proxy for difficulty. A target class is flagged as possibly containing a rare slice if its metric falls below a dataset-dependent target class threshold $τ_{c}$ . Specifically, we use per-class recall for the Super-CLEVR object detection task and Top-1 accuracy for the ImageNet image classification task. This threshold allows us to systematically identify underperforming slices and study them in controlled settings.

Our methodology for generating rare slices begins with a given class hierarchy $H = {h_{1} : {\bar{s}}_{1}, \dots, h_{m} : {\bar{s}}_{m}}$ , where each root class $h_{i}$ contains a set of subclasses ${\bar{s}}_{i}$ . We then identify a set of pairs $P = {(c_{1}, c_{2}), \dots, (c_{n - 1}, c_{n})}, n \geq 1$ , of visually similar subclasses, each belonging to some ${\bar{s}}_{i}$ , that a CV model is likely to confuse (e.g., “mountain bike” and “dirtbike”). Finally, we construct a dataset $D$ such that, for a given class $h_{i} \in H$ , one of its subclasses $c_{i}$ appearing in a pair of $P$ occurs with a very low occurrence probability $α$ in $D$ . We implement this methodology through a configurable, step-by-step process. While we illustrate it here for a synthetic dataset like Super-CLEVR, the procedure is analogous for a real-world dataset like ImageNet, where “generation” is simulated via controlled subsampling of the original dataset.

Define Rare Slice Candidates: We first create a slice configuration file that specifies the set $S = {c_{i} | c_{i} appears in a pair of P}$ of visually similar subclasses intended to be rare. For example, in Super-CLEVR, $c_{i}$ can be the “dirtbike” subclass, which is paired with “mountain bike” in $P$ because they are visually similar. In the same file, each subclass $c_{i}$ may be restricted by specifying any combination of attribute values that makes the respective slice more specific, such as by fixing a particular colour and material. For example, a rare slice can be defined as the “dirtbike” subclass with colour “red” and material “metal”. These user-specified attributes are exhaustively combined with all values of the remaining attributes. For example, if the attribute “size” is not specified, then the rare slice “dirtbike-red-metal” will include all possible values of “size”, that is, “dirtbike-red-metal-small” and “dirtbike-red-metal-large”. Non-rare slices consist of all remaining combinations of subclasses and attribute values that are not rare slices. For example, the combination “dirtbike-blue-metal-small” is a non-rare slice because the user has restricted the rare slice to the colour “red”. In summary, all $c_{i} \in S$ , together with their user-specified attribute values, are defined as rare slices, while every other combination of subclass and attribute values is a non-rare slice.

Set Occurrence Probability: We assign to each rare slice a low occurrence probability $α$ ; lower $α$ means a lower probability of creating an object.

Configure Data Splits: A second configuration file defines the total number $n_{s}$ of objects per split $s \in {t r a i n, v a l i d a t i o n}$ and whether the split should contain rare slices. We typically generate the training split with rare slices and the validation split with a uniform distribution of all objects for fair evaluation.

Rare and Non-Rare Slice Configuration: A third configuration file is generated containing the complete specification of all rare and non-rare slices as defined in step 1 according to the class hierarchy $H$ .

Generate or Subsample: The configuration files are used to either guide the modified Super-CLEVR image generator or the ImageNet subsampling script to produce the final data splits. In both cases, the goal is to create data splits with the desired object distributions. In Super-CLEVR, the generator computes the target number of rare objects for each subclass $c_{i}$ in a split $s$ as $n_{c_{i}, s} = α \cdot n_{s}$ , rounded to an integer. If $n_{c_{i}, s} = 0$ , the configuration is invalid, and the process stops with an error message stating that no rare objects can be generated with the given $n_{s}$ , prompting the user to increase it. Otherwise, $n_{c_{i}, s}$ ( $\geq 1$ ) rare objects are randomly distributed among the rendered images, with the remaining slots filled by $n_{s} - \sum_{c_{i} \in S} n_{c_{i}, s}$ uniformly distributed non-rare objects.

Following the above steps, specific rare slices can be generated depending on the taxonomy under consideration. Furthermore, for each generated image, the Super-CLEVR generator produces a corresponding description that consists of the objects in the scene, their attributes, and the relationships between them. These descriptions allow scene graphs to be readily derived and then encoded into ILP examples, as shown in Figure 3. For ImageNet, which provides a single class label per image without detailed object annotations, the necessary scene graphs are generated using external tools, as detailed in Section 6.

6. Experiments

The proposed SDM approach was evaluated in a series of experiments aimed at assessing the effectiveness of rare slice generation, rule extraction, and model mending. To demonstrate the versatility of our approach, we performed the evaluation on two distinct benchmarks: the synthetic Super-CLEVR dataset for an object detection task, and the real-world ImageNet dataset for an image classification task. This section describes the evaluation platform, outlines the experimental setup for both benchmarks, and presents the results from our analysis. All data and details are available in the online repository.

6.1. Evaluation Platform

The evaluation platform is a server running Ubuntu 22.04.2 LTS (kernel version 6.8) with two Intel Xeon Silver 4314 CPUs (each having 16 cores at 2.40GHz, 2 threads per core, and 24MB of cache), 1,024GB of DRAM, four NVIDIA RTX A5000 GPUs (each having 24GB of VRAM), and the CUDA 12.2 API.

6.2. Super-CLEVR Experiments

This section details the experimental setup and presents the results from our evaluation of the proposed SDM architecture for the object detection task on the generated Super-CLEVR dataset. Specifically, we built a challenging and imbalanced training set and used it to train several YOLOv5 models for object detection, each based on a different set of target classes from our custom taxonomies. Afterwards, we iteratively evaluated, diagnosed, and improved these models on the validation set. We outline the data taxonomies, dataset composition, the neural network architecture, and the iterative process of slice discovery and model mending in our pipeline.

6.2.1. Experimental Setup

In the following, we describe the experimental setup for each module of our SDM architecture.

Taxonomies. In our experiment, we used two custom taxonomies based on vehicle subclasses available in Super-CLEVR : airliner, biplane, fighter jet, private jet, sedan, minivan, station wagon, pickup truck, SUV, school bus, articulated bus, double bus, transit bus, scooter, chopper, sportbike, dirtbike, tandem bike, utility bike, mountain bike, and road bike. First, we identified five pairs of vehicle subclasses as visually similar: (“dirtbike”, “mountain bike”), (“articulated bus”, “transit bus”), (“utility bike”, “mountain bike”), (“pickup truck”, “sedan”), and (“private jet”, “airliner”). Then, we defined two Super-CLEVR taxonomies according to the proposed heuristic presented in Section 5, separating the vehicle subclasses of the pairs into different target classes. Specifically, as we descend toward the bottom of a taxonomy, more pairs are separated into distinct classes. In this way, we induced the generation of five rare slices to test the SDM implementation. To investigate rare slice generation, we defined from these taxonomies a total of five sets of target classes, referred to as hierarchies, each serving as training data labels to train a separate YOLOv5 model. The two taxonomies are listed and described below:

Vehicle Type (VT) classifies vehicles according to their type in a multilevel taxonomy, where target classes become more and more specific at each level. The name of a class suggests the vehicles it contains. For example, the “air vehicle” class contains air vehicles such as “airliner” and “biplane”. In contrast, the “scooter” and “mountain bike” vehicles are classified as “land vehicle”, but also fall into the class below, called “two-wheeler”, since they have only two wheels. However, “scooter” belongs to the more specific “motorcycle” class while “mountain bike” is in the “bicycle” class. The same applies to the other classes and vehicle subclasses, as shown in Figure 7 Appendix A, where vehicle subclasses are the leaves of the taxonomy. In the following, the four different hierarchies considered in the experiments to evaluate and compare the influence of rare slices on classification performance are reported. For each hierarchy of the VT taxonomy, the corresponding classes constitute the targets for training a neural network model.

Hierarchy 1 (VT: $H$ 1): “air vehicle” and “land vehicle”.

Hierarchy 2 (VT: $H$ 2): “air vehicle”, “two-wheeler”, and “multi-wheeler”.

Hierarchy 3 (VT: $H$ 3): “air vehicle”, “bicycle”, “motorcycle”, “bus”, and “car”. These are the classes that constitute the original hierarchy used in Super-CLEVR.

Hierarchy 4 (VT: $H$ 4): “air vehicle”, “sports bicycle”, “urban bicycle”, “sports motorcycle”, “urban motorcycle”, “regular bus”, “specialised bus”, “offroad car”, and “urban car”.

Note that we purposely designed VT: $H$ 1 and VT: $H$ 2 not to satisfy the heuristic criterion, that is, no previously defined pair of vehicle subclasses was separated into different target classes, serving as base cases. In contrast, for VT: $H$ 3 and VT: $H$ 4, we defined the classes based on the heuristic criterion. In particular, VT: $H$ 3 only separates the (“dirtbike”, “mountain bike”) pair, and VT-H4 separates four pairs: (“dirtbike”, “mountain bike”), (“articulated bus”, “transit bus”), (“utility bike”, “mountain bike”), and (“pickup truck”, “sedan”). This design allowed us to assess the effectiveness of the proposed heuristic in generating rare slices in the YOLOv5 models trained for each hierarchy.

Primary Purpose (PP) classifies vehicles according to their primary use, as shown in Figure 8 Appendix A. For example, “scooter” is in the “urban vehicle” class, which contains vehicles intended for urban transportation, while “dirtbike” is in the “offroad vehicle” class. For the PP taxonomy, we considered only one hierarchy, referred to as PP: $H$ 1, with five target classes – “urban vehicle”, “offroad vehicle”, “specialised vehicle”, “high-speed vehicle”, and “recreational vehicle” – designed to separate four of the visually similar pairs of vehicle subclasses: (“articulated bus”, “transit bus”), (“utility bike”, “mountain bike”), (“pickup truck”, “sedan”), and (“private jet”, “airliner”).

Dataset. For the two taxonomies, we generated a single training set of 10,000 images using the Super-CLEVR generator. Each image contains between three and six vehicles from the vehicle subclasses listed in Table 1. Vehicle attributes taken into account in image generation include: “materials” (e.g., “metal”), “colours” (e.g., “gray”), “sizes” (e.g., “small”), and “directions” (e.g., “southwest”). To create rare slices, we introduced data imbalance in the training set by manipulating the occurrence probability $α$ of specific vehicle subclasses, without restricting them by specifying any combination of attribute values. Specifically, we selected one vehicle subclass from each pair mentioned above – “dirtbike”, “articulated bus”, “utility bike”, “pickup truck”, and “private jet” – as potential rare slice by setting its occurrence probability $α$ to $0.05 %$ of the total number $n_{s}$ of vehicles in the training set. Depending on the hierarchy used in neural network training, these vehicle subclasses are potential rare slices; the remaining vehicle subclasses were uniformly distributed. Last, to fairly evaluate model performance, we generated a single validation set of 2,500 images with a balanced distribution, where each of the 21 vehicle subclasses is uniformly represented.

Neural Network. For each of the five hierarchies (VT: $H$ 1 – VT: $H$ 4 and PP: $H$ 1), a YOLOv5 model version yolov5s² was built on the training set running 80, 160, and 320 epochs using an image size of $640 \times 640$ pixels and a batch size of 16. The default YOLOv5 hyperparameters were used, including the SGD optimiser, initial learning rate of 0.01, final learning rate factor of 0.01, momentum of 0.937, and weight decay of $5.0 \times 10^{- 4}$ . Then, each trained model was evaluated on the validation set, and the results were inspected.

Rule Extraction and Selection. For the rule extraction module, we employ Popper, FOLD-R++, and FastLAS to identify rare slices within underperforming target classes of each hierarchy. The ILP systems extract rules based on the scene graphs generated for each image; an example is shown in Figure 5. These rules consist of a combination of vehicle attributes, described earlier in Section 3.1. The process begins by identifying problematic target classes with recall at or below a predefined target class threshold $τ_{c}$ . The value for $τ_{c}$ is empirically determined by analysing the per-class model performance on the validation set. By placing the threshold within an observed performance gap, a criterion is established to distinguish between underperforming classes that require intervention and those that are well-performing. For each problematic class, the ILP systems generate a set of rules describing rare slices that the model struggles to classify correctly. Then, we analyse these extracted rules to find unifying patterns and simplify them into more general candidate hypotheses derived by selecting the vehicle attributes that occurred more often in the extracted rules. For a candidate rule to be considered formally as a description of a potential rare slice, we introduce a rare slice hypothesis threshold $τ_{h}$ . This threshold is also empirically determined and serves as a criterion to ensure that a candidate rule for a rare slice is supported by a significant percentage of the extracted rules. Its purpose is to filter out spurious or overly specific rules that might only be supported by a small fraction of the ILP hypotheses. The rationale behind this human-in-the-loop strategy is supported by the fact that if most of the extracted rules agree on the choice of a vehicle attribute, it means that such an attribute is more likely to be the most appropriate to characterise positive examples, that is, rare slices. To ensure the robustness of our findings, we test each ILP system across various hyperparameter settings with a timeout of 3,600 s:

For Popper, we used the noisy mode, which allows it to learn the minimal description length program from noisy data. Furthermore, we varied the sample size of the validation set, using $25 %$ , $50 %$ , and $100 %$ , to study the scalability as the amount of available data changed.

For FOLD-R++, we tested nine configurations by combining the three sample sizes ( $25 %$ , $50 %$ , $100 %$ ) with three different exception ratios (0.25, 0.50, 0.75). This hyperparameter represents the threshold of the ratio of false-positive examples (i.e., exceptions) to true-positive examples that a rule can entail.

For FastLAS, we used the opl mode, which runs the original FastLAS1 algorithm. Furthermore, to utilise it in the mode that supports noisy data, we assigned a penalty to each example, representing the cost of violating it. As there are significantly fewer positive examples than negative ones and positive examples are more important to cover because they characterise rare slices, we set the penalty values for positive and negative examples to 4 and 2, respectively. To narrow down the hypothesis space, the maximum number of variables per rule was limited to 2. We also tested nine configurations for FastLAS, combining the three sample sizes with three rule head penalties (10, 20, 30) for the scoring function, which charges such penalty values for each extracted rule head, to observe how they affect the quality of the output result.

All hyperparameter values mentioned were empirically fine-tuned by exploratory experimentation. Specifically, we selected several reasonable values to test different configurations of ILP systems in extracting meaningful rules describing rare slices. All other system hyperparameters use default values. A set of rules was obtained for each experimental configuration based on the hierarchy, target class, ILP system, and hyperparameter values considered. This comprehensive evaluation allows us to assess the effectiveness, speed, and verbosity of each ILP system and to verify that the identified rare slices are consistent across different configurations.

Figure 5.

Rules extracted by Popper for the “urban bicycle” class in VT: $H$ 4, for sample size $100 %$ . The rules correctly detect the rare slices “utility bike facing north” and “utility bike facing south”.

Model Mending. We proceed with the model mending step after selecting candidate rules that describe rare slices. We augment the original training set with new images to address data imbalance and further train the model. These images, generated with the Super-CLEVR data generator, specifically contain the vehicle attributes based on the rare slices identified by the selected rules. As for the neural network hyperparameters, the initial learning rate is modified according to the specific needs of each model mending iteration, as we will discuss below. All other neural network hyperparameters remain as in the initial model training (see Section 6.2.1). The validation set remains unchanged for all iterations of our SDM pipeline to ensure a fair and consistent evaluation of performance improvements.

6.2.2. Experimental Results

In the following, we present the experimental results for rare slice generation, rule extraction, and model mending from the iterative application of our SDM architecture.

Rare Slice Generation and Initial Model Training. Our method successfully generates rare slices on which YOLOv5 models underperform, as revealed by the experimental results described below. More precisely, confusion matrices from model validation confirm the effectiveness of our taxonomy-based approach in inducing the presence of rare slices within the neural network models trained on hierarchies satisfying the proposed heuristic. Furthermore, rare slices degraded model performance across all epoch values considered (i.e., 80, 160 and 320), highlighting the persistence of rare slices even as the number of training rounds increases. The YOLOv5 training process saves the model weights that achieve the highest performance on the validation set. Hence, 160-epoch models that performed best were designated as our baseline defective models. In our object detection task, model performance is measured by its recall metric, also known as true positive rate (TPR), on the target classes of the validation set. Recall represents the proportion of all true positives that were correctly classified as positive. To diagnose the models, we inspected the recall of each target class. To identify underperforming classes, we set the target class threshold $τ_{c}$ to $95.00 %$ . Any target class performing at or below this threshold is considered problematic and is inspected via our SDM. As expected, models trained on VT: $H$ 1 and VT: $H$ 2, which did not employ the subclass separation heuristic, showed no evidence of rare slices, achieving $100 %$ recall for all target classes and thus exceeding $τ_{c}$ , as shown in Figures 10 and 13 Appendix C. In contrast, applying this heuristic to the hierarchy design consistently resulted in models exhibiting potential rare slices, a finding substantiated by their confusion matrix recall values:

For VT: $H$ 3, the “motorcycle” class recall dropped to $94.00 %$ , as shown in Figure 16 Appendix C, falling below our performance bar $τ_{c}$ and thus being marked as problematic. The other target classes in this hierarchy – “air vehicle”, “bicycle”, “bus”, and “car” – all achieved $100.00 %$ recall.

VT: $H$ 4 was the most challenging case. Four of its target classes fell significantly below the threshold $τ_{c}$ , as shown in Figure 22 Appendix C: “urban bicycle” at $80.00 %$ , “sports motorcycle” at $86.00 %$ , “specialised bus” at $91.00 %$ , and “offroad car” at $92.00 %$ . These were all identified as problematic classes for rule extraction. In contrast, the other target classes performed well: the “regular bus” class achieved $99.00 %$ recall, while the “air vehicle”, “sports bicycle”, “urban motorcycle”, and “urban car” classes all reached $100.00 %$ recall.

For PP: $H$ 1, four target classes performed at or below the threshold $τ_{c}$ , as shown in Figure 31 Appendix C, making them targets for diagnosis: “high-speed vehicle” at $92.00 %$ , “specialised vehicle” at $95.00 %$ , “urban vehicle” at $95.00 %$ , and “offroad vehicle” at $95.00 %$ . In contrast, the “recreational vehicle” class achieved $100.00 %$ recall.

First Rule Extraction and Selection Iteration. The poor performance of the nine target classes suggests investigating them in search of rare slices. To this end, we employed our rule extraction module, tasking ILP systems to find the rules that identify rare slices within each problematic class of the models. After extracting the rules, we analysed them to identify underlying patterns. As previously mentioned, these rules consist of a combination of vehicle attributes. Our analysis revealed that the vehicle subclass was the primary feature in the extracted rules, often appearing with specific secondary vehicle attributes that further defined the potential rare slice. Therefore, we simplified all these rules into more general candidate hypotheses, such as “an image is difficult for the object detection model if it contains a dirtbike facing north”. To formalise which of these candidate rules to consider as descriptions of potential rare slices, we set the rare slice hypothesis threshold $τ_{h}$ to $33.33 %$ . Consequently, only candidate rules that agree with a percentage of extracted rules greater than or equal to $τ_{h}$ are retained. Popper, FOLD-R++, and FastLAS results, summarised in Tables 2 –4, revealed patterns across all nine problematic classes. In the tables, an entry is marked with a ✓ to denote that at least one extracted rule agrees with a candidate rule for that target class, while an ✗ indicates that no extracted rule does. Furthermore, $T$ denotes that the ILP system has timed out. Each mark is accompanied by a tuple where the first value is the runtime in seconds, the second is the total number of extracted rules, the third is the number of those rules that agree with the first candidate rule, the fourth is the number of those rules that agree with the second candidate rule, and so on. We now provide a comparison of the results in these tables between the ILP systems in terms of effectiveness, speed, and verbosity. Overall, FOLD-R++ was the most effective and robust system. It successfully identified candidate rules for all nine problematic classes across the three hierarchies and for most hyperparameter configurations. In contrast, FastLAS demonstrated high potential but was less consistent. While it successfully identified several rare slices, it was prone to timeouts on larger sample sizes (e.g., for the “urban bicycle” and “high-speed vehicle” classes) and was highly sensitive to its rule head penalty hyperparameter. It was also generally the slowest system. Popper was the fastest and least verbose system, typically producing a small set of rules when successful. However, it was also the least effective, failing to identify three of the nine slices (“offroad car” in VT: $H$ 4, and “offroad vehicle” and “specialised vehicle” in PP: $H$ 1) and often requiring larger sample sizes to succeed on others. Despite these individual differences, the combined evidence from all three ILP systems strongly pointed towards the same underlying vehicle subclasses and their respective attributes, giving us high confidence in the subsequent hypothesis. A breakdown of the rules extracted for each underperforming class is given below:

Table 2.
Rule Extraction Results of the First Iteration of the Popper System on the Models for Super-CLEVR.

Sample Size

25% 50% 100%

VT: $H$ 3

M ✗ (2.94, 0, 0) ✓ (13.29, 1, 1) ✓ (29.77, 1, 1)

VT: $H$ 4

UB ✓ (11.34, 2, 1, 1) ✓ (29.44, 3, 1, 1) ✓ (71.48, 3, 1, 1)

SM ✓ (7.26, 1, 1) ✓ (26.19, 2, 1) ✓ (49.67, 1, 1)

OC ✗ (1.72, 0, 0) ✗ (7.09, 0, 0) ✗ (19.15, 0, 0)

SB ✗ (2.96, 0, 0) ✓ (8.80, 1, 1) ✓ (29.39, 2, 2)

PP: $H$ 1

UV ✓ (11.22, 2, 1, 1) ✓ (25.98, 2, 1, 1) ✓ (56.64, 3, 1, 1)

OV ✗ (3.58, 0, 0) ✗ (7.97, 0, 0) ✗ (30.55, 0, 0)

SV ✗ (1.95, 0, 0) ✗ (6.13, 0, 0) ✗ (17.74, 0, 0)

HV ✗ (7.22, 0, 0) ✗ (17.63, 0, 0) ✓ (46.47, 3, 3)

	Sample Size
VT: $H$ 3
M	✗ (2.94, 0, 0)	✓ (13.29, 1, 1)	✓ (29.77, 1, 1)
VT: $H$ 4
UB	✓ (11.34, 2, 1, 1)	✓ (29.44, 3, 1, 1)	✓ (71.48, 3, 1, 1)
SM	✓ (7.26, 1, 1)	✓ (26.19, 2, 1)	✓ (49.67, 1, 1)
OC	✗ (1.72, 0, 0)	✗ (7.09, 0, 0)	✗ (19.15, 0, 0)
SB	✗ (2.96, 0, 0)	✓ (8.80, 1, 1)	✓ (29.39, 2, 2)
PP: $H$ 1
UV	✓ (11.22, 2, 1, 1)	✓ (25.98, 2, 1, 1)	✓ (56.64, 3, 1, 1)
OV	✗ (3.58, 0, 0)	✗ (7.97, 0, 0)	✗ (30.55, 0, 0)
SV	✗ (1.95, 0, 0)	✗ (6.13, 0, 0)	✗ (17.74, 0, 0)
HV	✗ (7.22, 0, 0)	✗ (17.63, 0, 0)	✓ (46.47, 3, 3)

Table 3.

Rule Extraction Results of the First Iteration of the FOLD-R++ System on the Models for Super-CLEVR.

\vskip1.8pc ?> Table 4.

Rule Extraction Results of the First Iteration of the FastLAS System on the Models for Super-CLEVR.

VT: $H$ 3: For the “motorcycle” class, we found that $97.30 %$ of rules involved the “dirtbike” subclass, and $89.19 %$ also specified the “north” direction (Table 5). Since both these percentages exceed $τ_{h}$ , this led to the selection of the more specific candidate rule for model mending: $\circ$

Motorcycle first (and only)³ candidate rule:

Table 5.

Rule Extraction Results of the “motorcycle” Class of the First Iteration on the VT: $H$ 3 Model for Super-CLEVR.

	Total Runtime (s)	Total No. Rules	Total Bo. Rules per Vehicle Subclass
Popper	46.00	2	$D i r t b i k e : 2 (100.00 %)$ [ $N o r t h : 2 (100.00 %)$ ]
FOLD-R++	1,108.93	12	$D i r t b i k e : 12 (100.00 %)$ [ $N o r t h : 9 (75.00 %)$ ]
FastLAS	8,586.13	23	$D i r t b i k e : 22 (95.65 %) [N o r t h : 22 (95.65 %)]$ ,
			Sedan: 1 (4.35%)
Total	9,741.06	37	$D i r t b i k e : 36 (97.30 %) [N o r t h : 33 (89.19 %)]$ ,
			Sedan: 1 (2.70%)

hard(V0) :- contains(V0,V1), dirtbike(V1), north(V1).

where V1 denotes a vehicle in an image V0.

VT: $H$ 4: For the four problematic classes, the percentage of rules identifying the rare slice exceeded the threshold $τ_{h}$ in all cases. $\circ$

For “specialised bus”, $95.92 %$ of rules identified “articulated bus”, with $59.18 %$ specifying “north” direction (Table 6).

\circ

For “offroad car”, $84.38 %$ of rules identified “pickup truck” and “rubber” material (Table 18 Appendix C).

\circ

For “sports motorcycle”, $86.36 %$ of rules identified “dirtbike”, with $66.67 %$ also specifying “north” direction (Table 19 Appendix C).

\circ

For “urban bicycle”, $98.68 %$ of rules identified “utility bike”, with directions “north” ( $35.53 %$ ) and “south” ( $34.21 %$ ) being the most common secondary attributes (Table 20 Appendix C).

Table 6.

Rule Extraction Results of the “specialised bus” Class of the First Iteration on the VT: $H$ 4 Model for Super-CLEVR.

	Total Runtime (s)	Total No. Rules	Total No. Rules per Vehicle Subclass
Popper	41.15	3	$A r t i c u l a t e d B u s : 3 (100.00 %) [N o r t h : 3 (100.00 %)]$
FOLD-R++	4,325.68	33	$A r t i c u l a t e d B u s : 33 (100.00 %) [N o r t h : 19 (57.58 %)]$ ,
			Tandem Bike: 1 (3.03%), Utility Bike: 1 (3.03%)
FastLAS	7,427.78	13	$A r t i c u l a t e d B u s : 11 (84.62 %) [N o r t h : 7 (53.85 %)]$ ,
			Pickup Truck: 1 (7.69%), Biplane: 1 (7.69%),
			Dirtbike: 1 (7.69%), Road Bike: 1 (7.69%),
			Fighter Jet: 1 (7.69%)
Total	11,794.61	49	$A r t i c u l a t e d B u s : 47 (95.92 %) [N o r t h : 29 (59.18 %)]$ ,
			Pickup Truck: 1 (2.04%), Biplane: 1 (2.04%),
			Dirtbike: 1 (2.04%), Road Bike: 1 (2.04%),
			Fighter Jet: 1 (2.04%), Tandem Bike: 1 (2.04%),
			Utility Bike: 1 (2.04%)

This led to the selection of the following candidate rules for model mending:

\circ

Specialised bus first candidate rule:

hard(V0) :- contains(V0,V1), articulated_bus(V1), north(V1).

\circ

Offroad car first candidate rule:

hard(V0) :- contains(V0,V1), pickup_truck(V1), rubber(V1).

\circ

Sports motorcycle first candidate rule:

hard(V0) :- contains(V0,V1), dirtbike(V1), north(V1).

\circ

Urban bicycle first candidate rule:

hard(V0) :- contains(V0,V1), utility_bike(V1), north(V1).

\circ

Urban bicycle second candidate rule:

hard(V0) :- contains(V0,V1), utility_bike(V1), south(V1).

where V1 denotes a vehicle in an image V0.

PP: $H$ 1: The ILP systems also found strong evidence for rare slices in this hierarchy, with all candidate hypotheses surpassing the threshold $τ_{h}$ . $\circ$

For “high-speed vehicle”, $97.87 %$ of rules identified “private jet”, with $93.62 %$ specifying “large” and “metal” attributes (Table 7).

\circ

For “offroad vehicle”, $89.19 %$ of rules identified “pickup truck”, with $86.49 %$ also specifying “rubber” material (Table 21 Appendix C).

\circ

For “specialised vehicle”, $96.67 %$ of rules identified “articulated bus”, with $73.33 %$ specifying “north” direction (Table 22 Appendix C).

\circ

For “urban vehicle”, $95.77 %$ of rules identified “utility bike”, frequently with “north” ( $43.66 %$ ) and “south” ( $42.25 %$ ) directions (Table 23 Appendix C).

Table 7.

Rule Extraction Results of the “high-speed vehicle” Class of the First Iteration on the PP: $H$ 1 Model for Super-CLEVR.

	Total Runtime (s)	Total No. Rules	Total No. Rules per Vehicle Subclass
Popper	71.32	3	$P r i v a t e J e t : 3 (100.00 %)$ [ $L a r g e - M e t a l : 3 (100.00 %)$ ],
			Chopper: 1 (33.33%)
FOLD-R++	3,874.76	33	$P r i v a t e J e t : 33 (100.00 %)$ [ $L a r g e - M e t a l : 33 (100.00 %)$ ],
			Sportbike: 5 (15.15%), School Bus: 5 (15.15%)
FastLAS	15,610.72	11	$P r i v a t e J e t : 10 (90.91 %)$ [ $L a r g e - M e t a l : 8 (72.73 %)$ ],
			Sedan: 1 (9.09%)
Total	19,556.80	47	$P r i v a t e J e t : 46 (97.87 %)$ [ $L a r g e - M e t a l : 44 (93.62 %)$ ],
			Sportbike: 5 (10.64%), School Bus: 5 (10.64%),
			Sedan: 1 (2.13%), Chopper: 1 (2.13%)

This led to the selection of the following candidate rules for model mending:

\circ

High-speed vehicle first candidate rule:

hard(V0) :- contains(V0,V1), private_jet(V1), large(V1), metal(V1).

\circ

Offroad vehicle first candidate rule:

hard(V0) :- contains(V0,V1), pickup_truck(V1), rubber(V1).

\circ

Specialised vehicle first candidate rule:

hard(V0) :- contains(V0,V1), articulated_bus(V1), north(V1).

\circ

Urban vehicle first candidate rule:

hard(V0) :- contains(V0,V1), utility_bike(V1), north(V1).

\circ

Urban vehicle second candidate rule:

hard(V0) :- contains(V0,V1), utility_bike(V1), south(V1).

where V1 denotes a vehicle in an image V0.

First Model Mending Iteration. To address the data imbalance without introducing catastrophic forgetting⁴, we augmented the original training set with new images generated by the Super-CLEVR generator according to the selected rules. This augmentation was done according to the rules selected in the previous step, which consist of both the rare vehicle subclass and its most common secondary attributes. For each hierarchy, the respective defective model (the best performing 160-epoch version) was then retrained on its newly balanced dataset for 20, 40, and 80 epochs, using the same hyperparameters as before. By comparing the outcomes of the three retraining epochs for each model, we determined the most effective model mending for each specific hierarchy. The results from the number of optimal retraining epochs are detailed below.

For VT: $H$ 3, the original training set was augmented with 500 new images of the “dirtbike” rare slice adhering to its secondary vehicle attribute of facing “north”. The most effective model retraining was the 20-epoch one, which successfully addressed the deficiency in the “motorcycle” class by increasing its recall from $94.00 %$ to $99.00 %$ , as shown in Figure 17 Appendix C. The remaining classes – “air vehicle”, “bicycle”, “bus”, and “car” – all maintained their $100.00 %$ recall.

For VT: $H$ 4, the training set was augmented with 500 new images for each of the four identified rare slices, adhering to their secondary vehicle attributes: “utility bike” facing “north” or “south”, “dirtbike” facing “north”, “articulated bus” facing “north”, and “pickup truck” made of “rubber”. The 20-epoch retraining was the most effective, leading to substantial improvements. The recall for the “urban bicycle” class rose from $80.00 %$ to $94.00 %$ , “sports motorcycle” from $86.00 %$ to $96.00 %$ , “specialised bus” from $91.00 %$ to $96.00 %$ , and “offroad car” from $92.00 %$ to $98.00 %$ , as shown in Figure 23 Appendix C. The latter three classes became well-detected, and the “urban bicycle” class was significantly improved. However, “urban bicycle” is the only class to fall below our target class threshold $τ_{c}$ of $95.00 %$ , marking it as the target for a second SDM iteration. Finally, the target classes that already performed well were not negatively impacted; “regular bus” recall improved from $99.00 %$ to $100.00 %$ , while “air vehicle” maintained its $100.00 %$ recall. The other classes saw only a negligible $1.00 %$ drop in recall, indicating that the mending process did not cause significant catastrophic forgetting.

For PP: $H$ 1, the training set was augmented with 500 new images for each of its four rare slices defined by their primary and secondary vehicle attributes: “utility bike” facing “north” or “south”, “articulated bus” facing “north”, “pickup truck” made of “rubber”, and “private jet” in both “large” and “metal”. The most effective model retraining was the 40-epoch one, which achieved notable performance gains. The recall for the “urban vehicle” and “offroad vehicle” classes both rose from $95.00 %$ to $97.00 %$ . Similarly, the “specialised vehicle” recall increased from $95.00 %$ to $98.00 %$ , and “high-speed vehicle“ from $92.00 %$ to $96.00 %$ , as shown in Figure 32 Appendix C. Finally, the already well-performing “recreational vehicle” class maintained its $100.00 %$ recall.

The mending process was successful across all hierarchies, substantially improving the recall of the target classes; notably, the overall performance of the model for VT:

H

4 improved substantially. The three previously underperforming classes, that is, “sports motorcycle”, “specialised bus”, and “offroad car”, then reported recalls meeting the target class threshold

τ_{c}

95.00 %

, indicating that their initial rare slices had been successfully resolved. However, we conducted a second iteration of the SDM pipeline to investigate any remaining deficiencies.

Second Rule Extraction and Selection Iteration for VT:

H

4. Despite a significant improvement from

80.00 %

94.00 %

, the “urban bicycle” class was the only one that still failed to satisfy our threshold

τ_{c}

. We again employed our rule extraction module, tasking ILP systems to find rules that identify a potential rare slice within this problematic class. As in the previous iteration, we analysed the rules to identify underlying patterns and used the same hypothesis formation process and rare slice hypothesis threshold

τ_{h}

33.33 %

. The results of this second rule extraction iteration for the “urban bicycle” class are summarised in Tables 8 –10. The ILP systems showed a greater divergence in performance in this iteration. FOLD-R++ was once again the most effective system. It successfully found candidate rules across all nine hyperparameter configurations. It was also the most verbose, generating a total of 46 rules, 41 of which agreed with our candidate hypothesis across all configuration settings. In contrast, Popper and FastLAS were far less effective. Popper only succeeded in one configuration (at

100.00 %

sample size) and was the least verbose, generating only a single rule. FastLAS also only succeeded in one configuration (at

100.00 %

sample size with a rule head penalty of 10) and failed in all others, generating just 5 rules in total. This suggests that the remaining performance issue was more subtle to identify. A detailed breakdown of the rules extracted for the underperforming “urban bicycle” class is provided in Table 11. The evidence pointed to the “utility bike” subclass as the primary source of the problem. In particular,

88.46 %

of all extracted rules involved the “utility bike” subclass, and

84.62 %

also specified the “north” direction. We hypothesised that this problem resulted from the difficulty of the model in distinguishing the “utility bike” vehicle, when facing “north”, from other visually similar subclasses (e.g., “mountain bike”), a confusion that was not completely resolved by the initial data augmentation. Since both these percentages exceed

τ_{h}

, this led to the selection of the more specific candidate rule for the second model mending:

Urban bicycle first candidate rule:

Table 8.

Rule Extraction Results of the Second Iteration of the Popper System on the VT: $H$ 4 Model for Super-CLEVR.

	Sample Size
	25%	50%	100%
VT: $H$ 4
UB	✗ (1.38, 0, 0)	✗ (4.33, 0, 0)	✓ (11.37, 1, 1)

Table 9.

Rule Extraction Results of the Second Iteration of the FOLD-R++ System on the VT: $H$ 4 Model for Super-CLEVR.

\vskip1.8pc ?> Table 10.

Rule Extraction Results of the Second Iteration of the FastLAS System on the VT: $H$ 4 Model for Super-CLEVR.

Table 11.

Rule Extraction Results of the “urban bicycle” Class of the Second Iteration on the VT: $H$ 4 Model for Super-CLEVR.

	Total Runtime (s)	Total No. Rules	Total No. Rules per Vehicle Subclass
Popper	17.08	1	$U t i l i t y B i k e : 1 (100.00 %)$ [ $N o r t h : 1 (100.00 %)$ ]
FOLD-R++	1,612.38	46	$U t i l i t y B i k e : 43 (93.48 %)$ [ $N o r t h : 41 (89.13 %)$ ],
			Tandem Bike: 7 (15.22%), Sedan: 2 (4.35%), Scooter: 2 (4.35%),
			Sportbike: 1 (2.17%), Private Jet: 1 (2.17%),
			Station Wagon: 1 (2.17%)
FastLAS	2,615.94	5	$U t i l i t y B i k e : 2 (40.00 %)$ [ $N o r t h : 2 (40.00 %)$ ],
			Road Bike: 1 (20.00%), Pickup Truck: 1 (20.00%),
			School Bus: 1 (20.00%)
Total	4,245.40	52	$U t i l i t y B i k e : 46 (88.46 %)$ [ $N o r t h : 44 (84.62 %)$ ],
			Tandem Bike: 7 (13.46%), Sedan: 2 (3.85%), Scooter: 2 (3.85%),
			Sportbike: 1 (1.92%), Private Jet: 1 (1.92%),
			Station Wagon: 1 (1.92%), Road Bike: 1 (1.92%),
			Pickup Truck: 1 (1.92%), School Bus: 1 (1.92%)

hard(V0) :- contains(V0,V1), utility_bike(V1), north(V1).

where, as before, V1 denotes a vehicle in an image V0.

Second Model Mending Iteration for VT: $H$ 4. For the second mending iteration, we augmented the training data by generating 500 new images with the Super-CLEVR generator for the “utility bike” rare slice, adhering to its secondary vehicle attribute of facing “north”. We then retrained the best-performing model from the first iteration (the one that was mended over 20 epochs) for an additional 20, 40, and 80 epochs. To refine the model without risking degrading its performance for classes that still rely heavily on previous training, we employed fine-tuning with a lower initial learning rate of 0.001.

The model retrained for an additional 20 epochs proved to be the most effective. This second intervention successfully resolved the persistent slice, as shown in Figure 24 Appendix C. The recall for the problematic “urban bicycle” class rose significantly from $94.00 %$ to $98.00 %$ , finally surpassing our target class threshold $τ_{c}$ . The other classes maintained their high performance, with some showing minor fluctuations representing an acceptable trade-off for the significant improvement in the underperforming class: “sports motorcycle” recall remained at $96.00 %$ , while “specialised bus” saw a slight single-point increase to $97.00 %$ . The “offroad car” class saw a negligible decrease from $98.00 %$ to $97.00 %$ , and the “sports bicycle” class also saw a slight decrease from $99.00 %$ to $98.00 %$ . The “urban motorcycle” class improved from $99.00 %$ to $100.00 %$ recall, “air vehicle” and “regular bus” maintained their $100.00 %$ recall, and “urban car” remained stable at $99.00 %$ . This confirms that the iterative mending process was highly successful in correcting a specific deficiency without causing significant degradation elsewhere. Therefore, the process was terminated at this stage.

6.3. ImageNet Experiments

This section details the experimental setup and presents the results from our evaluation of the proposed SDM architecture for the image classification task on a curated subset of the ImageNet dataset. Specifically, we built a challenging and imbalanced training set and used it to train a YOLOv5 model for image classification. Afterwards, we iteratively evaluated, diagnosed, and improved such a model on the respective validation set. As for Super-CLEVR, we outline the data taxonomy, dataset composition, the neural network architecture, and the iterative process of slice discovery and model mending in our pipeline.

6.3.1. Experimental Setup

In the following, we describe the experimental setup for each module of our SDM architecture.

Taxonomy. In our experiment, we used a reduced list of 11 vehicle subclasses from ImageNet : tandem bicycle, motorhome, moped, scooter, mountain bike, jeep, pickup truck, station wagon, convertible, minivan, and moving van. First, we identified the following four pairs of vehicle subclasses as visually similar: (“tandem bicycle”, “mountain bike”), (“moped”, “mountain bike”), (“jeep”, “minivan”), and (“station wagon”, “minivan”). Then, as for Super-CLEVR, we defined a taxonomy according to the proposed heuristic presented in Section 5, separating the vehicle subclasses of the pairs into different target classes. This taxonomy, which we refer to as the Vehicle (VE) taxonomy, classifies vehicles according to their type as illustrated in Figure 9 Appendix A. For example, the “scooter” subclass is in the “motorcycle” class, while the “pickup truck” subclass is in the “offroad vehicle” class. To investigate rare slice generation, we defined from the VE taxonomy a single set of target classes, referred to as Hierarchy 1 (VE: $H$ 1), serving as training data labels to train a YOLOv5 model. VE: $H$ 1 comprises the following classes: “leisure vehicle” (LV), “motorcycle” (M), “offroad vehicle” (OV), “passenger car” (PC), and “van” (V). We specifically structured these classes to create challenging classification scenarios by separating all previous pairs of vehicle subclasses. In this way, we induced the generation of rare slices to test the SDM implementation.

Dataset. We built our training and validation sets using a subset of ImageNet. The training set consists of 3,634 images distributed across the five classes of our VE taxonomy. To create rare slices, we intentionally varied the number of images per vehicle subclass, introducing data imbalance. Specifically, we designated four vehicle subclasses – “tandem bicycle”, “moped”, “jeep”, and “station wagon” – as rare slices. These subclasses were chosen from each of the four visually similar pairs mentioned above. The occurrence probability $α$ for each of these subclasses in the training set was set to $5 %$ of the respective target class, thus making them potential rare slices. For this training set, rare slices were defined without considering specific values for vehicle attributes, such as colour or position. All remaining subclasses were uniformly distributed, with each represented by 500 images. To fairly evaluate model performance, we created a separate validation set of 2,200 images with a balanced distribution, where each of the 11 vehicle subclasses is uniformly represented. For the ILP systems, we generated scene graphs for each image with the help of the GPT-4.1⁵ VLM and then manually curated the results to ensure accuracy. This assisted annotation process allowed us to capture key attributes for both the scene environment and the vehicles. Environment attributes include “number of persons” (e.g., 2), “background” (e.g., “rural outdoor”), “snow” (e.g., “false”), and “time of day” (e.g., “daytime”). Vehicle attributes include “colour” (e.g., “black”), “orientation” (e.g., “side view”), “position” (e.g., “foreground”), “type” (e.g., “tandem bicycle”) and “visibility” (e.g., “fully visible”).

Neural Network. For the VE: $H$ 1 hierarchy, a YOLOv5 model version yolov5s-cls⁶ was built on the training set running $20$ , $40$ , and $80$ epochs using an image size of $224 \times 224$ pixels and a batch size of $16$ . The default YOLOv5 hyperparameters were used, including the Adam optimiser, initial learning rate of $0.001$ , final learning rate factor of $0.01$ , momentum of $0.9$ , and weight decay of $5.0 \times 10^{- 5}$ . Then, each trained model was evaluated on the validation set, and the results were inspected.

Rule Extraction and Selection. For the rule extraction module, we employ the same ILP systems (Popper, FOLD-R++, and FastLAS ) and a similar methodology as described in the Super-CLEVR experiments to identify rare slices within underperforming target classes. The process again begins by identifying problematic target classes, for which the ILP systems then extract rules based on the scene graphs generated for each image. These rules consist of a combination of vehicle and environment attributes, described in Section 6.3.1. The differences in this setup are as follows:

Problematic target classes are those with a Top-1 accuracy at or below a predefined target class threshold $τ_{c}$ .

While the hyperparameter settings for Popper and FOLD-R++ remain the same as in Super-CLEVR, for FastLAS we tested nine configurations combining the three sample sizes ( $25 %$ , $50 %$ , $100 %$ ) with three different rule head penalty values (1, 5, 10). These values were empirically fine-tuned based on exploratory experimentation specifically for the ImageNet dataset. We observed during initial runs that the higher penalty values used in the Super-CLEVR experiments were too restrictive in the ImageNet context, often preventing FastLAS from learning any rules at all.

Then, the extracted rules are analysed to form candidate hypotheses, which are formally selected if they meet a predefined rare slice hypothesis threshold

τ_{h}

. As before, this comprehensive evaluation measures the effectiveness, speed, and verbosity of each ILP system, while also verifying the consistency of the identified rare slices.

Model Mending. We proceed with the model mending step using the same procedure as in the Super-CLEVR experiments. After selecting candidate rules that describe rare slices, we augment the original training set with new images to address the data imbalance and then further train the model to mend its behaviour. The primary difference in this setup is the source of the new data. Instead of using a generator, the new images are sourced from the ImageNet dataset, specifically chosen based on the rare slices identified by the selected rules. As before, the initial learning rate is modified according to the specific needs of each model mending iteration, while all other neural network hyperparameters remain the same as those used in the initial model training and described in Section 6.3.1.

6.3.2. Experimental Results

In the following, we present the experimental results for rare slice generation, rule extraction, and model mending from the iterative application of our SDM architecture.

Rare Slice Generation and Initial Model Training. In our image classification task, model performance is measured by its Top-1 accuracy, which represents the percentage of validation images where the main prediction of the model matches the correct label. The YOLOv5 training process saves the model weights that achieve the highest Top-1 accuracy on the validation set. We trained the model for 20, 40, and 80 epochs; the results are shown in Table 12. The model trained for 40 epochs yielded the best initial result, achieving an overall Top-1 accuracy of $80.32 %$ on the validation set, compared to $79.23 %$ for 20 epochs and $77.73 %$ for 80 epochs, respectively. This model was designated as our baseline defective model. To diagnose the model, we inspected the Top-1 accuracy of each of the five target classes, as shown in Table 12. To identify underperforming classes, we set the target class threshold $τ_{c}$ to $86.00 %$ . Any target class performing at or below this threshold is considered problematic and is inspected via our SDM. Our analysis confirmed that four target classes fell below this performance bar: “leisure vehicle” at $62.25 %$ , “motorcycle” at $83.50 %$ , “offroad vehicle” at $85.00 %$ , and “passenger car” at $80.75 %$ . In contrast, the “van” class exceeded the threshold with an accuracy of $88.00 %$ .

Table 12.
Top-1 Accuracy of the VE: $H$ 1 Model on the ImageNet Validation Set after the Initial Model Training.

First Rule Extraction and Selection Iteration. The poor performance of the four target classes suggests investigating them in search of rare slices. To this end, we employed our rule extraction module, tasking ILP systems to find the rules that identify rare slices in each problematic class of the model. After extracting the rules, we analysed them to identify underlying patterns. As previously mentioned, these rules consist of a combination of vehicle and environment attributes. However, the vehicle subclass was the unifying feature in most rules for each potential rare slice. Therefore, we simplified these observations into more general candidate hypotheses, such as “an image is difficult for the model to classify if it represents a tandem bicycle”. To formalise which of these candidate rules to consider as descriptions of potential rare slices, we set the rare slice hypothesis threshold $τ_{h}$ to $33.33 %$ . Consequently, only candidate rules that agree with a percentage of extracted rules greater than or equal to $τ_{h}$ are retained. The results for Popper, FOLD-R++, and FastLAS, summarised in Tables 13-15, revealed patterns across all four problematic classes. We now compare the results in these tables for the ILP systems in terms of effectiveness, speed, and verbosity.

Table 13.

Rule Extraction Results of the First Iteration of the Popper System on the VE: $H$ 1 Model for ImageNet.

	Sample Size
	25%	50%	100%
VE: $H$ 1
LV	✓ (0.95, 1, 1)	✓ (3.12, 1, 1)	✓ (5.49, 1, 1)
M	✗ (0.75, 0, 0)	✗ (2.24, 0, 0)	✗ (2.53, 0, 0)
OV	✗ (1.25, 0, 0)	✗ (2.73, 0, 0)	✗ (4.62, 0, 0)
PC	✗ (0.65, 0, 0)	✗ (2.40, 0, 0)	✗ (3.96, 0, 0)

Table 14.

Rule Extraction Results of the First Iteration of the FOLD-R++ System on the VE: $H$ 1 Model for ImageNet.

\vskip1.8pc ?> Table 15.

Rule Extraction Results of the First Iteration of the FastLAS System on the VE: $H$ 1 Model for ImageNet.

FastLAS was the most effective and robust system, successfully identifying candidate rules for all four problematic classes across almost all hyperparameter settings. However, it was also the slowest and by far the most verbose, generating the most “noisy” output. FastLAS produced a total of 408 rules across the four classes, many of which were non-contributing rules. In contrast, FOLD-R++ was extremely fast, moderately verbose, but slightly less effective, finding candidate rules for three of the four problematic classes while generating a total of 90 rules. It completely missed the “offroad vehicle” class. Popper performed the worst, identifying only the candidate rule for the “leisure vehicle” class and failing on the other three. It was also the least verbose system, generating a mere 3 rules in total. Despite these individual differences, the combined evidence from all three ILP systems strongly pointed towards the same underlying vehicle subclasses, giving us high confidence in the subsequent hypothesis. A detailed breakdown of the rules extracted for each underperforming class is provided in Table 16 and Appendix D. In particular, for the “leisure vehicle” class, we found that

72.97 %

of the rules involved the “tandem bicycle” subclass. Similarly, “moped” was present in

79.61 %

of the rules for the “motorcycle” class, “jeep” in

61.00 %

for the “offroad vehicle” class, and “station wagon” in

74.34 %

for the “passenger car” class. Since each of these percentages exceeds

τ_{h}

, this led to the selection of the following candidate rules for model mending:

Leisure vehicle first candidate rule:

Table 16.

Rule Extraction Results of the “leisure vehicle” Class of the First Iteration on the VE: $H$ 1 Model for ImageNet.

	Total Runtime (s)	Total No. Rules	Total No. Rules per Vehicle Subclass
Popper	9.56	3	$T a n d e m B i c y c l e : 3 (100.00 %)$
FOLD-R++	0.20	31	$T a n d e m B i c y c l e : 22 (70.97 %)$
FastLAS	104.44	151	$T a n d e m B i c y c l e : 110 (72.85 %)$ , Motorhome: 13 (8.61%),
			Mountain Bike: 6 (3.97%), Pickup Truck: 2 (1.32%),
			Station Wagon: 1 (0.66%)
Total	114.20	185	$T a n d e m B i c y c l e : 135 (72.97 %)$ , Motorhome: 13 (7.03%),
			Mountain Bike: 6 (3.24%), Pickup Truck: 2 (1.08%),
			Station Wagon: 1 (0.54%)

hard(V0) :- contains(V0,V1), tandem_bicycle(V1).

Motorcycle first candidate rule:

hard(V0) :- contains(V0,V1), moped(V1).

Offroad vehicle first candidate rule:

hard(V0) :- contains(V0,V1), jeep(V1).

Passenger car first candidate rule:

hard(V0) :- contains(V0,V1), station_wagon(V1).

where V1 denotes a vehicle in an image V0.

First Model Mending Iteration. To address the data imbalance without introducing catastrophic forgetting, we augmented the original training set with new images taken from ImageNet according to the selected rules. This augmentation aimed to precisely balance the distribution of all vehicle subclasses to a target of 500 images each. The number of new images added for each vehicle subclass was therefore the exact amount needed to reach this target from their initial count in the imbalanced set. Specifically, we added 473 new images for “tandem bicycle”, 473 for “moped”, 447 for “jeep”, and 473 for “station wagon”. The defective model (the 40-epoch version) was then retrained on this newly balanced dataset for 10 and 20 epochs, using the same hyperparameters as before. As shown in Table 17, the model retrained for just 10 epochs achieved the highest Top-1 accuracy of $90.55 %$ , while 20 epochs yielded a slightly lower accuracy of $90.09 %$ . The intervention was highly effective, marking a significant improvement over the baseline. A detailed look at the per-class performance for the best model (retrained for 10 epochs), shown in Table 17 and exemplified in Figure 6, confirms this. The Top-1 accuracies for the four problematic classes rose substantially: “leisure vehicle” at $90.00 %$ , “motorcycle” at $96.00 %$ , “offroad vehicle” at $90.50 %$ , and “passenger car” at $90.25 %$ . In contrast, the “van” class deteriorated from $88.00 %$ to $85.75 %$ , probably due to the improvement in the accuracy of visually similar vehicles with which it is confused. This made “van” the new lowest-performing class and the only one to fall below our target class threshold $τ_{c}$ of $86.00 %$ , marking it as the target for a second SDM iteration. The details of this second iteration of our SDM pipeline are described in Appendix D.

Table 17.

Top-1 Accuracy of the VE: $H$ 1 Model on the ImageNet Validation set after the First Model Mending Iteration.

Figure 6.

The left figure shows a scene, based on VE: $H$ 1, in which the vehicle corresponding to the “tandem bicycle” rare slice is misclassified by YOLOv5 into the “offroad vehicle” class. In contrast, the right figure shows the same scene in which such a vehicle is correctly classified, after model mending, into its “leisure vehicle” class.

7. Discussion

Our experiments, conducted on both the synthetic Super-CLEVR and real-world ImageNet datasets, demonstrate that the proposed SDM is highly effective at identifying rare slices in CV models. By systematically training, diagnosing, and mending models for both object detection (for Super-CLEVR ) and image classification (for ImageNet ) tasks, we validated the general efficacy of our neurosymbolic approach. The taxonomy-based heuristic at the core of our approach consistently and successfully induced challenging, hard-to-detect rare slices in the trained models. This allowed for a rigorous evaluation of the SDM pipeline. The subsequent application of ILP systems not only identified underperforming slices, but also extracted interpretable logical rules that pinpointed the specific data attributes causing the model to underperform. These rules then guided a targeted data augmentation and model mending process, which led to significant and consistent performance improvements across all tested hierarchies. In the sequel, we compare the performance of the integrated ILP systems, discuss the impact of model mending, and briefly compare our work with some existing SDMs. Finally, we acknowledge current limitations.

7.1. Comparison of ILP Systems

Our comparative analysis among the three ILP systems – Popper, FOLD-R++, and FastLAS – reveals the differences in their performance:

Popper was the fastest and least verbose system, but also the least effective. It failed to identify several key rare slices, and its success was highly dependent on having a large sample size of data.

FOLD-R++ was a very reliable and robust system with respect to hyperparameters. It successfully identified the underlying rare slices in nearly all problematic classes across both Super-CLEVR and ImageNet experiments, even with smaller data samples. The exception ratio in FOLD-R++ had a minimal impact on its rule extraction, indicating limited sensitivity to this hyperparameter in our context.

FastLAS was the most effective and expressive system, capable of identifying subtle rare slices, as demonstrated in the second iteration of the ImageNet experiments where other systems failed. However, this expressiveness comes at a cost. FastLAS was consistently the slowest system, and highly sensitive to its rule head penalty hyperparameter. Lower penalty values consistently produced meaningful rules, whereas higher values sometimes prevented the system from discovering any rules.

In addition to requiring less data for slice discovery, smaller samples have the advantage of significantly reducing the running time of ILP systems. In particular, using smaller samples was necessary for FastLAS in the Super-CLEVR experiments. One possible reason why FastLAS is slower may be its penalty mechanism and scoring function, which make the optimisation problem more challenging to solve. On the other hand, one probable explanation for the speed of Popper is the lack of negation in its extracted rules. Indeed, it is important to consider that both FOLD-R++ and FastLAS provide rules with negation, which greatly widens the hypothesis space. However, rules with negation allow for alternative and more concise descriptions of slices by specifying which vehicle attributes should not be present. The ability to express rules with negation may be one reason why FOLD-R++ and FastLAS have succeeded more than Popper in identifying rare slices. Despite their individual differences, the ILP systems showed a crucial consistency; when multiple systems succeeded, they invariably pointed to the same root cause (e.g., a specific vehicle subclass), reinforcing the validity of our findings. This convergence gives us high confidence in the identified slices and the subsequent mending strategies. Interestingly, while larger sample sizes generally improved the likelihood of success for all systems, both FOLD-R++ and FastLAS were often effective with as little as

25.00 %

of the validation data, highlighting the potential for efficient application in resource-constrained scenarios. Finally, these findings emphasise the importance of achieving a good trade-off between speed, effectiveness, and robustness in ILP-based slice discovery.

7.2. Impact of Model Mending

The model mending phase, guided by the rules extracted via our SDM, proved highly effective across all hierarchies. By augmenting the training set with new images specifically targeting the identified rare slices, we achieved substantial improvements in model performance. For instance, in the challenging VT: $H$ 4 hierarchy, the recall for the “urban bicycle” class jumped from $80.00 %$ to $94.00 %$ after the first mending iteration. The iterative nature of our SDM pipeline proved to be very effective. The second mending iteration for VT: $H$ 4 further improved the “urban bicycle” recall to $98.00 %$ , demonstrating the capacity of our SDM to solve progressively more subtle performance issues. This iterative refinement also highlights its ability to enhance model robustness without causing catastrophic forgetting, as the performance of already well-behaving classes remained high.

YOLOv5 models achieved high overall performance on the Super-CLEVR dataset, with mAP@0.5 values approaching 1.0 in all experiments. However, the goal of this work was not to maximise general object detection metrics, but rather to diagnose and correct highly specific, induced failures known as rare slices. For this purpose, per-class recall serves as a more precise diagnostic tool than a global metric like mAP. While a high mAP score confirms the good overall model performance, it can mask the poor performance on a specific, underrepresented slice of data, as the error is averaged out. By focussing on the recall of the problematic classes, we can directly measure the impact of the slice and, more importantly, verify the success of the mending process in a targeted manner. While the main analysis focuses on recall to clearly illustrate the diagnosis and repair of rare slices, a more comprehensive set of performance metrics is provided for completeness. We have included detailed results in Appendix C, which contains the confusion matrices, F1-Confidence curves, and other model training and validation performance metrics (e.g., mAP@0.5) for all Super-CLEVR hierarchies, both before and after model mending. This supplementary data confirms that the targeted improvements in recall discussed in the main text are accompanied by corresponding positive gains in the F1-score, reinforcing the overall efficacy of the proposed SDM pipeline.

Finally, the results across both the Super-CLEVR and ImageNet experiments indicate that model mending allows for significant improvements without extensive retraining. For the Super-CLEVR hierarchies, a relatively small number of additional training epochs – between 20 and 40 – was sufficient to integrate the new data and correct the identified rare slices. This efficiency was even more pronounced in the ImageNet experiments, where a brief retraining of just 10 epochs yielded effective results for both mending iterations.

7.3. Comparison With Existing Methods

As detailed in Section 2, the state-of-the-art can be broadly categorised, and our neurosymbolic architecture offers a distinct alternative that emphasises logical rule extraction. Several methods in slice discovery and rare data mining, such as Domino (Eyuboglu et al., 2022), Spotlight (d’Eon et al., 2022), George (Sohoni et al., 2020), and Talisman (Kothawade et al., 2022), have introduced strategies to identify rare or underperforming data regions by operating largely in embedding spaces or latent distributions. Another relevant method by Jiang et al. (2022) proposed density-based rare example mining using normalizing flows over learned detection features in a 3D object detection setting. Although this approach significantly improves performance on rare intraclass instances, it does not provide semantic explanations of errors or insight into the nature of failure modes. In fact, the common limitation of these approaches is a lack of interpretability. A rare slice is typically identified as a cluster of data points, not a human-understandable concept in a semantic, logical format. More recent methods, such as PromptAttack (Metzen et al., 2023), AdaVision (Gao et al., 2023), and SSD-LLM (Luo et al., 2024), leverage the power of large-scale generative and multimodal models. These approaches can explore and structure datasets to suggest potential areas of underperformance. However, the final rare slice description often remains a textual prompt or a collection of images, rather than a formal, verifiable rule.

Our neurosymbolic SDM contrasts with these methods by prioritising interpretability and targeted causality. The main technical difference is the use of ILP to move from systematic model errors to a set of compact, human-readable, and formal logical rules describing them. This provides several advantages:

Transparency: Logical rules offer a clear semantic explanation of the failure condition of the model (e.g., shape(utility) and direction(north)).

Editability: Rules are not only descriptive, but also prescriptive. They provide a precise, actionable specification that can directly guide the model mending process.

Debugging: Rules serve as a valuable tool for model debugging, allowing ML practitioners to understand the specific visual attributes that confuse the model.

In summary, our work addresses the fundamental challenge of making slice discovery interpretable and directly actionable for targeted model correction.

7.4. Limitations

For each CV task, our SDM approach builds on the availability of scene graph representations to extract interpretable logical rules describing “hidden” rare slices, that is, underperforming subsets of data not explicitly labelled and difficult to spot from unstructured data, such as images. These representations provide the rich semantic structure necessary for our method. However, scene graphs are generally not readily available for datasets, especially in real-world scenarios. Nevertheless, rapid advances in Vision Language Models (VLMs) are making it increasingly feasible to semi-automatically generate (curated generation) such semantic representations from image data. In our ongoing work, we are actively and systematically exploring the integration of VLMs to automate the scene graph generation step, as we have experimented here for real-world images from the ImageNet dataset. Consequently, our SDM is directly applicable whenever appropriate semantic representations can be obtained from image data, making automated semantic extraction a promising direction for our future work.

A second limitation is the current reliance on manual, exploratory tuning for the hyperparameters of the ILP systems. While our experiments show that robust rules can be found by testing a range of configurations, this process can be time-consuming and may require domain expertise.

Furthermore, the scalability of ILP systems can be computationally intensive, especially with large validation sets or with a complex hypothesis space defined by numerous attributes and predicates. As observed in our experiments, FastLAS was prone to timeouts when analysing the entire validation set in Super-CLEVR, indicating that the ILP system performance can be a bottleneck.

Finally, the current implementation of our SDM focuses on discovering rare slices defined by object attributes (e.g., “a yellow rubber utility bike facing south”). A limitation of this approach is that it does not yet consider slices defined by the relationships between objects (e.g., “a bicycle next to a car”). Extending the proposed SDM to incorporate object relations would allow for the discovery of more complex rare slices, providing deeper insights into model failures.

8. Conclusion and Future Work

In this work, we have presented a neurosymbolic approach to address the slice discovery problem. In particular, we have provided a modular architecture and an implementation that connects dataset generation (or subsampling), model classification, and rule extraction via ILP to identify misclassified rare slices. Our experiments, conducted on both the synthetic Super-CLEVR and real-world ImageNet datasets, demonstrate the effectiveness of our methodology. The proposed taxonomy-based heuristic reliably generated datasets with predictable rare slices, validating our approach for inducing specific model failures. The ILP systems proved effective at producing useful logical rules describing rare slices. Further training the models with new data guided by these rules resulted in significant performance improvements on the problematic classes for both object detection and image classification tasks. Our SDM approach can also be extended to the multi-label classification task, thus dealing with taxonomies structured as directed acyclic graphs in addition to tree-shaped ones. Furthermore, our results underscore the effectiveness of the iterative nature of the SDM pipeline. We showed that further iteration can successfully diagnose and resolve more subtle and persistent errors, demonstrating the ability of the SDM to progressively refine model performance and address increasingly difficult deficiencies.

The results obtained are encouraging and demonstrate the potential of simultaneously exploiting DL and KRR methods for slice discovery. In this way, compact and human-readable logical rules can be extracted that improve the interpretability and explainability of a CV model under examination, also paving the way to advanced concepts such as causal and contrastive explanations.

Ongoing and Future Work. Although our experiments confirm that rule-based augmentation helps improve classification performance, we acknowledge that the overall effectiveness of model mending may vary depending on the specific rules extracted. A systematic analysis of the sensitivity of model mending to rule quality, granularity, and specificity remains a promising issue for future work. In addition, while our current SDM implementation relies on user input to analyse and select candidate rules from the ILP system outputs, developing fully automated methods for hypothesis formation and rule selection represents an important direction for future research. Furthermore, to make the SDM pipeline more accessible and efficient, automated methods for setting the optimal hyperparameters for the various ILP systems could be explored, reducing the need for manual, exploratory tuning. Our ImageNet experiments relied on a VLM to generate the necessary scene graphs, as mentioned in Section 6.3.1. A key future direction is to systematically explore and integrate state-of-the-art VLMs to fully automate this step, rather than being provided with ground truth scene graphs, as is the case with Super-CLEVR. Creating a robust pipeline for generating high-quality semantic representations directly from input images will make our SDM framework scalable and applicable to any visual dataset. Using such tools for our SDM presents an interesting research challenge. Thus far, our work has focussed on object attributes. We plan to extend the framework to incorporate relationships between objects. This will allow for the discovery of more specific rare slices (e.g., “a bicycle next to a car”), providing deeper insights into model failures at the expense of higher computational cost. Moreover, exploring the integration of further ILP systems, such as recent neurosymbolic ones, for example, $δ I L P$ (Evans & Grefenstette, 2018) and $α I L P$ (Shindo et al., 2023), as well as other rule learning approaches, such as those provided by Statistical Relational Learning, e.g. LERND (Merkys, 2020), is on our agenda.

Footnotes

Funding

The project leading to this research has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101034440. Additionally, this research was funded in whole or in part by the Austrian Science Fund (FWF) 10.55776/COE12, and it was supported by Bosch Center for AI (BCAI) in Renningen, Germany.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

ORCID iDs

Michele Collevati

Thomas Eiter

Nelson Higuera

Notes

References

Blender (2018). Blender – A 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam. http://www.blender.org

Boreiko

Hein

Metzen

J. H.

(2023). Identifying systematic errors in object detectors with the SCROD oipeline. In Proc. IEEE/CVF International Conference on Computer Vision (pp. 4090–4099).

Bratko

(2012). Prolog programming for artificial intelligence (4th ed.). Addison-Wesley.

Bratko

Muggleton

S. H.

(1995). Applications of inductive logic orogramming. Communications of the ACM, 38(11), 65–70. 10.1145/219717.219771

Brown

T. B.

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

D. M.

Winter

Amodei

(2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information orocessing systems 33: Annual conference on neural information orocessing systems 2020, NeurIPS 2020, December 6–12, 2020, virtual (pp. 1877–1901). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

Buolamwini

Gebru

(2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler & C. Wilson (Eds.), Conference on fairness, accountability and transparency, FAT 2018, February 23–24, 2018, New York, NY, USA (pp. 77–91). Proceedings of Machine Learning Research, Vol. 81. PMLR. http://proceedings.mlr.press/v81/buolamwini18a.html

Chung

Kraska

Polyzotis

Tae

K. H.

Whang

S. E.

(2019). Slice finder: Automated data slicing for model validation. In 35th IEEE international conference on data engineering, ICDE 2019, Macao, China, April 8–11, 2019 (pp. 1550–1553). IEEE. 10.1109/ICDE.2019.00139

Collevati

Eiter

Higuera

(2024). Leveraging neurosymbolic AI for slice discovery. In T. R. Besold, A. d’Avila Garcez, E. Jiménez-Ruiz, R. Confalonieri, P. Madhyastha, & B. Wagner (Eds.), Neural-symbolic learning and reasoning – 18th international conference, NeSy 2024, Barcelona, Spain, September 9–12, 2024, Proceedings, Part I (pp. 403–418). LNCS, Vol. 14979. Springer. 10.1007/978-3-031-71167-1_22

Cropper

Dumancic

(2022). Inductive logic programming at 30: A new introduction. The Journal of Artificial Intelligence Research, 74, 765–850. 10.1613/JAIR.1.13507

10.

Cropper

Dumancic

Evans

Muggleton

S. H.

(2022). Inductive logic programming at 30. Machine Learning, 111(1), 147–172. 10.1007/S10994-021-06089-1

11.

Cropper

Morel

(2021). Learning programs by learning from failures. Machine Learning, 110(4), 801–856. 10.1007/S10994-020-05934-Z

12.

Dai

Zhou

(2019). Bridging machine learning and logical reasoning by abductive learning. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, & R. Garnett (Eds.), Advances in neural information processing systems 32: Annual conference on neural information orocessing systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada (pp. 2811–2822). https://proceedings.neurips.cc/paper/2019/hash/9c19a2aa1d84e04b0bd4bc888792bd1e-Abstract.html

13.

d’Eon

Wright

J. R.

Leyton-Brown

(2022). The spotlight: A general method for discovering systematic errors in eeep learning models. In FAccT ’22: 2022 ACM conference on fairness, accountability, and transparency, Seoul, Republic of Korea, June 21–24, 2022 (pp. 1962–1981). ACM. 10.1145/3531146.3533240

14.

DeVries

Misra

Wang

van der Maaten

(2019). Does object recognition work for everyone? In IEEE conference on computer vision and oattern recognition workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16–20, 2019 (pp. 52–59). Computer Vision Foundation/IEEE. http://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html

15.

Enot

D. P.

King

R. D.

(2003). Application of inductive logic programming to structure-based drug design. In N. Lavrac, D. Gamberger, H. Blockeel, & L. Todorovski (Eds.), Knowledge discovery in databases: PKDD 2003, 7th European conference on orinciples and oractice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003, Proceedings (pp. 156–167). LNCS, Vol. 2838. Springer. 10.1007/978-3-540-39804-2_16

16.

Evans

Grefenstette

(2018). Learning explanatory rules from noisy data. The Journal of Artificial Intelligence Research, 61, 1–64. 10.1613/JAIR.5714

17.

Eyuboglu

Varma

Saab

K. K.

Delbrouck

Lee-Messer

Dunnmon

Zou

Ré

(2022). Domino: Discovering systematic errors with fross modal embeddings. In The tenth international conference on learning representations, ICLR 2022, Virtual Event, April 25–29, 2022 (pp. 1–28). OpenReview.net. https://openreview.net/forum?id=FPCMqjI0jXN

18.

Finn

P. W.

Muggleton

S. H.

Page

Srinivasan

(1998). Pharmacophore discovery using the inductive logic programming system PROGOL. Machine Learning, 30(2-3), 241–270. 10.1023/A:1007460424845

19.

Francescomarino

C. D.

Donadello

Ghidini

Maggi

F. M.

Rizzi

Tessaris

(2024). Making sense of temporal event data: A framework for comparing techniques for the discovery of discriminative temporal patterns. In G. Guizzardi, F. M. Santoro, H. Mouratidis, & P. Soffer (Eds.), Advanced information systems engineering – 36th international conference, CAiSE 2024, Limassol, Cyprus, June 3–7, 2024, Proceedings (pp. 423–439), LNCS, Vol. 14663. Springer. 10.1007/978-3-031-61057-8_25

20.

French

R. M.

(1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135. 10.1016/S1364-6613(99)01294-2

21.

Gao

Ilharco

Lundberg

S. M.

Ribeiro

M. T.

(2023). Adaptive testing of computer vision models. In IEEE/CVF international conference on computer vision, ICCV 2023, Paris, France, October 1–6, 2023 (pp. 3980–3991). IEEE. 10.1109/ICCV51070.2023.00370

22.

Hedderich

M. A.

Fischer

Klakow

Vreeken

(2022). Label-descriptive patterns and their application to characterizing classification errors. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, & S. Sabato (Eds.), International conference on machine learning, ICML 2022, July 17–23, 2022, Baltimore, Maryland, USA (pp. 8691–8707). Proceedings of Machine Learning Research, Vol. 162. PMLR. https://proceedings.mlr.press/v162/hedderich22a.html

23.

Hitzler

Sarker

M. K.

(Eds.) (2021). Neuro-symbolic artificial intelligence: The state of the art. Frontiers in artificial intelligence and applications, Vol. 342. IOS Press. 10.3233/FAIA342

24.

Jiang

C. M.

Najibi

C. R.

Zhou

Anguelov

(2022). Improving the intra-class long-tail in 3D detection via rare example mining. In S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.) Computer vision – ECCV 2022 – 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X (pp. 158–175). LNCS, Volume 13670. Springer. 10.1007/978-3-031-20080-9_10

25.

Johnson

Hariharan

der Maaten

L. van

Fei-Fei

Zitnick

C. L.

Girshick

R. B.

(2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017 (pp. 1988–1997). IEEE Computer Society. 10.1109/CVPR.2017.215

26.

Johnson

Cabrera

Á. A.

Plumb

Talwalkar

(2023). Where does my model underperform? A human evaluation of slice discovery algorithms. In Proceedings of AAAI conference on human computation and crowdsourcing (Vol. 11, pp. 65–76).

27.

Kalid

S. N.

Khor

K. H.

Tong

(2024). Detecting frauds and payment defaults on credit card data inherited with imbalanced class distribution and overlapping class problems: A systematic review. IEEE Access, 12, 23636–23652. 10.1109/ACCESS.2024.3362831

28.

Koenecke

Nam

Lake

Nudell

Quartey

Mengesha

Toups

Rickford

J. R.

Jurafsky

Goel

(2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences of the United States of America, 117(14), 7684–7689. 10.1073/pnas.1915768117

29.

Kókai

Alexin

Gyimóthy

(1997). Application of inductive logic programming for learning ECG waveforms. In E. T. Keravnou, C. Garbay, R. H. Baud, & J. C. Wyatt (Eds.), Artificial intelligence medicine, 6th conference on artificial intelligence in medicine in Europe, AIME’97, Grenoble, France, March 23–26, 1997, Proceedings (pp. 126–129). LNCS, Vol. 1211. Springer. 10.1007/BFb0029443

30.

Kothawade

Ghosh

Shekhar

Xiang

Iyer

R. K.

(2022). Talisman: Targeted active learning for object detection with rare classes and slices using submodular mutual information. In S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella and T. Hassner (Eds.), Computer vision – ECCV 2022 – 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII (pp. 1–16). LNCS, Vol. 13698. Springer. 10.1007/978-3-031-19839-7_1

31.

Krizhevsky

Sutskever

Hinton

G. E.

(2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. 10.1145/3065386

32.

Lavrac

Dzeroski

(1994). Inductive logic programming – Techniques and applications. Ellis Horwood series in artificial intelligence. Ellis Horwood.

33.

Law

Russo

Bertino

Broda

Lobo

(2020). FastLAS: Scalable inductive logic programming incorporating domain-specific optimisation criteria. In The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7–12, 2020 (pp. 2877–2885). AAAI Press. 10.1609/aaai.v34i03.5678

34.

Law

Russo

Broda

(2015). The ILASP system for learning Answer Set Programs. https://www.ilasp.com/

35.

Zhu

Zhang

Jiang

Dang

Hou

Shen

Zhao

Shah

S. A. A.

Bennamoun

(2024). Scene graph generation: A comprehensive survey. Neurocomputing, 566, 127052. 10.1016/j.neucom.2023.127052

36.

Wang

Stengel-Eskin

Kortylewski

Durme

B. V.

Yuille

A. L.

(2023). Super-CLEVR: A virtual benchmark to diagnose domain robustness in visual reasoning. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, BC, Canada, June 17–24, 2023 (pp. 14963–14973). IEEE. 10.1109/CVPR52729.2023.01437

37.

Lifschitz

(2019). Answer set programming. Springer. 10.1007/978-3-030-24658-7

38.

Luo

Zou

Tang

Liu

Zhang

(2024). LLM as dataset analyst: Subpopulation structure discovery with large language, Model. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Computer vision – ECCV 2024 – 18th European conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXIII (pp. 235–252). LNCS, Vol. 15091. Springer. https://doi.org/10.1007/978-3-031-73414-4_14

39.

Merkys

(2020). crunchiness/lernd: LERND – Implementation of

\partial

ILP. https://github.com/crunchiness/lernd

40.

Metzen

J. H.

Hutmacher

Hua

N. G.

Boreiko

Zhang

(2023). Identification of systematic errors of image classifiers on rare subgroups. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023 (pp. 5041–5050). IEEE. 10.1109/ICCV51070.2023.00467

41.

Muggleton

S. H.

(1991). Inductive logic programming. New Generation Computing, 8(4), 295–318. 10.1007/BF03037089

42.

Oakden-Rayner

Dunnmon

Carneiro

Ré

(2020). Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In M. Ghassemi (Ed.), ACM CHIL ’20: ACM conference on health, inference, and learning, Toronto, Ontario, Canada, April 2–4, 2020 [delayed] (pp. 151–159). ACM. 10.1145/3368555.3384468

43.

Olesen

Weng

Feragen

Petersen

(2024). Slicing through bias: Explaining oerformance gaps in medical image analysis using slice discovery methods. In E. Puyol-Antón, G. Zamzmi, A. Feragen, A. P. King, V. Cheplygina, M. Ganz-Benjaminsen, E. Ferrante, B. Glocker, E. Petersen, J. S. H. Baxter, I. Rekik, & R. Eagleson (Eds.), Ethics and fairness in medical imaging -- Second international workshop on fairness of AI in medical imaging, FAIMI 2024, and third international Wworkshop on ethical and philosophical issues in medical imaging, EPIMI 2024, Held in Conjunction with MICCAI 2024, Marrakesh, Morocco, October 6–10, 2024, Proceedings (pp. 3–13). LNCS, Vol. 15198. Springer. 10.1007/978-3-031-72787-0_1

44.

OpenAI (2025). Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/

45.

Pedersen

Patwardhan

Michelizzi

(2004). Wordnet: Similarity – Measuring the relatedness of concepts. In D. L. McGuinness & G. Ferguson (Eds.), Proc. Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25–29, 2004, San Jose, CA, USA (pp. 1024–1025). AAAI Press/The MIT Press. http://www.aaai.org/Library/AAAI/2004/aaai04-160.php

46.

Radford

Kim

J. W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

Krueger

Sutskever

(2021). Learning Transferable Visual Models From Natural Language Supervision. In M. Meila & T. Zhang (Eds.), Proc. 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (pp. 8748–8763). Proceedings of Machine Learning Research, Vol. 139. PMLR. http://proceedings.mlr.press/v139/radford21a.html

47.

Recht

Roelofs

Schmidt

Shankar

(2019). Do ImageNet classifiers generalize to ImageNet? In K. Chaudhuri & R. Salakhutdinov (Eds.), Proc. 36th international conference on machine kearning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (pp. 5389–5400). Proceedings of Machine Learning Research, Vol. 97. PMLR. http://proceedings.mlr.press/v97/recht19a.html

48.

Redmon

Divvala

S. K.

Girshick

R. B.

Farhadi

(2016). You only look once: Unified, real-time object detection. In 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, June 27–30, 2016, Las Vegas, NV, USA (pp. 779–788). IEEE Computer Society. 10.1109/CVPR.2016.91

49.

Sagadeeva

Boehm

(2021). SliceLine: Fast, linear-algebra-based slice finding for ML model debugging. In G. Li, Z. Li, S. Idreos, & D. Srivastava (Eds.), SIGMOD ’21: International conference on management of data, virtual event, China, June 20–25, 2021 (pp. 2290–2299). ACM. 10.1145/3448016.3457323

50.

Sagawa

Koh

P. W.

Hashimoto

T. B.

Liang

(2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In 8th International Conference on Learning Representations, ICLR 2020 (pp. 1–19). https://www.scopus.com/inward/record.uri?eid=2-s2.0-85150613153&partnerID=40&md5=9500e6ad6b1c2fc3da67c2f61683471f

51.

Sarker

M. K.

Zhou

Eberhart

Hitzler

(2021). Neuro-symbolic artificial intelligence. AI Communications, 34(3), 197–209. 10.3233/AIC-210084

52.

Shakerin

Salazar

Gupta

(2017). A new algorithm to automate inductive learning of default theories. Theory and Practice of Logic Programming, 17(5-6), 1010–1026. 10.1017/S1471068417000333

53.

Shindo

Pfanschilling

Dhami

D. S.

Kersting

(2023).

α

ILP: Thinking visual scenes as differentiable logic programs. Machine Learning, 112(5), 1465–1497. 10.1007/s10994-023-06320-1

54.

Slyman

Kahng

Lee

(2023). VLSlice: Interactive vision-and-language slice discovery. In IEEE/CVF international conference on computer vision, ICCV 2023, Paris, France, October 1–6, 2023 (pp. 15245–15255). IEEE. 10.1109/ICCV51070.2023.01403

55.

Sohoni

N. S.

Dunnmon

Angus

Ré

(2020). No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual (pp. 19339–19352). https://proceedings.neurips.cc/paper/2020/hash/e0688d13958a19e087e123148555e4b4-Abstract.html

56.

Szeliski

(2022). Computer vision – Algorithms and applications (2nd ed.). Texts in computer science. Springer. 10.1007/978-3-030-34372-9

57.

Tello

de la Cruz

Ribeiro

Fiérrez

Morales

Tolosana

Alonso

C. L.

Ortega

(2023). Symbolic AI (LFIT) for XAI to handle biases. In R. Calegari, A. A. Tubella, G. González-Castañé, V. Dignum, & M. Milano (Eds.), Proc. 1st Workshop on fairness and bias in AI co-located with 26th European conference on artificial intelligence (ECAI 2023), Kraków, Poland, October 1st, 2023 (pp. 1–20), CEUR Workshop Proceedings, Vol.3523. CEUR-WS.org. https://ceur-ws.org/Vol-3523/paper10.pdf

58.

Turcotte

Muggleton

S. H.

Sternberg

M. J. E.

(1998). Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure. In D. Page (Ed.), Inductive logic programming, 8th international workshop, ILP-98, Madison, Wisconsin, USA, July 22–24, 1998, Proceedings (pp. 53–64). LNCS, Vol. 1446. Springer. 10.1007/BFb0027310

59.

Wang

Gupta

(2022). FOLD-R++: A scalable toolset for automated lnductive learning of default theories from mixed data. In M. Hanus & A. Igarashi (Eds.), Functional and logic orogramming – 16th international symposium, FLOPS 2022, Kyoto, Japan, May 10–12, 2022, Proceedings (pp. 224–242). LNCS, Vol. 13215. Springer. 10.1007/978-3-030-99461-7_13

60.

Gan

Chen

Wan

P. S.

(2023). Multimodal large language models: A survey. In J. He, T. Palpanas, X. Hu, A. Cuzzocrea, D. Dou, D. Slezak, W. Wang, A. Gruca, J. C. Lin, & R. Agrawal (Eds.), IEEE international conference on gig data, BigData 2023, Sorrento, Italy, December 15–18, 2023 (pp. 2247–2256). IEEE. 10.1109/BigData59044.2023.10386743

61.

Yan

Fokoue

Chang

Julius

(2022). Neuro-symbolic models for interpretable time series classification using temporal logic description. In X. Zhu, S. Ranka, M. T. Thai, T. Washio, & X. Wu (Eds.), IEEE international conference on data mining, ICDM 2022, Orlando, FL, USA, November 28–December 1, 2022 (pp. 618–627). IEEE. 10.1109/ICDM54844.2022.00072

62.

Youssef

Y. M.

Müller

M. E.

(2023). A review of inductive logic programming applications for robotic systems. In E. Bellodi, F. A. Lisi, & R. Zese (Eds.), Inductive logic programming – 32nd international conference, ILP 2023, Bari, Italy, November 13–15, 2023, Proceedings (pp. 154–165). LNCS, Vol. 14363. Springer. 10.1007/978-3-031-49299-0_11

63.

Zhang

Liu

Khurshid

(2018). DeepRoad: GAN-based metamorphic testing and input validation dramework for autonomous driving systems. In M. Huchard, C. Kästner, & G. Fraser (Eds.), Proceedings of 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3–7, 2018 (pp. 132–142). ACM. 10.1145/3238147.3238187

64.

Zhang

Liu

Wang

(2023). Learning to binarize continuous features for neuro-rule networks. In Proceedings of thirty-second international joint conference on artificial intelligence, IJCAI 2023, 19th–25th August 2023, Macao, SAR, China (pp. 4584–4592). ijcai.org. 10.24963/ijcai.2023/510

Leveraging Neurosymbolic AI for Slice Discovery

Abstract

Keywords

1. Introduction

2. Related Work

3. Preliminaries

3.1. Super-CLEVR

3.2. ImageNet

3.3. Inductive Logic Programming

Example 2 Continued

4. Neurosymbolic Framework for Slice Discovery

4.2. Object Detection and Image Classification

4.3. Scene Graph Generation

4.4. Rule Extraction Via Inductive Logic Programming

4.5. Model Mending

Example 4 Continued

6. Experiments

6.1. Evaluation Platform

6.2. Super-CLEVR Experiments

6.2.1. Experimental Setup

6.3.1. Experimental Setup

6.3.2. Experimental Results

Table 12. Top-1 Accuracy of the VE: H 1 Model on the ImageNet Validation Set after the Initial Model Training.

7.1. Comparison of ILP Systems

7.2. Impact of Model Mending

7.3. Comparison With Existing Methods

7.4. Limitations

8. Conclusion and Future Work

Footnotes

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

References

Table 12.
Top-1 Accuracy of the VE: $H$ 1 Model on the ImageNet Validation Set after the Initial Model Training.