Sage Journals: Discover world-class research

Abstract

This work compares large language models (LLMs) and neuro-symbolic approaches in solving Raven’s progressive matrices (RPMs), a visual abstract reasoning test that involves the understanding of mathematical rules such as progression or arithmetic addition. Providing the visual attributes directly as textual prompts, which assumes an oracle visual perception module, allows us to measure the model’s abstract reasoning capability in isolation. Despite providing such compositionally-structured representations from the oracle visual perception and advanced prompting techniques, both GPT-4 and Llama-3 70B cannot achieve perfect accuracy on the center constellation of the I-RAVEN dataset. Our analysis reveals that the root cause lies in the LLM’s weakness in understanding and executing arithmetic rules. As a potential remedy, we analyze the Abductive Rule Learner with Context-awareness (ARLC), a neuro-symbolic approach that learns to reason with vector-symbolic architectures. Here, concepts are represented with distributed vectors such that dot products between encoded vectors define a similarity kernel, and element-wise vector operations perform addition/subtraction on the encoded values. We find that ARLC achieves almost perfect accuracy on the center constellation of I-RAVEN, demonstrating a high fidelity in arithmetic rules. To stress the length generalization capabilities, we extend the RPM tests to larger matrices (3 $\times$ 10 instead of typical 3 $\times$ 3) and larger dynamic ranges of the attribute values (from 10 up to 1000). We find that the LLM’s accuracy of solving arithmetic rules drops to sub-10%, especially as the dynamic range expands, while ARLC can maintain a high accuracy due to emulating symbolic computations on top of distributed representations.¹

Keywords

analogical reasoning large language models vector-symbolic architectures reasoning benchmarks

1. Introduction

Abstract reasoning is often regarded as a core feature of human intelligence. This cognitive process involves abstracting rules from observed patterns in a source domain, and applying them in an unseen target domain. With the ultimate aim of achieving human-level intelligence, abstract reasoning tasks have sparked the interest of many in machine learning research. Thanks to the availability of large datasets (Barrett et al., 2018; Hu et al., 2021; Zhang et al., 2019), various learning-based methods, ranging from pure connectionist (Benny et al., 2021; Wu et al., 2020) to neuro-symbolic (Camposampiero et al., 2024; Hersche et al., 2023a, 2023b; Sun et al., 2025; Zhang et al., 2021, 2022) approaches, achieved promising results in this domain.

Figure 1.

This work compares the abstract reasoning capabilities of large language models (LLMs) and neuro-symbolic abductive rule learner with context-awareness (ARLC) on Raven’s progressive matrices (RPMs) tests. (a) An RPM example taken from the center constellation of I-RAVEN. The task is to find the empty panel at the bottom-right of the context matrix by selecting one of the answer candidates. (b) Solving RPMs through LLM prompting. Visual attribute values are extracted from the I-RAVEN dataset and assembled to individual per-attribute text-only prompts. LLMs are prompted to predict the attribute of the empty panel. Finally, the attribute predictions are compared with the answer candidates, whereby the best-matching answer is selected as the final answer. (c) Solving RPMs with neuro-symbolic ARLC that relies on distributed similarity-preserving representations and manipulates them via dimensionality-preserving operations; it learns rule-formulations as a differentiable assignment problem.

More recently, the zero- and few-shot capabilities of large language models (LLMs) and their multi-modal variants have been tested on various abstract reasoning tasks such as verbal (Gendron et al., 2024; Lewis & Mitchell, 2025; Stevenson et al., 2023; Webb et al., 2023) or visual (Ahrabian et al., 2024; Camposampiero et al., 2023; Cao et al., 2024; Hu et al., 2023; Jiang et al., 2024; Latif et al., 2024; Lewis & Mitchell, 2025; Mitchell et al., 2024; Webb et al., 2023; Wüst et al., 2024; Zhang et al., 2024) analogies. One natural approach towards zero-shot visual abstract reasoning is to leverage multi-modal LLM’s vision capabilities to solve the task end-to-end. However, these multi-modal models perform significantly worse than their text-only version (Mitchell et al., 2024), which might stem from a missing fine-grained compositional feature comprehension (Cao et al., 2024). As an additional help, LLMs have been provided with text-only inputs by giving them access to an oracle perception, that is, providing perfectly disentangled representations (Hu et al., 2023; Webb et al., 2023). While this generally improves their reasoning abilities, LLMs still fail to achieve perfect accuracy on many simple tasks. One example is represented by Raven’s progressive matrices (RPMs) (Raven et al., 1938), a benchmark that tests visual abstract reasoning capabilities by measuring the fluid intelligence of humans. Here, the state-of-the-art (SOTA) LLM-based approach (Hu et al., 2023) achieves only 86.4% accuracy in the center constellation of I-RAVEN (Hu et al., 2021), which we observe to be a gate-keeper for this task (see Section 2.1). In contrast, recent neuro-symbolic approaches showed not only almost perfect accuracy on the center constellation of I-RAVEN, but also demonstrated high fidelity in out-of-distribution (OOD) settings. For instance, the Abductive Rule Learner with Context-awareness (ARLC) represents attribute values with high-dimensional, distributed representations based on vector-symbolic architectures (VSAs) (Gayler, 2003; Kanerva, 2009; Plate, 1995, 2003). Learning the RPMs rules boils down to a differentiable assignment problem of high-dimensional panel representations in a series of binding and unbinding operations, which can be solved with unconstrained optimization algorithms such as stochastic gradient descent (SGD). ARLC outperformed the SOTA LLM-based approach (Hu et al., 2023) both on in-distribution and OOD, thanks to relying on structured and similarity-preserving representations based on fractional power encoding (FPE) (Plate, 2003).

This article extends on the initial work on ARLC (Camposampiero et al., 2024), by comparing its abstract reasoning capability with two prominent LLMs (GPT-4 (Achiam et al., 2024) and Llama-3 70B (Dubey et al., 2024)) (see Figure 1). Circumventing the perception by providing ground-truth attribute labels to the models allows us to measure their analogical and mathematical reasoning capabilities in isolation. Hence, we evaluate the reasoning capabilities of LLMs under conditions that play to their strengths, namely, language understanding, when such compositionally structured (i.e., disentangled) representations are provided. Our comprehensive prompting efforts lead to very high accuracy for Llama-3 70B (85.0%) and GPT-4 (93.2%), where the latter notably outperforms previous reports with GPT-3 (Hu et al., 2023) (86.4%) and GPT-4 o1-preview (Latif et al., 2024) (18.00%). This LLM’s imperfect accuracy on the isolated task motivated us to further analyze their capability of detecting and executing different rules. In both GPT-4 and Llama-3 70B, we find a notable weakness in performing arithmetic rules that require row-wise additions or subtractions (e.g., see the last prompt in Figure 2). To gain more insight about this behavior, we set up a new RPM dataset (I-RAVEN-X) that increases the grid size from 3 $\times$ 3 to 3 $\times$ 10, additionally allowing for a configurable dynamic range for the arithmetic computations. Also here, we observe a notable weakness in the arithmetic rule that gets even amplified by an increasing dynamic range. On the other hand, ARLC demonstrates high accuracy on larger grid sizes and allows to increase the dynamic range without further retraining, thanks to the capability of adjusting the underlying structured FPE representations.

Figure 2.

(a) Individual per-attribute text-only prompts to solve Raven’s progressive matrices (RPMs) tasks from I-RAVEN. (b) Example prompts with of our novel configurable I-RAVEN-X dataset of size 3 $\times$ 10 with a value range of $m = 1000$ . In both the I-RAVEN and I-RAVEN-X examples, the LLM (GPT-4) errors in the arithmetic rules.

2. Datasets

2.1. I-RAVEN

We test the models on the center constellation of I-RAVEN (Hu et al., 2021) (see Figure 1). The test consists of a 3 $\times$ 3 context matrix, where the bottom-right panel is missing. The task is then to select the correct answer from eight candidate panels to complete the matrix. Each panel contains an object, which is characterized by different attributes (shape, size, and color). The relation between each attribute’s value in different panels is governed by a well-defined set of rules:

–
constant: This rule keeps the attribute value constant within the row.
–
progression: The attribute value monotonically increases or decreases in a row by a value of 1 or 2.
–
arithmetic: The attribute values of the first two panels are either added (arithmetic plus) or subtracted (arithmetic minus), yielding the attribute value of the third panel in the row.
–
distribute three: This rule involves the fact that three different values of an attribute appear in the three panels of every row (with distinct permutations of the values in different rows). The same holds with respect to the columns.
The task is to infer the rule governing each attribute in the context matrix and use it to determine the content of the missing (bottom-right) panel, selecting it within the eight candidate answers. Compared to other RPMs benchmarks that have been used to evaluate LLMs (Webb et al., 2023), I-RAVEN tests a more comprehensive range of logical and arithmetic skills. While I-RAVEN provides tests in various constellations with more objects that may intuitively appear more arduous to solve, LLMs are more challenged with the seemingly simple constellations. For instance, GPT-3 achieved a higher accuracy on the 2x2 and 3x3 constellations (78.0% and 86.4%) than on center (77.2%) (Hu et al., 2023). Moreover, high accuracy can be maintained on the 2x2 and 3x3 constellations while only looking at the last row of the context matrix (Hu et al., 2023), effectively showing that no analogical reasoning is required to solve the test in these constellations. Hence, we opted to focus our evaluation on the center constellation only, using 500 samples from I-RAVEN’s test set. Inspired by recent works (Hu et al., 2023; Webb et al., 2023), we simplify RPMs from a visual abstract reasoning test to a purely abstract reasoning test. Assuming a perfect perception, we extract the attribute values from I-RAVEN and use them to create the prompts for the model. This approach simplifies the RPM task as it not only provides the correct attributes but also filters irrelevant attributes. In a follow-up work (Camposampiero et al., 2025), we tested the robustness of LLMs against uncertainties in the perception. As expected, adding confounding attributes as well as a smoothened non-one-hot distribution degraded the LLMs’ performance considerably.
2.2. New I-RAVEN-X

To further evaluate the mathematical reasoning capabilities at scale, we introduce an extension of the I-RAVEN’s center constellation, called I-RAVEN-X. Our new benchmark maintains I-RAVEN’s four rules and three attributes but allows for a parameterizable number of columns ( $g$ ) and a dynamic range of attribute values ( $m$ ). When generating a new RPMs example, we uniformly sample from one of the available rules (constant, progression, arithmetic, and distribute three). Note that the attribute shape does not incur the arithmetic rule. In the following, we describe the generation process of the RPM context matrix of size $3 \times g$ for the individual rules. The overall goal is that the values stay in the range $[0, m - 1]$ .

–
constant: For each row, we uniformly sample an integer from the set ${0, 1, \dots, m - 1}$ , and duplicate along the row.
–
progression: First, we uniformly sample the progressive increment/decrement ( $δ$ ) from the set ${- 2, - 1, + 1, + 2}$ . In case of a positive increment, we first define the values of the right-most columns, by uniformly sampling from the set ${(g - 1) \cdot δ, \dots, m - 1}$ for each row. Then, the rest of the matrix is completed by applying the progression rule. The sampling for a negative $δ$ is done specularly from the first column.
–
arithmetic: The attribute values of the first $g - 1$ panels are either added (arithmetic plus) or subtracted (arithmetic minus), yielding the attribute value of the last panel in the row. In arithmetic plus, we sequentially sample the values from the first $g - 1$ panels in the row. For each panel, we set the sampling range to ${0, \dots, m - s}$ , where $s$ is the sum of the already sampled panels in the row. Afterward, the first $g - 1$ panels are shuffled. Finally, the values of the last panels are the sum of the first $g - 1$ ones, applied row-wise. For arithmetic minus, we apply the same sampling strategy but leave the first column empty. The value of the first column is then defined as the sum of the other columns.
–
distribute-n: We uniformly sample distinct values for the first row from ${0, \dots, m - 1}$ . The content of the remaining rows is defined by applying a circular shift per row (either right or left).

Finally, we generate the candidate answers using I-RAVEN’s attribute bisection tree (Hu et al., 2021). The original RAVEN dataset had a flaw in the generation of the answer set. Each distractor in the answer set (i.e., a wrong answer candidate) was generated by randomly altering one attribute of the correct answer. As a result, one could predict the correct answer by taking the mode of the answer candidates without looking at the context matrix, therefore bypassing the actual reasoning task. As a remedy, the attribute bisection tree generates unbiased answers that are well balanced. Figure 2(b) shows example prompts generated from samples of our new dataset.
3. LLM-Based RPM Solving

3.1. Models

We focused our evaluations on text-only LLMs. There exist attempts (Ahrabian et al., 2024; Cao et al., 2024; Jiang et al., 2024; Mitchell et al., 2024; Zhang et al., 2024) that leverage vision support of multi-modal LLMs (e.g., GPT-4V) directly feeding the models with visual RPMs data; however, they achieve consistently lower reasoning performance than with text-only prompting. The SOTA LLM-based abstract reasoning approach (Hu et al., 2023) relied on reading out GPT-3’s (text-davinci-002) token probabilities. However, this model is no longer accessible to users and its successive iterations do not allow the retrieval of prediction logits. Hence, we considered discrete classification approaches that are based on output strings rather than distribution over tokens. In particular, we investigated two SOTA LLMs: the proprietary GPT-4 (Achiam et al., 2024)² (gpt-4-0613) and the open-source Llama-3 70B (Dubey et al., 2024).³ More recent iterations of these models were not considered in our analysis for different reasons. Meta’s attribution requirement in their updated terms regarding naming conventions prevented us from testing Llama-3.1 During initial tests, GPT-4o yielded worse results than GPT-4, hence we focused on GPT-4. Moreover, the evaluation of large reasoning models, such as DeepSeek’s R1 (Guo et al., 2025) or OpenAI’s o-series (OpenAI, 2024), is covered in our follow-up work (Camposampiero et al., 2025).

3.2. Prompting and Classification

Entangled and disentangled prompts

Following (Hu et al., 2023), we use numerical descriptions of the attribute values that has lead to better performance than textual descriptions (Latif et al., 2024). Moreover, we evaluate two different prompting strategies, entangled and disentangled prompting. The entangled prompting provides all the attributes’ values in a single prompt (see Appendix A.1). The disentangled prompting, on the other hand, is a compositionally structured approach that queries the LLM for individual attribute prediction. Disentangled prompting simplifies the task, but increases the number of queries by 3 $\times$ .

Discriminative and predictive classification

Similarly to (Gendron et al., 2024), we consider two approaches to solve RPM tests with LLMs. In the discriminative approach, we provide the attribute descriptions of both the context matrix and the answer candidates. The LLM is then asked to return the panel number of the predicted answer. Appendix A.2 provides an example prompt of the discriminative approach. In the predictive approach, we prompt the LLM only with the context matrix without the candidate answers. The LLM has to predict the value of the empty panel (see Figure 2). For selecting the final answer, we compare the predicted values with the answer panels and pick the one with the highest number of overlapping values. While the predictive approach may appear more difficult, it implicitly biases the LLM to approach the task as humans usually do, that is, first applying a generative process to abduce rules and execute them to synthesize a possible solution, and then discriminatively selecting the most similar answer from choices (Holyoak & Morrison, 2013). Moreover, the final answer selection is done without the intervention of the LLM, rendering phenomena like hallucinations less likely. Thus, the predictive classification can be seen as a more guided approach that helps LLM to solve the task.

Self-consistency

As an optional extension, we employ self-consistency (Lewkowycz et al., 2022; Wang et al., 2023) by querying the model multiple times ( $n = 7$ times), sampling the next token from the distribution with a non-zero soft-max temperature. We find the optimal soft-max temperature for GPT-4 ( $T = 0.5$ ) and Llama-3 70 B ( $T = 0.4$ ) via a grid search on a subset of 50 I-RAVEN problems. We did not explore the effect of other parameters, such as top-k or top-p, and set them to the default values. The final prediction is determined by the majority vote over the sampled outputs. The selection of an odd number of samples (i.e., $n = 7$ ) helps to prevent potential ties.

In-context learning

For a better understanding of the RPM task, we optionally prefix 16 in-context examples to the prompt (Brown et al., 2020). In the predictive classification approach (where no answer candidates are provided), we simply provide complete example RPMs. The in-context samples are randomly selected from I-RAVEN’s training set. Examples that had the same context matrix as the actual task are discarded and re-sampled to prevent shortcut solutions.

4. ARLC: Learning Abductive Reasoning Using VSA Distributed Representations

This section presents the ARLC, which performs neuro-symbolic reasoning with distributed VSA representations (see Figure 3). ARLC projects each panel’s attribute value (or distributions of values) into a high-dimensional VSA space. The resulting VSA vectors preserve the semantic similarity between attribute values: the dot products between corresponding VSA encoded vectors define a similarity kernel (Frady et al., 2022; Plate, 2003). Moreover, simple component-wise operations on these vectors, binding and unbinding, perform addition and subtraction, respectively, on the encoded values. For rule learning, ARLC introduces a generic rule template with several terms forming a series of binding and unbinding operations between vectors. The problem of learning the rules from data is reduced to a differentiable assignment problem between the terms of the general rule template and the VSA vectors encoding the contents of the panels, which can be learned with standard SGD. ARLC was initially presented by Camposampiero et al. (2024); this work mainly compares it to the reasoning capabilities of LLMs on I-RAVEN, and demonstrates its extension to larger grid sizes and dynamic ranges on our novel I-RAVEN-X.

Figure 3.
ARLC architecture. ARLC maps attribute values, or distributions of values, to distributed VSA representations, where the semantic similarity between values is preserved via a notion of kernel. Learnable rules ( $r_{1}, \dots, r_{R}$ ) predict the VSA representation of the empty panel ( ${\hat{v}}_{a, r}^{(3, 3)}$ ) together with a confidence value ( $s_{r}$ ). The closest answer to the predicted soft-selected prediction ( ${\hat{v}}_{a}^{(3, 3)}$ ) is chosen as the final answer.
4.1. From Visual Attributes to Distributed VSA Representations

ARLC’s key concept is to represent attribute values with high-dimensional, distributed VSA vectors that preserve the semantic similarity between the attribute values thanks to an introduced kernel notion. We start by defining a VSA that equips the space with dimensionality-preserving vector operations. Bundling ( $\oplus$ ) is a similarity-preserving operation that creates a superposition of the operands, that is, the resulting vector will have a high similarity with the two operands. Binding ( $\otimes$ ) associates two elements, effectively encoding a relationship between two vectors. For example, binding the attribute “color” with the value “red” produces a new vector that represents this pair. Importantly, this operation destroys similarity: the result is dissimilar to both operands. Unbinding ( $⊘$ ) is the inverse operation. Given a bound pair and one of its components (e.g., the attribute), unbinding retrieves the other (e.g., the value). This allows for structured information retrieval. The main difference between members of the VSA family is the specific realization of the bundling, binding, and vector space (see Kleyko et al., 2023).

Specifically, ARLC uses binary generalized sparse block codes (GSBCs) (Hersche et al., 2024) as a particular VSA instance. In binary GSBCs, the $D$ -dimensional vectors are divided into $B$ blocks of equal length, $L = D / B$ , where only one (randomly selected) element per block is set to 1 ( $D = 1024$ and $B = 4$ ). The algebraic operations of binary GSBCs are defined in Table 1. The choice of binary GSBCs is motivated by better retrieval accuracy when encoding probability mass functions (PMFs), compared to other alternatives such as Fourier holographic reduced representations (FHRRs) (Hersche et al., 2023b). See Appendix B for a detailed background on VSA.

Table 1.
Supported VSA Operations and Their Equivalent in $R$ .

Operation Binary GSBCs with FPE Equivalent in $R$

Binding ( $\otimes$ ) Block-wise circular convolution Addition $+$

Unbinding ( $⊘$ ) Block-wise circular correlation Subtraction $-$

Bundling ( $\oplus$ ) Sum and normalization –

Similarity ( $⊙$ ) Cosine similarity ( $cos (\cdot, \cdot)$ ) –

Operation	Binary GSBCs with FPE	Equivalent in $R$
Binding ( $\otimes$ )	Block-wise circular convolution	Addition $+$
Unbinding ( $⊘$ )	Block-wise circular correlation	Subtraction $-$
Bundling ( $\oplus$ )	Sum and normalization	–
Similarity ( $⊙$ )	Cosine similarity ( $cos (\cdot, \cdot)$ )	–

VSA = vector-symbolic architecture; GSBCs = generalized sparse block codes; FPE = fractional power encoding.

Next, we define a mapping $z : Z^{+} \to R^{D}$ that enables the projection of input RPM attributes into a corresponding high-dimensional, semantically rich feature space. Note that this work focuses on mapping integer values as the attribute values in I-RAVEN are integer-valued too. However, generalizing this approach to real-valued domain mappings is possible using FHRR (Plate, 1995). Leveraging fractional power encoding (FPE) (Plate, 2003), a value $v \in Z^{+}$ is encoded as follows:

z (v) = z^{v} = ⨂_{n = 1}^{v} z,

where

z \in R^{D}

is a randomly drawn binary GSBC vector. This mapping yields a similarity kernel between neighboring vector representations (Frady et al., 2022), as shown in Figure 4.

Figure 4.

Similarity kernel in VSA. Mapping two values ( $v_{1}$ and $v_{2}$ ) to a VSA space (i.e., GSBC in ARLC) that uses FPE and computing their similarity in the VSA space yields the shown similarity kernel $K (v_{1} - v_{2})$ .

Let us assume two variables with values $v_{1}$ and $v_{2}$ , which are represented with two VSA vectors ( $z (v_{1}) = z^{v_{1}}$ and $z (v_{1}) = z^{v_{2}}$ ). Binding the two vectors yields $z (v_{1}) \otimes z (v_{2}) = z^{v_{1}} \otimes z^{v_{2}} = z^{v_{1} + v_{2}}$ . Hence, binding in the VSA space is equivalent to the addition in $R$ . In other words, the FPE initialization allows to establish a semantic equivalence between high-dimensional vectors and real numbers. This property is consistently exploited in ARLC’s framework, as it allows to solve the analogies in the RPMs puzzles as simple algebraic operations in the domain of real numbers. For example, by computing the similarity between the bound representation and a third projected variable ( $sim (z^{v_{1} + v_{2}}, z^{v_{3}})$ ), we can evaluate whether $v_{1} + v_{2} \overset{?}{=} v_{3}$ representing the arithmetic plus rule in RPMs.

One advantage of performing reasoning with distributed VSA representations is its capability to represent perceptual uncertainty in the variable values. Connecting to the previous example, let us assume that the first variable takes value $v_{1}$ with probability $p$ and value $v_{1}^{'}$ with probability $p^{'} = 1 - p$ . The distribution can be encoded as the weighted superposition of the two corresponding codewords: $p \cdot z^{v_{1}} + p^{'} \cdot z^{v_{1}^{'}}$ . The similarity computation between the bound representation and a third variable would then yield

\begin{aligned} sim ((p \cdot z^{v_{1}} + p^{'} \cdot z^{v_{1}^{'}}) \otimes z^{v_{2}}, z^{v_{3}}) & = sim (p \cdot z^{v_{1}} \otimes z^{v_{2}} + p^{'} \cdot z^{v_{1}^{'}} \otimes z^{v_{2}}, z^{v_{3}}) \end{aligned}

(1)

\begin{aligned} \approx p \cdot sim (z^{v_{1}} \otimes z^{v_{2}}, z^{v_{3}}) + p^{'} \cdot sim (z^{v_{1}^{'}} \otimes z^{v_{2}}, z^{v_{3}}), \end{aligned}

(2)

where the first equality uses the linearity of the binding operation, and the second approximation requires linearity of the similarity metric.⁴ Overall, this formulation allows the validation of multiple solutions (in this case two) using only a single binding and similarity computation.

In the RPM application, each panel’s label is translated to a PMF $p_{a}^{(i, j)}$ , where $a$ is the attribute, $i$ is the row index and $j$ is the column index of the panel. The panel’s PMF is then projected into the VSA space as follows:

v_{a}^{(i, j)} = \sum_{k = 1}^{m} p_{a}^{(i, j)} [k] \cdot z^{k},

where

m

is the number of possible values that the attribute

a

can assume. Overall, this yields eight VSA vectors for each attribute

a

(one for each panel of the input RPM), represented by

V_{a} := (v_{a}^{(1, 1)}, v_{a}^{(1, 2)}, \dots, v_{a}^{(3, 2)}) .

(3)

Note that the basis vectors are pre-computed and stored in a dictionary

C = {z^{k}}_{i = 1}^{r}

containing

m

elements.

4.2. Learning RPM Rules as an Assignment Problem

Here we introduce a general framework for interpreting RPM rule learning as an assignment problem, where VSA vectors are mapped to placeholders in a rule expression composed of binding and unbinding operations. The previous example demonstrates that executing the arithmetic rule requires addition computations, which can be efficiently performed in the VSA space using the binding operation. Indeed, we find that other RPM rules (constant, progression, distribute three) can be described with one or multiple additions and subtractions as well, which can be represented in the VSA space using binding and unbinding operations, respectively (see Appendix D). Hence, the rules used in RPM can be generally framed as a series of binding and unbinding operations:

r = (c_{1} \otimes c_{2} \otimes c_{3} \otimes c_{4} \otimes c_{5} \otimes c_{6}) ⊘ (c_{7} \otimes c_{8} \otimes c_{9} \otimes c_{10} \otimes c_{11} \otimes c_{12}) .

(4)

Here each placeholder

c_{i}

can either assume the value of a context panel

v_{a}^{(i, j)}

or the identity vector

e

. The assignments between each placeholder

c_{i}

and its value are learned during training (or programmed) and depend on the specific rule. For instance, during the inference of the arithmetic plus rule on the

3^{r d}

row of the context matrix, the assignments would correspond to:

c_{1} = v_{a}^{(3, 1)}, c_{2} = v_{a}^{(3, 2)}, c_{i} = e \forall i \in {3, 4, \dots, 12},

where

v_{a}^{(3, 1)}

and

v_{a}^{(3, 2)}

are the vector representations of the first and second panel of the bottom row, respectively. In this setting, learning RPM rules can be interpreted as an assignment problem between VSA vectors and the terms in equation (4).

Motivated by works in cognitive sciences and psychology that argue for the importance of context in the solution of analogies for humans (Chalmers et al., 1992; Cheng, 1990), ARLC uses a general formulation of the soft-assignment problem which relies on the notion of context:

c_{k} = \sum_{i = 1}^{I} w_{k}^{i} \cdot x_{i} + \sum_{j = 1}^{J} u_{k}^{j} \cdot o_{j} + v_{k} \cdot e,

(5)

where

w, u, and v

are the learned parameters which are subject to the following constraints:

\sum_{i = 0}^{I} w_{k}^{i} + \sum_{j = 0}^{J} w_{k}^{j} + v_{k} = 1, 0 \leq w_{k}^{i} \leq 1 \forall i, 0 \leq u_{k}^{j} \leq 1 \forall j, 0 \leq v_{k} \leq 1, \forall k .

Here

X = {x_{1}, \dots, x_{I}}

is the set of attributes that define the current sample, that is, the description of the problem for which we infer a solution.

O = {o_{1}, \dots, o_{J}}

is the set of attributes that define the context for that sample, that could be interpreted as a working memory from which additional information to infer the answer can be retrieved. For predicting the empty panel in the last row, the context (

O

) corresponds to the first two rows and the current samples (

X

) to the last row (see Figure 5(c)). Formally, we set

X = {v_{a}^{(3, 1)}, v_{a}^{(3, 2)}}

and

O = {v_{a}^{(1, 1)}, v_{a}^{(1, 2)}, v_{a}^{(1, 3)}, v_{a}^{(2, 1)}, v_{a}^{(2, 2)}}

for predicting the rightmost panel in the last row. We augment this standard prediction with two more permutations, which aim to predict the rightmost panel of the first and second row (see Figure 5(a) and (b)). The knowledge of the rightmost panels in the first two rows allows us to compute a rule confidence by comparing the rule’s prediction with the actual panel representation via the cosine similarity.

Figure 5.

Visualization of current samples ( $X = {x_{1}, x_{2}}$ , in yellow) and context ( $O = {o_{1}, \dots, o_{5}}$ , in green) panels when predicting the third panel for different rows, namely the first row (left), second row (center), and third row (right). Black objects represent the panels that are not used for the computation, while the question mark represents the unknown test panel, which is unavailable during inference.

4.3. Executing and Selecting the Learned Rules

ARLC learns a set of $R$ different rules with rule-specific weights ( $w_{r}, u_{r}, v_{r}$ ). Inference with the learned rule set is a two-step process: an execution step (where all the rules are applied in parallel to the input) and a selection step (where a prediction for the missing panel is generated). The application of each rule $r$ to an RPM example generates a tuple of three VSA vectors $({\hat{v}}_{a, r}^{(i, 3)})_{i = 1}^{3}$ , which corresponds to the result of the rule execution on the three rows of the RPM, together with a rule confidence value $s_{r}$ . The confidence value is computed as the sum of the cosine similarities between the predicted VSA vectors and their respective ground-truth vector,

s_{r} = \sum_{i = 1}^{3} \cos (v_{a}^{(i, 3)}, {\hat{v}}_{a, r}^{(i, 3)}) .

(6)

Note that the ground-truth value for the last row (

v_{a}^{(3, 3)}

) is unknown during inference, since the RPM task is to predict this panel. Hence, we omit the last term of the sum (i = 3) in the inference. The answer is finally produced by taking a linear combination of the VSA vectors generated by executing all the rules, weighted by their respective confidence scores (normalized to a valid probability distribution using a softmax function). More formally, if we define

s = [s_{1}, \dots, s_{R}]

to be the concatenation of all rules’ confidence score and

{\hat{V}}_{a}^{(3, 3)} = [{\hat{v}}_{a, 1}^{(3, 3)}, \dots, {\hat{v}}_{a, R}^{(3, 3)}]

to be the concatenation of all rules’ predictions for the missing panel, the final VSA vector predicted by the model for the attribute

a

becomes

{\hat{v}}_{a}^{(3, 3)} = softmax (s) \cdot {\hat{V}}_{a}^{(3, 3)} .

(7)

The use of the weighted combination can be understood as a soft-selection mechanism between rules and was found to be more effective compared to the hard-selection mechanism provided by sampling (Hersche et al., 2023a).

4.4. Training Loss and Other Implementation Aspects

We follow the training recipe provided by Learn-VRF (Hersche et al., 2023a). The model is trained using SGD with a learning rate $lr = 0.01$ for 25 epochs. The training loss is defined as the inverse cosine similarity between the three predicted panels and their corresponding ground truth

L = 1 - \sum_{i = 1}^{3} \cos (v_{a}^{(i, 3)}, {\hat{v}}_{a}^{(i, 3)}) .

(8)

As in Learn-VRF, we set the number of rules to $R = 5$ . A single set of rules is instantiated and shared between all RPM attributes.

4.5. Applying ARLC on I-RAVEN-X

While ARLC was initially designed for I-RAVEN, it can be seamlessly extended to our I-RAVEN-X with minor modifications. First, the number of binding/unbinding terms in equation (4) is increased, for example, from 12 to 22 to support the larger grid size of $g = 10$ . Moreover, we increase the number of entries in the dictionary ( $C$ ) to support the larger dynamic range ( $m$ ). Notably, only varying the dynamic range at constant grid size does not require retraining: we can simply replace the dictionary in order to support OOD generalization. Indeed, we could demonstrate that ARLC trained on a dynamic range of $m = 45$ can favorably generalize to a dynamic range of $m = 1000$ .

5. Results

5.1. Main Results on I-RAVEN

Table 2 compares our LLM results with ARLC on the center constellation of I-RAVEN, considering also a range of neuro-symbolic and connectionist baselines. For the LLMs, we show the results with the corresponding best prompting techniques (see the ablation in Section 5.2). Moreover, we present results for three different versions of ARLC: ${ARLC}_{progr}$ , where the model’s weights are manually programmed with RPM rules ( $R = 4$ , since constant can be considered as a special case of progression), ${ARLC}_{p \mapsto l}$ , where the model is initialized with the programmed rules and then trained with gradient descent, and ${ARLC}_{learn}$ , where the rules are learned from scratch from data.

Table 2.
Task Accuracy (%) on the Center Constellation of I-RAVEN.

Method Parameters Accuracy

MLP (Hersche et al., 2023a) 300 k 97.6

SCL (Wu et al., 2020) 961 k ${99.9}^{\pm 0.0}$

PrAE (Zhang et al., 2021) n.a. ${83.8}^{\pm 3.4}$

NVSA (Hersche et al., 2023b) n.a. ${99.8}^{\pm 0.2}$

Learn-VRF (Hersche et al., 2023a) 20 k ${97.7}^{\pm 4.1}$

GPT-3 (Hu et al., 2023) 175 b 86.4

Llama-3 70 b 85.0

GPT-4 unk. 93.2

${ARLC}_{progr}$ n.a. ${99.6}^{\pm 0.0}$

${ARLC}_{p \mapsto l}$ 480 ${99.6}^{\pm 0.0}$

${ARLC}_{learn}$ 480 ${98.4}^{\pm 1.5}$

Method	Parameters	Accuracy
MLP (Hersche et al., 2023a)	300 k	97.6
SCL (Wu et al., 2020)	961 k	${99.9}^{\pm 0.0}$
PrAE (Zhang et al., 2021)	n.a.	${83.8}^{\pm 3.4}$
NVSA (Hersche et al., 2023b)	n.a.	${99.8}^{\pm 0.2}$
Learn-VRF (Hersche et al., 2023a)	20 k	${97.7}^{\pm 4.1}$
GPT-3 (Hu et al., 2023)	175 b	86.4
Llama-3	70 b	85.0
GPT-4	unk.	93.2
${ARLC}_{progr}$	n.a.	${99.6}^{\pm 0.0}$
${ARLC}_{p \mapsto l}$	480	${99.6}^{\pm 0.0}$
${ARLC}_{learn}$	480	${98.4}^{\pm 1.5}$

Among the baselines, we replicate Learn-VRF (Hersche et al., 2023a); the other results are taken from (Hersche et al., 2023b). The standard deviations are reported over 10 random seeds. Llama-3 and GPT-4 are queried with the corresponding best prompting technique (see Table 3). ARLC’s weights are either manually programmed ( ${ARLC}_{progr}$ ), learned from scratch ( ${ARLC}_{learn}$ ), or learned after manual programming ( ${ARLC}_{p \mapsto l}$ ). The number of parameters for GPT-4 is not publicly available. The reasoning backend of PrAE, NVSA, and our ${ARLC}_{progr}$ do not have trainable parameters. Learn-VRF = Learning VSA Rule Formulations; ARLC = abductive rule learner with context-awareness.

Among the LLM approaches, our GPT-4-based approach achieved the highest accuracy (93.2%) notably outperforming previous SOTA LLM-based abstract reasoning approaches on this benchmark (86.4%) (Hu et al., 2023). Yet, all LLM approaches fall behind the tailored connectionist and neuro-symbolic solutions. Notably, with only 480 learnable parameters, ARLC achieves a high accuracy of 98.4%. Moreover, we show that post-programming training allows for maintaining the knowledge of the model, rather than completely erasing it as shown in other settings (Wu et al., 2019).

5.2. Ablation of LLM Prompting Techniques

Table 3 shows the task accuracy on I-RAVEN using GPT-4 and Llama-3 70B in various prompting configurations. Overall, both models benefit from the additional guidance provided by our prompting techniques. Concretely, using a predictive approach and querying for individual disentangled attributes yielded already high accuracies (91.4% and 83.2% for GPT-4 and Llama-3 70B, respectively). Introducing self-consistency further improves the accuracy for both models. Llama-3 70B’s performance can be further pushed (to 85.0%) by using self-consistency and in-context learning. On the contrary, GPT-4 cannot make use of the additional in-context samples, yielding a lower accuracy instead. Indeed, recent work on LLM reasoning models (Guo et al., 2025) made a similar observation, where “few-shot prompting consistently degrades its performance.” To test the potential impact of instruction-tuning, we conducted experiments with Llama 3 70B Instruct. We found that the instruction-tuned model generally performs worse, achieving 64.6% and 79.2% with and without in-context learning, respectively. We leave the exploration of finding an optimized set and sequence of in-context examples, which has been shown to improve the performance of instruction-tuned models (Liu et al., 2024), for future work.

Table 3.
Ablation Study Considering Various LLM Prompting techniques. We Report the Task Accuracy (%) on the Center Constellation of I-RAVEN.

Predictive/ discriminative Disentangled queries per attribute (3 $\times$ queries) Self-consistency (n=7) In-context learning (s=16) GPT-4 Llama-3 70B

Discriminative 46.8 22.8

Discriminative ✓ 63.0 22.4

Predictive 74.8 79.0

Predictive ✓ 91.4 83.2

Predictive ✓ ✓ 93.2 84.8

Predictive ✓ ✓ 85.4 84.8

Predictive ✓ ✓ ✓ 86.4 85.0

Predictive/ discriminative	Disentangled queries per attribute (3 $\times$ queries)	Self-consistency (n=7)	In-context learning (s=16)	GPT-4	Llama-3 70B
Discriminative				46.8	22.8
Discriminative	✓			63.0	22.4
Predictive				74.8	79.0
Predictive	✓			91.4	83.2
Predictive	✓	✓		93.2	84.8
Predictive	✓		✓	85.4	84.8
Predictive	✓	✓	✓	86.4	85.0

LLM = large language models.

5.3. LLMs Show Weakness in Arithmetic Rule

Even though both LLMs achieve a reasonable overall task accuracy, they fail in some instances. We shed more light on the reasoning capability of the two models by analyzing the accuracy of predicting the correct value for a given rule. As shown in Table 4, both models perform well on constant, progression, and distribute three rules, whereas the accuracy notably drops for the arithmetic rule. One explanation for the accuracy drop could be the LLM’s tendency for (short-sighted) relational reasoning, instead of performing relational mapping that requires the understanding of the first two rows before applying a rule on the last row (Stevenson et al., 2023). We analyze this hypothesis in Appendix C, where we attempt to explain the LLM’s wrong predictions by rules that may have been inferred from the last row. For GPT-4, 32 out of 68 errors can be explained by rules that might have been inferred from a partial context matrix, for example, a constant or progression rule based on the last row.

Table 4.
Accuracy (%) of Predicting the Correct Attribute Value.

Model Disentangled queries per attribute (3 $\times$ queries) Constant Progression Distribute three Arithmetic

GPT-4 No 100 98.0 91.6 27.1

Yes 100 100 99.5 73.6

Llama-3 70B No 100 97.2 99.3 31.0

Yes 100 100 96.6 45.0

Model	Disentangled queries per attribute (3 $\times$ queries)	Constant	Progression	Distribute three	Arithmetic
GPT-4	No	100	98.0	91.6	27.1
	Yes	100	100	99.5	73.6
Llama-3 70B	No	100	97.2	99.3	31.0
	Yes	100	100	96.6	45.0

Self-consistency (n=7) is used. Results are averaged across all attributes.

5.4. Results on Our Novel I-RAVEN-X

Finally, we conduct experiments on our novel I-RAVEN-X test, which allows us to configure the matrix size and the dynamic range of the attribute values. We fix the grid size to $3 \times 10$ and vary the dynamic range between 50, 100, and 1000. As shown in Table 5, the LLM’s drops not only due to the larger grid size but also generally degrades with an increasing dynamic range. At the same time, our ARLC maintains a high accuracy across the board, while only being trained at dynamic range of 50 and reconfigured for the higher ranges. Investigating the performance on the arithmetic rule in Table 6 explains the overall accuracy degradation: the arithmetic accuracy drops below 10% for both LLMs at the highest dynamic range (1000).

Table 5.
Task Accuracy (%) on I-RAVEN and Our Novel I-RAVEN-X.

I-RAVEN I-RAVEN-X

$3 \times 3$ $3 \times 10$

Dynamic range ( $m$ ) 5–10 50 100 1000

Llama-3 70B 85.0 76.8 73.0 74.2

GPT-4 93.2 82.2 79.6 76.6

${ARLC}_{progr}$ 99.6 100.0 100.0 99.7

${ARLC}_{learn}$ 99.1/98.6 94.6/86.3 95.1/88.0 91.6/82.8

	I-RAVEN	I-RAVEN-X
Llama-3 70B	85.0	76.8	73.0	74.2
GPT-4	93.2	82.2	79.6	76.6
${ARLC}_{progr}$	99.6	100.0	100.0	99.7
${ARLC}_{learn}$	99.1/98.6	94.6/86.3	95.1/88.0	91.6/82.8

The large language models (LLMs) use self-consistency (n=7). For ARLC $_{learn}$ , we report max/mean evaluation accuracies over five different training seeds.

Table 6.

Arithmetic Accuracy (%) on I-RAVEN and Our Novel I-RAVEN-X.

	I-RAVEN	I-RAVEN-X
	$3 \times 3$	$3 \times 10$
Dynamic range ( $m$ )	5–10	50	100	1000
Llama-3 70B	45.0	1.5	2.6	0.4
GPT-4	73.6	30.4	25.1	8.4
${ARLC}_{progr}$	100.0	99.8	100.0	99.5
${ARLC}_{learn}$	99.5/99.2	99.1/95.5	98.9/96.3	97.9/95.3

The large language models (LLMs) use self-consistency (n=7). For ARLC $_{learn}$ , we report max/mean evaluation accuracies over five different training seeds.

6. Conclusion

This work revealed LLM’s limitations in recognizing and executing arithmetic rules in abstract reasoning tasks, despite being provided disentangled prompts with ground-truth visual attributes and using advanced prompting techniques. We further showed the serious limitation on a larger (3 $\times$ 10) RPM test. As a viable alternative, we presented a neuro-symbolic approach (ARLC) that achieves a high accuracy both on I-RAVEN and our I-RAVEN-X, thanks to learning to reason with distributed VSA representations and operators. Beyond accuracy, ARLC not only inherits advantages from symbolic methods (e.g., interpretability and programmability) but also advances efficiency and trainability. Yet, it is still tailored and trained to solve the given RPM task. In contrast, LLMs are more general but lack interpretability and require more computing resources. Combining the strengths of both methods, we see great potential in integrating ARLC into more general frameworks, for example, within a neuro-symbolic system where ARLC could execute (Pan et al., 2023) or validate (Kambhampati et al., 2024) reasoning steps from neural models (e.g. LLMs). Moreover, it would be interesting to tighten the integration between the two systems at the embedding level (Bounsi et al., 2024).

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

Appendix A Prompting Details

This appendix provides more details on our prompting strategy. While the prompt design was mainly inspired by Hu et al. (2023), we extended it with predictive and discriminative classification and fine-tuned it for the different models. For example, we found that adding a prefix (“Only return the missing number”) helped to slightly improve GPT4’s accuracy, whereas it reduced Llama-3 70B’s performance. Thus, we used individual prompts for the different models.

References

Achiam

Adler

Agarwal

Ahmad

Akkaya

Aleman

F. L.

Almeida

Altenschmidt

Altman

Anadkat

Avila

(2024). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.

Ahrabian

Sourati

Sun

Zhang

Jiang

Morstatter

Pujara

(2024). The curious case of nonverbal abstract reasoning with multi-modal large language models. In First conference on language modeling (COLM).

Barrett

D. G. T.

Hill

Santoro

Morcos

A. S.

Lillicrap

(2018). Measuring abstract reasoning in neural networks. In International conference on machine learning (ICML) (pp. 511–520).

Benny

Pekar

Wolf

(2021). Scale-localized abstract reasoning. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12552–12560). IEEE.

Bounsi

Ibarz

Dudzik

Hamrick

J. B.

Markeeva

Vitvitskyi

Pascanu

Veličković

(2024). Transformers meet neural algorithmic reasoners. In CVPR 2024 multimodal algorithmic reasoning (MAR) workshop.

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Amodei

(2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc.

Camposampiero

Hersche

Terzic

Wattenhofer

Sebastian

Rahimi

(2024). Towards learning abductive reasoning using VSA distributed representations. In International conference on neural-symbolic learning and reasoning (NeSy). Springer (pp. 370–385).

Camposampiero

Hersche

Wattenhofer

Sebastian

Rahimi

(2025). Can large reasoning models do analogical reasoning under perceptual uncertainty? In International conference on neural-symbolic learning and reasoning (NeSy).

Camposampiero

Houmard

Estermann

Mathys

Wattenhofer

(2023). Abstract visual reasoning enabled by language. In 2023 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 2643–2647). IEEE.

10.

Cao

Lai

Heintz

Chen

Cao

Rehg

J. M.

(2024). What is the visual cognition gap between humans and multimodal LLMs? arXiv preprint arXiv:2406.10424.

11.

Chalmers

D. J.

French

R. M.

Hofstadter

D. R.

(1992). High-level perception, representation, and analogy: A critique of artificial intelligence methodology. Journal of Experimental & Theoretical Artificial Intelligence, 4(3), 185–211.

12.

Cheng

(1990). Context-dependent similarity. In Proceedings of the sixth annual conference on uncertainty in artificial intelligence (pp. 41–50). Elsevier Science Inc.

13.

Dubey

Jauhri

Pandey

Kadian

Al-Dahle

Letman

Mathur

Schelten

Yang

Fan

Goyal

(2024). The Llama 3 Herd of Models. arxiv preprint arXiv:2407.21783.

14.

Frady

E. P.

Kleyko

Kymn

C. J.

Olshausen

B. A.

Sommer

F. T.

(2022). Computing on functions using randomized vector representations (in brief). In Neuro-inspired computational elements conference.

15.

Gayler

R. W.

(2003). Vector symbolic architectures answer Jackendoff’s challenges for cognitive neuroscience. In Proceedings of the joint international conference on cognitive science (pp. 133–138). ICCS/ASCS.

16.

Gendron

Bao

Witbrock

Dobbie

(2024). Large language models are not strong abstract reasoners. In Thirty-third international joint conference on artificial intelligence (IJCAI) (Vol. 7, pp. 6270–6278).

17.

Guo

Yang

Zhang

Song

Zhang

Zhu

Wang

Zhang

(2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arxiv preprint arXiv:2501.12948.

18.

Hersche

di Stefano

Hofmann

Sebastian

Rahimi

(2023a). Probabilistic abduction for visual abstract reasoning via learning rules in vector-symbolic architectures. In The 3rd workshop on mathematical reasoning and AI at NeurIPS’23.

19.

Hersche

Terzic

Karunaratne

Langenegger

Pouget

Cherubini

Benini

Sebastian

Rahimi

(2024). Factorizers for distributed sparse block codes. Neurosymbolic Artificial Intelligence.

20.

Hersche

Zeqiri

Benini

Sebastian

Rahimi

(2023b). A neuro-vector-symbolic architecture for solving Raven’s progressive matrices. Nature Machine Intelligence, 5(4), 363–375.

21.

Holyoak

K. J.

Morrison

R. G.

(2013). The Oxford handbook of thinking and reasoning. OUP.

22.

Liu

Wei

Bai

(2021). Stratified rule-aware network for abstract visual reasoning. In Proceedings of the AAAI conference on artificial intelligence.

23.

Storks

Lewis

Chai

(2023). In-context analogical reasoning with pre-trained language models. In Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 1953–1969). Association for Computational Linguistics, Toronto, Canada.

24.

Jiang

Zhang

Sun

Sourati

Ahrabian

Ilievski

Pujara

(2024). MARVEL: Multidimensional abstraction and reasoning through visual evaluation and learning. In The thirty-eight conference on neural information processing systems (NeurIPS).

25.

Kambhampati

Valmeekam

Guan

Verma

Stechly

Bhambri

Saldyt

L. P.

Murthy

A. B.

(2024). Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. In Forty-first international conference on machine learning (ICML).

26.

Kanerva

(2009). Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2), 139–159.

27.

Kleyko

Rachkovskij

D. A.

Osipov

Rahimi

(2023). A survey on hyperdimensional computing aka vector symbolic architectures, part I: Models and data transformations. ACM Computing Surveys, 55(6), 1–40.

28.

Latif

Zhou

Guo

Gao

Shi

Nayaaba

Lee

Zhang

Bewersdorff

Fang

Yang

(2024). A systematic assessment of OpenAI o1-preview for higher order thinking in education. arXiv preprint arXiv:2410.21287.

29.

Lewis

Mitchell

(2025). Evaluating the robustness of analogical reasoning in large language models. Transactions on Machine Learning Research.

30.

Lewkowycz

Andreassen

Dohan

Dyer

Michalewski

Ramasesh

Slone

Anil

Schlag

Gutman-Solo

Neyshabur

Gur-Ari

Misra

(2022). Solving quantitative reasoning problems with language models. In Advances in neural information processing systems (NeurIPS) (Vol. 35, pp. 3843–3857).

31.

Liu

Shi

Cheng

Huang

(2024). Let’s learn step by step: Enhancing in-context learning ability with curriculum learning. arXiv preprint arXiv:2402.10738.

32.

Mitchell

Palmarini

A. B.

Moskvichev

(2024). Comparing humans, GPT-4, and GPT-4V on abstraction and reasoning tasks. In AAAI 2024 workshop on “Are large language models simply causal parrots?”.

33.

OpenAI (2024). Learning to reason with LLMs.

34.

Pan

Albalak

Wang

(2023). Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 3806–3824). Association for Computational Linguistics, Singapore.

35.

Plate

T. A.

(1995). Holographic reduced representations. IEEE Transactions on Neural Networks and Learning Systems, 6(3), 623–641.

36.

Plate

T. A.

(2003). Holographic reduced representations: Distributed representation for cognitive structures. Center for the Study of Language and Information, Stanford.

37.

Raven

J. C.

Court

J. H.

Raven

(1938). Raven’s progressive matrices. Oxford Psychologists Press.

38.

Stevenson

C. E.

ter Veen

Choenni

van der Maas

H. L. J.

Shutova

(2023). Do large language models solve verbal analogies like children do? arxiv preprint arXiv:2310.20384.

39.

Sun

Z.-H.

Zhang

R.-Y.

Zhen

Wang

D.-H.

Y.-J.

Wan

You

(2025). Systematic abductive reasoning via diverse relation representations in vector-symbolic architecture. arXiv preprint arXiv:2501.11896.

40.

Wang

Wei

Schuurmans

Chi

E. H.

Narang

Chowdhery

Zhou

(2023). Self-consistency improves chain of thought reasoning in language models. In The eleventh international conference on learning representations (ICLR).

41.

Webb

Holyoak

K. J.

(2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541.

42.

Dong

Grosse

(2020). The scattering compositional learner: Discovering objects, attributes, relationships in analogical reasoning. arxiv preprint arXiv:2007.04212.

43.

Zhang

Shu

(2019). Cognitive deficit of deep learning in numerosity. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 1303–1310.

44.

Wüst

Tobiasch

Helff

Dhami

D. S.

Rothkopf

C. A.

Kersting

(2024). Bongard in Wonderland: Visual puzzles that still make ai go mad? In The first workshop on system-2 reasoning at scale, NeurIPS’24.

45.

Zhang

Gao

Jia

Zhu

S.-C.

(2019). RAVEN: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

46.

Zhang

Jia

Zhu

S.-C.

Zhu

(2021). Abstract spatial-temporal reasoning via probabilistic abduction and execution. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9731–9741). IEEE.

47.

Zhang

Xie

Jia

Y. N.

Zhu

S.-C.

Zhu

(2022). Learning algebraic representation for systematic generalization in abstract reasoning. In European conference on computer vision (ECCV) (pp. 692–709). Springer.

48.

Zhang

Bai

Zhang

Zhai

Susskind

J. M.

Jaitly

(2024). How far are we from intelligent visual deductive reasoning? In ICLR 2024 workshop: How far are we from AGI.

Towards Learning to Reason: Comparing LLMs With Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

Abstract

Keywords

1. Introduction

2.1. I-RAVEN

3.1. Models

3.2. Prompting and Classification

Entangled and disentangled prompts

Discriminative and predictive classification

Self-consistency

In-context learning

4. ARLC: Learning Abductive Reasoning Using VSA Distributed Representations

5. Results

5.1. Main Results on I-RAVEN

Table 4. Accuracy (%) of Predicting the Correct Attribute Value. Model Disentangled queries per attribute (3 × queries) Constant Progression Distribute three Arithmetic GPT-4 No 100 98.0 91.6 27.1 Yes 100 100 99.5 73.6 Llama-3 70B No 100 97.2 99.3 31.0 Yes 100 100 96.6 45.0

Table 5. Task Accuracy (%) on I-RAVEN and Our Novel I-RAVEN-X. I-RAVEN I-RAVEN-X 3 × 3 3 × 10 Dynamic range ( m ) 5–10 50 100 1000 Llama-3 70B 85.0 76.8 73.0 74.2 GPT-4 93.2 82.2 79.6 76.6 ARLC progr 99.6 100.0 100.0 99.7 ARLC learn 99.1/98.6 94.6/86.3 95.1/88.0 91.6/82.8

Footnotes

Funding

Declaration of conflicting interests

Notes

Appendix A Prompting Details

References

Table 4.
Accuracy (%) of Predicting the Correct Attribute Value.

Model Disentangled queries per attribute (3 $\times$ queries) Constant Progression Distribute three Arithmetic

GPT-4 No 100 98.0 91.6 27.1

Yes 100 100 99.5 73.6

Llama-3 70B No 100 97.2 99.3 31.0

Yes 100 100 96.6 45.0