Sage Journals: Discover world-class research

Abstract

We present an initial automated test to evaluate LLMs’ capacity to perform inductive reasoning tasks. We use the GPT-3.5 and GPT-4 models to create a system which generates Python code as hypotheses for inductive reasoning to transform sequences of the One Dimensional Abstract Reasoning Corpus (1D-ARC) challenge. We experiment with three prompting techniques, namely standard prompting, Chain of Thought (CoT), and direct feedback. We provide results and an analysis of cost-to-success rate and benefit-cost ratio. Our best result is an overall 25% success rate with our CoT prompting on GPT-4, significantly surpassing the standard prompting approach. We assess the programming capabilities of the LLM by analysing the execution rate and errors of the generated code for inductive reasoning. We discuss potential avenues to improve our experiments, testing other strategies, and combining deductive reasoning with LLM-based inductive reasoning.

Keywords

Large language models inductive reasoning prompt engineering abstract reasoning corpus

Introduction

Pre-trained large language models (LLMs) are approaching or have exceeded human-level performance in various tasks including abstractive summarization, question answering in some domains, some coding tasks, etc. Bommarito & Katz, 2022; OpenAI, 2023; Zhang et al., 2024. Moving towards more intelligent systems, or Artificial General Intelligence (AGI) (Morris et al., 2023), there are significant interests on the reasoning abilities of LLMs, which is a fundamental aspect of human intelligence (Häggström, 2021). There are debates about whether LLMs can reason based on the understanding of truth and logic, which is closer to the “thinking” process of humans (Wang et al., 2023a). By leveraging prompting techniques (in-context few-shot learning), pre-trained LLMs can solve some reasoning tasks in different benchmarks, including math, commonsense, and games (Wei et al., 2022; Yao et al., 2023), but their performance heavily depends on the prompting approach.

Hypothesis

We are interested in looking at solving the 1D-ARC challenge as a test for AGI. We hypothesise that general purpose LLMs’ performance at inductive reasoning by writing programs to solve 1D-ARC tasks are poor. We developed a testing framework to evaluate this hypothesis with different prompting techniques and parameters, without providing any help, guidance, or fine-tuning the LLM. We expect an AGI would be able to programme functions to solve the simple inductive reasoning tasks of 1D-ARC challenge autonomously.

GPT-3 showcased the zero-shot inference ability, with limited ability to perform more complex reasoning tasks (Brown et al., 2020). Few-shot learning is then studied to promote “thinking” by showcasing several examples of intermediate reasoning steps, named CoT (Wei et al., 2022). Zero-shot CoT is proposed through simple additional prompting “Let’s think step-by-step” (Kojima et al., 2022). While this research focuses on probing reasoning ability through in-context learning, we would like to further understand the variations of different prompting techniques. We would also like to understand the impact of feedback on the thinking process of LLMs, and how it would correct itself through feedback. Besides, there is a gap in benchmarking the cost of LLMs with different prompting techniques.

Inductive reasoning is a fundamental cognitive challenge that involves making generalizations from specific instances (Han et al., 2024). It requires inferring the principles from the observations and applying them to novel situations.

To provide a structured review of existing works on inductive reasoning with LLMs, we review the types of tasks addressed, the prompting methods employed, and the evaluation strategies used.

Tasks

Previous works on various types of reasoning tasks, including arithmetic (Imani et al., 2023; Wang et al., 2023b), commonsense (Zhao et al., 2024), and symbolic reasoning (Wang et al., 2024), have showcased the reasoning capabilities of LLMs to a certain degree. The Abstraction and Reasoning Corpus (ARC) is a particularly challenging inductive reasoning benchmark (Chollet, 2019), and has been used to test LLM’s inductive reasoning ability (Mirchandani et al., 2023; Wang et al., 2023c; Xu et al., 2023).

Prompting Methods

Given that LLMs are in-context few-shot learners (Brown et al., 2020), these studies on reasoning abilities have paved the way for novel prompting and searching techniques, such as the CoT (Wei et al., 2022) and Tree of Thoughts (Yao et al., 2023). Inspired by the Bayesian learner, a hypothesis search method is proposed to assist LLMs in solving complex inductive reasoning tasks, with hypotheses selection and testing loops (Wang et al., 2023c). Instead of directly prompting, LLMs are used to generate natural language hypotheses according to the problem description and examples, then translate hypotheses into Python programs for execution and testing. This work has demonstrated the contribution of hypothesis searching and computer programs in solving complex inductive reasoning problems. Note that human-annotated hypotheses are introduced as few-shot demonstrations for generating new hypotheses. Human annotators are also involved in selecting the subset of hypotheses, to avoid the huge computation load in testing all hypotheses. During the course of this research, significant breakthroughs in the reasoning abilities of LLMs have been reported in models such as OpenAI o1, OpenAI o3, Nvidia Nemotron, and DeepSeek-R1. Notably, OpenAI o1, with the CoT mechanism, generates an extended internal reasoning process before responding, leading to improved reasoning performance over previous models like GPT-4o (Zhong et al., 2024). OpenAI o3 further advances performance on the ARC-AGI challenge (Chollet, 2024), while Nemotron and DeepSeek-R1 have also integrated the CoT mechanism, contributing to enhanced reasoning capabilities. Our work differs from the programme only solution of Wang et al. (2023c), first we work with the 900 1D-ARC tasks whereas they report results on 108 tasks, second with standard prompting we only generate 10 programme samples whereas they produce 80 samples, third, our prompt does not include examples whereas their prompt includes examples which we do not know the content of, hence our solution is a better evaluation of a standalone evaluation of the LLM capabilities.

Evaluation

Existing works on LLMs’ inductive reasoning ability are evaluated against the number of solved tasks, different LLM models applied, and the adjustment on task dimensionality (Mirchandani et al., 2023; Xu et al., 2023). The hypothesis search method (Wang et al., 2023c) evaluated the success rate on randomly selected 100 ARC tasks, and the number of execution feedback iterations. The quality of generated code is not evaluated. None of these works discusses the cost of performing the 1D-ARC tasks.

We propose to perform a comprehensive analysis of different prompting techniques, as well as different ways for LLMs to produce results, with minimal human intervention in the reasoning process. As the study Wang et al. (2023c) has demonstrated the contribution of computer programs in solving complex inductive reasoning problems, we also ask LLMs to generate computer programs for easy generalisation and testing. Thus, this task is relevant to programme synthesis, a classical problem studied with inductive and deductive reasoning and more recently with neural networks and LLMs (Kalyan et al., 2018; Liu et al., 2023). We mainly test LLMs for inductive reasoning in this work and will discuss deductive reasoning at the end of the paper.

We are looking at evaluating the effect of CoT (Wei et al., 2022) and CoT variations (Kojima et al., 2022) on solving reasoning tasks from the 1D-ARC (Xu et al., 2023) which we compare with standard prompting and direct feedback. We use the OpenAI API and store every interaction with it for each solving method that we set. The goal of developing intelligent agents which would be capable of understanding the world by interpreting their observations requires theoretically formalising a model of the real world. Along this goal, we argue that observing a phenomenon, hypothesising and theorising about it requires the same skills as finding the transformation rules in the 1D-ARC tasks and that code is a way of formalising a theory. We evaluate the inductive reasoning capacity of LLMs as intelligent agents by automatically generating prompts to perform 1D-ARC tasks, prompting for code of transformation functions and testing the correctness of the generated functions.

As an extending our previous work in Mesnage et al. (2025), in this study, we conduct a thorough analysis of the generated code from ChatGPT to assess its programming capability, especially, we analyse different variants of errors of the programs and perform a verification of the code. These additional results suggest that different prompting techniques lead to different runtime errors, which implies that the chain of thoughts for instance impact greatly the code produced, with more complexity comes more errors and also a higher chance of producing successful code at solving the task. Additionally, we further make the generated code available along with our implementation for reproducibility (see “Data-availability”). This can serve as an important resource for programme synthesis analysis for addressing the inductive reasoning task, 1D-ARC, using Large Language Model.

The remainder of the paper is as follows: “Methodology” section describes the 1D-ARC, prompting techniques, testing and evaluation metrics; “Performance Analysis” section summarises the results; “Error Analysis and Runtime Verification” section performs assessment of the programs generated for the task, and “Discussion” and Section “Conclusion and Future Work” discuss the results and conclude the work with potential future research.

Methodology

This section describes the methods we developed to evaluate and compare different prompting methods. In order to evaluate the level of reasoning of current large language models, we develop a method to automatically assess the correctness of answers produced by an LLM. We choose the 1D-ARC as a dataset of tasks to perform since it provides us with a clear test to evaluate.

Abstraction and Reasoning Corpus

The 1D-ARC consists of 900 tasks grouped into 18 categories of 50 tasks each. For each task, we have 3 examples of input-to-output sequences to illustrate the transformation to be performed and one test set of one input and one output. Figure 1 is a graphical representation of each category. These are tasks which are simple to complete for humans. The sequences are represented as coloured squares, the background ones are black and the colours of interest as other colours. For instance, a transformation can be to shift all coloured squares by one pixel to the right whilst keeping the black pixels identical. Another transformation is the mirror, which transforms the sequence into its reverse.

Figure 1.
The 1D-ARC task categories and representative input-output pairs (Xu et al., 2023).

To evaluate the capability of LLMs to understand those transformations and successfully complete the tasks by generating code, we generate prompts from our prompt template and 1D-ARC tasks and use the OpenAI application programming interface (API) programmatically with GPT-3.5/4 models for each of the 1D-ARC tasks and prompt for Python code to be generated. We consequently evaluate the generated code in terms of performance in completing the task and in terms of programming errors. We describe this process in the following subsections.

Code Generation Process

Firstly we encode the sequences as strings of integers, 0 representing black and other digits for other colours. We prompt an LLM to produce a “transform (sequence)” Python function which returns the transformed sequence. The Python function is generated given the few transformation examples provided in the prompt.

Prompt Engineering

We experimented on a trial-and-error basis by interacting with the GPT 3.5/4 models directly in order to engineer a prompt which produces a usable function without producing too much unnecessary text. In fact, the GPT models tend to add many comments in their code even for very simple tasks which increases the produced tokens and therefore the cost. The OpenAI API for GPT models requires a “system” input, in which we describe the purpose of the query and what we expect as an output and a “user” entry in which we give the transformation examples. The API returns a response as JSON which includes the answer to our query, the number of input and output tokens and multiple choices depending on how many we asked for. We built three different prompting techniques, namely standard prompting, CoT and direct feedback which we describe next and provide comparative results.

Standard Prompting

Our first method is standard prompting, namely prompting the LLM solely for a function solving the task. Figure 2 is an example of the generated prompt for task 1dMove2p7, the response from GPT-4 and the test we perform. No help is given to the LLM, clue or hint, solely the sequences of the transformations examples.

Figure 2.
Standard prompting of the task 1dMove2p7, the answer from GPT-4 and the test we run.

Since this is the less costly of our methods, for each of the tasks we prompt both GPT-3.5-turbo and GPT-4 for 10 choices using standard prompting.

Chain of Thought

The CoT concept is to prompt the LLM to produce reasoning steps or decompose a task or question before giving an answer, since those produced tokens become the context, it is thought that it increases the accuracy of the answer given and has been used in various contexts.

There are multiple ways to perform CoT. One could give an example of the expected reasoning process, prompting for a “step-by-step” list, for reasoning in a controlled or free-form manner. As shown in Figure 3, we chose to add to our standard prompting method the prompt: “Provide your reasoning step by step prior to giving the function.”.

Figure 3.
Generated CoT prompt for the task 1dMove1p44, the response from GPT-4 and the test we run.

The investigated LLMs do produce a description of the transformation and of the task to complete before writing the Python function. We will see in the results section how this affects the success rate as well as the cost of running the experiment for all 900 tasks.

Direct Feedback

Seeing that many of the functions produced either do not run or fail at completing the tasks, we designed a different method called direct feedback in which we iteratively query GPT models, letting it know of previous functions on the same task which did not pass the test. Figure 4 shows an example of the prompt we generate on a second iteration once a function has failed. When a function succeeds at the task we stop iterating. We also stop if we reach the maximum number of iterations, which is 5 in our experiments. Direct feedback is costly since we provide the previous response as part of the new prompt, hence why we stop at 5 iterations.

Figure 4.
Direct feedback iteration for the task 1dMove2p7, the answer from GPT-4 and the test we run.

The functions which failed the tasks are given in the prompt following the sentence: “Knowing that these functions failed:”.

Direct feedback being iterative is much more costly than standard prompting, especially when the size of the input increases with the number of functions that failed previously (the number of iterations).

Testing Generated Code

To test the functions responded by the GPT models we first process the response, removing any additional lines that are not code and removing comments. The code is then interpreted and if it can be interpreted will be accessible as a function. We then run this function to test if the transformation is correct by comparing the returned sequence with the output sequence from the dataset. We have guardrails when running the code, for instance, some functions run into infinite loops which we kill after 1 second. We store all responses from GPT-3.5/4, results, and calculations on Zenodo (Please see “Data availability” Section at the end of this manuscript). The errors are classified by error code produced by the Python interpreter.

OpenAI API Parameters

When querying the chat completion API we specify parameters which affect the response generation. We adjust the number of choices to generate(how many samples to generate, called choices in the API), when an LLM generates a completion, it selects the most likely next token according to a probability distribution on all tokens, OpenAI provides multiple choices resulting from choosing within those distributions. Temperature controls the randomness of responses, since we are looking for diverse responses in the samples, we set the temperature to 1 for all experiments, i.e. somewhat random since the parameter values are between 0 and 2.

Evaluation Metrics

To evaluate and compare our different prompting strategies, we compute metrics on the data resulting from testing, i.e. the number of successful tasks per task category and per number of choices.
$μ = \sum_{t \in t a s k s} \frac{s u c c e s s (t)}{| t a s k s |}$
(1)

We define the mean success rate $μ$ as the sum of successful tasks divided by the total number of tasks. The cost c per method and number of choices is defined in USD by OpenAI as $i n p u t_t o k e n s * 0.0000005 + o u t p u t_t o k e n s * 0.0000015$ with the GPT-3.5-turbo model and $i n p u t_$ $t o k e n s * 0.00003 + o u t p u t_t o k e n s * 0.00006$ with GPT-4.

We define the benefit-cost ratio as the mean success rate divided by the cost. We calculate it for each prompting strategy and number of choices/iterations.
$b c r = \frac{μ}{c}$
(2)

The following section analyses the results of our experiments.

Performance Analysis

We have queried the OpenAI API with both the GPT-3.5-turbo and GPT-4 models. The tables summarising the results are given in the appendix for clarity. With either model, CoT performs better for the same number of choices but standard prompting reaches a similar success rate whilst remaining less costly. Even with GPT-4 and CoT, the LLM did not find a suitable solution for any task of several task categories, namely Pattern Copy, Pattern Copy Multicolor and Mirror. Those tasks do not seem to be too complex for a human to solve. The most successful task categories are not necessarily the simplest, since the LLM is more successful at the Move by 2 pixels tasks than by Move by 1 pixel, we think that there is a large amount of chance in solving the tasks. There is no clear rate per complexity of task correlation emerging from the results which would reveal some inductive reasoning from the LLM.

GPT-3.5-Turbo Results

Tables 1 to 3 give the number of successful tasks out of 50 tasks per category. Tables 4 to 6 give the number of tokens used for input and output per amount of choices as well as the calculation of the mean success rate, the cost and benefit-cost ratio. The best $μ$ is with 5 choices. The CoT strategy has a 2.44% success rate, which is really low, but slightly better than 10 choices with the standard prompting approach, which has a 1.56% success rate. The best $b c r$ is with the CoT strategy and 2 choices, the $b c r$ drops with more than 2 choices queried.

Table 1.
Standard Prompting Number of Successful Functions per Task Category out of 50 by the Number of Choices with GPT-3.5-turbo.

Choice 1 2 3 4 5 6 7 8 9 10

Task category

Move 1 1 1 1 2 2 2 2 2 2 3

Move 2 0 0 0 0 0 0 0 1 2 2

Move 3 0 1 2 2 3 3 3 4 4 4

Move Dynamic 0 0 0 0 0 1 1 1 2 2

Fill 0 0 0 1 2 2 2 2 2 2

Hollow 0 0 0 0 0 0 0 0 1 1

Mean success rate 0.11 0.22 0.33 0.56 0.78 0.89 0.89 1.11 1.44 1.56

Standard deviation 0.47 0.64 1.02 1.33 1.83 1.84 1.84 2.19 2.35 2.52

Note: The categories of Padded Fill, Move 2 Towards, Flip, Mirror, Denoise, Denoise Multicolor, Pattern Copy, Pattern Copy Multicolor, Recolor by Odd Even, Recolor by Size, Recolor by Size Comparison, and Scaling were omitted as none of the tasks were successful.

Bold values are calculated.

Table 2.
Direct Feedback Number of Successful Functions per Task Category out of 50 by the Number of Choices with GPT-3.5-turbo.

Choice 1 2 3 4 5

Task category

Move 1 0 1 2 2 2

Move 3 2 2 2 2 3

Hollow 1 1 2 2 2

Mean success rate 0.33 0.44 0.67 0.67 0.78

Standard deviation 1.029 1.097 1.534 1.534 1.833

The categories of Move 2, Move 2 Towards, Move Dynamic, Fill, Padded Fill, Flip, Mirror, Denoise, Denoise Multicolor, Pattern Copy, Pattern Copy Multicolor, Recolor by Odd Even, Recolor by Size, Recolor by Size Comparison, and Scaling were omitted since none of those tasks were successful. Mean success rate is a percentage.

Bold values are calculated.

Table 3.
CoT Number of Successful Functions per Task Category out of 50 by the Number of Choices with GPT-3.5-turbo. The Categories of Pattern Copy, Mirror, Pattern Copy Multicolor, Recolor by Odd Even, Recolor by Size, Recolor by Size Comparison, and Scaling were Omitted Since None of Those Tasks were Successful. Mean Success Rate is a Percentage.

Choice 1 2 3 4 5

Task category

Move 1 3 5 6 6 6

Move 2 0 1 1 1 1

Move 3 0 1 1 1 1

Move Dynamic 1 1 1 1 1

Move 2 Towards 0 0 0 0 1

Fill 1 2 2 2 2

Padded Fill 0 0 0 0 0

Hollow 1 2 3 5 7

Flip 0 1 1 1 1

Denoise 0 1 1 1 1

Denoise Multicolor 0 0 1 1 1

Mean success rate 0.67 1.56 1.89 2.11 2.44

Standard deviation 1.534 2.526 3.027 3.462 4.033

Bold values are calculated.

Table 4.
Standard Prompting used Tokens, Success Rate and Costs per Number of Choices with GPT-3.5-turbo. Mean Success Rate is a Percentage.

Number of choices 1 2 3 4 5

Input tokens 129607 129607 129607 129607 129607

Output tokens 45289 90578 135867 181156 226446

Mean success rate 0.111 0.222 0.333 0.556 0.778

Cost in $ 0.133 0.201 0.269 0.337 0.404

Benefit-cost ratio 0.837 1.107 1.241 1.651 1.923

Number of choices 6 7 8 9 10

Input tokens 129607 129607 129607 129607 129607

Output tokens 271735 317024 362313 407602 452891

Mean success rate 0.889 0.889 1.111 1.444 1.556

Cost in $ 0.472 0.54 0.608 0.676 0.744

Benefit-cost ratio 1.882 1.645 1.827 2.136 2.09

Bold values are calculated.

Table 5.
CoT used Tokens, Success Rate and Costs per Number of Choices with GPT-3.5-turbo. Mean Success Rate is a Percentage.

Number of choices 1 2 3 4 5

Input tokens 143122 143122 143122 143122 143122

Output tokens 165003 330007 495010 660014 825017

Mean success rate 0.667 1.556 1.889 2.111 2.444

Cost in $ 0.319 0.567 0.814 1.062 1.309

Benefit-cost ratio 2.089 2.746 2.32 1.989 1.867

Bold values are calculated.

Table 6.
Direct Feedback used Tokens, Success Rate and Costs per Number of Choices with GPT-3.5-turbo. Mean Success Rate is a Percentage.

Number of choices 1 2 3 4 5

Input tokens 129964 308737 528124 789198 1092075

Output tokens 45080 88221 132960 177407 223007

Mean success rate 0.333 0.444 0.667 0.667 0.778

Cost in $ 0.133 0.287 0.464 0.661 0.881

Benefit-cost ratio 2.514 1.55 1.438 1.009 0.883

Bold values are calculated.

Figure 5(a) and 5(b) represent visually the results. The success rates and costs discussed in Figure 5(a) show tradeoffs for all prompting approaches. However, the slope of the tradeoff for the direct feedback approach is less steep than the rest of the approaches. This means that even if the cost increases significantly, the increase in success rate is not that significant for the direct feedback approach.

Figure 5.
Analysis of ChatGPT-3.5-turbo results. (a) Tradeoff analysis between success rate and cost and (b) Benefit-cost ratio per number of choices.

GPT-4 Results

Tables 7 to 9 give the number of successful tasks per category, out of 50 tasks per category. Tables 10 to 12 give the number of tokens used for input and output per amount of choices as well as the calculation of the mean success rate, the cost and benefit-cost ratio.

Table 7.
Standard Prompting Number of Successful Functions per Task Category out of 50 by the Number of Choices with GPT-4.

Choice 1 2 3 4 5 6 7 8 9 10

Task category

Move 1 7 11 13 14 17 19 20 21 21 25

Move 2 2 4 4 5 5 5 5 7 8 9

Move 3 2 8 12 13 13 14 16 18 18 18

Move Dynamic 2 2 2 3 4 4 4 6 6 6

Move 2 Towards 0 0 0 0 0 1 1 2 2 3

Fill 5 13 19 22 22 24 26 27 28 30

Padded Fill 0 0 0 0 0 1 1 1 1 1

Hollow 6 10 14 18 22 26 29 30 30 32

Flip 5 7 9 13 19 21 22 24 24 27

Denoise 0 4 4 5 5 5 6 6 6 7

Denoise Multicolor 1 1 1 4 5 6 8 10 12 13

Scaling 1 2 4 6 6 7 7 7 7 7

Mean success rate 3.44 6.89 9.11 11.44 13.11 14.78 16.11 17.67 18.11 19.78

Standard deviation 4.74 8.81 12.17 14.25 16.38 18.28 19.88 20.81 21.04 22.93

Note: The categories of Pattern Copy, Pattern Copy Multicolor, Mirror, Recolor by Odd Even, Recolor by Size, and Recolor by Size Comparison were omitted since none of those tasks were successful. Mean success rate is a percentage.

Table 8.
CoT Number of Successful Functions per Task Category out of 50 by the Number of Choices with GPT-4. The Categories of Pattern Copy, Pattern Copy Multicolor, and Mirror were Omitted Since none of Those Tasks were Successful. Mean Success Rate is a Percentage.

Choice 1 2 3 4 5

Task category

Move 1 7 10 12 14 18

Move 2 5 15 17 22 27

Move 3 6 8 10 15 19

Move Dynamic 2 2 2 4 6

Move 2 Towards 1 2 3 3 4

Fill 5 11 14 17 23

Padded Fill 1 1 1 1 1

Hollow 16 25 28 37 39

Flip 9 11 20 21 24

Denoise 5 8 12 13 16

Denoise Multicolor 15 21 24 31 34

Recolor by Size 1 2 2 2 2

Recolor by Size Comparison 0 0 1 1 1

Scaling 4 4 7 10 12

Mean success rate 8.56 13.33 17.0 21.22 25.11

Standard deviation 9.889 15.262 18.153 22.949 25.743

Table 9.
Direct Feedback Number of Successful Functions per Task Category out of 50 by the Number of Choices with GPT-4. The Categories of Mirror, Pattern Copy, Pattern Copy Multicolor, Padded Fill, Recolor by Odd Even, Recolor by Size, and Recolor by Size Comparison were omitted since none of those tasks were successful. Mean success rate is a percentage. Mean success rate is a percentage.

Choice 1 2 3 4 5

Task category

Move 1 5 10 13 14 17

Move 2 1 2 4 4 4

Move 3 4 5 5 6 6

Move Dynamic 1 1 1 3 3

Move 2 Towards 1 1 2 3 3

Fill 9 13 15 15 16

Hollow 10 15 20 23 23

Flip 5 8 9 11 11

Denoise 1 1 2 4 5

Denoise Multicolor 3 4 5 6 6

Scaling 1 4 4 4 7

Mean success rate 4.56 7.11 8.89 10.33 11.22

Standard deviation 6.28 9.634 11.985 13.092 13.791

Table 10.
Standard Prompting used Tokens, Success Rate and Costs per Number of Choices with GPT-4. Mean Success Rate is a Percentage.

Number of choices 1 2 3 4 5

Input tokens 129607 129607 129607 129607 129607

Output tokens 50767 101533 152300 203067 253834

Mean success rate 3.444 6.889 9.111 11.444 13.111

Cost in $ 6.934 9.98 13.026 16.072 19.118

Benefit-cost ratio 0.497 0.69 0.699 0.712 0.686

Number of choices 6 7 8 9 10

Input tokens 129607 129607 129607 129607 129607

Output tokens 304600 355367 406134 456900 507667

Mean success rate 14.778 16.111 17.667 18.111 19.778

Cost in $ 22.164 25.21 28.256 31.302 34.348

Benefit-cost ratio 0.667 0.639 0.625 0.579 0.576

Table 11.
CoT used Tokens, Success Rate and Costs per Number of Choices with GPT-4. Mean Success Rate is a Percentage.

Number of choices 1 2 3 4 5

Input tokens 143122 143122 143122 143122 143122

Output tokens 276249 552498 828748 1104997 1381246

Mean success rate 8.556 13.333 17.0 21.222 25.111

Cost in $ 20.869 37.444 54.019 70.593 87.168

Benefit-cost ratio 0.41 0.356 0.315 0.301 0.288

Table 12.
Direct Feedback used Tokens, Success Rate and Costs per Number of Choices with GPT-4. Mean Success Rate is a Percentage.

Number of choices 1 2 3 4 5

Input tokens 129607 309496 541517 828076 1168956

Output tokens 10624 71260 180380 335254 542357

Mean success rate 4.556 7.111 8.889 10.333 11.222

Cost in $ 4.526 13.56 27.068 44.957 67.61

Benefit-cost ratio 1.007 0.524 0.328 0.23 0.166

Figures 6(a) and 6(b) represent the results visually. Contrarily with the GPT-3.5 model, the best $b c r$ is with 1 iteration of direct feedback whereas the $b c r$ of standard prompting maximises at 4 choices. The CoT success rate is lower and decreases with the number of choices. The best success rate obtained is with 5 choices and the CoT strategy with 25.11% whereas standard prompting with 10 choices reaches 19.77%. However, the standard prompting is generally much less costly.

Figure 6.
Analysis of GPT-4 results. (a) Tradeoff analysis between success rate and cost and (b) Benefit-cost ratio per number of choices.

Figure 6(a) shows that the tradeoff slope for standard prompting is the steepest, followed by the CoT and direct feedback prompting approaches. As in the case of the GPT-3.5 model, in the GPT-4 model as well, a significant increase in the cost does not guarantee a significant increase in the success rates with the direct feedback approach.

Error Analysis and Runtime Verification

In the previous section, we looked at the performance of LLMs at solving inductive reasoning tasks by producing executable code and considered the number of tasks solved or unsolved. In this section, we dig deeper in understanding the quality of the produced functions.

In this analysis, we look at the functions produced by standard prompting and chain of thought prompting, for which we generate 5 function samples for each task, respectively 4,504 (18016/4) and 4,503 (18012/4) functions as shown in Tables 13 and 14, since we run those functions on the 3 training examples and the test example for each of the 900 tasks, we analyse a total of 36,030 function executions. Most functions load (99.8% for standard prompting, 99.4% for CoT) with a slightly higher chance of indentation errors with CoT and similar syntax errors (see Figure 7). We focus on results of the GPT-4 model since these are the best results.

Figure 7.
Efficiency and execution success in generated code. (a) Average number of trials to obtain one successful solution for a given 1D-ARC task and (b) The execution ratio, in other words the ratio of functions which run to the overall number of generated functions.

Table 13.
Number of Errors of the Programs Generated From Standard Prompting with GPT-4.

Number of choices 1 2 3 4 5

processed 3604 7208 10808 14412 18016

loaded 3600 7196 10788 14384 17984

executions 3492 6963 10456 13944 17414

passed 31 62 82 103 118

IndentationError 0 0 4 4 4

SyntaxError 4 12 16 24 28

TypeError 16 28 39 51 78

IndexError 50 95 120 155 188

Timeout 3 6 13 21 21

ValueError 31 84 117 165 215

KeyError 4 4 15 15 19

NameError 4 12 12 12 19

UnboundLocalError 0 4 8 9 9

AttributeError 0 0 8 12 19

StopIteration 0 0 0 0 2

Note: processed shows the total number of functions generated. loaded shows the number of functions successfully being loaded without errors such as SyntaxError and IndentationError. executions are the number of functions that are successfully executed. passed is the number of tasks solved in the 1D-ARC challenge. The rest are different types of errors in the code.

Table 14.
Number of Errors of the Programs Generated from Chain of Thought prompting with GPT-4.

Number of choices 1 2 3 4 5

processed 3600 7204 10808 14412 18012

loaded 3580 7148 10736 14328 17908

executions 3433 6818 10287 13751 17192

passed 77 120 153 190 224

IndentationError 12 40 56 64 76

SyntaxError 8 16 16 20 28

TypeError 27 40 46 65 85

IndexError 25 81 110 151 207

RecursionError 0 3 3 3 3

RegExpError 4 4 12 12 12

Timeout 23 43 43 59 69

ValueError 30 40 64 90 101

KeyError 8 23 33 41 45

NameError 20 60 76 84 104

UnboundLocalError 10 24 46 52 66

AttributeError 0 8 12 16 20

StopIteration 0 4 4 4 4

Note: The meaning of processed, loaded, executions and passed are the same with Table 13. The rest are different types of errors in the code.

Tables 13 and 14 also provide a breakdown of the various errors encountered, allowing us to assess the quality of the generated solutions. In Tables 13 and 14, the row processed means the total number of functions being generated and being processed by this evaluation. The row loaded means the number of functions successfully being loaded without errors (some errors prevent functions from being loaded, such as IndentationError and SyntaxError). The executions is the number of functions being successfully executed. Other rows are the different types of errors from the generated code. From the data, we observe that Standard Prompting results in a higher proportion of execution failures, with frequent occurrences of SyntaxError, TypeError, and IndexError, suggesting that LLMs could generate code with structural mistakes or incorrect assumptions about list indexing. In contrast, CoT prompting appears to improve logical correctness, as indicated by the increased number of successfully executed functions and fewer critical failures such as SyntaxError. However, it also introduces new types of errors, such as RecursionError and RegExpError, which may stem from incorrect generalization or excessive function complexity.

Furthermore, the presence of Timeout errors across both approaches highlights cases where the models generate inefficient or non-terminating loops, affecting execution feasibility. The overall trend suggests that while CoT prompting improves interpretability and reasoning steps, it does not completely eliminate structural and logical errors.

Figure 7(a) shows the trial ratio, which is the average number of trials to obtain one successful solution for a given 1D-ARC task, calculated by
$t r i a l_r a t i o = \frac{l o a d e d - e x e c u t i o n s}{p a s s e d}$
(3)

Figure 7(b) shows the execution ratio, defined as
$e x e c u t i o n_r a t i o = \frac{e x e c u t i o n s}{p r o c e s s e d}$
(4)

Increasing the number of returned choices increases the chance of error as shown in Figure 7(a) as well as the chance of generating successful code. The number of unsuccessful trials is higher for standard prompting than for chain of thought prompting.

Figure 8 shows a comparison between Standard Prompting and CoT in terms of the different types of errors under 5 choices. In general, CoT generates more or about the equal number of errors, except for the ValueError, where Standard Prompting generates more of this type. The methods generate functions with different error proportions, we assume that CoT leads to more complex functions and therefore are more likely prone to runtime errors.

Figure 8.
Number of functions producing different types of errors at 5 choices. SP: standard prompting. CoT: chain of thought.

Discussion

We observe generally low results with the LLMs, GPT-4 and GPT-3.5-turbo. GPT-4 obtained the best successful rate of around 25%, although greatly better (about 10 times) than the results compared to GPT-3.5-turbo. This suggests that the greater parameter size enhanced their capability with inductive reasoning with the prompts. While the results are still low, the performance of GPT-4 is encouraging. This shows the potential of prompts (i.e., forward propagation, the process happening when prompting LLMs with an input prompt data is passed through the network to produce a response) of very large neural networks of billions of artificial neurons to approximate inductive reasoning. Further ways of learning with the prompts, e.g., instruction tuning and reinforcement learning, may help improve the performance.

The generally low results are also likely due to our numeric representation (e.g. “000333000”) of the 1D-ARC task, given that the tokenisation process may split the numeric string (e.g., by splitting it into “000”, “333”, “000”, separately), and thus may distort the meaning of the original string. Future studies can improve the numeric representation when prompting the model, or using other open LLMs (e.g., the Llama series Touvron et al., 2023) which have a distinct processing of numeric strings.

We have run the standard prompting experiment with a Python list of integer embedding of the sequences, such as in [0,0,0,3,3,3,0,0,0] instead of “000333000” to compare and in fact there is an improvement, the success rate at 10 choices is 5% compared to 1.55% with GPT-3.5, 45 tasks out of 900 for a cost of $1.11 as opposed to $0.74. This result remains low and therefore shows the limited inductive reasoning capability of LLMs.

It might be possible to increase the performance of LLMs at coding for inductive reasoning by developing prompting techniques and we have shown that zero-shot CoT does improve the success rate. Nevertheless, the transformer architecture of LLMs and the autoregressive approach do not enable effective inductive reasoning and research in this direction is necessary.

Conclusion and Future Work

In this work, we have explored the capability of off-the-shelf LLMs, especially GPT-3.5 and GPT-4 for inductive reasoning using the 1D-ARC corpus. Results show great room for improvement for future systems that employ LLM. Programme synthesis has been used as an intermediary task to solve the problem. While we mainly explored pure inductive reasoning with LLMs, the area of programme synthesis has been explored greatly with both inductive and deductive methods. These methods can be explored in the future to generate programmable hypotheses to solve the tasks from the Abstract Reasoning Corpus.

Finally, the low performance of a pure LLM-based approach in this work may suggest the need for future studies to combine inductive and deductive methods with large neural networks like LLMs. For example, Retrieval Augmented Generative (RAG) (Lewis et al., 2020) together with symbolic representation in deductive reasoning (e.g., graph-based RAG) and other LLM adaptation methods like fine-tuning with preference learning may provide new avenues to combine neural-based inductive and deductive methods. Syntactic data generation, especially from meaningful deductive reasoning knowledge, followed by fine-tuning of the LLM may be useful to improve the performance. While we looked at the execution of the functions, a further analysis would be to look at the similarity between functions and the ratio of unique functions generated to the overall number of functions. In other words, is ChatGPT producing the same functions several times for different inputs or several unique functions. One might want to perform a corpus analysis on the 13,500 functions generated that we made available on our GitHub repository.

Choice	1	2	3	4	5	6	7	8	9	10
Task category
Move 1	1	1	1	2	2	2	2	2	2	3
Move 2	0	0	0	0	0	0	0	1	2	2
Move 3	0	1	2	2	3	3	3	4	4	4
Move Dynamic	0	0	0	0	0	1	1	1	2	2
Fill	0	0	0	1	2	2	2	2	2	2
Hollow	0	0	0	0	0	0	0	0	1	1
Mean success rate	0.11	0.22	0.33	0.56	0.78	0.89	0.89	1.11	1.44	1.56
Standard deviation	0.47	0.64	1.02	1.33	1.83	1.84	1.84	2.19	2.35	2.52

Choice	1	2	3	4	5
Task category
Move 1	0	1	2	2	2
Move 3	2	2	2	2	3
Hollow	1	1	2	2	2
Mean success rate	0.33	0.44	0.67	0.67	0.78
Standard deviation	1.029	1.097	1.534	1.534	1.833

Choice	1	2	3	4	5
Task category
Move 1	3	5	6	6	6
Move 2	0	1	1	1	1
Move 3	0	1	1	1	1
Move Dynamic	1	1	1	1	1
Move 2 Towards	0	0	0	0	1
Fill	1	2	2	2	2
Padded Fill	0	0	0	0	0
Hollow	1	2	3	5	7
Flip	0	1	1	1	1
Denoise	0	1	1	1	1
Denoise Multicolor	0	0	1	1	1
Mean success rate	0.67	1.56	1.89	2.11	2.44
Standard deviation	1.534	2.526	3.027	3.462	4.033

Number of choices	1	2	3	4	5
Input tokens	129607	129607	129607	129607	129607
Output tokens	45289	90578	135867	181156	226446
Mean success rate	0.111	0.222	0.333	0.556	0.778
Cost in $	0.133	0.201	0.269	0.337	0.404
Benefit-cost ratio	0.837	1.107	1.241	1.651	1.923

Number of choices	6	7	8	9	10
Input tokens	129607	129607	129607	129607	129607
Output tokens	271735	317024	362313	407602	452891
Mean success rate	0.889	0.889	1.111	1.444	1.556
Cost in $	0.472	0.54	0.608	0.676	0.744
Benefit-cost ratio	1.882	1.645	1.827	2.136	2.09

Number of choices	1	2	3	4	5
Input tokens	143122	143122	143122	143122	143122
Output tokens	165003	330007	495010	660014	825017
Mean success rate	0.667	1.556	1.889	2.111	2.444
Cost in $	0.319	0.567	0.814	1.062	1.309
Benefit-cost ratio	2.089	2.746	2.32	1.989	1.867

Number of choices	1	2	3	4	5
Input tokens	129964	308737	528124	789198	1092075
Output tokens	45080	88221	132960	177407	223007
Mean success rate	0.333	0.444	0.667	0.667	0.778
Cost in $	0.133	0.287	0.464	0.661	0.881
Benefit-cost ratio	2.514	1.55	1.438	1.009	0.883

Choice	1	2	3	4	5	6	7	8	9	10
Task category
Move 1	7	11	13	14	17	19	20	21	21	25
Move 2	2	4	4	5	5	5	5	7	8	9
Move 3	2	8	12	13	13	14	16	18	18	18
Move Dynamic	2	2	2	3	4	4	4	6	6	6
Move 2 Towards	0	0	0	0	0	1	1	2	2	3
Fill	5	13	19	22	22	24	26	27	28	30
Padded Fill	0	0	0	0	0	1	1	1	1	1
Hollow	6	10	14	18	22	26	29	30	30	32
Flip	5	7	9	13	19	21	22	24	24	27
Denoise	0	4	4	5	5	5	6	6	6	7
Denoise Multicolor	1	1	1	4	5	6	8	10	12	13
Scaling	1	2	4	6	6	7	7	7	7	7
Mean success rate	3.44	6.89	9.11	11.44	13.11	14.78	16.11	17.67	18.11	19.78
Standard deviation	4.74	8.81	12.17	14.25	16.38	18.28	19.88	20.81	21.04	22.93

Choice	1	2	3	4	5
Task category
Move 1	7	10	12	14	18
Move 2	5	15	17	22	27
Move 3	6	8	10	15	19
Move Dynamic	2	2	2	4	6
Move 2 Towards	1	2	3	3	4
Fill	5	11	14	17	23
Padded Fill	1	1	1	1	1
Hollow	16	25	28	37	39
Flip	9	11	20	21	24
Denoise	5	8	12	13	16
Denoise Multicolor	15	21	24	31	34
Recolor by Size	1	2	2	2	2
Recolor by Size Comparison	0	0	1	1	1
Scaling	4	4	7	10	12
Mean success rate	8.56	13.33	17.0	21.22	25.11
Standard deviation	9.889	15.262	18.153	22.949	25.743

Choice	1	2	3	4	5
Task category
Move 1	5	10	13	14	17
Move 2	1	2	4	4	4
Move 3	4	5	5	6	6
Move Dynamic	1	1	1	3	3
Move 2 Towards	1	1	2	3	3
Fill	9	13	15	15	16
Hollow	10	15	20	23	23
Flip	5	8	9	11	11
Denoise	1	1	2	4	5
Denoise Multicolor	3	4	5	6	6
Scaling	1	4	4	4	7
Mean success rate	4.56	7.11	8.89	10.33	11.22
Standard deviation	6.28	9.634	11.985	13.092	13.791

Number of choices	1	2	3	4	5
Input tokens	129607	129607	129607	129607	129607
Output tokens	50767	101533	152300	203067	253834
Mean success rate	3.444	6.889	9.111	11.444	13.111
Cost in $	6.934	9.98	13.026	16.072	19.118
Benefit-cost ratio	0.497	0.69	0.699	0.712	0.686

Number of choices	6	7	8	9	10
Input tokens	129607	129607	129607	129607	129607
Output tokens	304600	355367	406134	456900	507667
Mean success rate	14.778	16.111	17.667	18.111	19.778
Cost in $	22.164	25.21	28.256	31.302	34.348
Benefit-cost ratio	0.667	0.639	0.625	0.579	0.576

Number of choices	1	2	3	4	5
Input tokens	143122	143122	143122	143122	143122
Output tokens	276249	552498	828748	1104997	1381246
Mean success rate	8.556	13.333	17.0	21.222	25.111
Cost in $	20.869	37.444	54.019	70.593	87.168
Benefit-cost ratio	0.41	0.356	0.315	0.301	0.288

Number of choices	1	2	3	4	5
Input tokens	129607	309496	541517	828076	1168956
Output tokens	10624	71260	180380	335254	542357
Mean success rate	4.556	7.111	8.889	10.333	11.222
Cost in $	4.526	13.56	27.068	44.957	67.61
Benefit-cost ratio	1.007	0.524	0.328	0.23	0.166

Number of choices	1	2	3	4	5
processed	3604	7208	10808	14412	18016
loaded	3600	7196	10788	14384	17984
executions	3492	6963	10456	13944	17414
passed	31	62	82	103	118
IndentationError	0	0	4	4	4
SyntaxError	4	12	16	24	28
TypeError	16	28	39	51	78
IndexError	50	95	120	155	188
Timeout	3	6	13	21	21
ValueError	31	84	117	165	215
KeyError	4	4	15	15	19
NameError	4	12	12	12	19
UnboundLocalError	0	4	8	9	9
AttributeError	0	0	8	12	19
StopIteration	0	0	0	0	2

Number of choices	1	2	3	4	5
processed	3600	7204	10808	14412	18012
loaded	3580	7148	10736	14328	17908
executions	3433	6818	10287	13751	17192
passed	77	120	153	190	224
IndentationError	12	40	56	64	76
SyntaxError	8	16	16	20	28
TypeError	27	40	46	65	85
IndexError	25	81	110	151	207
RecursionError	0	3	3	3	3
RegExpError	4	4	12	12	12
Timeout	23	43	43	59	69
ValueError	30	40	64	90	101
KeyError	8	23	33	41	45
NameError	20	60	76	84	104
UnboundLocalError	10	24	46	52	66
AttributeError	0	8	12	16	20
StopIteration	0	4	4	4	4

Footnotes

Acknowledgements

We are thankful to the Institute for Data Science and Artificial Intelligence at the University of Exeter for funding the OpenAI API requests.

ORCID iDs

Cédric Mesnage

Xiaoyang Wang

Hang Dong

Aishwaryaprajna

Author Contribution

CM and XW originated the idea of evaluating GPT models’ ability at inductive reasoning on 1D-ARC. CM developed the querying and evaluation software. CM and XW produced the manuscript. All the authors, CM, XW, HD, and A, revised the manuscript of the paper.

Ethical Considerations

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

We store all responses from GPT-3.5/4, results, and calculations on Zenodo. Our implementation is available at https://zenodo.org/records/13735926 as well as on our github which includes the error analysis https://github.com/cedricidsai/LLMDNT. The following repository also contains the 1D-ARC dataset from https://github.com/khalil-research/1D-ARC (Xu et al., 2023) for reference.

The code, data, notebook to generate plots, and the generated programs for inductive reasoning are available on our Github repository: .

References

Bommarito

I. I. M.

Katz

D. M.

(2022). GPT takes the bar exam. arXiv preprint arXiv:2212.14402 .

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

, et al (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Chollet

(2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547 .

Chollet

(2024). OpenAI o3 breakthrough high score on arc-agi-pub. https://arcprize.org/blog/oai-o3-pubbreakthrough. ARC Prize Foundation.

Häggström

(2021). Artificial general intelligence and the common sense argument. In Conference on philosophy and theory of artificial intelligence (pp. 155–160). Springer.

Han

S. J.

Ransom

K. J.

Perfors

Kemp

(2024). Inductive reasoning in humans and large language models. Cognitive Systems Research, 83, 101155.

Imani

Shrivastava

(2023). Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398 .

Kalyan

Mohta

Polozov

Batra

Jain

Gulwani

(2018). Neural-guided deductive search for real-time program synthesis from examples. In International conference on learning representations (pp. 1–15).

Kojima

S. S.

Reid

Matsuo

Iwasawa

(2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199–22213.

10.

Lewis

Perez

Piktus

Petroni

Karpukhin

Goyal

Küttler

Lewis

Yih

Wt.

Rocktäschel

Riedel

Kiela

(2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Larochelle

Ranzato

Hadsell

Balcan

Lin

(Eds.), Advances in neural information processing systems (Vol. 33, pp. 9459–9474). Curran Associates, Inc.

11.

Liu

Xia

C. S.

Wang

Zhang

(2023). Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Oh

Naumann

Globerson

Saenko

Hardt

Levine

(Eds.), Advances in neural information processing systems (Vol. 36, pp. 21558–21572). Curran Associates, Inc.

12.

Mesnage

Wang

Dong

, et al (2025). Evaluating inductive reasoning capabilities of large language models with the one dimensional abstract reasoning corpus. In International workshop on hybrid models for coupling deductive and inductive reasoning (pp. 35–51). Springer.

13.

Mirchandani

Xia

Florence

Ichter

Driess

Arenas

M. G.

Rao

Sadigh

Zeng

(2023). Large language models as general pattern machines. arXiv preprint arXiv:2307.04721 .

14.

Morris

M. R.

Sohl-dickstein

Fiedel

Warkentin

Dafoe

Faust

Farabet

Legg

(2023). Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462 .

15.

OpenAI (2023) GPT-4 technical report.

16.

Touvron

Lavril

Izacard

Martinet

Lachaux

M. A.

Lacroix

Rozière

Goyal

Hambro

Azhar

, et al (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 .

17.

Wang

Yue

Sun

(2023a). Can ChatGPT defend its belief in truth? evaluating LLM reasoning via debate. In Findings of the association for computational linguistics: EMNLP 2023 (pp. 11865–11881).

18.

Wang

Ren

Zhou

Luo

Shi

Zhang

Song

Zhan

(2023b). Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731 .

19.

Wang

Zelikman

Poesia

Haber

Goodman

N. D.

(2023c). Hypothesis search: Inductive reasoning with language models.

20.

Wang

Wei

Choi

Ren

(2024). Can LLMs reason with rules? logic scaffolding for stress-testing and improving LLMs. arXiv preprint arXiv:2402.11442 .

21.

Wei

Wang

Schuurmans

Bosma

Xia

Chi

Q. V.

Zhou

, et al (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

22.

Vaezipoor

Sanner

Khalil

E. B.

(2023). LLMs and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations.

23.

Yao

Zhao

Shafran

Griffiths

T. L.

Cao

Narasimhan

(2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 .

24.

Zhang

Ladhak

Durmus

Liang

McKeown

Hashimoto

T. B.

(2024). Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12, 39–57.

25.

Zhao

Lee

W. S.

Hsu

(2024). Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36, 31967–31987.

26.

Zhong

Liu

Pan

Zhang

Zhou

Liang

Lyu

Shu

, et al. (2024). Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486 .

Evaluating Inductive Reasoning and Programming Capabilities of Large Language Models With The One-Dimensional Abstract Reasoning Corpus

Abstract

Keywords

Introduction

Hypothesis

Tasks

Prompting Methods

Evaluation

Methodology

Abstraction and Reasoning Corpus

Code Generation Process

Prompt Engineering

Standard Prompting

Chain of Thought

Direct Feedback

Testing Generated Code

OpenAI API Parameters

Evaluation Metrics

Performance Analysis

GPT-3.5-Turbo Results

GPT-4 Results

Error Analysis and Runtime Verification

Discussion

Conclusion and Future Work

Footnotes

Acknowledgements

ORCID iDs

Author Contribution

Ethical Considerations

Consent to Participate

Consent for Publication

Funding

Declaration of Conflicting Interests

Data Availability

References