Sage Journals: Discover world-class research

Abstract

Iterative prompt refinement is a practical approach for improving the reliability of large language models without weight updates. In this work, we study metatuning: a judge-guided prompt-refinement loop in which an evaluator critiques errors and provides targeted natural-language corrections or demonstrations that are incorporated into the prompt. We evaluate metatuning on axiomatic deductive reasoning (MATH-500), on combinations with chain-of-thought and self-reflection prompting, and on video-based physical reasoning (CLEVRER). Our results show that metatuning can improve baseline performance in static, rule-like domains, but offers limited benefit when paired with strong reasoning baselines and does not generalize to spatiotemporal video reasoning. Overall, we identify boundary conditions for judge-guided prompt refinement and motivate future work on integrating feedback at the level of reasoning traces.

Keywords

large language models prompt optimization symbolic feedback

1. Introduction

Large language models (LLMs) have become strong general-purpose systems, yet they still exhibit brittleness in complex reasoning and can fail systematically on recurring error patterns (Capitanelli & Mastrogiovanni, 2024). A practical family of approaches to improve reliability without weight updates is prompt refinement: iteratively modifying prompts and demonstrations based on observed failures. In this paper, we study metatuning,¹ a judge-guided prompt-refinement loop in which a separate evaluator critiques errors and produces targeted corrections or demonstrations that are incorporated into the prompt.

In this work, we frame metatuning as a form of symbolic feedback in an operational sense: the feedback signal is expressed in discrete natural-language instructions and examples that can be inspected, edited, and reused. This perspective connects to neurosymbolic motivations (Colelough & Regli, 2024; Fang et al., 2024; Sheth et al., 2024), but our claims in this revision are deliberately empirical: we evaluate when judge-guided prompt refinement helps, when it becomes redundant, and when it fails to transfer across domains. Future work should complement this operational framing with direct representation probing to test stronger grounding/alignment hypotheses (Bisk et al., 2022; Blank & Piantadosi, 2023).

Judge-Guided Prompt Refinement Perspective

Can judge-guided, error-driven prompt refinement reliably improve LLM reasoning, and what are its boundary conditions?

Rather than updating model parameters, metatuning refines a task state—a prompt plus a structured memory of critiques/demonstrations—derived from repeated dataset interactions and an external judge.

Under such a learning regime, there are direct analogs to the traditional learning setup—train–validation–test splits, number of training runs through the dataset (epochs), gradient accumulation (when to trigger a prompt revision), model saving (saving the history of interactions and prompt-revisions, along with the final prompt), and model loading at inference time on the test set (loading the same history and warm starting with a few sample generations before testing begins).

We evaluate metatuning on axiomatic deductive reasoning (MATH-500), study its interaction with chain-of-thought (CoT) and self-reflection prompting, and test transfer to a multimodal, spatiotemporal video reasoning domain (CLEVRER). Our goal is to empirically characterize where iterative prompt refinement improves reliability and where it breaks down.

To further explore the boundaries of our framework, we investigate its interaction with other advanced reasoning techniques. Recently, methods such as CoT and self-reflection have been proposed to enhance LLM reasoning by guiding the model to produce a step-by-step reasoning trace. This raises a crucial question: Do the benefits of our metatuning approach complement or become redundant when combined with these powerful prompting techniques? Our experiments reveal that while COT and self-reflection significantly boost baseline accuracy on mathematical problems, adding metatuning provides no further benefit to GPT-4o and, in some cases, even degrades performance for Gemini 1.5 Flash.

Furthermore, we extend our evaluation to a new domain: video-based physical reasoning. Using the CLEVRER dataset, we assess whether our metatuning approach can improve an LLM’s ability to answer descriptive, explanatory, predictive, and counterfactual questions about object trajectories and collisions in short video clips. We find that in this dynamic and visual domain, metatuning has a negligible effect on performance. These findings provide valuable insights into the limitations of metatuning and suggest that its effectiveness is highly dependent on the nature of the task and the symbolic domain (e.g., static, discrete logical rules vs. dynamic, continuous physical laws).

Thus, our main contributions are:

Main Contributions:

Symbolic Feedback Perspective. Frame judge-guided prompt refinement as operational symbolic feedback expressed in natural language instructions and demonstrations.

Iterative Prompt-Refinement. Introduce a structured approach to “learning” via iterative prompt revisions and critique-driven updates, bridging symbolic reasoning and gradient-based optimization.

Empirical Validation. We show that while our iterative prompt-refinement approach (metatuning) improves reasoning on baseline problems, its effectiveness is limited when combined with advanced prompting techniques such as CoT and self-reflection.

Boundary Conditions for Metatuning. We investigate the applicability of our metatuning approach beyond text-based domains, demonstrating its limitations in improving performance on video-based, spatiotemporal reasoning tasks. This provides valuable insights into the types of problems where our framework is most effective.

Our work is also related to recent methods for automatic prompt optimization and meta-prompting, where prompts are improved via search or feedback loops rather than parameter updates. Examples include human-level prompt engineering via automatic prompt search (Zhou et al., 2023), optimizing prompts using LLMs as optimizers (Yang et al., 2024), and iterative refinement with self-feedback (Madaan et al., 2023). Metatuning differs in that it uses judge-driven critiques and error-driven example selection to construct a learned prompt artifact focused on specific failure modes.

The rest of the paper is organized as follows. Section 2 introduces the operational framing of symbolic feedback and situates metatuning relative to prompt optimization and in-context learning (ICL). Section 3 presents our judge-guided prompt-refinement loop and Algorithm 1. Section 4 compares our approach to conventional backpropagation training. Section 5 reports experiments on MATH-500, CoT/self-reflection prompting, and CLEVRER. Section 6 concludes, and the Appendix provides qualitative examples.

2. Background: Symbolic Feedback and Prompt Refinement

2.1. Operational View

We use symbolic feedback in an operational sense: feedback is communicated through discrete natural-language artifacts (instructions, critiques, and demonstrations) that can be appended to a prompt and directly inspected. This differs from weight updates, and it does not require claiming that the model is symbol-grounded in a philosophical or representational-probing sense (Bisk et al., 2022). Our goal is to study whether such feedback improves empirical task performance and how its effects depend on the task and prompting baseline.

2.2. Relation to Prompt Optimization and ICL

Metatuning is closely related to prompt optimization and meta-prompting methods that search over prompts or iteratively refine them based on feedback (Madaan et al., 2023; Yang et al., 2024; Zhou et al., 2023). It also resembles ICL at inference time because it conditions on demonstrations; however, metatuning emphasizes error-driven construction of those demonstrations via judge-guided correction, rather than static or similarity-retrieved examples. We make this distinction explicit in our CLEVRER experiments (Section 5.3).

3. Method: Metatuning (Judge-Guided Prompt Refinement)

3.1. Illustrative Example

Imagine an LLM-based agent in a text-based adventure game (a simple “world”). The agent’s policy is given by an LLM, but we also maintain a symbolic memory of facts the agent has discovered (e.g., a natural language-based description of the game’s map, items, etc.), and perhaps a similar description of explicit goals or rules (e.g., “you must not harm innocents” as a rule in the game). As the agent acts, an external prompt-based probe/judge model (another LLM) could check its actions against these rules and the known facts of the world. If the agent attempts something against the rules or logically inconsistent with its knowledge, the evaluator can intervene—for instance, by giving a natural language feedback (“You recall that harming innocents is against your code.”) or by adjusting the agent’s state (inserting a reminder into the agent’s context window). The agent (LLM) thus receives symbolic interactions (in this case, a textual message that encodes a rule or a fact) that alter its subsequent processing. In this learning scenario, the agent refines its internal model based on such interactions. Note that this does not involve directly tweaking weights each time; it instead involves an iterative procedure where each episode of interaction produces a trace that is used to slightly adjust the model’s state (it’s current prompt, history of interactions, prompt-revisions, and judge critiques). Over time, the model internalizes the rules so that it no longer needs the intervention. This viewpoint reframes symbolic learning as training on a dataset of task-related world experiences where the experience includes symbolic content (natural language descriptions of rules, knowledge queries) and the learning algorithm’s job is to make the model’s behavior align with task objectives.

3.2. Task Learning Algorithm

We propose an iterative learning paradigm for judge-guided prompt refinement that mirrors gradient-based optimization but uses symbolic feedback and intervention (expressed in natural language) to update the prompt state. The loop can be summarized at a high level in four steps:

Model Initialization: Begin with a pre-trained model (e.g., an LLM) with initial parameters $θ_{0}$ .

Evaluation via an External Judge: Present tasks to the model and assess its responses through an evaluator that detects errors or inconsistencies.

Generating Symbolic Corrections: Use the feedback to generate symbolically structured interventions (natural language), such as prompt refinements, additional demonstrations, or logical explanations.

Iterative Refinement: Apply the corrections iteratively to improve the model’s output, either through context updates (natural language-based prompting).

This cycle repeats for a fixed number of iterations (or until the judge indicates that no further improvement is being made). The process is formally described in Algorithm 1.

Algorithm 1 details our proposed perspective on learning. This iterative cycle aims to systematically reduce reasoning errors through natural language-based interactions and feedback based on running through the training set. Because the procedure relies on heuristic feedback and prompt/context updates, it does not provide formal convergence guarantees.

The Judge.evaluate function represents our symbolic evaluator. It could be implemented in numerous ways. For instance, we might have an LLM (potentially a more advanced or specialized one) that examines the model’s output and compares it to expected results or known constraints, outputting a “score” or textual critique.

The generate_corrections step is where symbolic intervention comes in. The judge gives natural language feedback. For example, the judge might say: “The reasoning is flawed because it assumed X, which contradicts known fact Y.” The algorithm then turns that into a corrected reasoning trace or a prompt that reminds the model of Y in context. In essence, part of the model state, that is, its prompt is revised during training through the training dataset in response to the model’s mistakes.

The update mechanism for the model is in the effective model behavior, modulated by providing a better prompt or adding a memory of previous corrections. For example, we can use a persistent prompt that accumulates instructions (a form of prompt tuning or using the model in a closed-loop system). This can be interpreted as a kind of supervised training loop where the new examples from corrections serve as training data with the judge acting as an oracle providing the target output or loss.

4. Comparison to Conventional Backpropagation Training

We compare our paradigm to standard backpropagation-based learning as follows:

Differentiability: Backprop requires the model and loss to be differentiable end-to-end. Our approach uses non-differentiable feedback. The judge could be a black-box procedure (e.g., LLM) that we cannot differentiate through (Hasan & Holleman, 2021). We treat the judge as an external oracle and make model state updates via generated examples-based prompt adjustments. This is a big advantage in incorporating arbitrary symbolic rules—we don’t need to make the symbolic logic differentiable; we can just have it critique the model and then adjust via examples.

Data Efficiency and Curriculum: Traditional training uses a fixed dataset, and if the model makes mistakes, it will continue to do so unless the data distribution covers those mistakes. In our iterative loop, we are essentially performing a form of curriculum learning or active learning—the model’s mistakes drive the correct decisions based on new training data instances on the fly, focusing learning on the most relevant areas. This can be more sample-efficient within the context window for correcting specific failure modes, without implying overall training data efficiency. For example, if an LLM consistently makes a reasoning error, we go through a few training examples demonstrating the correct reasoning and behavior-correct on them; a small number of focused examples might correct a behavior that would otherwise require many implicit examples in random training data to fix. Empirically, approaches such as self-correction have shown that even a single well-chosen example or instruction can pivot an LLM’s performance significantly on certain tasks (Graves et al., 2017).

Limitations and Convergence: Our approach does not have the convergence guarantees or well-defined optimization objective that gradient descent has. Its a more heuristic process. The quality of the final model depends on the quality of the judge and the corrections. If the judge is imperfect (e.g., an LLM judge might have its own errors or biases), we might lead the model astray or instill incorrect rules (Soviany et al., 2021). Conventional training, when you have clear loss and data, is more straightforward to analyze. One could end up oscillating or not converging if, say, the prompt-based corrections don’t stick in the model’s long-term memory.

Relation to Prompt Optimization: Our formulation is closely related to prompt optimization and self-improvement methods that update prompts (or prompting policies) based on performance feedback rather than weight updates (Madaan et al., 2023; Yang et al., 2024; Zhou et al., 2023). We emphasize an error-driven, judge-mediated construction of the prompt as a learned artifact, which we then evaluate empirically to identify boundary conditions for when such updates help or become redundant with strong reasoning prompts.

5. Experiments and Discussion

5.1. Metatuning With Zero-shot Prompting

In this section, we evaluate the impact of metatuning on the performance of an LLM using the Maths 500 Dataset. We begin by selecting a subsample of 100 problems from the dataset. As illustrated in Figures 1 to 3, we assess the model’s zero-shot performance by prompting it to generate answers without any prior fine-tuning. The generated responses, along with the corresponding ground-truth answers, are then evaluated by an LLM-based judge. The subsampled dataset contains problems of various levels from Level 1 to Level 5 of varying difficulty. One example from each level is given in Figure 2.

Following this, we implement a train–test split on the dataset. For the training set, we identify instances where the LLM’s initial responses were incorrect. For these incorrect cases, we construct a solution-infused chat history by incorporating the correct answers and their corresponding solutions. This enriched context is then provided to the model during inference on the test set. Finally, we compare the model’s zero-shot accuracy with its performance after metatuning. The results highlight the effectiveness of metatuning in enhancing the model’s ability to solve mathematical problems by leveraging solution-infused contextual learning.

Initial experiments were conducted with smaller language models, such as LLaMA 3.2 (1B parameters), inferred via Ollama. However, these models exhibited extremely low baseline accuracy, making them unsuitable for the study. Furthermore, given the critical role of the Judge LLM, we found that employing a large, state-of-the-art (SOTA) model as the judge is essential. If the Judge LLM’s evaluations lack high fidelity, the entire metatuning process becomes unreliable.

Therefore, this study focuses exclusively on SOTA models. Future work could explore the impact of metatuning on reasoning-focused models compared to non-reasoning models, using both as candidate and judge LLMs. In this study, all models used are non-reasoning models, but the candidate LLMs are explicitly prompted to provide both a reasoning process and a final solution. In the experimentation, the candidate LLMs used are GPT-4o and Gemini-1.5-Flash, and the judge model used is Gemini-2.0-Flash.

Benchmarking Results

We conducted experiments on GPT-4o and Gemini 1.5 Flash using different train–test splits and evaluated their performance with and without metatuning. Train context size of $x$ means there are $x$ problems used for metatuning, and the remaining $100 - x$ problems are used for testing the metatuned model. The results are summarized in Tables 1 and 2.

Figure 1.
Level of problems distribution in the dataset.

Figure 2.
Dataset examples.

Figure 3.
Workflow for evaluating metatuning on MATH500.

Table 1.
Performance of GPT-4o With and Without Metatuning.

Train Context Size Setting Correct Incorrect Accuracy Delta

5 Without metatuning 62 33 65.26% –

With metatuning 64 31 67.37% +2.11%

10 Without metatuning 59 31 65.56% –

With metatuning 64 26 71.11% +5.56%

20 Without metatuning 52 28 65.00% –

With metatuning 52 28 65.00% 0.00%

30 Without metatuning 47 23 67.14% –

With metatuning 45 25 64.29% $- 2.86 %$

40 Without metatuning 40 20 66.67% –

With metatuning 40 20 66.67% 0.00%

Table 2.
Performance of Gemini 1.5 Flash With and Without Metatuning.

Train Context Size Setting Correct Incorrect Accuracy Delta

5 Without metatuning 41 54 43.16% –

With metatuning 40 55 42.11% $- 1.05 %$

10 Without metatuning 39 51 43.33% –

With metatuning 45 45 50.00% +6.67%

20 Without metatuning 35 45 43.75% –

With metatuning 40 40 50.00% +6.25%

30 Without metatuning 30 40 42.86% –

With metatuning 33 37 47.14% +4.29%

40 Without metatuning 23 37 38.33% –

With metatuning 26 34 43.33% +5.00%

Analysis

From the results, we observe that metatuning improves the accuracy of both models in most cases. GPT-4o benefits significantly at smaller context sizes (e.g., +5.56% at context size 10), but shows no improvement at larger context sizes. In contrast, Gemini 1.5 Flash exhibits consistent improvements across all context sizes except for context size 5, where accuracy slightly decreases $(- 1.05 %)$ . The largest improvement for Gemini occurs at context size 10, with a +6.67% accuracy boost.

These results highlight that metatuning can be beneficial for improving model accuracy but may exhibit diminishing returns or even slight regressions depending on context size and model architecture.
5.2. Metatuning With Advanced Prompting

Train Context Size	Setting	Correct	Incorrect	Accuracy	Delta
5	Without metatuning	62	33	65.26%	–
	With metatuning	64	31	67.37%	+2.11%
10	Without metatuning	59	31	65.56%	–
	With metatuning	64	26	71.11%	+5.56%
20	Without metatuning	52	28	65.00%	–
	With metatuning	52	28	65.00%	0.00%
30	Without metatuning	47	23	67.14%	–
	With metatuning	45	25	64.29%	$- 2.86 %$
40	Without metatuning	40	20	66.67%	–
	With metatuning	40	20	66.67%	0.00%

Train Context Size	Setting	Correct	Incorrect	Accuracy	Delta
5	Without metatuning	41	54	43.16%	–
	With metatuning	40	55	42.11%	$- 1.05 %$
10	Without metatuning	39	51	43.33%	–
	With metatuning	45	45	50.00%	+6.67%
20	Without metatuning	35	45	43.75%	–
	With metatuning	40	40	50.00%	+6.25%
30	Without metatuning	30	40	42.86%	–
	With metatuning	33	37	47.14%	+4.29%
40	Without metatuning	23	37	38.33%	–
	With metatuning	26	34	43.33%	+5.00%

In this section, we extend the experimentation on the same subsection of the dataset with advanced prompting instead of using zero-shot prompting as a baseline. We use a CoT prompt with two steps: (1) ask the LLM to think step by step, and (2) ask it to reflect on its own answer from Step 1. What we find is a high jump in accuracy from zero-shot to this new CoT-based prompting, but a negligible change in accuracy (or degrading accuracy) when metatuning is incorporated with CoT, as can be seen in Tables 3 and 4.

Table 3.
Performance of GPT-4o With and Without Metatuning With Chain-of-Thought (CoT).

Train Context Size Setting Correct Incorrect Accuracy Delta

5 Without metatuning 75 20 78.95% –

With metatuning 79 16 83.16% +4.21%

10 Without metatuning 71 19 78.89% –

With metatuning 72 18 80.00% +1.11%

20 Without metatuning 67 13 83.75% –

With metatuning 68 12 85.00% +1.25%

30 Without metatuning 59 11 84.29% –

With metatuning 60 10 85.71% +1.43%

40 Without metatuning 51 19 72.86% –

With metatuning 51 19 72.86% 0.00%

Table 4.

Performance of Gemini 1.5 Flash With and Without Metatuning With Chain-of-Thought (CoT).

Train Context Size	Setting	Correct	Incorrect	Accuracy	Delta
5	Without metatuning	75	20	78.95%	–
	With metatuning	62	33	65.26%	$- 13.68 %$
10	Without metatuning	71	19	78.89%	–
	With metatuning	63	27	70.00%	$- 8.89 %$
20	Without metatuning	67	13	83.75%	–
	With metatuning	42	38	52.50%	$- 31.25 %$
30	Without metatuning	59	11	84.29%	–
	With metatuning	36	34	51.43%	$- 32.86 %$
40	Without metatuning	51	19	72.86%	–
	With metatuning	31	29	51.67%	$- 21.19 %$

Analysis

The results suggest that metatuning is not reliably additive with strong reasoning prompts. For GPT-4o, metatuning yields small gains or plateaus once CoT is enabled. For Gemini 1.5 Flash, adding metatuning to CoT consistently reduces accuracy, indicating a boundary condition rather than a stable improvement. A plausible explanation is context saturation: the model must attend to both the injected metatuning demonstrations (which encode symbolic corrections) and a long CoT + self-reflection trace; the additional context may become redundant, compete for attention, or introduce noise that disrupts reasoning.

Finally, our current implementation uses the judge to evaluate and correct the final answer rather than the intermediate CoT trace. This may limit how metatuning can complement CoT, since the feedback does not target flawed intermediate reasoning steps. A key direction for future work is to incorporate trace-level critiques so that the judge can correct the reasoning process, not only the outcome. Another important direction is to replicate these metatuning + CoT interactions across additional candidate models.
5.3. Evaluation on Video-Based Physical Reasoning

To assess the generalizability of our metatuning approach beyond text-based mathematical problems, we conducted an experiment on a video-based physical reasoning task. For this, we used the CLEVRER dataset, which consists of short video clips depicting interactions between various 3D objects (e.g., cubes and spheres) and a set of associated questions for each video. The questions fall into four categories: descriptive (e.g., “How many spheres are moving?”), explanatory (e.g., “Which of the following is responsible for the collision between the gray object and the cube?”), predictive (e.g., “What will happen next?”), and counterfactual (e.g., “What will happen if the gray sphere is removed?”).

For our methodology, we selected a total of 50 videos from the dataset, each containing multiple questions, for a total of 620 questions. We used Gemini 2.0 Flash as the candidate model. In the baseline condition, we provided the video and its questions to the model without any additional examples in the prompt. For the metatuning condition, we first ran a short sweep to identify questions the model answered incorrectly and used the judge to obtain corrected solutions; we then prepended two representative failure cases (question + corrected solution) as demonstrations before asking the remaining questions. The model’s answers were then evaluated by a judge against the ground-truth solutions.

Although the resulting inference-time prompt resembles standard ICL, the key distinction is how the examples are constructed. In standard ICL, demonstrations are typically static or retrieved by similarity. In metatuning, the demonstrations are a learned artifact derived from the model’s prior errors and the judge’s critiques during a training sweep (consistent with Algorithm 1). This distinction matters because it shifts the method from “using examples” to “using error-driven example selection.”

The results of this experiment are summarized in Table 5 and Figure 4.

Table 5.
Performance on CLEVRER Video Reasoning Task.

Setting Correct Incorrect Accuracy Delta

Without metatuning 319 301 51.45% –

With metatuning 320 300 51.61% +0.16%

Setting	Correct	Incorrect	Accuracy	Delta
Without metatuning	319	301	51.45%	–
With metatuning	320	300	51.61%	+0.16%

Figure 4.

CLEVRER data distribution and accuracy across each question type.

Analysis

The central finding from this experiment is that metatuning had a negligible impact on the model’s performance, resulting in only a marginal improvement of +0.16%. This outcome stands in contrast to our findings on the mathematical reasoning tasks, where metatuning consistently improved accuracy.

This result leads us to a key insight regarding the limitations of our metatuning framework. Our initial hypothesis was that providing solved examples would help the LLM identify patterns and improve its reasoning, much like it did for the logical and arithmetic problems. However, the nature of the “symbols” and “rules” in the video domain is fundamentally different. In mathematical reasoning, the symbols (e.g., numbers and variables) are discrete, and the rules (e.g., logical identities and algebraic formulas) are static and well-defined. The model can effectively learn to apply these explicit, high-level rules from a few examples in the context window.

In contrast, the symbols in the CLEVRER dataset (e.g., the position, velocity, and interactions of objects) are continuous and dynamic. The underlying “rules” are the laws of physics, which are far more complex and only implicitly reflected in the model’s learned representations. The two examples provided via metatuning are likely insufficient to teach the model a complex new physical law or to correct a fundamental misunderstanding of spatiotemporal dynamics. This suggests that representation-mediated prompt steering, while effective for discrete linguistic error patterns, may be too weak or noisy for metatuning to be effective in domains requiring complex, continuous reasoning. The context-based learning of our approach appears to be most effective when the task-relevant knowledge can be distilled into clear, symbolic, and rule-based examples.
6. Conclusion

In this work, we presented an empirical study of iterative prompt refinement using symbolic feedback expressed in natural language. We investigated metatuning, a judge-guided loop that accumulates targeted corrections and demonstrations into the model’s prompt, and evaluated when this style of symbolic feedback helps—and when it does not.

Across axiomatic deductive reasoning (MATH-500), metatuning improved baseline performance most clearly at smaller context sizes, consistent with the intuition that a small number of carefully selected examples can help correct recurring error patterns. However, when combined with strong prompting strategies such as CoT and self-reflection, metatuning provided little additional benefit and, for some models, degraded performance. These negative results suggest a boundary condition: the added metatuning context can become redundant with, or compete for attention against, a long reasoning trace (a form of context saturation).

We also evaluated metatuning on a video-based physical reasoning task (CLEVRER), where gains were negligible. This further emphasizes that prompt-based symbolic feedback is highly task-dependent and may fail to transfer to domains requiring rich spatiotemporal understanding and continuous dynamics.

Overall, our results highlight boundary conditions for metatuning: it is best viewed as a tool for improving weaker baselines and for domains with crisp, symbolic structure, rather than a universal method that consistently improves strong reasoning traces or dynamic visual reasoning. A key direction for future work is to incorporate feedback at the level of the reasoning process—for example, having the judge critique and correct intermediate CoT traces rather than only the final answer—and to replicate these findings across additional candidate models. We also see representation probing as an important future step for evaluating stronger claims about grounding/alignment that go beyond the operational framing adopted here.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

Appendix A. LLM Reasoning: Pre- and Postmetatuning

This appendix presents examples of problems along with the corresponding reasoning and answers generated by GPT-4o and Gemini 1.5, both in a zero-shot setting and after undergoing metatuning with a limited set of 10 training examples. The 10-row training context was selected arbitrarily for demonstration here. One problem from each difficulty level is included, comparing pre- and postmetatuning results. Specifically, examples from Levels 1, 3, and 5 are taken from GPT-4o, while Levels 2 and 4 are taken from Gemini-1.5-flash. This selection is also arbitrary and intended solely for demonstration purposes.

The distribution of problems where a 10-row context training produced the correct result only after metatuning is shown here in Figure 1.

Note that neither GPT4o nor Gemini has reasoning models. In the prompt, it was asked of both models to provide reasoning as well as the final answer.

Appendix B. Discussion

The results demonstrate how metatuning helps align LLMs with correct mathematical reasoning to arrive at the correct solution by leveraging a small set of training data in the context window.

References

Bisk

Gan

Andreas

Bengio

Salakhutdinov

Yuille

(2022). Symbols and grounding in large language models. Philosophical Transactions of the Royal Society A, 380(22138), 20220041. https://doi.org/10.1098/rsta.2022.0041

Blank

I. A.

Piantadosi

S. T.

(2023). Symbols and grounding in large language models. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381(2237), 20220041. https://doi.org/10.1098/rsta.2022.0041

Capitanelli

Mastrogiovanni

(2024). A framework for neurosymbolic robot action planning using large language models. Frontiers in Neurorobotics, 18, 1342786. https://doi.org/10.3389/fnbot.2024.1342786

Chattopadhyay

Dandekar

Roy

(2025). Model-grounded symbolic artificial intelligence systems learning and reasoning with model-grounded symbolic artificial intelligence systems. https://arxiv.org/abs/2507.09854.

Colelough

B. C.

Regli

(2024). Neuro-symbolic AI in 2024: A systematic review. In Proceedings of the first international workshop on logical foundations of neuro-symbolic AI (LNSAI 2024) (pp. 23–34). CEUR.WS.org. https://ceur-ws.org/Vol-3819/paper3.pdf.

Fang

Deng

Zhang

Shi

Chen

Pechenizkiy

Wang

(2024). Large language models are neurosymbolic reasoners. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17985–17993. https://doi.org/10.1609/aaai.v38i16.29754

Graves

Bellemare

M. G.

Menick

Munos

Kavukcuoglu

(2017). Automated curriculum learning for neural networks. In Proceedings of the 34th international conference on machine learning (pp. 1311–1320). PMLR.

Hasan

M. M.

Holleman

(2021). Training neural networks using the property of negative feedback to inverse a function. arXiv preprint arXiv:2103.14115.

Madaan

Tandon

Gupta

Hallinan

Gao

Wiegreffe

Alon

Dziri

Prabhumoye

Yang

Gupta

Majumder

B. P.

Hermann

Welleck

Yazdanbakhsh

Clark

(2023). Self-refine: Iterative refinement with self-feedback. https://arxiv.org/abs/2303.17651.

10.

Sheth

Pallagani

Roy

(2024). Neurosymbolic AI for enhancing instructability in generative AI. IEEE Intelligent Systems, 39(5), 5–11.

11.

Soviany

Rota

Tzionas

Sebe

(2021). Curriculum learning: A survey. International Journal of Computer Vision, 129(5), 1–21.

12.

Yang

Wang

Liu

Q. V.

Zhou

Chen

(2024). Large language models as optimizers. https://arxiv.org/abs/2309.03409.

13.

Zhou

Muresanu

A. I.

Han

Paster

Pitis

Chan

(2023). Large language models are human-level prompt engineers. https://arxiv.org/abs/2211.01910.

Metatuning: An Empirical Study of Judge-Guided Prompt Refinement and Its Boundary Conditions

Abstract

Keywords

1. Introduction

2. Background: Symbolic Feedback and Prompt Refinement

2.1. Operational View

2.2. Relation to Prompt Optimization and ICL

3. Method: Metatuning (Judge-Guided Prompt Refinement)

3.1. Illustrative Example

3.2. Task Learning Algorithm

4. Comparison to Conventional Backpropagation Training

5. Experiments and Discussion

5.1. Metatuning With Zero-shot Prompting

Benchmarking Results

Analysis

Analysis

Table 5. Performance on CLEVRER Video Reasoning Task. Setting Correct Incorrect Accuracy Delta Without metatuning 319 301 51.45% – With metatuning 320 300 51.61% +0.16%

Analysis

Footnotes

Funding

Declaration of Conflicting Interests

Notes

Appendix A. LLM Reasoning: Pre- and Postmetatuning

Appendix B. Discussion

References

Table 5.
Performance on CLEVRER Video Reasoning Task.

Setting Correct Incorrect Accuracy Delta

Without metatuning 319 301 51.45% –

With metatuning 320 300 51.61% +0.16%