Sage Journals: Discover world-class research

Abstract

This article reviews two case studies in which AI systems were evaluated for abstraction and analogy-making capabilities and compared with those of humans. These studies illustrate how AI systems should be evaluated not only for accuracy on benchmark tasks but also for robustness to task variations and for insight into how the system is solving the tasks. These studies also illuminate the need for transparency, interpretability, and scientifically informed experimental methodology in AI evaluations.

Keywords

artificial intelligence abstraction analogy evaluation benchmarks

Get full access to this article

View all access options for this article.

References

Barsalou

L. W.

(2009). Simulation, situated conceptualization, and prediction. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1281–1289.

Beger

Moskvichev

Tsai

S. W.

Rajamanickam

Mitchell

(2025). Do AI models perform human-like abstract reasoning across modalities? arXiv. https://doi.org/10.48550/arXiv.2510.02125

Bender

Beller

Medin

D. L.

(2017). Causal cognition and culture. In Waldmann

M. R.

(Ed.), The Oxford handbook of causal reasoning (pp. 717–738). Oxford University Press.

Chiang

H.-Y.

Camacho-Collados

Pardos

(2020). Understanding the source of semantic regularities in word embeddings. In Fernández

Linzen

(Eds.), Proceedings of the 24th Conference on Computational Natural Language Learning (pp. 119–131). Association for Computational Linguistics.

Chollet

(2024, December 20). OpenAI o3 breakthrough high score on ARC-AGI-Pub. ARC Prize. https://arcprize.org/blog/oai-o3-pub-breakthrough

Chollet

(2025). Abstraction and Reasoning Corpus for Artificial General Intelligence v1 (ARC-AGI-1). GitHub. https://github.com/fchollet/ARC

Chollet

Knoop

Kamradt

Landers

(2024). ARC Prize 2024: Technical report. https://doi.org/10.48550/arXiv.2412.04604

Frank

M. C.

(2023). Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology, 2(8), 451–452.

Galison

(2004). Einstein’s clocks and Poincaré’s maps: Empires of time. W. W. Norton & Company.

10.

Gelman

S. A.

(2009). Learning from others: Children’s construction of concepts. Annual Review of Psychology, 60, 115–140.

11.

Gentner

Holyoak

K. J.

Kokinov

B. N.

(2001). The analogical mind: Perspectives from cognitive science. MIT Press.

12.

Gentner

Rattermann

M. J.

Forbus

K. D.

(1993). The roles of similarity in transfer: Separating retrievability from inferential soundness. Cognitive Psychology, 25(4), 524–575.

13.

Goodman

Tenenbaum

J. B.

Gerstenberg

(2015). Concepts in a probabilistic language of thought. In Margolis

Laurence

(Eds.), The conceptual mind: New directions in the study of concepts (pp. 623–654). MIT Press.

14.

Gopnik

(2011). A unified account of abstract structure and conceptual change: Probabilistic models and early learning mechanisms. Behavioral & Brain Sciences, 34(3), 129–130.

15.

Hofstadter

D. R.

Sander

(2013). Surfaces and essences. Basic Books.

16.

Ivanova

A. A.

(2025). How to evaluate the cognitive abilities of LLMs. Nature Human Behaviour, 9, 230–233.

17.

Kambhampati

(2023, September 12). Can LLMs really reason and plan? Communications of the ACM. https://cacm.acm.org/blogs/blog-cacm/276268-can-llms-really-reason-and-plan/fulltext

18.

Lakoff

(2012). The neural theory of metaphor. In Gibbs

R. W.

Jr. (Ed.), The Cambridge handbook of metaphor and thought (pp. 17–38). Cambridge University Press.

19.

LeGris

Vong

W. K.

Lake

B. M.

Gureckis

T. M.

(2024). H-ARC: A robust estimate of human performance on the abstraction and reasoning corpus benchmark. arXiv. https://doi.org/10.48550/arXiv.2409.01374

20.

Lewis

Mitchell

(2024). Evaluating the robustness of analogical reasoning in GPT models. arXiv. https://doi.org/10.48550/arXiv.2411.14215

21.

Lupyan

Bergen

(2016). How language programs the mind. Topics in Cognitive Science, 8(2), 408–424.

22.

McCoy

R. T.

Yao

Friedman

Hardy

M. D.

Griffiths

T. L.

(2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, USA, 121(41), Article e2322420121. https://doi.org/10.1073/pnas.2322420121

23.

Mirzadeh

Alizadeh

Shahrokhi

Tuzel

Bengio

Farajtabar

(2024). GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv. https://doi.org/10.48550/arXiv.2410.05229

24.

Moskvichev

Odouard

V. V.

Mitchell

(2023). The ConceptARC benchmark: Evaluating understanding and generalization in the ARC domain. arXiv. https://doi.org/10.48550/arXiv.2305.07141

25.

Nikankin

Reusch

Mueller

Belinkov

(2024). Arithmetic without algorithms: Language models solve math with a bag of heuristics. arXiv. https://doi.org/10.48550/arXiv.2410.21272

26.

Nosofsky

R. M.

(1986). Attention, similarity, and the identification–categorization relationship. Journal of Experimental Psychology: General, 115(1), 39–61.

27.

Rane

Kirkman

Royka

Todd

Law

Foster

J. G.

Cartmill

(2025, July 13–19). Principles of animal cognition for LLM evaluations: A case study on transitive inference [Conference session]. Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada.

28.

Rosch

(1999). Principles of categorization. In Margolis

Laurence

(Eds.), Concepts: Core readings (pp. 189–206). MIT Press.

29.

Sharkey

Chughtai

Batson

Lindsey

Bushnaq

Goldowsky-Dill

Heimersheim

Ortega

Bloom

Biderman

Garriga-Alonso

Conmy

Nanda

Rumbelow

Wattenberg

Schoots

Miller

Michaud

E. J.

. . . McGrath

(2025). Open problems in mechanistic interpretability. arXiv. https://doi.org/10.48550/arXiv.2501.16496

30.

Spelke

E. S.

Kinzler

K. D.

(2007). Core knowledge. Developmental Science, 10(1), 89–96.

31.

Srivastava

Annarose

M. B.

Anto

P. V.

Menon

Sukumar

Philipose

Prince

Thomas

(2024). Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv. https://doi.org/10.48550/arXiv.2402.19450

32.

Taghanaki

S. A.

Khani

Khasahmadi

(2024). MMLU-Pro+: Evaluating higher-order reasoning and shortcut learning in LLMs. arXiv. https://doi.org/10.48550/arXiv.2409.02257

33.

Webb

Holyoak

K. J.

(2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541.

On Evaluating Abstraction and Analogy in Humans and Machines

Abstract

Keywords

Get full access to this article

References