This article reviews two case studies in which AI systems were evaluated for abstraction and analogy-making capabilities and compared with those of humans. These studies illustrate how AI systems should be evaluated not only for accuracy on benchmark tasks but also for robustness to task variations and for insight into how the system is solving the tasks. These studies also illuminate the need for transparency, interpretability, and scientifically informed experimental methodology in AI evaluations.
BarsalouL. W. (2009). Simulation, situated conceptualization, and prediction. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1281–1289.
2.
BegerC.YiR.FuS.MoskvichevA.TsaiS. W.RajamanickamS.MitchellM. (2025). Do AI models perform human-like abstract reasoning across modalities? arXiv. https://doi.org/10.48550/arXiv.2510.02125
3.
BenderA.BellerS.MedinD. L. (2017). Causal cognition and culture. In WaldmannM. R. (Ed.), The Oxford handbook of causal reasoning (pp. 717–738). Oxford University Press.
4.
ChiangH.-Y.Camacho-ColladosJ.PardosZ. (2020). Understanding the source of semantic regularities in word embeddings. In FernándezR.LinzenT. (Eds.), Proceedings of the 24th Conference on Computational Natural Language Learning (pp. 119–131). Association for Computational Linguistics.
FrankM. C. (2023). Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology, 2(8), 451–452.
9.
GalisonP. (2004). Einstein’s clocks and Poincaré’s maps: Empires of time. W. W. Norton & Company.
10.
GelmanS. A. (2009). Learning from others: Children’s construction of concepts. Annual Review of Psychology, 60, 115–140.
11.
GentnerD.HolyoakK. J.KokinovB. N. (2001). The analogical mind: Perspectives from cognitive science. MIT Press.
12.
GentnerD.RattermannM. J.ForbusK. D. (1993). The roles of similarity in transfer: Separating retrievability from inferential soundness. Cognitive Psychology, 25(4), 524–575.
13.
GoodmanN.TenenbaumJ. B.GerstenbergT. (2015). Concepts in a probabilistic language of thought. In MargolisE.LaurenceS. (Eds.), The conceptual mind: New directions in the study of concepts (pp. 623–654). MIT Press.
14.
GopnikA. (2011). A unified account of abstract structure and conceptual change: Probabilistic models and early learning mechanisms. Behavioral & Brain Sciences, 34(3), 129–130.
15.
HofstadterD. R.SanderE. (2013). Surfaces and essences. Basic Books.
16.
IvanovaA. A. (2025). How to evaluate the cognitive abilities of LLMs. Nature Human Behaviour, 9, 230–233.
LakoffG. (2012). The neural theory of metaphor. In GibbsR. W.Jr. (Ed.), The Cambridge handbook of metaphor and thought (pp. 17–38). Cambridge University Press.
19.
LeGrisS.VongW. K.LakeB. M.GureckisT. M. (2024). H-ARC: A robust estimate of human performance on the abstraction and reasoning corpus benchmark. arXiv. https://doi.org/10.48550/arXiv.2409.01374
LupyanG.BergenB. (2016). How language programs the mind. Topics in Cognitive Science, 8(2), 408–424.
22.
McCoyR. T.YaoS.FriedmanD.HardyM. D.GriffithsT. L. (2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, USA, 121(41), Article e2322420121. https://doi.org/10.1073/pnas.2322420121
23.
MirzadehI.AlizadehK.ShahrokhiH.TuzelO.BengioS.FarajtabarM. (2024). GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv. https://doi.org/10.48550/arXiv.2410.05229
24.
MoskvichevA.OdouardV. V.MitchellM. (2023). The ConceptARC benchmark: Evaluating understanding and generalization in the ARC domain. arXiv. https://doi.org/10.48550/arXiv.2305.07141
25.
NikankinY.ReuschA.MuellerA.BelinkovY. (2024). Arithmetic without algorithms: Language models solve math with a bag of heuristics. arXiv. https://doi.org/10.48550/arXiv.2410.21272
26.
NosofskyR. M. (1986). Attention, similarity, and the identification–categorization relationship. Journal of Experimental Psychology: General, 115(1), 39–61.
27.
RaneS.KirkmanC.RoykaA.ToddG.LawR.FosterJ. G.CartmillE. (2025, July13–19). Principles of animal cognition for LLM evaluations: A case study on transitive inference [Conference session]. Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada.
28.
RoschE. (1999). Principles of categorization. In MargolisE.LaurenceS. (Eds.), Concepts: Core readings (pp. 189–206). MIT Press.
29.
SharkeyL.ChughtaiB.BatsonJ.LindseyJ.WuJ.BushnaqL.Goldowsky-DillN.HeimersheimS.OrtegaA.BloomJ.BidermanS.Garriga-AlonsoA.ConmyA.NandaN.RumbelowJ.WattenbergM.SchootsN.MillerJ.MichaudE. J.. . . McGrathT. (2025). Open problems in mechanistic interpretability. arXiv. https://doi.org/10.48550/arXiv.2501.16496
30.
SpelkeE. S.KinzlerK. D. (2007). Core knowledge. Developmental Science, 10(1), 89–96.
31.
SrivastavaS.AnnaroseM. B.AntoP. V.MenonS.SukumarA.PhiliposeA.PrinceS.ThomasS. (2024). Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv. https://doi.org/10.48550/arXiv.2402.19450
32.
TaghanakiS. A.KhaniA.KhasahmadiA. (2024). MMLU-Pro+: Evaluating higher-order reasoning and shortcut learning in LLMs. arXiv. https://doi.org/10.48550/arXiv.2409.02257
33.
WebbT.HolyoakK. J.LuH. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541.