Abstract
This article reviews two case studies in which AI systems were evaluated for abstraction and analogy-making capabilities and compared with those of humans. These studies illustrate how AI systems should be evaluated not only for accuracy on benchmark tasks but also for robustness to task variations and for insight into how the system is solving the tasks. These studies also illuminate the need for transparency, interpretability, and scientifically informed experimental methodology in AI evaluations.
Get full access to this article
View all access options for this article.
