Sage Journals: Discover world-class research

Abstract

As artificial intelligence (AI), machine learning (ML), and other forms of advanced automation are increasingly considered for deployment in safety-critical industries, there is an urgent need for evaluation methods which reliably identify risks of deployment prior to people being harmed. In this narrative review, we discuss the benefits and drawbacks of 11 major methodological decisions underpinning evaluations of AI-infused technologies from the perspective of cognitive systems engineering (CSE) and naturalistic decision making (NDM). These methodological decisions are organized around four aspirations central to the perspective of CSE and NDM: evaluations of AI-infused technologies should be (1) integrated, (2) naturalistic, (3) grounded, and (4) pattern-centered. We use these aspirations to interpret common human-AI evaluation methods and discuss new evaluation challenges for emerging AI-infused technologies. This narrative review is meant to guide both current methods and future research toward safe and effective strategies for evaluating AI-infused technologies, especially in safety-critical settings.

Keywords

artificial intelligence machine learning testing and evaluation cognitive systems engineering naturalistic decision making

Get full access to this article

View all access options for this article.

References

Alarcon

G. M.

Capiola

Lee

M. A.

Willis

Hamdan

I. A.

Jessup

S. A.

Harris

K. N.

(2023). Development and validation of the system trustworthiness scale. Human Factors: The Journal of the Human Factors and Ergonomics Society, 66(7), 1913. https://doi.org/10.1177/00187208231189000

Alduhailan

H. W.

Alshamari

M. A.

Wahsheh

H. A. M.

(2025). A comprehensive comparison and evaluation of AI-powered healthcare mobile applications’ usability. Healthcare, 13(15), 1829. https://doi.org/10.3390/healthcare13151829

Alsuyayfi

Alanazi

(2022). Impact of clinical alarms on patient safety from nurses’ perspective. Informatics in Medicine Unlocked, 32(1), 1–4. https://doi.org/10.1016/j.imu.2022.101047

Amershi

Cakmak

Knox

W. B.

Kulesza

(2014). Power to the people: The role of humans in interactive machine learning. AI Magazine, 35(4), 4–120. https://doi.org/10.1609/aimag.v35i4.2513

Amershi

Weld

Vorvoreanu

Fourney

Nushi

Collisson

Suh

Iqbal

Bennett

P. N.

Inkpen

Teevan

Kikin-Gil

Horvitz

(2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI’19, 1(1), 1–13. https://doi.org/10.1145/3290605.3300233

Arrieta

A. B.

Díaz-Rodríguez

Del Ser

Bennetot

Tabik

Barbado

Garcia

Gil-Lopez

Molina

Benjamins

Chatila

Herrera

(2020). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58(1), 82–115. https://doi.org/10.1016/j.inffus.2019.12.012

Bandi

Adapa

P. V. S. R.

Kuchi

Y. E. V. P. K.

(2023). The power of generative AI: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet, 15(8), 8. https://doi.org/10.3390/fi15080260

Bansal

Nushi

Kamar

Horvitz

Weld

D. S.

(2021). Is the most accurate AI the best teammate? Optimizing AI for teamwork. Proceedings of the AAAI Conference on Artificial Intelligence, 35(13), 13–11414. https://doi.org/10.1609/aaai.v35i13.17359

Becker

Rush

Barnes

Rein

(2025). Measuring the impact of early-2025 AI on experienced open-source developer productivity. (arXiv:2507.09089). arXiv. https://doi.org/10.48550/arXiv.2507.09089

10.

Bedi

Jiang

Chung

Koyejo

Shah

(2025). Fidelity of medical reasoning in large language models. JAMA Network Open, 8(8), Article e2526021. https://doi.org/10.1001/jamanetworkopen.2025.26021

11.

Beede

Baylor

Hersch

Iurchenko

Wilcox

Ruamviboonsuk

Vardoulakis

L. M.

(2020). A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI’ 20, 1(1), 1–12. https://doi.org/10.1145/3313831.3376718

12.

Behymer

K. J.

Mersch

E. M.

Ruff

H. A.

Calhoun

G. L.

Spriggs

S. E.

(2015). Unmanned vehicle plan comparison visualizations for effective human-autonomy teaming. Procedia Manufacturing, 3(1), 1022–1029. https://doi.org/10.1016/j.promfg.2015.07.162

13.

Bibyk

S. A.

Blaha

L. M.

Myers

C. W.

(2021). How packaging of information in conversation is impacted by communication medium and restrictions. Frontiers in Psychology, 12(1), Article 594255. https://doi.org/10.3389/fpsyg.2021.594255

14.

Bilinski

Emanuel

Ciaranello

(2025). Sins of omission: Model-based estimates of the health effects of excluding pregnant participants from randomized controlled trials. Annals of Internal Medicine, 178(6), 868–877. https://doi.org/10.7326/ANNALS-24-00689

15.

Bolton

M. L.

(2024). Trust is not a virtue: Why we should not trust trust. Ergonomics in Design, 32(4), 4–11. https://doi.org/10.1177/10648046221130171

16.

Bondarenko

Volk

Volkov

Ladish

(2025). Demonstrating specification gaming in reasoning models. (arXiv:2502.13295). arXiv. https://doi.org/10.48550/arXiv.2502.13295

17.

Bradshaw

J. M.

Hoffman

R. R.

Johnson

Woods

D. D.

(2013). The seven deadly myths of “autonomous systems”. IEEE Intelligent Systems, 28(3), 54–61. https://doi.org/10.1109/MIS.2013.70

18.

Branson

Van Horn

Wah

Perona

Belongie

(2014). The ignorant led by the blind: A hybrid human–machine vision system for fine-grained categorization. International Journal of Computer Vision, 108(1), 3–29. https://doi.org/10.1007/s11263-014-0698-4

19.

Cabitza

Campagner

Ronzio

Cameli

Mandoli

G. E.

Pastore

M. C.

Sconfienza

L. M.

Folgado

Barandas

Gamboa

(2023). Rams, hounds and white boxes: Investigating human–AI collaboration protocols in medical diagnosis. Artificial Intelligence in Medicine, 138(1), Article 102506. https://doi.org/10.1016/j.artmed.2023.102506

20.

Cabitza

Fregosi

Campagner

Natali

(2024). Explanations considered harmful: The impact of misleading explanations on accuracy in hybrid human-AI decision making. In Longo

Lapuschkin

Seifert

(Eds.), Explainable artificial intelligence (pp. 255–269). Springer Nature. https://doi.org/10.1007/978-3-031-63803-9_14

21.

Chavaillaz

Schwaninger

Michel

Sauer

(2019). Expertise, automation and trust in X-ray screening of cabin baggage. Frontiers in Psychology, 10(1), 256. https://doi.org/10.3389/fpsyg.2019.00256

22.

Cheung

J. L. S.

Ali

Abdalla

Fine

(2023). U “AI” testing: User interface and usability testing of a chest X-ray AI tool in a simulated real-world workflow. Canadian Association of Radiologists Journal, 74(2), 314–325. https://doi.org/10.1177/08465371221131200

23.

Clancey

W. J.

(1997). Situated cognition: On human knowledge and computer representations. Cambridge University Press.

24.

Creswell

J. W.

Plano Clark

V. L.

(2018). Designing and conducting mixed methods research (3rd ed.). Sage Publications.

25.

Cvach

(2012). Monitor alarm fatigue: An integrative review. Biomedical Instrumentation & Technology, 46(4), 268–277. https://doi.org/10.2345/0899-8205-46.4.268

26.

Dahl

Alsos

O. A.

Svanæs

(2009). Evaluating mobile usability: The role of fidelity in full-scale laboratory simulations with mobile ICT for hospitals. Human-Computer Interaction, 1(1), 232–241. https://doi.org/10.1007/978-3-642-02574-7_26

27.

Degas

Islam

M. R.

Hurter

Barua

Rahman

Poudel

Ruscio

Ahmed

M. U.

Begum

Rahman

M. A.

Bonelli

Cartocci

Di Flumeri

Borghini

Babiloni

Aricó

(2022). A survey on Artificial Intelligence (AI) and explainable AI in air traffic management: Current trends and development with future research trajectory. Applied Sciences, 12(3), 3. https://doi.org/10.3390/app12031295

28.

Dekker

Hollnagel

(2004). Human factors and folk models. Cognition, Technology & Work, 6(2), 79–86. https://doi.org/10.1007/s10111-003-0136-9

29.

Dzindolet

M. T.

Peterson

S. A.

Pomranky

R. A.

Pierce

L. G.

Beck

H. P.

(2003). The role of trust in automation reliance. International Journal of Human-Computer Studies, Trust and Technology, 58(6), 697–718. https://doi.org/10.1016/S1071-5819(03)00038-7

30.

Elkin

P. L.

Mehta

LeHouillier

Resnick

Mullin

Tomlin

Resendez

Liu

Nebeker

J. R.

Brown

S. H.

(2025). Semantic clinical artificial intelligence vs native large language model performance on the USMLE. JAMA Network Open, 8(4), Article e256359. https://doi.org/10.1001/jamanetworkopen.2025.6359

31.

Ericsson

K. A.

Simon

H. A.

(1980). Verbal reports as data. Psychological Review, 87(3), 215–251. https://doi.org/10.1037/0033-295X.87.3.215

32.

Escobar

G. J.

Liu

V. X.

Schuler

Lawson

Greene

J. D.

Kipnis

(2020). Automated identification of adults at risk for in-hospital clinical deterioration. New England Journal of Medicine, 383(20), 1951–1960. https://doi.org/10.1056/NEJMsa2001090

33.

Feltovich

P. J.

Hoffman

R. R.

Woods

Roesler

(2004). Keeping it too simple: How the reductive tendency affects cognitive engineering. IEEE Intelligent Systems, 19(3), 90–94. https://doi.org/10.1109/MIS.2004.14

34.

Feng

Phillips

R. V.

Malenica

Bishara

Hubbard

A. E.

Celi

L. A.

Pirracchio

(2022). Clinical artificial intelligence quality improvement: Towards continual monitoring and updating of AI algorithms in healthcare. Npj Digital Medicine, 5(1), 66. https://doi.org/10.1038/s41746-022-00611-y

35.

Fitzgerald

M. C.

(2019). The impacts framework: The necessary requirements for making science-based organizational impact. The Ohio State University. https://etd.ohiolink.edu/apexprod/rws_olink/r/1501/10?clear=10&p10_accession_num=osu1557191348657812

36.

Fram

(2013). The constant comparative analysis method outside of grounded theory. The Qualitative Report, 18(1), 1–25. https://doi.org/10.46743/2160-3715/2013.1569

37.

Gaube

Suresh

Raue

Merritt

Berkowitz

S. J.

Lermer

Coughlin

J. F.

Guttag

J. V.

Colak

Ghassemi

(2021). Do as AI say: Susceptibility in deployment of clinical decision-aids. Npj Digital Medicine, 4(1), 1–8. https://doi.org/10.1038/s41746-021-00385-9

38.

Gulshan

Peng

Coram

Stumpe

M. C.

Narayanaswamy

Venugopalan

Widner

Madams

Cuadros

Kim

Raman

Nelson

P. C.

Mega

J. L.

Webster

D. R.

(2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316(22), 2402–2410. https://doi.org/10.1001/jama.2016.17216

39.

Gupta

Pruthi

(2025). All that glitters is not novel: Plagiarism in AI generated research. (arXiv:2502.16487). arXiv. https://doi.org/10.48550/arXiv.2502.16487

40.

Hauptman

A. I.

Schelble

B. G.

Duan

Flathmann

McNeese

N. J.

(2024). Understanding the influence of AI autonomy on AI explainability levels in human-AI teams using a mixed methods approach. Cognition, Technology & Work, 26(3), 435–455. https://doi.org/10.1007/s10111-024-00765-7

41.

Hedley

L. G.

Bennett

M. S.

Love

Houpt

J. W.

Brown

Eidels

(2022). The relationship between teaming behaviours and joint capacity of hybrid human-machine teams. PsyArXiv. [Preprint]. https://doi.org/10.31234/osf.io/sq5gp

42.

Hoc

J. M.

Leplat

(1983). Evaluation of different modalities of verbalization in a sorting task. International Journal of Man-Machine Studies, 18(3), 283–306. https://doi.org/10.1016/S0020-7373(83)80011-X

43.

Hoffman

R. R.

Woods

D. D.

(2011). Beyond Simon’s slice: Five fundamental trade-offs that bound the performance of macro cognitive work systems. IEEE Intelligent Systems, 26(6), 67–71. https://doi.org/10.1109/MIS.2011.97

44.

Hoi

S. C. H.

Sahoo

Zhao

(2021). Online learning: A comprehensive survey. Neurocomputing, 459(1), 249–289. https://doi.org/10.1016/j.neucom.2021.04.112

45.

Hollnagel

(2009). The ETTO principle: Efficiency-thoroughness trade-off: Why things that go right sometimes go wrong (1st ed.). Ashgate. https://www.routledge.com/The-ETTO-Principle-Efficiency-Thoroughness-Trade-Off-Why-Things-That-Go/Hollnagel/p/book/9780754676782

46.

Hollnagel

Woods

D. D.

(1983). Cognitive Systems Engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600. https://doi.org/10.1016/s0020-7373(83)80034-0

47.

Hutchins

(1995). How a cockpit remembers its speeds. Cognitive Science, 19(3), 265–288. https://doi.org/10.1207/s15516709cog1903_1

48.

Hutchinson

Rostamzadeh

Greer

Heller

Prabhakaran

(2022). Evaluation gaps in machine learning practice. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, 1(1), 1859–1876. https://doi.org/10.1145/3531146.3533233

49.

Jabbour

Fouhey

Shepard

Valley

T. S.

Kazerooni

E. A.

Banovic

Wiens

Sjoding

M. W.

(2023). Measuring the impact of AI in the diagnosis of hospitalized patients: A randomized clinical vignette survey study. JAMA, 330(23), 2275–2284. https://doi.org/10.1001/jama.2023.22295

50.

Johnson

Vera

(2019). No AI is an Island: The case for teaming intelligence. AI Magazine, 40(1), 1–28. https://doi.org/10.1609/aimag.v40i1.2842

51.

Johnson

Vignatti

Duran

(2020). Understanding human-machine teaming through interdependence analysis. In McNeese

M. D.

Salas

Endsley

M. R.

(Eds.), Contemporary research: Models, methodologies, and measures in distributed team cognition (1st ed., pp. 209–233). CRC Press. https://doi.org/10.1201/9780429459733-9

52.

Karnik

Bonafide

C. P.

(2015). A framework for reducing alarm fatigue on pediatric inpatient units. Hospital Pediatrics, 5(3), 160–163. https://doi.org/10.1542/hpeds.2014-0123

53.

Kasali

Nersessian

N. J.

(2017). Grounding evidence in design: Framing next practices. The Design Journal, 20(1), S4387–S4397. https://doi.org/10.1080/14606925.2017.1352935

54.

Kiani

Uyumazturk

Rajpurkar

Wang

Gao

Jones

Langlotz

C. P.

Ball

R. L.

Montine

T. J.

Martin

B. A.

Berry

G. J.

Ozawa

M. G.

Hazard

F. K.

Brown

R. A.

Chen

S. B.

Wood

Allard

L. S.

Ylagan

Shen

(2020). Impact of a deep learning assistant on the histopathologic classification of liver cancer. Npj Digital Medicine, 3(1), 1–8. https://doi.org/10.1038/s41746-020-0232-8

55.

Kim

Jeong

Chen

S. S.

Alhamoud

Mun

Grau

Jung

Gameiro

Fan

Park

Lin

Yoon

Sap

Tsvetkov

Liang

Breazeal

(2025). Medical hallucination in foundation models and their impact on healthcare. medRxiv. https://doi.org/10.1101/2025.02.28.25323115

56.

Kipnis

Turk

B. J.

Wulf

D. A.

LaGuardia

J. C.

Liu

Churpek

M. M.

Romero-Brufau

Escobar

G. J.

(2016). Development and validation of an electronic medical record-based alert score for detection of inpatient deterioration outside the ICU. Journal of Biomedical Informatics, 64(1), 10–19. https://doi.org/10.1016/j.jbi.2016.09.013

57.

Klein

(1993). A recognition-primed decision (RPD) model of rapid decision making. In Decision making in action: Models and methods (4th ed., pp. 138–147). Ablex Publishing.

58.

Klein

(2007). Flexecution, part 2: Understanding and supporting flexible execution. IEEE Intelligent Systems, 22(6), 108–112. https://doi.org/10.1109/MIS.2007.107

59.

Klein

(2008). Naturalistic decision making. Human Factors, 50(3), 456–560. https://doi.org/10.1518/001872008X288385

60.

Klein

Calderwood

Macgregor

(1989). Critical decision method for eliciting knowledge. IEEE Transactions on Systems, Man, and Cybernetics, 19(3), 462–472. https://doi.org/10.1109/21.31053

61.

Klein

Feltovich

P. J.

Bradshaw

J. M.

Woods

D. D.

(2005). Common ground and coordination in joint activity. In Rouse

W. B.

Boff

K. R.

(Eds.), Organizational simulation (pp. 139–184). John Wiley & Sons, Inc. https://doi.org/10.1002/0471739448.ch6

62.

Klein

Moon

B. M.

Hoffman

R. R.

(2006). Making sense of sense making 2: A macro cognitive model. IEEE Intelligent Systems, 21(5), 88–92. https://doi.org/10.1109/MIS.2006.100

63.

Klein

Ross

K. G.

Moon

B. M.

Klein

D. E.

Hoffman

R. R.

Hollnagel

(2003). Macrocognition. IEEE Intelligent Systems, 18(3), 81–85. https://doi.org/10.1109/MIS.2003.1200735

64.

Kosmyna

Hauptmann

Yuan

Y. T.

Situ

Liao

X.-H.

Beresnitzky

A. V.

Braunstein

Maes

(2025). Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task. (arXiv:2506.08872). arXiv. https://doi.org/10.48550/arXiv.2506.08872

65.

Krening

Feigh

K. M.

(2018). Interaction algorithm effect on human experience with reinforcement learning. ACM Transactions on Human-Robot Interaction, 7(2), 1–16. https://doi.org/10.1145/3277904

66.

Kunar

M. A.

(2022). The optimal use of computer aided detection to find low prevalence cancers. Cognitive Research: Principles and Implications, 7(1), 13. https://doi.org/10.1186/s41235-022-00361-1

67.

Lee

J. D.

See

K. A.

(2004). Trust in automation: Designing for appropriate reliance. Human Factors: The Journal of the Human Factors and Ergonomics Society, 46(1), 50–80. https://doi.org/10.1518/hfes.46.1.50_30392

68.

Lyu

Zhang

Gong

Fan

Yan

(2020). Nuclear power plants with artificial intelligence in industry 4.0 era: Top-level design and current applications—a systemic review. IEEE Access, 8(1), 194315–194332. https://doi.org/10.1109/ACCESS.2020.3032529

69.

Lynch

Larson

Mindermann

(2025). Agentic misalignment: How LLMs could be insider threats. Anthropic Research: Alignment. https://www.anthropic.com/research/agentic-misalignment

70.

Mancoridis

Weeks

Vafa

Mullainathan

(2025). Potemkin understanding in large language models. arXiv.Org. https://arxiv.org/abs/2506.21521v2

71.

McDuff

Schaekermann

Palepu

Wang

Garrison

Singhal

Sharma

Azizi

Kulkarni

Hou

Cheng

Liu

Mahdavi

S. S.

Prakash

Pathak

Semturs

Patel

Webster

D. R.

Natarajan

(2025). Towards accurate differential diagnosis with large language models. Nature, 642(8067), 451–457. https://doi.org/10.1038/s41586-025-08869-4

72.

McNeese

N. J.

Demir

Cooke

N. J.

Myers

(2018). Teaming with a synthetic teammate: Insights into human-autonomy teaming. Human Factors, 60(2), 262–273. https://doi.org/10.1177/0018720817743223

73.

Milner

Seong

D. H.

Brewer

R. W.

Baker

A. L.

Krausman

Chhan

Thomson

Rovira

Schaefer

K. E.

(2021). Identifying new team trust and team cohesion metrics that support future human-autonomy teams. In Cassenti

D. N.

Scataglini

Rajulu

S. L.

Wright

J. L.

(Eds.), Advances in simulation and digital human modeling (pp. 86–93). Springer International Publishing. https://doi.org/10.1007/978-3-030-51064-0_12

74.

Morey

D. A.

Rayo

M. F.

(2024). Situated interpretation and data: Explainability to convey machine misalignment. IEEE Transactions on Human-Machine Systems, 54(1), 100–109. https://doi.org/10.1109/THMS.2023.3334988

75.

Morey

D. A.

Rayo

M. F.

(2022). From reactive to proactive safety: Joint activity monitoring for infection prevention. Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, 11(1), 48–52. https://doi.org/10.1177/2327857922111009

76.

Morey

D. A.

Rayo

M. F.

Woods

D. D.

(2025). Empirically derived evaluation requirements for responsible deployments of AI in safety-critical settings. Npj Digital Medicine, 8(1), 374. https://doi.org/10.1038/s41746-025-01784-y

77.

Morey

D. A.

Walli

Cassidy

K. S.

Tewani

P. K.

Reynolds

M. E.

Malone

Jalaeian

Rayo

M. F.

McGeorge

N. M.

(2023). Towards joint activity design heuristics: Essentials for human-machine teaming. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 67(1), 136. https://doi.org/10.1177/21695067231193646

78.

Nakahira

Liu

Sejnowski

T. J.

Doyle

J. C.

(2021). Diversity-enabled sweet spots in layered architectures and speed–accuracy trade-offs in sensorimotor control. Proceedings of the National Academy of Sciences, 118(22), e1916367118. https://doi.org/10.1073/pnas.1916367118

79.

NASEM . (2023). Test and evaluation challenges in artificial intelligence-enabled systems for the department of the air force. National Academies of Sciences, Engineering, and Medicine. https://doi.org/10.17226/27092

80.

NASEM . (2022). Human-AI teaming: State-of-the-art and research needs. National Academies of Sciences, Engineering, and Medicine. https://doi.org/10.17226/26355

81.

Nauta

Trienes

Pathak

Nguyen

Peters

Schmitt

Schlötterer

van Keulen

Seifert

(2023). From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable AI. ACM Computing Surveys, 55(13s), 295:1–295:42. https://doi.org/10.1145/3583558

82.

Nisbett

R. E.

Wilson

T. D.

(1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3), 231–259. https://doi.org/10.1037/0033-295x.84.3.231

83.

Obermeyer

Powers

Vogeli

Mullainathan

(2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. https://doi.org/10.1126/science.aax2342

84.

Palinko

Kun

A. L.

Shyrokov

Heeman

(2010). Estimating cognitive load using remote eye tracking in a driving simulator. Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications - ETRA, 10(1), 141. https://doi.org/10.1145/1743666.1743701

85.

Patterson

E. S.

Hoffman

R. R.

(2012). Visualization framework of macrocognition functions. Cognition, Technology & Work, 14(3), 221–227. https://doi.org/10.1007/s10111-011-0208-1

86.

Polson

P. G.

Lewis

Rieman

Wharton

(1992). Cognitive walkthroughs: A method for theory-based evaluation of user interfaces. International Journal of Man-Machine Studies, 36(5), 741–773. https://doi.org/10.1016/0020-7373(92)90039-N

87.

Provan

D. J.

Woods

D. D.

Dekker

Rae

A. J.

(2020). Safety II professionals: How resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195(1), Article 106740. https://doi.org/10.1016/j.ress.2019.106740

88.

Rajpurkar

Chen

Banerjee

Topol

E. J.

(2022). AI in health and medicine. Nature Medicine, 28(1), 1–38. https://doi.org/10.1038/s41591-021-01614-0

89.

Rayo

M. F.

Fitzgerald

M. C.

Gifford

R. C.

Morey

D. A.

Reynolds

M. E.

D’Annolfo

Jefferies

C. M.

(2020). The need for machine fitness assessment: Enabling joint human-machine performance in consumer health technologies. Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, 9(1), 40–42. https://doi.org/10.1177/2327857920091041

90.

Rayo

M. F.

Horwood

C. R.

Fitzgerald

M. C.

Grayson

M. R.

Abdel-Rasoul

Moffatt-Bruce

S. D.

(2022). Situated visual alarm displays support machine fitness assessment for non-explainable automation. IEEE Transactions on Human-Machine Systems, 52(5), 984–993. https://doi.org/10.1109/THMS.2022.3155714

91.

Rayo

M. F.

Kowalczyk

Liston

B. W.

Sanders

E. B. N.

White

Patterson

E. S.

(2015). Comparing the effectiveness of alerts and Dynamically Annotated Visualizations (DAVs) in improving clinical decision making. Human Factors: The Journal of the Human Factors and Ergonomics Society, 57(6), 1002–1014. https://doi.org/10.1177/0018720815585666

92.

Rayo

M. F.

Moffatt-Bruce

S. D.

(2015). Alarm system management: Evidence-based guidance encouraging direct measurement of informativeness to improve alarm response. BMJ Quality and Safety, 24(4), 282–286. https://doi.org/10.1136/bmjqs-2014-003373

93.

Ruamviboonsuk

Krause

Chotcomwongse

Sayres

Raman

Widner

Campana

B. J. L.

Phene

Hemarat

Tadarati

Silpa-Archa

Limwattanayingyong

Rao

Kuruvilla

Jung

Tan

Orprayoon

Kangwanwongpaisan

Sukumalpaiboon

Webster

D. R.

(2019). Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. Npj Digital Medicine, 2(1), 1. https://doi.org/10.1038/s41746-019-0099-8

94.

Salas-Boni

Bai

Harris

P. R. E.

Drew

B. J.

(2014). False ventricular tachycardia alarm suppression in the ICU based on the discrete wavelet transform in the ECG signal. Journal of Electrocardiology, 47(6), 775–780. https://doi.org/10.1016/j.jelectrocard.2014.07.016

95.

Salehi

Kim

Mosallanezhad

Pan

Cohen

M. C.

Wang

Zhao

Bhatti

Sung

Blasch

Mancenido

M. V.

Chiou

E. K.

(2024). Towards trustworthy AI-enabled decision support systems: Validation of the Multisource AI Scorecard Table (MAST). Journal of Artificial Intelligence Research, 80(1), 1311–1341. https://doi.org/10.1613/jair.1.14990

96.

Schaefer

K. E.

Baker

Brewer

Patton

Canady

Metcalfe

(2019). Assessing multi-agent human-autonomy teams: US army robotic wingman gunnery operations. In Islam

M. S.

George

(Eds.), Micro- and nanotechnology sensors, systems, and applications XI (p. 82). SPIE. https://doi.org/10.1117/12.2519302

97.

Scheurer

Balesni

Hobbhahn

(2024). Large language models can strategically deceive their users when put under pressure. (arXiv:2311.07590). arXiv. https://doi.org/10.48550/arXiv.2311.07590

98.

Shelby

Rismani

Henne

Moon

Aj.

Rostamzadeh

Nicholas

Yilla-Akbari

Gallegos

Smart

Garcia

Virk

(2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, 1(1), 723–741. https://doi.org/10.1145/3600211.3604673

99.

Shojaee

Mirzadeh

Alizadeh

Horton

Bengio

Farajtabar

(2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv Preprint arXiv:2506.06941.

100.

Smith

P. J.

McCoy

C. E.

Layton

(1997). Brittleness in the design of cooperative problem-solving systems: The effects on user performance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 27(3), 360–371. https://doi.org/10.1109/3468.568744

101.

Smith

P. J.

Stone

R. B.

Spencer

A. L.

(2006). Design as a prediction task: Applying cognitive psychology to system development. In Handbook of industrial ergonomics (2nd ed.). CRC Press.

102.

Sorkin

R. D.

Woods

D. D.

(1985). Systems with human monitors: A signal detection analysis. Human–Computer Interaction, 1(1), 49–75. https://doi.org/10.1207/s15327051hci0101_2

103.

Spallazzo

Sciannamè

Ceconello

(2025). AIXE. A method to evaluate the UX of systems integrating AI. In User experience + Artificial intelligence (pp. 65–82). Springer. https://doi.org/10.1007/978-3-031-77521-5_5

104.

Steyvers

Tejeda

Kumar

Belem

Karny

Mayer

L. W.

Smyth

(2025). What large language models know and what people think they know. Nature Machine Intelligence, 7(2), 221–231. https://doi.org/10.1038/s42256-024-00976-7

105.

Tschandl

Rinner

Apalla

Argenziano

Codella

Halpern

Janda

Lallas

Longo

Malvehy

Paoli

Puig

Rosendahl

Soyer

H. P.

Zalaudek

Kittler

(2020). Human–computer collaboration for skin cancer recognition. Nature Medicine, 26(8), 1229–1234. https://doi.org/10.1038/s41591-020-0942-0

106.

Vaccaro

Almaatouq

Malone

(2024). When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour, 8(12), 1–11. https://doi.org/10.1038/s41562-024-02024-1

107.

Wang

Zhang

Yang

Chen

Z. Y.

Tang

Zhang

Chen

Lin

Sun

Song

Zhao

Dou

Wang

Wen

J.-R.

(2025). User behavior simulation with large language model-based agents. ACM Transactions on Information Systems, 43(2), 1–55. https://doi.org/10.1145/3708985

108.

Wong

Otles

Donnelly

J. P.

Krumm

McCullough

DeTroyer-Cooley

Pestrue

Phillips

Konye

Penoza

Ghous

Singh

(2021). External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine, 181(8), 1065–1070. https://doi.org/10.1001/jamainternmed.2021.2626

109.

Woods

D. D.

(2018). The theory of graceful extensibility: Basic rules that govern adaptive systems. Environment Systems and Decisions, 38(4), 433–457. https://doi.org/10.1007/s10669-018-9708-3

110.

Woods

D. D.

Hollnagel

(2006). Joint cognitive systems: Patterns in cognitive systems engineering. CRC Press. https://doi.org/10.1201/9781420005684

111.

Woods

D. D.

Licu

Leonhardt

Rayo

Balkin

E. A.

Cloponea

(2021). Patterns in how people think and work: Importance of patterns discovery for understanding complex adaptive systems. European Organisation for the Safety of Air Navigation (EUROCONTROL). https://skybrary.aero/sites/default/files/bookshelf/5987.pdf

112.

F. F.

Song

Tang

Jain

Bao

Wang

Z. Z.

Zhou

Guo

Cao

Yang

H. Y.

Martin

Maben

Mehta

Chi

Jang

Xie

Neubig

(2025). The Agent Company: Benchmarking LLM agents on consequential real world tasks. (arXiv:2412.14161). arXiv. https://doi.org/10.48550/arXiv.2412.14161

113.

Yamayee

Z. A.

Albright

R. J.

(2008). Direct and indirect assessment methods: Key ingredients for continuous quality improvement and ABET accreditation. The International Journal of Engineering Education, 24(5), 877–883.

114.

Moehring

Banerjee

Salz

Agarwal

Rajpurkar

(2024). Heterogeneity and predictors of the effects of AI assistance on radiologists. Nature Medicine, 30(3), 837–849. https://doi.org/10.1038/s41591-024-02850-w

115.

Zhang

Zhu

Yang

Tseng

Y.-C.

Jiang

Rzeszotarski

J. M.

(2025). Navigating the fog: How university students recalibrate sensemaking practices to address plausible falsehoods in LLM outputs. Proceedings of the 7th ACM Conference on Conversational User Interfaces, CUI ’25, 1(1), 1–15. https://doi.org/10.1145/3719160.3736618

Four Aspirations for Evaluating AI-Infused Technologies: A Narrative Review

Abstract

Keywords

Get full access to this article

References