Sage Journals: Discover world-class research

Abstract

The peer review process is fundamental to scientific advancement, fostering quality publications through constructive feedback, but identifying helpful reviewers remains challenging for journals facing increasing submission volumes. While Large Language Models show promise in manuscript evaluation and can reduce reviewer burden, they still have limitations, including potential biases, vague feedback, and context constraints that require significant human oversight. Our study collected 9 submissions with 33 human reviews and used Claude 3.5 Sonnet to create both an AI-submission reviewer and an AI-review reviewer, which evaluates both AI-generated and human reviews based on word count, coverage, and quality with zero-shot learning. Analysis shows AI reviewers provide better-structured reviews, though humans and AI use different rating approaches, with humans showing greater variance. Reviews create meaningful interactions between authors and reviewers, with the human element providing domain perspectives and personal flair. AI could improve reviews further by constructing a review critiquing system that actively engages reviewers rather than relegating them to passive automation consumers. This study suggests AI can augment human reviewers—not replace them—by integrating AI-generated holistic reviews with nuanced human insights to improve conference quality.

Keywords

peer review human-automation interaction human-AI interaction LLMs GenAI academic publication

Get full access to this article

View all access options for this article.

References

Anthropic. (2024). The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

Checco

Bracciale

Loreti

Pinfield

Bianchi

(2021). AI-assisted peer review. Humanities and Social Sciences Communications, 8(1), 1–11. https://doi.org/10.1057/s41599-020-00703-8

Dee

J. D.

Cassano-Pinché

Vicente

K. J.

(2005). Bibliometric analysis of human factors (1970-2000): A quantitative description of scientific impact. Human Factors, 47(4), 753–766. https://doi.org/10.1518/001872005775570970

de Winter

Hancock

P. A.

Eisma

Y. B

. (2025). ChatGPT and academic work: New psychological phenomena. AI & Society, 40, 4855–4868. https://doi.org/10.1007/s00146-025-02241-w

D’Arcy

Hope

Birnbaum

Downey

(2024). MARG: Multi-Agent Review Generation for Scientific Papers (No. arXiv:2401.04259). arXiv. https://doi.org/10.48550/arXiv.2401.04259

Gelman

(2013). Ethics and statistics: It’s too hard to publish criticisms and obtain data for republication. Chance, 26(3), 49–52. https://doi.org/10.1080/09332480.2013.845455

Guerlain

S. A.

Smith

P. J.

Obradovich

J. H.

Rudmann

Strohm

Smith

J. W.

Svirbely

Sachs

(1999). Interactive critiquing as a form of decision support: An empirical evaluation. Human Factors, 41(1), 72–89. https://doi.org/10.1518/001872099779577363

Hancock

P. A.

(2024). Science in peril: The crumbling pillar of peer review. Theoretical Issues in Ergonomics Science, 25(2), 187–191. https://doi.org/10.1080/1463922x.2022.2157066

Kuhn

T. S.

Hacking

(Eds.). (2012). The structure of scientific revolutions: 50th anniversary edition (4th ed.). University of Chicago Press.

10.

Kuznetsov

Afzal

O. M.

Dercksen

Dycke

Goldberg

Hope

Hovy

Kummerfeld

J. K.

Lauscher

Leyton-Brown

Mieskes

Névéol

Pruthi

Schwartz

Smith

Solorio

N. A.

Gurevych

T. I.

(2024). What can natural language processing do for peer review? (No. arXiv:2405.06563). arXiv. https://doi.org/10.48550/arXiv.2405.06563

11.

Liang

Izzo

Zhang

Lepp

Cao

Zhao

Chen

Liu

Huang

McFarland

D. A.

Zou

J. Y.

(2024). Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews (No. arXiv:2403.07183). arXiv. https://doi.org/10.48550/arXiv.2403.07183

12.

Liang

Zhang

Cao

Wang

Ding

Yang

Vodrahalli

Smith

Yin

McFarland

Zou

(2023). Can large language models provide useful feedback on research papers? A large-scale empirical analysis (No. arXiv:2310.01783). arXiv. https://doi.org/10.48550/arXiv.2310.01783

13.

Lange

R. T.

Foerster

Clune

(2024). The AI scientist: towards fully automated open-ended scientific discovery (No. arXiv:2408.06292). arXiv. https://doi.org/10.48550/arXiv.2408.06292

14.

Naddaf

(2025). AI is transforming peer review - and many scientists are worried. Nature, 639(8056), 852–854. https://doi.org/10.1038/d41586-025-00894-7

15.

R Core Team. (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/

16.

Yang

Hashimoto

(2024). Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers (No. arXiv:2409.04109). arXiv. https://doi.org/10.48550/arXiv.2409.04109

17.

The Perfect Peer. (2011). Nature Chemistry, 3(11), 831. https://doi.org/10.1038/nchem.1185

18.

Thompson

(2017). Hit makers: The Science of popularity in an age of distraction. Penguin Press.

19.

Warren

N. L.

Farmer

Warren

(2021). Marketing ideas: How to write research articles that readers understand and cite. Journal of Marketing, 85(5), 42–57. https://doi.org/10.1177/00222429211003560

20.

Wei

Wang

Schuurmans

Bosma

Ichter

Xia

Chi

Zhou

(2023). Chain-of-thought prompting elicits reasoning in large language models (No. arXiv:2201.11903). arXiv. https://doi.org/10.48550/arXiv.2201.11903

21.

Zheng

Chiang

W. L.

Sheng

Zhuang

Lin

Xing

E. P.

Zhang

Gonzalez

J. E.

Stoica

(2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (No. arXiv:2306.05685). arXiv. https://doi.org/10.48550/arXiv.2306.05685

AI Reviewers: Are Human Reviewers Still Necessary?

Abstract

Keywords

Get full access to this article

References