Sage Journals: Discover world-class research

Abstract

This study explores the use of artificial intelligence (AI) as a complementary tool for grading essay-type questions in higher education, focusing on its consistency with human grading and potential to reduce biases. Using 70 handwritten exams from an introductory sociology course, we evaluated generative pretrained transformer (GPT) models’ performance in transcribing and scoring students’ responses. GPT models were tested under various settings for both transcription and grading tasks. Results show high similarity between human and GPT transcriptions, with GPT-4o-mini outperforming GPT-4 in accuracy. For grading, GPT demonstrated strong correlations with the human grader scores, especially when template answers were provided. However, discrepancies remained, highlighting GPT’s role as a “second grader” to flag inconsistencies for assessment reviewing rather than fully replacing human evaluation. This study contributes to the growing literature on AI in education, demonstrating its potential to enhance fairness and efficiency in grading essay-type questions.

Keywords

AI-assisted grading higher education bias reduction essay-type assessment generative pretrained transformers

Get full access to this article

View all access options for this article.

References

Abid

Abubakar

Farooqi

Maheen

Zou

James

. 2021. “Large Language Models Associate Muslims with Violence.” Nature Machine Intelligence 3(6):461–63. doi:10.1038/s42256-021-00359-2.

Bai

Xuechunzi

Wang

Angelina

Sucholutsky

Ilia

Griffiths

Thomas L.

2025. “Explicitly Unbiased Large Language Models Still Form Biased Associations.” Proceedings of the National Academy of Sciences 122(8):e2416228122. doi:10.1073/pnas.2416228122.

Baker

Ryan S.

Hawn

Aaron

. 2022. “Algorithmic Bias in Education.” International Journal of Artificial Intelligence in Education 32(4):1052–92. doi:10.1007/s40593-021-00285-9.

Bang

Yejin

Chen

Delong

Lee

Nayeon

Fung

Pascale

. 2024. “Measuring Political Bias in Large Language Models: What Is Said and How It Is Said.” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 1:11142–59. doi:10.18653/v1/2024.acl-long.600.

Barros

Amon

Prasad

Ajnesh

Śliwa

Martyna

. 2023. “Generative Artificial Intelligence and Academia: Implications for Research, Teaching and Service.” Management Learning 54(5):597–604.

Barus

Okky P.

Hidayanto

Achmad N.

Handri

Eko Y.

Sensuse

Dana I.

Yaiprasert

Chairote

. 2025. “Shaping Generative AI Governance in Higher Education: Insights from Student Perception.” International Journal of Educational Research Open 8:100452. doi:10.1016/j.ijedro.2025.100452.

Bianchi

Federico

Kalluri

Pratyusha

Durmus

Esin

Ladhak

Faisai

Cheng

Myra

Nozza

Debora

Hashimoto

Tatsunori

Jurafsky

Dan

Zou

James

Caliskan

Aylin

. 2023. “Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale.” Pp. 1493–504 in ACM International Conference Proceeding Series. New York, NY: Association for Computing Machinery. doi:10.1145/3593013.3594095.

Birkelund

Johan

. 2014. “The Lunch Effect: Can It Result in Biased Grading at Universities?” UiT Norges Arktiske Universitet, Tromsø, Norway.

Bland

John M.

Altman

Douglas G.

1999. “Measuring Agreement in Method Comparison Studies.” Statistical Methods in Medical Research 8(2):135–60.

10.

Boston University. 2023. “Policy on the Use of AI Text Generation.” https://www.bu.edu/files/2023/02/GAIA-Final-2023.pdf.

11.

Bridgeman

Brent

Trapani

Catherine

Attali

Yigal

. 2012. “Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country.” Applied Measurement in Education 25(1):27–40. doi:10.1080/08957347.2012.635502.

12.

Chan

Cecilia K. Y.

2023. “A Comprehensive AI Policy Education Framework for University Teaching and Learning.” International Journal of Educational Technology in Higher Education 20(1):38. doi:10.1186/s41239-023-00408-3.

13.

Chen

Jing

Zhang

Bejar

Isaac

. 2017. “An investigation of the e-rater® automated scoring engine’s grammar, usage, mechanics, and style microfeatures and their aggregation model.” ETS Research Report Series 2017(1):1–14.

14.

Chin

Mark J.

Quinn

David M.

Dhaliwal

Tasminda K.

Lovison

Virginia S.

2020. “Bias in the Air: A Nationwide Exploration of Teachers’ Implicit Racial Attitudes, Aggregate Bias, and Student Outcomes.” Educational Researcher 49(8):566–78.

15.

Chu

Charlene H.

Nyrup

Rune

Leslie

Kathleen

Shi

Jiamin

Bianchi

Andria

Lyn

Alexandra

McNicholl

Molly

Khan

Shehroz

Rahimi

Samira

Grenier

Amanda

. 2022. “Digital Ageism: Challenges and Opportunities in Artificial Intelligence for Older Adults.” The Gerontologist 62(7):947–55. doi:10.1093/geront/gnab167.

16.

Context. 2025. “Understand and Compare GPT-4o Mini vs. GPT-4o.” https://context.ai/compare/gpt-4o-mini/gpt-4o.

17.

Coronado-Blázquez

Javier

. 2025. “Deterministic or Probabilistic? The Psychology of LLMs as Random Number Generators.” arXiv. doi:10.48550/arXiv:2502.19965.

18.

Danziger

Shai

Levav

Jonathan

Avnaim-Pesso

Liora

. 2011. “Extraneous Factors in Judicial Decisions.” Proceedings of the National Academy of Sciences of the United States of America 108(17):6889–92.

19.

Digital Education Council. 2024. “Digital Education Council Global AI Student Survey 2024: AI or Not AI: What Students Want.” https://www.digitaleducationcouncil.com/post/digital-education-council-global-ai-student-survey-2024.

20.

Fang

Xiao

Che

Shangkun

Mao

Minjia

Zhang

Hongzhe

Zhao

Ming

Zhao

Xiaohang

. 2024. “Bias of AI-Generated Content: An Examination of News Produced by Large Language Models.” Scientific Reports 14(1):5224. doi:10.1038/s41598-024-55686-2.

21.

Flodén

Johan

. 2025. “Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT.” British Educational Research Journal 51(1):201–24. doi:10.1002/berj.4069.

22.

Foltz

Peter W.

Streeter

Lynn A.

Lochbaum

Karen E.

Landauer

Thomas K.

2013. “Implementation and Applications of the Intelligent Essay Assessor.” Pp. 68–88 in Handbook of Automated Essay Evaluation: Current Applications and New Directions, edited by Shermis

M. D.

Burstein

New York, NY: Routledge/Taylor and Francis Group.

23.

González-Calatayud

Víctor

Prendes-Espinosa

Paz

Roig-Vila

Rosabel

. 2021. “Artificial Intelligence for Student Assessment: A Systematic Review.” Applied Sciences 11(12):5467. doi:10.3390/app11125467.

24.

Grivokostopoulou

Foteini

Perikos

Isidoros

Hatzilygeroudis

Ioannis

. 2017. “An Educational System for Learning Search Algorithms and Automatically Assessing Student Performance.” International Journal of Artificial Intelligence in Education 27(1):207–40.

25.

Heikkilä

Melissa

. 2023. “These New Tools Let You See for Yourself How Biased AI Image Models Are.” MIT Technology Review. https://www.technologyreview.com/2023/03/22/1070167/these-news-tool-let-you-see-for-yourself-how-biased-ai-image-models-are/.

26.

Herridge

Michelle

Tashiro

Jenna

Talanquer

Vicente

. 2021. “Variation in Chemistry Instructors’ Evaluations of Student Written Responses and Its Impact on Grading.” Chemistry Education Research and Practice 22(4):948–72.

27.

Hofmann

Valentin

Kalluri

Pratyusha R.

Jurafsky

Dan

King

Sharese

. 2024. “AI Generates Covertly Racist Decisions about People Based on Their Dialect.” Nature 633(8028):147–54. doi:10.1038/s41586-024-07856-5.

28.

Hoffmann

Florian

Oreopoulos

Philip

. 2009. “Professor Qualities and Student Achievement.” The Review of Economics and Statistics 91(1):83–92.

29.

Tiancheng

Kyrychenko

Yara

Rathje

Steve

Collier

Nigel

van der Linden

Sander

Roozenbeek

Jon

. 2024. “Generative Language Models Exhibit Social Identity Biases.” Nature Computational Science 5(1):65–75. doi:10.1038/s43588-024-00741-1.

30.

Jenka. 2023. “AI and the American Smile: How AI Misrepresents Culture through a Facial Expression.” Medium, March 26. https://medium.com/@socialcreature/ai-and-the-american-smile-76d23a0fbfaf.

31.

Jiang

Yang

Hao

Jiangang

Fauss

Micheal

Chen

. 2024. “Detecting ChatGPT-Generated Essays in a Large-Scale Writing Assessment: Is There a Bias against Non-native English Speakers?” Computers and Education 217:105070. doi:10.1016/j.compedu.2024.105070.

32.

Johnson

Matthew

Zhang

. 2024. “Examining the Responsible Use of Zero-Shot AI Approaches to Scoring Essays.” Scientific Reports 14(1):30064. doi:10.1038/s41598-024-79208-2.

33.

Jukiewicz

Marcin

. 2024. “The Future of Grading Programming Assignments in Education: The Role of ChatGPT in Automating the Assessment and Feedback Process.” Thinking Skills and Creativity 52:101522. doi:10.1016/j.tsc.2024.101522.

34.

Kamelski

Tobias

Klinge

Denise

. 2024. “Generative Artificial Intelligence and Digital Ageism: Exploring the Construction of Age and Aging by Image-Generating AI.” Medien und Altern 25. https://kopaed.de/kopaedshop/?pg=2_14&qt=32&pid=1502.

35.

Kates

Sean

Paulsen

Tine

Yntiso

Sidak

Tucker

Joshua A.

2023. “Bridging the Grade Gap: Reducing Assessment Bias in a Multi-grader Class.” Political Analysis 31(4):642–50.

36.

Klein

Joseph

Pat El

Liat

. 2003. “Impairment of Teacher Efficiency during Extended Sessions of Test Correction.” European Journal of Teacher Education 26(3):379–92.

37.

Lee

Messi H. J.

Montgomery

Jacob M.

Lai

Calvin K.

2024. “Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans.” Pp. 1321–40 in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. New York, NY: Association for Computing Machinery. doi:10.1145/3630106.3658975.

38.

Levy

Dan

Albertos

Angela Pérez

. 2024. Teaching Effectively with ChatGPT, edited by Levy

Albertos

Á. P.

North Haven, CT.

39.

Liang

Weixin

Yuksekgonul

Mert

Mao

Yinning

Eric

Zou

James

. 2023. “GPT Detectors Are Biased against Non-native English Writers.” Patterns 4(7):100779. doi:10.1016/j.patter.2023.100779.

40.

Litman

Diane

Zhang

Haoran

Correnti

Richard

Matsumura

Lindsay C.

Wang

Elaine

. 2021. “A Fairness Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing.” Pp. 255–67 in Artificial Intelligence in Education (AIED 2021), edited by Roll

McNamara

Sosnovsky

Luckin

Dimitrova

New York, NY: Springer. doi:10.1007/978-3-030-78292-4_21.

41.

Liu

Yifei

Panwang

Yuang

Chao

. 2025. “‘Turning RIGHT’? An Experimental Study on the Political Value Shift in Large Language Models.” Humanities and Social Sciences Communications 12(1):179. doi:10.1057/s41599-025-04465-z.

42.

Luccioni

Alexandra S.

Akiki

Christopher

Mitchell

Margaret

Jernite

Yacine

. 2023. “Stable Bias: Analyzing Societal Representations in Diffusion Models.” arXiv. https://huggingface.co/spaces/society-ethics/StableBias.

43.

Malouff

John M.

Emmerton

Ashley J.

Schutte

Nicola S.

2013. “The Risk of a Halo Bias as a Reason to Keep Students Anonymous during Grading.” Teaching of Psychology 40(3):233–37.

44.

Malouff

John M.

Stein

Sarah J.

Bothma

Lodewicka N.

Coulter

Kimberley

Emmerton

Ashley J.

2014. “Preventing Halo Bias in Grading the Work of University Students.” Cogent Psychology 1(1):988937. doi:10.1080/23311908.2014.988937.

45.

Marshik

Tesia

McCracken

Christopher

Kopp

Bryan

O’Marrah

Morgan

. 2025. “Student and Instructor Perceptions and Uses of Artificial Intelligence in Higher Education.” Teaching of Psychology 52(3):339–46. doi:10.1177/00986283241299745.

46.

McCarthy

John

Minsky

Marvin L.

Rochester

Nathaniel

Shannon

Claude E.

1955. “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence.” AI Magazine 27(4):12–14.

47.

Mei

Katelyn

Fereidooni

Sonia

Caliskan

Aylin

. 2023. “Bias against 93 Stigmatized Groups in Masked Language Models and Downstream Sentiment Classification Tasks.” Pp. 1699–710 in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. New York, NY: Association for Computing Machinery. doi:10.1145/3593013.3594109.

48.

Motoki

Fabio

Neto

Valdemar Pinho

Rodrigues

Victor

. 2024. “More Human Than Human: Measuring ChatGPT Political Bias.” Public Choice 198(1–2):3–23. doi:10.1007/s11127-023-01097-2.

49.

Naik

Ranjita

Nushi

Besmira

. 2023. “Social Biases through the Text-to-Image Generation Lens.” Pp. 786–808 in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. New York, NY: Association for Computing Machinery. doi:10.1145/3600211.3604711.

50.

Olivos

Francisco

Liu

Minhui

. 2024. “ChatGPTest: Opportunities and Cautionary Tales of Utilizing AI for Questionnaire Pretesting.” Field Methods 37(4):277–90. doi:10.1177/1525822X241280574.

51.

OpenAI. 2024a. “ChatGPT-4o (Dec 2025 Version) [Large language model].”

52.

OpenAI. 2024b. “GPT-4o Mini: Advancing Cost-Efficient Intelligence.” https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.

53.

OpenAI. 2025. “Create Chat Completion.” https://platform.openai.com/docs/api-reference/chat/create.

54.

Ouyang

Shuyin

Zhang

Jie M.

Harman

Mark

Wang

Meng

. 2025. “An Empirical Study of the Non-determinism of ChatGPT in Code Generation.” ACM Transactions on Software Engineering and Methodology 34(2):1–28. doi:10.1145/3697010.

55.

Peter

Sophia

Karst

Karina

Bonefeld

Meike

. 2024. “Objective Assessment Criteria Reduce the Influence of Judgmental Bias on Grading.” Frontiers in Education 9:1386016. doi:10.3389/feduc.2024.1386016.

56.

Popenici

Stefan A. D.

Kerr

Sharon

. 2017. “Exploring the Impact of Artificial Intelligence on Teaching and Learning in Higher Education.” Research and Practice in Technology Enhanced Learning 12(1):1–13.

57.

Protivínský

Tomáš

Münich

Daniel

. 2018. “Gender Bias in Teachers’ Grading: What Is in the Grade.” Studies in Educational Evaluation 59:141–49.

58.

Ratten

Vanessa

Jones

Paul

. 2023. “Generative Artificial Intelligence (ChatGPT): Implications for Management Educators.” The International Journal of Management Education 21(3):100857. doi:10.1016/j.ijme.2023.100857.

59.

Sarrion

Eric

. 2023. ChatGPT for Beginners. New York, NY: Apress.

60.

Srihari

Sargur

Collins

Jim

Srihari

Rohini

Srinivasan

Harish

Shetty

Shravya

Brutt-Griffler

Janina

. 2008. “Automatic Scoring of Short Handwritten Essays in Reading Comprehension Tests.” Artificial Intelligence 172(2–3):300–24.

61.

Stoltz

Dustin S.

Taylor

Marshall A.

2024. Mapping Texts: Computational Text Analysis for the Social Sciences. New York, NY: Oxford University Press.

62.

Stypinska

Justyna

. 2023. “AI Ageism: A Critical Roadmap for Studying Age Discrimination and Exclusion in Digitalized Societies.” AI and Society 38(2):665–77. doi:10.1007/s00146-022-01553-5.

63.

UNESCO. 2023. Guidance for Generative AI in Education and Research. Paris, France: UNESCO. doi:10.54675/EWZM9535.

64.

The University of Alabama. 2025. “UA Policies.” https://teachingai.as.ua.edu/ua-policies/.

65.

U.S. Department of Education, Office of Educational Technology. 2023. “Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations.” https://www.ed.gov/sites/ed/files/documents/ai-report/ai-report.pdf.

66.

U.S. Department of Education, Office of Educational Technology. 2024. “Empowering Education Leaders: A Toolkit for Safe, Ethical, and Equitable AI Integration.” https://files.eric.ed.gov/fulltext/ED661924.pdf.

67.

Usher

Maya

. 2025. “Generative AI vs. Instructor vs. Peer Assessments: A Comparison of Grading and Feedback in Higher Education.” Assessment and Evaluation in Higher Education 50(6):919–27. doi:10.1080/02602938.2025.2487495.

68.

van der Loo

Mark P. J.

2014. “The Stringdist Package for Approximate String Matching.” R Journal 6(1):111–22.

69.

Wang

Shan

Wang

Fang

Zhu

Zhen

Wang

Jingxuan

Tran

Tam

Zhao

. 2024. “Artificial Intelligence in Education: A Systematic Literature Review.” Expert Systems with Applications 252:124167. doi:10.1016/j.eswa.2024.124167.

70.

Warr

Melissa

Pivovarova

Margarita

Mishra

Punya

Oster

Nicole J.

2024. “Is ChatGPT Racially Biased? The Case of Evaluating Student Writing.” SSRN Electronic Journal. doi:10.2139/ssrn.4851112.

71.

Wetzler

Elizabeth L.

Cassidy

Kenneth S.

Jones

Margaret J.

Frazier

Chelsea R.

Korbut

Nickalous A.

Sims

Chelsea M.

Bowen

Shari S.

Wood

Michael

. 2024. “Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation.” Teaching of Psychology 52(3):298–304. doi:10.1177/00986283241282696.

72.

Williamson

David M.

Xiaoming

Breyer

F. Jay

. 2012. “A Framework for Evaluation and Use of Automated Scoring.” Educational Measurement: Issues and Practice 31(1):2–13. doi:10.1111/j.1745-3992.2011.00223.x.

73.

Wilson

Kyra

Caliskan

Aylin

. 2024. “Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval.” In Proceedings of the Seventh AAAI/ACM Conference on AI, Ethics, and Society (AIES). http://arxiv.org/abs/2407.20371.

74.

Yang

Yiran

. 2025. “Racial bias in AI-generated images.” AI & SOCIETY 40:5425–5437.

75.

Zhou

Abhishek

Vibhanshu

Derdenger

Timothy

Kim

Jaymo

Srinivasan

Kannan

. 2024. “Bias in generative AI.” arXiv. arXiv:2403.02726.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

Assessing Instructor-AI Cooperation for Grading Essay-Type Questions in an Introductory Sociology Course

Abstract

Keywords

Get full access to this article

References

Supplementary Material