Sage Journals: Discover world-class research

Abstract

Study Design

A comparative analysis of AI-generated vs human-authored personal statements for spine surgery fellowship applications.

Objective

To assess whether evaluators could differentiate between ChatGPT- and human-authored personal statements and determine if AI-generated statements could outperform human-authored ones in quality metrics.

Summary of Background Data

Personal statements are key in fellowship admissions, but the rise of AI tools like ChatGPT raises concerns about their use. While previous studies have examined AI-generated residency statements, their role in spine fellowship applications remains unexplored.

Methods

Nine personal statements (4 ChatGPT-generated, 5 human-authored) were evaluated by 8 blinded reviewers (6 attending spine surgeons and 2 fellows). ChatGPT-4o was prompted to create statements focused on 4 unique experiences. Evaluators rated each for readability, originality, quality, and authenticity (0-100 scale), determined AI authorship, and indicated interview recommendations.

Results

ChatGPT-authored statements scored higher in readability (65.69 vs 56.40, P = 0.016) and quality (63.00 vs 51.80, P = 0.004) but showed no differences in originality (P = 0.339) or authenticity (P = 0.256). Reviewers could not reliably distinguish AI from human authorship (P = 1.000). Interview recommendations favored ChatGPT-generated statements (84.4% vs 62.5%, OR: 3.24 [1.08-11.17], P = 0.045).

Conclusion

ChatGPT can produce high quality, indistinguishable spine fellowship personal statements that increase interview likelihood. These findings highlight the need for nuanced guidelines regarding AI use in application processes, particularly considering its potential role in expanding access to high-quality writing assistance and editing.

Keywords

ChatGPT artificial intelligence personal statement spine surgery fellowship

Introduction

Applying for spine surgery fellowship is a highly competitive process for orthopaedic and neurosurgical residents.¹ Personal statements play an important role in fellowship admissions, allowing applicants to showcase their unique personal experiences and characteristics which help assess their fit for the program.² Many applicants seek external guidance on their personal statements, from mentors in the field to commercial writing services.^2,3 With the increasing availability of artificial intelligence (AI) tools, such as large language models (LLMs) like ChatGPT, these technologies have become popular among applicants due to their versatility and ease of use.⁴ LLMs are designed for advanced language processing tasks, and can assist with various functions including brainstorming, grammatical editing, and even generating entire essays.⁴ Advantages of using these AI tools include reducing time burden on busy applicants, providing support for those less proficient in English, and reducing the financial burden of professional writing services.⁵

However, the use of AI and LLMs in academic writing has sparked significant debate.^6,7 Residency admissions directors have expressed concerns that AI usage may obscure an applicant’s voice and limit the evaluation of written communication skills – attributes that are highly valued in prospective residents.⁷ Additionally, ethical concerns have been raised regarding the integrity of using AI, with many considering AI-generated content as a form of plagiarism.⁷ The lack of reliable AI detection tools makes it challenging for reviewers to distinguish between AI-generated and human-written essays, complicating efforts to enforce AI usage guidelines.⁸

Previous research has shown that personal statements for residency applications generated by LLMs were often indistinguishable from those written by applicants and have received similar scores on quality metrics.^9-11 However, these studies used limited prompting, testing only the very basic capabilities of LLMs. Additionally, no research has specifically examined the use of LLMs in generating personal statements for spine fellowship applications, which generally require more specialized content than residency applications.

In this study, we assessed whether evaluators could reliably differentiate between ChatGPT- and human-authored statements for spine surgery fellowship applications. Additionally, we used an extensive prompt designed to improve the narrative quality of ChatGPT-generated essays to explore whether ChatGPT-generated essays could potentially outperform human-authored ones.

Materials and Methods

Personal Statement Generation

A total of 9 (4 ChatGPT, 5 applicant) personal statements were compiled. For the AI-generated essays, ChatGPT (version 4o, OpenAI, San Francisco, California) was given the prompt, “Please compose a personal statement for my application to spine fellowship programs. I am currently in my fourth year of orthopedic surgery residency in the United States. The statement should adopt an academic tone without being pretentious, maintaining clarity over sophistication, and avoiding predictable transitions and extraneous wording. Emphasize cohesiveness and a smooth narrative flow. The essay should be narrative-driven, showcasing a unique and original story revolving around [fill in the blank]. Avoid simply listing achievements from my resume. The statement should not exceed 600 words.” To ensure the 4 responses were unique, the software was prompted to focus on 1 of 4 unique experiences: global health, family member injury, residency experience, and technology advancement. To evaluate the independent capabilities of ChatGPT, no additional information outside of the above prompt was given to the chat bot (ie, an applicant’s CV). Each essay was generated in a separate chat session to minimize AI self-referencing and to promote heterogeneity between personal statements.

For the applicant essays, 5 de-identified essays were chosen from applicants who were granted interviews at our institution. Essays were chosen at random from 2017-2022, before ChatGPT was publicly available.

Evaluation Process

The personal statements were randomly ordered and reviewed by 8 blinded evaluators, composed of 6 spine surgery attending physicians with extensive experience in fellowship admissions and 2 current spine surgery fellows. Participants were instructed to grade each essay on a scale of 0-100 for readability, originality, quality, and authenticity. Additionally, participants were asked if they thought AI was used in the creation of the essay, and if they would grant this applicant an interview based on the personal statement.

Statistical Analysis

Quality metrics between applicant-authored and ChatGPT-authored personal statements were compared using independent samples t-tests. Univariate logistic regression was performed to evaluate the odds of receiving an interview recommendation and to assess whether reviewers could accurately identify ChatGPT-authored personal statements. Each reviewer’s evaluation of each personal statement was treated as a separate data point for statistical analysis. Inter-rater reliability for continuous variables was assessed using a Two-Way Mixed-Effects Model Intraclass Correlation Coefficient (ICC), while reliability for categorical variables was evaluated using Fleiss’ Kappa. All statistical analyses were conducted using SPSS (Version 29.0.2.0; Armonk, NY: IBM Corp.). P < 0.05 was considered statistically significant.

Results

Five applicant-authored and 4 ChatGPT-authored personal statements were evaluated by 8 reviewers, for a total of 40 and 32 responses, respectively. The readability (65.69 ± 14.68 vs 56.40 ± 17.25, P = 0.016) and quality rating (63.00 ± 15.82 vs 51.80 ± 15.93, P = 0.004) scores were significantly higher for ChatGPT-authored statements compared to applicant-authored statements. In contrast, there were no significant differences in originality (56.94 ± 18.28 vs 52.78 ± 18.11, P = 0.339) or authenticity (62.09 ± 22.87 vs 56.43 ± 17.97, P = 0.256) scores between ChatGPT-authored and applicant-authored statements. Quality metrics can be seen in Table 1 and Figure 1.

Table 1.

Comparison of Quality Metrics Between Applicant-Authored and ChatGPT-Authored Personal Statements.

	Applicant-Authored	ChatGPT-Authored	P-Value
	n = 40	n = 32
Readability	56.40 ± 17.25	65.69 ± 14.68	0.016
Originality	52.78 + 18.11	56.94 + 18.28	0.339
Quality	51.80 + 15.93	63.00 + 15.82	0.004
Authenticity	56.43 + 17.97	62.09 + 22.87	0.256

Statical significance defined as P < 0.05.

^aContinuous data presented as mean ± standard deviation.

Figure 1.

Comparison of quality metrics between applicant-authored and ChatGPT-Authored personal statements.

Reviewers were unable to identify whether a personal statement was AI or human authored, with 37.5% of both the ChatGPT-authored statements (12/32 reviewer responses) and applicant-authored statements (15/40 reviewer responses) being identified as AI-generated (P = 1.000). However, reviewers were statistically more likely to recommend granting an interview for ChatGPT-authored personal statements (84.4%, 27/32 reviewer responses) compared to applicant-authored ones (62.5%, 25/40 reviewer responses) (OR: 3.24 [1.08-11.17], P = 0.045). A complete analysis of reviewer perceptions and interview recommendations is presented in Table 2.

Table 2.

Reviewer Perception of Applicant-Authored and ChatGPT-Authored Personal Statements.

	Applicant-Authored	ChatGPT-Authored	P-Value	OR	95% CI
	n = 40	n = 32
Identified as AI	15 (37.5%)	12 (37.5%)	1.000	1.00	[0.38-2.62]
Granted interview	25 (62.5%)	27 (84.4%)	0.045	3.24	[1.08-11.17]

Abbreviations: AI = artificial intelligence, OR = odds ratio, 95% CI = 95% confidence interval. Categorical data presented as n (%). Statistical significance defined as P < 0.05.

Inter-rater reliability was poor for readability (ICC: 0.28 [0.75-0.81]), but moderate for originality (ICC: 0.72 [0.31-0.93]), overall quality (ICC: 0.68 [0.23-0.92]), and authenticity (ICC: 0.66 [0.17-0.91]). There was poor inter-rater reliability for identifying ChatGPT-generated personal statements (κ: 0.16 [0.04-0.29]) and for recommending a personal statement for an interview (κ: 0.15 [0.03-0.27]).

Discussion

This study comparing ChatGPT- and human-authored personal statements for spine surgery fellowship applications revealed that ChatGPT can produce essays that outperform authentic submissions in certain evaluation criteria. Evaluators failed to distinguish between ChatGPT and human authors and rated the ChatGPT-written statements higher for readability, quality, and likelihood of earning interview recommendations.

Previous studies have also reported that reviewers typically cannot reliably distinguish between ChatGPT-generated and human-authored statements, with both scoring similarly across quality metrics in contexts such as general surgery, orthopaedic surgery, plastic surgery, and anesthesia residency applications.^9-13 However, our study is the first to analyze personal statements within the fellowship application process and to demonstrate that ChatGPT-generated essays can potentially exceed human-written content on key evaluation criteria. We attribute these results to our unique prompting strategy. Whereas earlier studies used either basic, single-sentence prompts or more detailed, content-focused instructions, our approach employed a comprehensive prompt that emphasized humanistic writing style and allowed ChatGPT greater flexibility to craft content within defined parameters. This strategic shift in prompt design likely contributed to the significantly higher readability and quality scores observed in our study.

It is perhaps unsurprising that ChatGPT can generate high-quality, humanistic personal statements, given that LLMs are trained on vast datasets of human-written content and established quality benchmarks. However, not all studies align with our results. For example, Nair et al.’s research on internal medicine residency personal statements found that reviewers could identify AI-generated content.¹⁴ Their use of an exaggerated prompt, such as “you want an outstanding and unique personal statement that will captivate and wow the readers on the admission committee”, may have resulted in an overly dramatic writing style that signaled AI authorship.¹⁴ However, this study underscores the pivotal role of prompt engineering in shaping AI-generated outputs, likely explaining the large differences between study results. Furthermore, while prior research predominantly employed ChatGPT-3.5 or −4, our study utilized the latest iteration, ChatGPT-4o, highlighting the rapid advancements in generative AI technology. These findings highlight the powerful capabilities of current generative AI technology and introduce key considerations for future application processes in medical education.

The use of generative AI in medical education applications has created significant debate within medical admissions committees, with concerns centered on authenticity, essay homogenization, and the loss of the applicant’s voice.⁷ Research by Whitrock et al. reveals strong opposition among faculty and residents as 66.7% oppose AI use in personal statement writing, and 80% said AI usage would impact their opinion of an applicant.¹² This issue is further complicated by limitations in AI detection tools and reviewers’ ongoing difficulty in identifying AI-generated content, as seen in this study.⁸

The discourse surrounding AI in applications warrants a nuanced approach that considers its diverse capabilities rather than treating it as a binary choice. Many agree that using AI-generated content as one’s own work is a form of plagiarism. While this study investigated this most extreme use of AI in personal statement writing, complete ghostwriting, generative AI can be used in ways that do not challenge the accuracy, authenticity, and integrity of applications.^4,7 For example, The United Kingdom’s shared admissions service for higher education endorses AI’s use for essay organization, grammar checking, and improving readability.¹⁵ Furthermore, AI-powered tools like text correctors and suggestion features have become seamlessly embedded in daily workflows that they are often overlooked as standard functionalities and not acknowledged as “products” of AI.

Applicants have long sought external assistance in writing personal statements, whether through multiple rounds of mentor feedback or expensive professional editing services costing hundreds to thousands of dollars.^2,3 In this context, publicly available free AI platforms could expand access to high quality writing assistance, potentially benefiting applicants with limited access to mentorship or financial resources.⁷ Koaum et al further argued that AI tools can help bridge the fluency gap for those with English as a second language, helping compose a compelling narrative while retaining the authenticity of the applicant’s experiences.⁷ This study shows that, when prompted effectively, current AI models can generate engaging and effective narratives, offering a powerful editing tool for applicants to convey their experiences. Chen et al suggest that if the primary goal of a personal statement is to communicate unique information about applicants not covered elsewhere in the application, then generative AI can help achieve this.⁴

In 2024, the AAMC updated its personal statement guidelines, shifting from the requirement that personal statements “must be your own work and not the work of another author or the product of artificial intelligence” to “the use of AI tools is acceptable for brainstorming, proofreading, or editing the personal statement, but the final submission should represent your own work.”¹⁶ This change signals a growing acceptance of limited AI use in medical education.

As AI technologies continue to advance, it is imperative for fellowship and residency matching organizations to develop clear and well-considered guidelines regarding the appropriate use of AI in application processes. Given the expanding capabilities of LLMs and their integration into everyday workflows, outright prohibition of these tools is both impractical and unenforceable, particularly in the absence of reliable detection methods. However, it is crucial to establish explicit definitions for the ethical use of LLMs within higher education, as current policy development lags behind the rapid pace of technological advancement. Future research should focus on assessing the prevalence of AI usage in personal statement writing, exploring program directors’ views on various AI applications after they have been fully informed about the technology’s capabilities and limitations, and determining where boundaries should be set for acceptable and ethical use.

Limitations

This study has several limitations. Our use of a single prompt for generating personal statements likely underestimates AI’s potential, as optimal performance typically involves iterative prompting with human intervention and editing. The evaluation metrics such as “overall quality” were intentionally broad but susceptible to individual reviewer interpretation. While we used the free version of ChatGPT due to its widespread accessibility, other LLMs such as Claude, Perplexity, and Gemini, may offer enhanced creative writing capabilities. In the essay review process, we included current spine surgery fellows who do not have extensive experience in fellowship admissions, leading to potential inaccuracies in grading. ChatGPT was not provided with real applicant information, such as a CV, and instead generated fictional candidates from scratch based around a theme. While this could be seen as unfair, since ChatGPT could fabricate any story, it aligns with the study’s goal of assessing whether it can produce an adequate personal statement without specific applicant input. Furthermore, because personal statements are often judged on subjective elements like personal motivations and experiences rather than on the quality metrics or achievements that can be seen on an applicant’s CV, this approach provided a reasonable basis for comparison between cohorts.² Finally, the study’s small sample size of 9 personal statements, limited number of reviewers, and exclusive focus on ChatGPT constrain the generalizability of the findings and prevent definitive conclusions about ChatGPT or other LLMs. This is particularly relevant given that some results were borderline significant (eg, referral for interview: P = 0.045). Nonetheless, this research lays the groundwork for future investigations into the capabilities and ethical usage of ChatGPT and other LLMs in this context.

Conclusion

This study demonstrates that ChatGPT can produce personal statements for spine surgery fellowship applications that are indistinguishable from human-written essays and may also surpass human-written essays in quality, readability, and likelihood of securing interviews, highlighting the capabilities of current AI technology. While concerns about authenticity and academic integrity persist, the medical education community must provide nuanced guidelines that acknowledge both the benefits and limitations of AI in application processes, particularly its role in expanding access to high-quality writing and application assistance, while balancing its use with human creativity and applicant voice (Supplemental Material).

Supplemental Material

Supplemental Material - Artificial Intelligence vs Human Authorship in Spine Surgery Fellowship Personal Statements: Can ChatGPT Outperform Applicants?

Supplemental Material for Artificial Intelligence vs Human Authorship in Spine Surgery Fellowship Personal Statements: Can ChatGPT Outperform Applicants? by William J. Karakash, BS, Henry Avetisian, MS, Jonathan M. Ragheb, BA, Jeffrey C. Wang, MD, Raymond J. Hah, MD, and Ram K. Alluri, MD in Global Spine Journal.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the William Karakash, Henry Avetisian, and Jonathan M. Ragheb have nothing to disclose. Jeffrey C. Wang has received intellectual property royalties from Zimmer Biomet, NovApproach, SeaSpine, and DePuy Synthes. Raymond J. Hah has received grant funding from SI bone, consulting fees from NuVasive, and support from the North American Spine Society to attend meetings. Ram K. Alluri has received grant funding from NIH, consulting fees from HIA Technologies, and payment from Eccentrial Robotics for lectures and presentations.

Institutional Review Board

This study was Institutional review board (IRB) exempt.

ORCID iDs

William J. Karakash

Henry Avetisian

Jonathan M. Ragheb

Raymond J. Hah

Ram K. Alluri

Supplemental Material

Supplemental material for this article is available online.

References

Satin

Rogers-LaVanne

Albert

Hilibrand

Schoenfeld

Guyer

. Best practices for the spine fellowship match: interview statistics, applicant perspectives, what programs look for in fellowship applicants, and guidance. Spine. 2023;48(6):371-375.

Jones

Pittman

Jr Manning

. Ten steps for writing an exceptional personal statement. J Grad Med Educ. 2022;14(5):522-525.

Johnstone

Vallejo

Zakowski

. Improving residency applicant personal statements by decreasing hired contractor involvement. J Grad Med Educ. 2022;14(5):526-528.

Chen

Bowe

Deng

. Residency applications in the era of generative artificial intelligence. J Grad Med Educ. 2024;16(3):254-256.

Woodfin

. The personal statement in the age of artificial intelligence. Acad Med. 2023;98(8):869.

van de Ridder

JMM

Shoja

Rajput

. Finding the place of ChatGPT in medical education. Acad Med. 2023;98(8):867.

Kouam

Pak

Montelongo Hernandez

. Ethics of using artificial intelligence for medical residency personal statements. Acad Psychiatry. 2024;49:46-47. doi:10.1007/s40596-024-02047-w

Weber-Wulff

Anohina-Naumeca

Bjelobaba

, et al. Testing of detection tools for AI- generated text. arXiv [csCL]. Published online June 21, 2023. https://arxiv.org/abs/2306.15666

Patel

Deleonibus

Wells

Bernard

Schwarz

. Distinguishing authentic voices in the age of ChatGPT: comparing AI-generated and applicant-written personal statements for plastic surgery residency application. Ann Plast Surg. 2023;91(3):324-325.

10.

Johnstone

Neely

Sizemore

. Artificial intelligence software can generate residency application personal statements that program directors find acceptable and difficult to distinguish from applicant compositions. J Clin Anesth. 2023;89(111185):111185.

11.

Crawford

Hendzlik

Lam

, et al. Digital ink and surgical dreams: perceptions of artificial intelligence-generated essays in residency applications. J Surg Res. 2024;301:504-511.

12.

Whitrock

Pratt

Carter

, et al. Does using artificial intelligence take the person out of personal statements? We can’t tell. Surgery. 2024;176:1610-1616. doi:10.1016/j.surg.2024.08.018

13.

Lum

Guntupalli

Saiz

, et al. Can artificial intelligence fool residency selection committees? Analysis of personal statements by real applicants and generative AI, a randomized, single-blind multicenter study. JB JS Open Access. 2024;9(4):e24.00028. doi:10.2106/JBJS.OA.24.00028

14.

Nair

Nayak

Ahuja

, et al. Comparing IM residency application personal statements generated by GPT-4 and authentic applicants. J Gen Intern Med. 2025;40(1):124-126.

15.

A guide to using AI and ChatGPT with your personal statement. UCAS. 2023. https://www.ucas.com/applying/applying-university/writing-your-personal-statement/guide-using-ai-and-chatgpt-your-personal-statement. Accessed November 13, 2024.

16.

Personal Statement . Students & residents. https://students-residents.aamc.org/applying-residencies-eras/publication-chapters/personal-statement. Accessed 13 November 2024.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB