Can ChatGPT Fool the Match? Artificial Intelligence Personal Statements for Plastic Surgery Residency Applications: A Comparative Study

Abstract

French

Introduction: Personal statements can be decisive in Canadian residency applications. With the rise in AI technology, ethical concerns regarding authenticity and originality become more pressing. This study explores the capability of ChatGPT in producing personal statements for plastic surgery residency that match the quality of statements written by successful applicants. Methods: ChatGPT was utilized to generate a cohort of personal statements for CaRMS (Canadian Residency Matching Service) to compare with previously successful Plastic Surgery applications. Each AI-generated and human-written statement was randomized and anonymized prior to assessment. Two retired members of the plastic surgery residency selection committee from the University of British Columbia, evaluated these on a 0 to 10 scale and provided a binary response judging whether each statement was AI or human written. Statistical analysis included Welch 2-sample t tests and Cohen's Kappa for agreement. Results: Twenty-two personal statements (11 AI-generated by ChatGPT and 11 human-written) were evaluated. The overall mean scores were 7.48 (SD 0.932) and 7.68 (SD 0.716), respectively, with no significant difference between AI and human groups (P = .4129). The average accuracy in distinguishing between human and AI letters was 65.9%. The Cohen's Kappa value was 0.374. Conclusions: ChatGPT can generate personal statements for plastic surgery residency applications with quality indistinguishable from human-written counterparts, as evidenced by the lack of significant scoring difference and moderate accuracy in discrimination by experienced surgeons. These findings highlight the evolving role of AI and the need for updated evaluative criteria or guidelines in the residency application process.

Keywords

residency artificial intelligence ChatGPT machine learning postgraduate medical education

Introduction

The Canadian Resident Matching Service (CaRMS) serves as the national annual residency selection process in Canada, offering medical students a centralized platform to apply for postgraduate training programs.¹ This process involves the submission of a CaRMS application package, followed by a thorough file review by the programs, and ultimately, candidate interviews.¹

Personal statements have traditionally held to be potentially decisive in residency applications, as relying solely on objective data for selection has been shown to be both an inadequate predictor of clinical performance and leave gaps in a candidate's full portrait.^2,3 These letters offer insights into applicants’ qualities, experiences, and motivations beyond their academic achievements.¹ A recent survey study revealed that Canadian program directors of Plastic Surgery residency programs value personal statements with a rating of 3.1 on a 5-point Likert scale, equating their utility to that of medical school transcripts.⁴ Similarly, program directors and selection committee members of Canadian urology residency programs rated the importance of the personal statement at 3.65 on the same scale.⁵ In the United States, between 74% and 78% of program directors consider personal statements important in selecting interview candidates, and 48% to 54% rely on them when determining final rank lists.^3,6 Studies have further demonstrated a correlation between applicant traits found in the personal statement and decreased resident attrition.^7,8 Nonetheless, despite their utility in candidate selection, personal statement scoring remains highly subjective, resulting in poor interrater reliability as there is often a lack of objective criteria.^9,10

Residency applicants invest substantial time in crafting and refining their personal statements, leading to a significant level of anxiety in 80% of these applicants.¹¹ Recently, advanced AI language models, such as ChatGPT (OpenAI, San Francisco), have shown potential in automating the generation of personalized and persuasive narratives, including personal statements, given its high accuracy operating natural language generation and ability to write essays of reasonable quality on a particular topic.¹² Other evidence demonstrates the utility of ChatGPT in writing patient medical educational handouts with acceptable quality.¹³ However, outside the context of medical writing, it is important to consider the limitations of ChatGPT in other writing circumstances, such as lack of context or personalization, ethical concerns, and inaccurate information.¹² To address these issues, human experts may provide guidance, context, and verification to enhance ChatGPT's output.¹² Nonetheless, a controversial study revealed that human reviewers accurately identified only 68% of ChatGPT-generated artificial research paper abstracts.¹⁴ Whether these limitations extend to the writing of reference letters for residency applications remain underinvestigated.

This study aims to explore the capability of AI in producing convincing personal statements for plastic surgery residency that match the quality of statements written by successful applicants. If ChatGPT-generated statements prove to be equally persuasive and impactful, it prompts the question of how much weight should be placed on personal statements in the selection process today. Understanding the potential of AI in assisting with personal statement writing can inform selection committees across Canada of the need for a more balanced and fair selection criteria, particularly in an era of machine learning. The primary outcome of this study is the subjective score of each group evaluated by 2 plastic surgery faculty from the University of British Columbia, experienced in CaRMS selection. The secondary outcome is a binary question to assess if the plastic surgeon evaluators can differentiate whether the personal statement is written by an applicant or ChatGPT.

Methods

Study Design

The aim of this study is to assess the ability of ChatGPT 4.0 in generating personal statements for plastic surgery residency applications, simulating the circumstances of a medical student. Chat-GPT was selected for its open access and equipped with universally accessible online resources to facilitate this simulation. A guide published in the Journal of Graduate Medical Education, authored by clerkship and residency program directors with a combined experience of 50 years, has underscored the role of personal statements in highlighting an applicant's strengths and achievements.³ This guide further suggests incorporating personal anecdotes and patient stories as a means to showcase personal traits, the journey toward or enthusiasm for the chosen discipline, and professional growth.³ These elements, along with a structured framework, were woven into the ChatGPT prompt to ensure sufficient variety.³

ChatGPT-4 was utilized using the following query stem on November 18, 2023:

I am a final year medical student in Canada applying for a plastic surgery resident program at the University of British Columbia. Please write me a personal statement with no fixed structure of a maximum of 500 words incorporating specific and compelling stories, experiences, or something that introduces the applicant and makes the reader want to know more, and essential details that a program must know about the applicant and their proudest accomplishments, and specific strengths related to the specialty of choice and leadership experiences. Please do not use placeholders.

To avoid any potential influence from prior inputs, each query was executed in a fresh ChatGPT session. Furthermore, to account for response quality variation, we repeated this search eleven times, evaluating and rating each answer individually. All identifiable information has been removed and replaced with variables.

Human-written personal statements submitted to CaRMS from 2018 to 2023 were collected, with permission, from current UBC plastic surgery residents by EB and anonymized prior to assessment.

Data Collection and Grading

Each ChatGPT-generated and human-written statement was assessed blindly by 2 plastic surgeons, recently retired from the UBC CaRMS selection committee. In total, 11 successful statements were compared with 11 ChatGPT-generated statements evaluated on a subjective score range (0-10), replicating the residency application process at University of British Columbia. The secondary outcome, binary response discriminating past successful versus AI-generated personal statements were collected in the same manner.

Statistical Analysis

The primary outcome was the surgeons’ rating score (scale: 0-10) between human- and AI-derived reference letters. We refer to the surgeons as “Surgeon A” and “Surgeon B.” The secondary outcome was the likelihood of correct prediction of human- versus AI-derived reference letters. Descriptive statistics were described as mean (standard deviation; SD) or median (interquartile range; IQR), where applicable. Inferential statistics were conducted using independent pairwise (Welch 2-sample) t test to determine significant between-group differences, using an a = 0.05. For each surgeon, we calculate their sensitivity, specificity, and accuracy of detecting human-from-AI-written letters. We employ Cohen's Kappa analysis to determine agreement between the surgeon's rating of human versus AI. Agreement on continuous quality scoring was assessed using a Bland-Altman plot. Where indicated, correlation analysis was conducted using Spearman methodology. Analysis was completed by BT in RStudio (version 2023.06.1 + 524).

Ethical Considerations

This study adheres to the principles of the Declaration of Helsinki. In accordance with article 2.4 of the tricouncil policy statement for research ethics in Canada, institutional review board approval was not required as the data were obtained from publicly available sources. No personally identifiable information was generated or used in this study.

Results

Across 22 personal statements (11 from ChatGPT and 11 human-written), Surgeons A and B rated an overall mean (SD) score of 7.48 (0.932) and 7.68 (0.716), respectively. Table 1 depicts both surgeons’ scoring when stratified by human- versus AI-derived letters. When both raters’ scores were pooled into a summary estimate, there was no significant difference in scoring between AI and human groups (P = .4129).

Table 1.

Surgeons’ Scoring of Human- Versus AI-Derived Letters.

	Human-derived letters (mean, SD)	AI-derived letters (mean, SD)
Surgeon A	7.5 (0.866)	7.45 (1.04)
Surgeon B	7.91 (0.701)	7.45 (0.688)
Pooled Scoring	7.70 (0.697)	7.45 (0.705)

With respect to discriminating between human- and AI-derived letters, Surgeon A exhibited a sensitivity, specificity, and accuracy of 0.636, 0.818, and 0.727, respectively. Comparatively, Surgeon B exhibited a sensitivity, specificity, and accuracy of 0.636, 0.545, and 0.591, respectively. Regarding interrater reliability of discerning human-written material from not, Surgeons A and B shared a Cohen's Kappa of 0.374, indicating fair agreement. Figure 1 depicts a Band-Altman plot to visualize the intersurgeon agreement on their continuous variable scoring of each letter. The overall mean difference of scoring between raters was near-zero with all-but-one individual mean differences lying within the 95% confidence interval of the pooled mean difference, thereby suggesting acceptable agreement.

Figure 1.

Bland-Altman plot for intersurgeon agreement on scoring letters.

Discussion

This study suggests AI is capable of generating convincing, high-quality personal statements for Plastic Surgery residency applications with minimal guidance. Our results contribute new and objective data to the timely discussion around the complex ethical challenge of distinguishing between AI-written and human-written academic pieces.¹⁵ Our comparative analysis revealed no significant difference in average ratings between AI and human groups. This observation is found between unaltered AI without any post-generation editing, and human-written statements, and may even underestimate the potential strength of a hybrid AI-human approach, such as sequential refinement of query stems. This study adds to the question of AI in enhancing research and academic productivity, and its impact on the traditional functions of academic institutions with the possible blurring of lines between human creativity and machine efficiency.¹⁶

The ethical and practical implications of AI use in academia are profound and multifaceted. The utilization of AI in academic pursuits, especially in contexts such as lesser-known journals, graduate theses, and grant applications, brings forth issues of authenticity and academic integrity.¹⁷ A notable example, the admission of a Japanese literature prize winner who utilized ChatGPT, illustrates the complex interplay of AI in creative and academic fields.¹⁸ This underscores the need for clear guidelines and best practices in AI usage. The CaRMS Privacy Statement emphasizes the responsibility of applicants to ensure the accuracy and completeness of their personal information.¹⁹ However, there is a notable absence of specific directives regarding AI assistance in this context. To mitigate potential issues such as the homogenization of statements, perpetuation of bias, and ethical dilemmas, it is imperative to define the role of AI in residency applications more clearly before establishing such guidelines.²⁰ While current AI-detection tools (OpenAI, CrossPlag, GPTZero, and more) offer some capability to distinguish between human and AI-generated content, their performance is inconsistent, leading to the suggestion that they should not be the sole determinant in questions of academic integrity.²¹ Unlike plagiarism detection tools in widespread use throughout academia, AI screening may not yet be sufficiently robust to ensure a fair judgment. Thus, a broader, more nuanced approach is required to address the challenges posed by AI in academic settings.

The demonstrated success of AI in crafting personal statements of course risks diminishing their value, potentially shifting emphasis toward more quantifiable metrics like academic performance.²⁰ CaRMS does not yet afford a standardized academic selection tool; the inevitable disparity between Medical School evaluation standards, together with the universally high grades presented by Plastic Surgery aspirants, confers inherent inadequacy on the ability to discriminate between them. Furthermore, the recent shift in Canadian medical education toward pass-fail and practical clinical evaluations, with less emphasis on competition and academic ranking, complicates the selection process.⁴ In highly competitive specialties, application volume exceeds the ability to meaningfully interview all candidates. Academic performance alone may correlate with postsecondary success but is limited in addressing suitability for a surgical career. References are highly subjective and often based on but momentary exposure.²² Personal statements offer a holistic view of the applicant's unique experiences and journeys that are not objectively measurable, and have historically therefore, formed an integral part of the residency application.²³ Yet, the results from our study call for transparency about how programs will evaluate applications in this new AI-influenced landscape.²⁴

As AI technology advances and its capability to generate increasingly sophisticated personal statements grows, ethical concerns regarding authenticity and originality become more pressing. The debate about the future role and relevance of personal statements in the context of emerging AI technologies is ongoing. As we are unlikely to stop the tide, will traditional personal statements become obsolete? Optimistically, a timely published guideline warning against the overt or inappropriate use of AI in CaRMS applications or policies regarding AI usage published by individual programs in their program descriptions may be sufficient. Additionally, publication of this and similar studies alerts applicants to selection committee self-awareness. This allows for bilateral acceptance of AI as a valid academic tool within ethical boundaries. The determined student will exploit AI while honing its contribution with a personal slant. Meanwhile, the shrewd reviewer will become more discerning against generic content and AI flags, rewarding effort invested in authenticity and originality. Nevertheless, our study implies that AI will probably outpace the evolving capability of human subjectivity, and ongoing prospective validation of all application tools should be promoted to preserve their discriminatory value.

Strengths and Limitations

The limitations of this study include its small sample size, which may not represent a broader range of personal statements. The subjective nature of the evaluation process and potential biases is also concerns, especially considering that our faculty, while chosen for their retired status, may not fully capture the range of experiences of a current file reviewer, particularly in terms of familiarity with AI. Nonetheless, the reviewers’ more than 20 years of experience involved in CaRMS selection process remain a notable strength of this study. Future works could therefore iterate upon this work by evaluating whether these results generalize well across a greater number of raters and further explore the implications of AI in academic settings.

Footnotes

Acknowledgements

The authors would like to thank our reviewers, Drs. Douglas Courtemanche and Cindy Verchere, for their expertise and time to evaluate the personal statements.

Author Contributions

JC, BT, and EB conceived of the research question. BT completed the statistical analysis. All authors discussed the results and contributed to the final manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethical Statement

Informed consent was obtained from all individual participants included in the study.

ORCID iD

Jeffrey Chen

References

Gawad

Younan

Towaij

Raiche

. Comparing 2 approaches for the file review of residency applications. J Grad Med Educ. 2021;13(2):240-245. doi:10.4300/jgme-d-20-00619.1

Swanson

Sawhill

Holtzman

, et al. Relationship between performance on part I of the American Board of Orthopaedic Surgery Certifying Examination and Scores on USMLE steps 1 and 2. Acad Med. 2009;84(10 Suppl):S21-S24. doi:10.1097/ACM.0b013e3181b37fd2

Jones

Pittman Jr

Manning

. Ten steps for writing an exceptional personal statement. J Grad Med Educ. 2022;14(5):522-525. doi:10.4300/jgme-d-22-00331.1

Krauss

Bezuhly

Williams

. Selecting the best and brightest: a comparison of residency match processes in the United States and Canada. Plast Surg (Oakv). 2015;23(4):225-230. doi:10.4172/plastic-surgery.1000943

Nguyen

Lee

Domes

, et al.

Survey of Canadian urology programs: which aspects of the Canadian Residency Matching Service (CaRMS) application are the most important?

Can Urol Assoc J. 2020;14(6):169-173. doi:10.5489/cuaj.6191

Naples

French

Lipman

Prabhu

Aiello

Park

. Personal statements in general surgery: an unrecognized role in the ranking process. J Surg Educ. 2020;77(6):e20-e27. doi:10.1016/j.jsurg.2020.03.021

Naylor

Reisch

Valentine

. Factors related to attrition in surgery residency based on application data. Arch Surg. 2008;143(7):647-51; discussion 651-2. doi:10.1001/archsurg.143.7.647

Kelz

Mullen

Kaiser

, et al. Prevention of surgical resident attrition by a novel selection strategy. Ann Surg. 2010;252(3):537-1; discussion 541-3. doi:10.1097/SLA.0b013e3181f27a50

Dirschl

. Scoring of orthopaedic residency applicants: is a scoring system reliable? Clin Orthop Relat Res. 2002;399:260-264. doi:10.1097/00003086-200206000-00033

10.

White

Sadoski

Thomas

Shabahang

. Is the evaluation of the personal statement a reliable component of the general surgery residency application? J Surg Educ. 2012;69(3):340-343. doi:10.1016/j.jsurg.2011.12.003

11.

Campbell

Havas

Derse

Holloway

. Creating a residency application personal statement writers workshop: fostering narrative, teamwork, and insight at a time of stress. Acad Med. 2016;91(3):371-375. doi:10.1097/acm.0000000000000863

12.

Bhatia

. ChatGPT for academic writing: a game changer or a disruptive tool? J Anaesthesiol Clin Pharmacol. 2023;39(1):1-2. doi:10.4103/joacp.joacp_84_23

13.

Tao

Handzic

Hua

Vosoughi

Margolin

Micieli

. Utility of ChatGPT for automated creation of patient education handouts: An application in neuro-ophthalmology. J Neuroophthalmol. 2024;44(1):119–124.

14.

Else

. Abstracts written by ChatGPT fool scientists. Nature. 2023;613(7944):423. doi:10.1038/d41586-023-00056-7

15.

Dergaa

Chamari

Zmijewski

Ben Saad

. From human writing to artificial intelligence generated text: Examining the prospects and potential threats of ChatGPT in academic writing. Biol Sport. 2023;40(2):615-622. doi:10.5114/biolsport.2023.125623

16.

Chubb

Cowling

Reed

. Speeding up to keep up: exploring the use of AI in the research process. AI Soc. 2022;37(4):1439-1457. doi:10.1007/s00146-021-01259-0

17.

Miao

Thongprayoon

Suppadungsuk

Garcia Valencia

Qureshi

Cheungpasitporn

. Ethical dilemmas in using AI for academic writing and an example framework for peer review in nephrology academia: a narrative review. Clin Pract. 2023;14(1):89-105. doi:10.3390/clinpract14010008

18.

Author admits she used ChatGPT to write parts of prize-winning novel. The Telegraph; January 18, 2024, 2024. Accessed January 30, 2024. https://www.telegraph.co.uk/world-news/2024/01/18/author-used-chatgpt-ai-to-write-prize-winning-novel-japan/

19.

Policies. Accessed January 30, 2024, https://www.carms.ca/policies/.

20.

Woodfin

. The personal statement in the age of artificial intelligence. Acad Med. 2023;98(8):869. doi:10.1097/acm.0000000000005266

21.

Elkhatat

Elsaid

Almeer

. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. Int J Educ Integr. 2023;19(1):17. doi:10.1007/s40979-023-00140-5

22.

Report of the UC Academic Council Standardized Testing Task Force (STTF). 2020. https://senate.ucsd.edu/media/424154/sttf-report-rev-2-14-20.pdf

23.

Johnstone

Vallejo

Zakowski

. Improving residency applicant personal statements by decreasing hired contractor involvement. J Grad Med Educ. 2022;14(5):526-528. doi:10.4300/jgme-d-22-00226.1

24.

Zumsteg

Junn

. Will ChatGPT match to your program? Am J Phys Med Rehabil. 2023;102(6):545-547. doi:10.1097/phm.0000000000002238