ChatGPT-4’s capability in addressing multiple-choice questions within the primary examination of the Australian and New Zealand College of Anaesthetists

Abstract

The rapid advancement in artificial intelligence has been subjected to much discussion and interest. Chat Generative Pre-Trained Transformer (ChatGPT), a powerful large language model with advanced reasoning ability, has shown promise in medical education and assessment. It has been successful in navigating through not only standard medical licensing examinations^1,2 but also in different specialties^3
–8 across multiple countries and languages.^9,10

In anaesthesia, evidence examining the effectiveness of ChatGPT in tackling multiple-choice questions (MCQs) is emerging. Three sets of MCQs have been analysed—two on the examination overseen by the Fellowship of Royal College of Anaesthetists (FRCA) and two by the American Board of Anesthesiology (ABA). Of the FRCA questions, one assessed 3705 MCQs from a question bank (AnaesthesiaUK Primary FRCA)¹¹ and the other examined 27 sample MCQs from the Royal College of Anaesthetists website.¹² Both found that ChatGPT-3.5 had results within the thresholds of passing; however, the latter did note a significant improvement in ChatGPT-4. Echoing this, in the assessment of 1231 questions from an ABA preparation book,¹³ ChatGPT-3.5 just fell short of a pass but ChatGPT-4 was successful in achieving a passing score.¹⁴

For trainees undergoing specialist training with the Australian and New Zealand College of Anaesthetists (ANZCA), two examinations are compulsory. The first, titled ‘Primary Examination’, assesses candidates in the subjects of applied physiology, pharmacology, anatomy, measurement and equipment. This encompasses both a written and an oral (viva) component. The written aspect comprises 150 MCQs and 15 short answer questions (SAQs). To progress to the viva component, the trainee must pass the MCQs as well as score a minimum of 40% in the SAQs. To achieve a pass in the entire exam, an average mark of 50% or more is required across both the written SAQs and viva components.

The MCQs grading system, in 2017, transitioned to a pass/fail model. The average score required to pass the MCQ barrier was 60.2%¹⁵ for the years 2017 to 2022. Overall, the proportion of candidates passing the entire primary examination stood at 60.5% during the same period.

Previous investigations of ChatGPT-4’s capability to correctly answer anaesthesia-related MCQs have used question sets that originated from an examination preparation book, an unofficial question bank or a very small set of official questions. In this paper, we appraise ChatGPT-4’s capability in MCQs in official questions only but with a considerably expanded set. To closely emulate the ANZCA examination, only officially published questions from the college were used. Unlike the SAQs, where all past papers were published, only one publicly available primary examination paper exists. The questions in this ‘practice’ paper do not arise from one sitting; rather, it is a composite of 120 questions from the two sittings of 2018.¹⁶ The official MCQs of other sittings are unavailable, though some exist in personal websites or question banks set up privately by previous candidates.

In July 2023, ChatGPT-4 was tasked with addressing each question in accordance with ANZCA examination instructions. The same prompt, as seen in the examination, was used for all questions: ‘Please answer the following questions. Each question has five answers. Choose the ONE best answer to each question.’ The answers were then marked by the first author with reference to ANZCA’s recommended reading list of textbooks.¹⁷ In instances where the answers were difficult to discern or information was conflicting, the latest journal articles were utilised. In the rare times where there was still significant doubt, expert opinion was sought. Of note, there were nine questions in the MCQs that involved graphical input, which was beyond the capability of ChatGPT-4 at the time of evaluation.

Overall, ChatGPT-4 achieved a score of 73% with 87 correct answers out of 120 questions (Table 1). If graphical questions were excluded, the mark improved to 78% (87/111 questions).

Table 1.

ChatGPT 4 score by subject, with and without graphic questions.

Multiple-choice questions	All questions	All questions (graphic questions excluded)
Physiology	75%(35/47)	85%(35/41)
Pharmacology	76%(45/59)	76%(45/59)
Clinical measurement	50%(5/10)	63%(5/8)
Statistics	100%(2/2)	100%(2/2)
Anatomy	0%(0/2)	0%(0/1)
Overall	73%(87/120)	78%(87/111)

Among the five subjects examined, pharmacology accounted for the highest number of questions, with 49% of the total (59 out of 120). Physiology questions followed closely behind, accounting for 39% (47/120). Other subjects, namely clinical measurement/equipment (4% or 5/120), statistics (2% or 2/120) and anatomy (2% or 2/120), featured considerably fewer questions.

In the key categories of pharmacology and physiology, ChatGPT-4 performed well. In physiology, it accurately answered 75% of the questions (35/47), which rose to 85% (35/41) if graphic questions were excluded. For pharmacology, the model achieved a mark of 76% (45/59). Unlike physiology, there were no graphic questions in pharmacology.

In the remaining topics, the performance varied. In clinical measurement and equipment, 50% of the questions were answered correctly (5/10), but when excluding graphic questions, the score improved to 63% (5/8). Statistics questions were answered correctly for all two questions. Conversely, both of two anatomy questions were incorrectly answered.

The questions posed here were drawn from the two examination sittings held in 2018. To overcome the MCQ barrier in the first sitting, candidates required a score of 53% (80/150), while in the second sitting, they required 57% (86/150). In 2018, 69% and 84% of actual examinees, respectively, achieved this mark. The specific details regarding ad hoc standardisation and scaling were not publicly disclosed. ChatGPT-4 easily cleared the passing threshold of these MCQ examinations, demonstrating good performance in the core categories of physiology and pharmacology. When graphic questions were excluded, its performance was consistent across all categories, particularly in statistics, clinical measurement and equipment.

The answers generated not only selected the best option but often offered detailed explanations of the rationale behind the selection, and why other options were deemed unsuitable. Not infrequently, the generated answers appeared to display a level of nuance and depth of thought (Figure 1).

Figure 1.

Two example answers generated by ChatGPT-4 to multiple-choice questions, appearing to display higher ordered thinking.

ChatGPT-4 has exhibited considerable aptitude, approaching or attaining the passing standard in anaesthesia exams in both the UK and the US.^11
–14 In the FRCA question sets, ChatGPT-3.5 accurately answered 69.7% of 3705 questions from the question bank,¹¹ while ChatGPT-4 answered 63.6% correctly from the 27 sample questions from the Royal College of Anaesthetists.¹² In the ABA exam preparation book, ChatGPT-3.5 achieved a score of 56.2%,¹³ while ChatGPT-4 outperformed with a correct rate of 72.1%¹⁴ from a pool of just over 1300 questions. The question sets sourced have been mostly through an unofficial channel such as a preparation book, with the only official source being 27 questions only from the Royal College of Anaesthetists. In our appraisal of ChatGPT-4, utilising an expanded pool of official questions, it not only passed this section of the examination comfortably but compared favourably with the performance of past candidates.

We tried to replicate the exam experience as closely as possible in both the testing and grading aspects. Nevertheless, testing a chatbot in a simulated examination setting is a relatively recent concept, prompting a need to discuss its limitations.

A total of 120 questions were presented to ChatGPT-4 as it was the sole source of official questions. This number is fewer than the 150 questions that are administered in each examination sitting and substantially fewer than the 3705 and 1319 questions assessed in the FRCA^11,13 and ABA^13,14 examinations. The absence of a larger collection of questions may result in a sample of questions that is not fully representative, reducing validity.

Since the release of the official questions from 2018, the format of the ANZCA Primary examination has been modified.¹⁶ Key changes made include: a reduction from a five-option single best response to four options, a more refined standardisation scheme and an enhanced metareview process. These substantive modifications might mean the that performance in this paper might not necessarily carry through to the new format.

Since this evaluation of MCQs, ChatGPT-4 has seen significant improvement. Whereas it was not capable of interfacing with graphic content in its base model previously, it is now able to both interpret and generate images to questions.¹⁸ Whereas ChatGPT-4 scored zero for visual questions in this study, now it has the engineering capability to address them.

Our study illustrates the ability of a large language model such as ChatGPT-4 to comfortably clear the MCQ threshold of the ANZCA Primary exam. It would be prudent for candidates, however, not to rely solely on ChatGPT-4. It not infrequently generates factual errors and has a tendency to deliver answers confidently, even when 27% of the questions were answered incorrectly. Despite its shortcomings, this easily accessible chatbot could offer advantages to primary examination candidates as an alternative study aid, providing immediate, personalised, problem-oriented solutions at any time of the day or night. Improvements are expected with the evolution of artificial intelligence models. Furthermore, the notion that an untrained, general-purpose chatbot was able to pass the MCQs might inspire further research, leading to the development of innovative alternative approaches to anaesthesia learning and assessment.

Footnotes

Author Contribution(s)

Steven C Cai: Conceptualization; Data curation; Methodology; Project administration; Writing – original draft; Writing – review & editing.

Alpha MS Tung: Conceptualization; Methodology; Project administration; Supervision; Writing – review & editing.

Acknowledgement

Declaration of generative AI and AI-assisted technologies in the writing process: in reviewing this work, the authors used ChatGPT to enhance readability and language. The authors reviewed and edited the content and assume full responsibility for the content of the publication.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Steven C Cai

References

Gilson

Safranek

Huang

, et al. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023; 9: e45312. doi:10.2196/45312

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit Health 2023; 2: e0000198. doi:10.1371/journal.pdig.0000198

Bhayana

Krishna

Bleakney

RR.

Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology 2023; 307: e230582. doi:10.1148/radiol.230582

Gupta

Herzog

Park

, et al. Performance of ChatGPT on the Plastic Surgery Inservice Training Examination. Aesthet Surg J 2023; 43: NP1078–NP1082. doi:10.1093/asj/sjad128

Hopkins

Nguyen

Dallas

, et al. ChatGPT versus the neurosurgical written boards: A comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions. J Neurosurg 2023; 139: 904–911. doi:10.3171/2023.2.Jns23419

Humar

Asaad

Bengur

, et al. ChatGPT is equivalent to first-year plastic surgery residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination. Aesthet Surg J 2023; 43: NP1085–NP1089. doi:10.1093/asj/sjad130

Lum

ZC.

Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res 2023; 481: 1623–1630. doi:10.1097/corr.0000000000002704

Suchman

Garg

Trindade

AJ.

Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology self-assessment test. Am J Gastroenterol 2023; 118: 2280–2282. doi:10.14309/ajg.0000000000002320

Cohen

Alter

Lessans

, et al. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Arch Gynecol Obstet 2023; 308: 1797–1802. doi:10.1007/s00404-023-07185-4

10.

Toyama

Harigai

Abe

, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 2023; 42(2): 201–207. doi:10.1007/s11604-023-01491-2

11.

Birkett

Fowler

Pullen

Performance of ChatGPT on a primary FRCA multiple choice question bank. Br J Anaesth 2023; 131: e34–e35. doi:10.1016/j.bja.2023.04.025

12.

Aldridge

Penders

Artificial intelligence and anaesthesia examinations: Exploring ChatGPT as a prelude to the future. Br J Anaesth 2023; 131: e36–e37. doi:10.1016/j.bja.2023.04.033

13.

Shay

Kumar

Bellamy

, et al. Assessment of ChatGPT success with specialty medical knowledge using anaesthesiology board examination practice questions. Br J Anaesth 2023; 131: e31–e34. doi:10.1016/j.bja.2023.04.017

14.

Shay

Kumar

Redaelli

, et al. Could ChatGPT-4 pass an anaesthesiology board examination? Follow-up assessment of a comprehensive set of board examination practice questions. Br J Anaesth 2023; 132(1): 172–174. doi:10.1016/j.bja.2023.10.025

15.

ANZCA. Primary Examination Reports 2016–2022, https://learn.anzca.edu.au/ (accessed 3 July 2023).

16.

ANZCA. 2022.2 Primary Examination Report, https://learn.anzca.edu.au/ (accessed 3 July 2023).

17.

ANZCA. ANZCA primary exam (PEx): Recommended reading list, https://libguides.anzca.edu.au/primary (accessed 3 July 2023).

18.

OpenAI. ChatGPT can now see, hear, and speak, https://openai.com/blog/chatgpt-can-now-see-hear-and-speak (accessed 12 December 2023).