Sage Journals: Discover world-class research

Abstract

Introduction

ChatGPT has shown remarkable performance in medical licensing examinations such as the United States Medical Licensing Examination. However, limited research exists regarding its performance on national medical licensing exams in low-income countries. In Nepal, where nearly half of the candidates fail the national medical licensing exam, ChatGPT has the potential to contribute to medical education.

Objective

To evaluate ChatGPT's (GPT-4) performance on the Nepal Medical Council Licensing Medical Examination (NMCLE).

Methods

The NMCLE-May 2024 dataset, comprising 900 multiple-choice questions, was used to assess ChatGPT's performance. After excluding 8 questions that contained figures or were not compatible with text-only input, 892 questions were analyzed. Specific prompt, including a background description, question, and choices, was entered. The response generated by ChatGPT was compared taking responses from experienced clinicians as a reference. Descriptive statistics were used to present the results, and regression analysis was employed to determine the association between variables, including set, question type, pattern, and subject, and incorrect responses.

Results

GPT-4 generated 783 correct responses in 892 questions, an accuracy rate of 87.8%. Incorrect responses were more likely with questions requiring logical reasoning (odds ratio 14.7, 95% confidence interval [CI] 8.94-24.16).

Conclusions

ChatGPT-4 performs at a standard comparable to or above that of medical graduates on the Nepalese undergraduate medical licensing examination. Incorrect responses were mainly in questions requiring logical reasoning, underscoring the need for caution when relying on its outputs in the same. These findings are encouraging and highlight the need for further studies to evaluate its role as an educational resource in Nepalese medical education.

Keywords

Nepal medical education large language model ChatGPT artificial intelligence

Introduction

There has been a growing interest in exploring large language models (LLMs) for biomedical question answering and automated dialogue generation within the medical field.¹ Among many existing models, ChatGPT (Chat Generative Pre-trained Transformer) has demonstrated exceptional language comprehension and generation capabilities, with its applicability extending across various domains, including the generation of clinical responses, images, clinical decision support, medical education, literature retrieval, scientific writing, and peer review.^2–7 It has passed the United States Medical Licensing Exam (USMLE),⁸ Radiology Board-style Examination,⁹ Polish nephrology Specialty Certificate Examination,¹⁰ Plastic Surgery In-Service Exam¹¹ achieving outcomes comparable to experts. However, failures have also been reported in a few, for example, in the Family Medicine Board Exam¹² and all medical, pharmacist and nurse licensing examination.¹ Recent discussions have emphasized ChatGPT's role in improving access to high-quality education. Assessing performance of ChatGPT in national medical licensing examinations is an important proxy measure to assess how well do they respond in terms of medical education.¹³

The Nepal Medical Council Licensing Examination (NMCLE) is a national-level, professional requirement for medical graduates to practice medicine in Nepal. It is a computer-based test (CBT) totaling 180 marks, primarily based on vignettes related to common diseases and health issues prevalent in Nepal in clinical, surgical, and public health areas, and requiring a minimum score of 90 to pass. Eligible candidates must have completed an MBBS (Bachelor of Medicine and Bachelor of Surgery) program and a 6-month internship. It has a pass rate notably lower than comparable international medical licensing exams. For instance, the pass rate for the January-February 2024 NMCLE was 56.68% for MBBS candidates.¹⁴ There is a large amount and variety of resources undergraduate medical students need to consume to become competent clinicians. Very often, these large amounts of resources can overwhelm the students and paradoxically leave them choosing none. ChatGPT's ability to process vast amounts of medical literature and generate context-specific explanations could help bridge these gaps. In this context, the application of ChatGPT presents prospective advantages for enhancing medical education in Nepal.

Comparing ChatGPT's performance in the NMCLE to international licensing exams provides insights into its adaptability across diverse educational systems. While ChatGPT's strong performance in USMLE and UK-based exams reflects alignment with Western curricula, its application in Nepal's context, characterized by different cultural and curricular emphases, is yet to be explored. While Western exams emphasize standardized guidelines, ethical scenarios, and evidence-based practice, Nepal's licensing exam assesses locally relevant public health concerns, for example, infectious disease, endemic diseases, community medicine, and context-specific clinical scenarios. To anticipate and prepare for the role of artificial intelligence (AI) in Nepalese medical education, it is imperative to systematically evaluate the potential benefits and inherent risks of ChatGPT in this unique setting. We aimed to assess ChatGPT-4's performance on the NMCLE and identify factors associated with incorrect responses.

Materials and Methods

Study Design

This study utilized a cross-sectional design. The assessment of ChatGPT was carried out by employing the GPT-4, powered by the GPT-4-turbo model, an enhanced version of GPT-4 developed by OpenAI. All model testing was conducted using the October 26, 2024 version of ChatGPT, accessed via manual entry on the official OpenAI web interface. This study was conducted from October to November 2025.

Medical Examination Datasets: Inclusion and Exclusion Criteria

Our data source comprised questions from 5 sets (A-E) of the NMCLE, May 2024. These 5 sets were distributed over 3 days to different groups of students. The entire 5 sets were selected to ensure the inclusivity of questions, making a single year's data representative and providing generalizability to the NMCLE. The dataset, consisting of 180 multiple-choice questions from each set, was subsequently uploaded to a Google Spreadsheet. Eight questions were excluded because they contained visual elements (figures or tables) that could not be accurately assessed through ChatGPT's text-only interface. Thus, a total of 892 questions were included in the final analysis.

Question Format Classification

The questions were categorized into 2 formats: Multiple Choice Questions (MCQs): These were single-stem questions typically testing recall or direct application of knowledge. Example: “What is the most common cause of community-acquired pneumonia in adults?” and Case Study Questions (CSQs): These were vignette-based items that presented a brief clinical scenario, requiring interpretation and decision-making. Example: “A 5-year-old boy presents with fever, sore throat, and a sandpaper-like rash. On examination, there is cervical lymphadenopathy and a ‘strawberry’ tongue. What is the most likely diagnosis?”

Prompt Design

Given the influence of prompt structure on LLM performance, a standardized prompt format was developed and used across all questions. Each prompt included a brief contextual introduction, the full question with answer choices, and an explicit instruction asking the model to return only one best answer with a brief explanation. This structured prompting approach was informed by recent literature highlighting the significant influence of prompt design on LLM performance in medical domains.¹⁵

The general structure was:

“This is a multiple-choice question from the Nepal Medical Council Licensing Exam. Please analyze the options and return only the best answer with a brief explanation. Only one option is correct. Question: A 22-year-old female presents with fatigue, pallor, and exertional dyspnea. Lab results reveal Hb: 8.2 g/dL, MCV: 72 fL, serum ferritin: 5 ng/mL. What is the most likely diagnosis?

A. Vitamin B12 deficiency

B. Iron deficiency anemia

C. Aplastic anemia

D. Anemia of chronic disease”

Evaluation

Board-certified clinicians (completed residency after undergraduate degree) of respected subjects evaluated the response generated by ChatGPT to determine the predicted answer, which was compared with the correct answer. A score of 1 was given when the expected answer corresponded with the accurate answer, and a score of 0 was assigned for inconsistency. Questions were classified according to the set, type, and pattern: MCQ or CSQ, and subject.

Qualitative Evaluation of ChatGPT Responses

Each response was reviewed to identify the selected choice, and ChatGPT responses were evaluated as correct or incorrect. Each response was further classified based on an evaluation of the following criteria¹⁶: first, logical reasoning, assessed whether the explanation demonstrated a clear rationale for choosing the answer, applying reasoning to the information provided. A 65-year-old man with long-standing hypertension presents with sudden, severe chest pain radiating to the back. On examination, his blood pressure is markedly different in the 2 arms. What is the most likely diagnosis? Options: (A) Acute myocardial infarction, (B) Aortic dissection, (C) Pulmonary embolism, and (D) Pericarditis. Second, internal information, which evaluated if the response was based on details from the question itself to support the answer, for example: a patient presents with yellow discoloration of the sclera. What is it called? Options: (A) Cyanosis, (B) Pallor, (C) Jaundice, and (D) Erythema. Third, external information, examined whether the response incorporated relevant knowledge beyond the question stem to justify the choice. Example: A 24-year-old primigravida at 10 weeks of gestation comes for routine antenatal care. Which live vaccine is contraindicated during pregnancy? Options: (A) Influenza (inactivated), (B) Tetanus toxoid, (C) Hepatitis B, and (D) Varicella. The responses were organized into a Google Spreadsheet for further analysis.

For each incorrect response, we categorized the incorrect response into one of the following: logical error, where the response identified the relevant information but failed to apply it correctly to reach an appropriate conclusion, example: A 60-year-old smoker presents with hematuria. Ultrasound shows a renal mass. The question asks for the most likely diagnosis. The model selects “renal stone” instead of “renal cell carcinoma,” despite the malignant features described; information error, where it failed to identify a crucial piece of information, whether from the question stem or external knowledge, that is expected to be considered. Example: A 25-year-old woman presents with amenorrhea and a positive urine pregnancy test. The question asks for the most appropriate next investigation. The model chooses “FSH/LH levels” instead of recognizing that a pregnancy test result was already provided and “ultrasound pelvis” was the appropriate next step and statistical error, which referred to an arithmetic mistake, including direct errors or indirect ones, such as inaccurate estimations of disease prevalence. Example: What is the most common cause of community-acquired pneumonia in adults? The model answers “Klebsiella pneumoniae” rather than the correct “Streptococcus pneumoniae,” reflecting an error in estimating disease frequency.

Variable and Outcome

The primary outcome was ChatGPT's performance on the NMCLE, assessed by evaluating whether its responses were correct or incorrect, with a correct response defined as alignment with the answers provided by board-certified clinicians (described above). The independent variables included the type of question, categorized into logical reasoning, internal information, or external knowledge. Other independent variables were the specific dataset, the question format MCQs or CSQs, and the subject. Subjects were classified into basic sciences, internal medicine, surgery, community medicine or public health, forensic medicine, obstetrics and gynecology, and pediatrics.

Statistical Analysis

Data were analyzed using SPSS version 26 (IBM Corp., Armonk, NY, USA). Descriptive statistics were reported as frequency distributions and percentages for categorical variables. To identify predictors of incorrect responses generated by GPT-4, binary logistic regression analysis was performed. The dependent variable was response correctness, dichotomized as incorrect = 1 and correct = 0, so that model coefficients directly reflected the odds of an incorrect response. Independent variables included question type and subject. All predictors were entered simultaneously into the model using the Enter method to assess their independent contributions. Regression coefficients were expressed as odds ratios (ORs) with corresponding 95% confidence intervals (CIs). For categorical predictors, indicator (dummy) coding was applied, with the last category designated as the reference. Model fit was evaluated using the −2 log likelihood statistic and Nagelkerke R². A P-value < .05 was considered statistically significant.

Ethical Considerations

This study adhered to the Helsinki Declaration and is reported in accordance with STROBE guidelines.¹⁷ No humans were involved during the study. Therefore, evaluation by the ethics committee was not considered necessary.

Results

Overall Performance

A total of 892 responses were analyzed, with 783 (87.8%) correct responses.

Factors Associated With Incorrect Responses

The distribution of correct and incorrect responses across different sets (Set A-E) showed significant variation (P = .004). Incorrect responses ranged from 7.2% in Set E to 17.8% in Set B. Logical reasoning questions had a significantly higher proportion of incorrect responses (51.6%) than external information questions (5.7%) (P < .05; Figure 1). Analysis by the system revealed significant differences (P = .007). Pediatrics had the highest proportion of incorrect responses (22%) (Figure 2). There was no significant difference based on the pattern of questions (Table 1).

Figure 1.

Performance of ChatGPT on Different Question Types in the Nepal Medical Council Licensing Medical Examination.

Figure 2.

ChatGPT's Performance Across Different Subjects in the Nepal Medical Council Licensing Medical Examination (y-Axis Frequency).

Table 1.

Distribution of GPT-4 Responses by Set, Type of Question, Pattern, and Subject.

Variable	Incorrect, n (%)	Correct, n (%)	Total	P-value
Set				.004 *
A	14 (7.8)	166 (92.2)	180
B	32 (17.8)	148 (82.2)	180
C	29 (16.2)	150 (83.8)	179
D	21 (12.0)	154 (88.0)	175
E	13 (7.2)	167 (92.8)	180
Total	109	783	892
Type of question				<.05 *
Logical reasoning	66 (51.6)	62 (48.4)	128
Internal information	0 (0.0)	10 (100)	10
External information	43 (5.7)	711 (94.3)	754
Total	109	783	892
Pattern of question				.8
MCQ	53 (12.3)	379 (87.7)	432
CSQ	18 (11.3)	141 (88.7)	159
Total	71	520	591
Subject				.007 *
Basic sciences	21 (7.9)	245 (92.1)	266
Surgery	34 (17.8)	157 (82.2)	191
Internal medicine	17 (9.2)	168 (90.8)	185
Community medicine	3 (6.7)	42 (93.3)	45
Forensic medicine	2 (8.0)	23 (92.0)	25
Obstetrics & Gynecology	8 (11.0)	65 (89.0)	73
Pediatrics	9 (22.0)	32 (78.0)	41
Total	94	732	826

Abbreviation: CSQ, Case Study Questions; MCQ, Multiple Choice Question.

Logistic regression with incorrect response as the outcome showed that question type was a significant independent predictor (χ² = 112.3, P < .001, Nagelkerke R² = 0.266). Logical reasoning questions had markedly higher odds of being answered incorrectly compared with external-information questions (OR = 14.7, 95% CI: 8.94-24.16, P < .001). Subject area was not an independent predictor after adjusting for question type (Table 2).

Table 2.

Logistic Regression Analysis of Predictors of Incorrect Responses by GPT-4.

Variable	df	Significance	Odds ratio (OR)	95% CI for OR
Type of question	2	<0.001
Logical reasoning	1	<0.001	14.7	8.94-24.16
Internal information	1	0.99	— (unstable)	—
External information	—	—	Reference
System (Subject)	1	0.93	1	0.93-1.06

Model fit: χ² (overall) = 112.3, P < .001; Nagelkerke R² = 0.266; classification accuracy = 88.6%.

*Significant at P < .05.

Discussion

This study is the first of its kind to assess ChatGPT's performance in a low-resource, English-medium setting where access to advanced technology is not routine. First, we demonstrated that ChatGPT performed at a standard or above medical graduates on the NMCLE. Second, we found that GPT-4's incorrect responses were associated with questions requiring logical reasoning, indicating potential areas for further research to explore why ChatGPT struggles with certain types of questions, contributing to a deeper understanding of the behavior of LLMs. Furthermore, the NMCLE includes a unique mix of context-specific public health, community medicine, and endemic disease content, offering a novel benchmark to test ChatGPT's adaptability to localized medical priorities.

ChatGPT-4-turbo generated correct response in 87.8% of the NMCLE questions. As the threshold of 50% is considered the passing standard for the NMCLE exam, ChatGPT performs at the level expected of a medical graduate in our setting. These are consistent with the conclusions drawn from prior studies in developing nations.^13,18 Likewise, the findings also resonate with those from developed countries, where Yang et al¹⁹ reported ChatGPT-4's ability to respond to USMLE questions involving images to be 90.7% accurate, exceeding the passing threshold of approximately 60% accuracy. A previous study reported that an accuracy rate of >95% would make ChatGPT a reliable educational tool.²⁰

However, GPT-4-Turbo had previously failed other tests based on non-English languages like the Spanish, Japanese, and Chinese National Licensing Medical Examination.^21–23 Compared to earlier versions of the GPT, newer versions perform better on medical licensing exams.²⁴ Different studies have shown that language and cultural contexts are the reasons for discrepancies in the performance of ChatGPT on licensing exams across the globe. Because ChatGPT's training data is primarily in English, which possibly leads to poor performance in the licensing exams on languages other than English.¹ Additional factors are differences in medical policies, legal requirements, and management practices in each country. ChatGPT's strong performance on the NMCLE likely reflects the exclusive use of English in Nepal's medical education and exam.¹ In addition, the sets contained more science-based questions than those related to policies or local law.

Compared to a meta-analysis by Levin et al²⁴ that found an overall accuracy of 61.1%, ChatGPT's higher success rate in the NMCLE is probably because the test relied more on recall-based knowledge than logical reasoning or problem-solving. This is consistent with earlier findings that ChatGPT performs better on tests requiring standardized knowledge bases, like Basic Sciences and Community Medicine,¹⁶ than on subjects like Pediatrics, which call for clinical reasoning.^25,26 Half of the NMCLE errors were found in questions that required logical reasoning. This limitation, which is in line with results from earlier research, emphasizes the model's inability to solve complex problems.^27–29 Even though ChatGPT frequently offers reasonable justifications for its responses, even when they are inaccurate, less knowledgeable users may be misled by these. Furthermore, the model frequently supplied titles, DOIs, and links that were either nonexistent or irrelevant when asked for references, highlighting the need for supervision when utilizing AI-generated citations in academic settings.^24,30

ChatGPT's adaptability across question formats, as in our study, highlights its versatility as an educational tool. This adaptability suggests that ChatGPT could be integrated into diverse learning scenarios, exam preparedness, mnemonic creation, rephrasing questions, and the overall learning experience.^31,32 In settings with limited faculty availability or restricted access to up-to-date textbooks and journal articles, ChatGPT could help bridge educational gaps by being a one-stop solution for summarizing key concepts, offering explanations, and simulating interactive discussions.^26,33 ChatGPT is convenient and accessible for everyone. Additionally, ChatGPT could be integrated into existing medical curricula as a self-assessment tool, allowing students to test their knowledge through AI-generated quizzes. Such an approach may be particularly beneficial in rural medical schools or institutions with faculty shortages, where AI models can provide immediate feedback and explanations. Moreover, structured AI-assisted learning platforms could complement clinical training by providing case-based learning simulations, reinforcing diagnostic reasoning, and offering management guidelines. Additionally, Nepalese students pursuing medical education in non-English settings such as Egypt, China, or the Philippines can integrate ChatGPT early in their studies by using English-language prompts to clarify complex concepts, rehearse clinical scenarios, and review local guidelines, which can accelerate mastery of core knowledge and improve preparedness for both national and international exams.

ChatGPT's errors in logical reasoning raise concerns regarding its applicability in clinical decision making. Therefore, professional judgment remains indispensable, and in clinical practice, the need for careful oversight of AI-generated recommendations is paramount.²⁷ The observed tendency of ChatGPT to provide plausible but incorrect answers highlights the risk of over-reliance on AI-generated information.

Students and trainees must be aware of these limitations and critically analyze AI-generated responses rather than accepting them at face value. One potential strategy to mitigate this issue is to develop AI models explicitly trained in stepwise clinical reasoning, incorporating real-world patient cases and decision trees. Additionally, integrating ChatGPT into medical education should emphasize critical thinking, ensuring students understand not just what the model suggests but also why certain responses may be flawed. AI tools like ChatGPT should complement, rather than replace, traditional educational approaches, emphasizing a partnership between human expertise and machine capabilities to optimize outcomes.^25,27,34

Limitations

To the best of our knowledge, this is the first study to evaluate the performance of GPT-4 in medical examination in the Nepalese context. This study has several limitations. The questions of NMCLE relied entirely on memory recall; thus, clinical vignettes might not be entirely represented. Questions were not classified based on the difficulty level. In addition, a specific sample was not calculated before the study; instead, we included all questions of the test during the year. The exclusion of questions with figures and tables may introduce selection bias by limiting the dataset to text-based questions, potentially underrepresenting the role of visual interpretation in clinical decision making. Future studies should explore the application of LLMs beyond ChatGPT, such as Microsoft's Bing Chat, Google's Bard, and Meta's LLaMA, on their efficacy in medical education.

Conclusions

ChatGPT-4 performs at a standard comparable to or above that of medical graduates on the Nepalese undergraduate medical licensing examination. Incorrect responses were mainly in questions requiring logical reasoning, underscoring the need for caution when relying on its outputs in such domains. These findings are encouraging and highlight the need for further studies to evaluate ChatGPT-4's role as an educational resource in Nepalese medical education.

Footnotes

Acknowledgments

The authors acknowledge the role of all individuals involved in the recall of the questions for this research.

ORCID iD

Amit Yadav

Authors’ Contributions

PL and SP were involved in conceptualization and data curation; DU and AY in formal analysis and resources; PL, SP, DU, AY, and GJ in funding acquisition, investigation and visualization; PL, SP, and AY in methodology; PL, SP, and DU in project administration; DU, AY, and GJK in software; PL in supervision; validation in PL, SP, and GJK; writing—original draft in SP and DU; and PL, AY, and GJK in writing—review and editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data supporting this study are available from the corresponding author upon reasonable request.

References

Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses | BMC Medical Education | Full Text [Internet]. Cited May 11, 2025. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-024-05125-7

Weng

Liu

, et al. Current status of ChatGPT use in medical education: potentials, challenges, and strategies. J Med Internet Res. 2024;26:e57896.

Kovoor

Gupta

Bacchi

. ChatGPT: effective writing is succinct. Br Med J. 2023;381:p1125.

Fiorillo

Mehta

. Accelerating editorial processes in scientific journals: leveraging AI for rapid manuscript review. Oral Oncol Rep. 2024;10:100511.

Dash

Mehta

Kharat

. We are entering a new era of problems: AI-generated images in research manuscripts. Oral Oncol Rep. 2024;10:100289.

Mehta

Mathur

Anjali

Fiorillo

. The application of ChatGPT in the peer-reviewing process. Oral Oncol Rep. 2024;9:100227.

AI-dependency in scientific writing – ScienceDirect [Internet]. Cited May 11, 2025. https://www.sciencedirect.com/science/article/pii/S2772906024001158?via%3Dihub

Mihalache

Huang

Popovic

Muni

. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024;46(3):366–372.

Bhayana

Krishna

Bleakney

. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582.

10.

Nicikowski

Szczepański

Miedziaszczyk

Kudliński

. The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland. Clin Kidney J. 2024;17(8):sfae193.

11.

Humar

Asaad

Bengur

Nguyen

. ChatGPT is equivalent to first-year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service examination. Aesthet Surg J. 2023;43(12):NP1085–9.

12.

Weng

Wang

Chang

Chen

Hwang

. ChatGPT failed Taiwan’s family medicine board exam. J Chin Med Assoc JCMA. 2023;86(8):762–766.

13.

Flores-Cohaila

García-Vicente

Vizcarra-Jiménez

, et al. Performance of ChatGPT on the Peruvian national licensing medical examination: cross-sectional study. JMIR Med Educ. 2023;9:e48039.

14.

Nepal Medical Council reveals a 56.68% pass rate in MBBS and BDS registration exams [Internet]. Edusanjal. Cited May 11, 2025. https://edusanjal.com/news/nepal-medical-council-reveals-a-5668-pass-rate-in-mbbs-and-bds-registration-exams/

15.

Patel

Parmar

Prompt engineering for large language model. 2024.

16.

Gilson

Safranek

Huang

, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.

17.

Von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ. 2007;335(7624):806–808.

18.

Gandhi

Joesph

Rajagopal

, et al. Performance of ChatGPT on the India undergraduate community medicine examination: cross-sectional study. JMIR Formative Res. 2024;8:e49964.

19.

Yang Z, Yao Z, Tasmin M, et al. Performance of multimodal GPT-4V on USMLE with image: potential for imaging diagnostic support with explanations. medRxiv (Cold Spring Harbor Laboratory); 2023. https://doi.org/10.1101/2023.10.26.23297629

20.

Suchman

Garg

Trindade

. Chat generative pretrained transformer fails the multiple-choice American college of gastroenterology self-assessment test. Off J Am Coll Gastroenterol ACG. 2023;118(12):2280–2282.

21.

Cerame

Juaneda

Estrella-Porter

, et al. ¿Es capaz GPT-4 de aprobar el MIR 2023? Comparativa entre GPT-4 y ChatGPT-3 en los exámenes MIR 2022 y 2023. Rev Esp Educ Médica. 2024;5(2). https://doi.org/10.6018/edumed.604091

22.

Kaneda

Tanimoto

Ozaki

Sato

Takahashi

Can ChatGPT Pass the 2023 Japanese National Medical Licensing Examination [Internet]. 2023. Cited May 11, 2025.

23.

Wang

Gong

Wang

, et al. ChatGPT performs on the Chinese National Medical Licensing Examination. J Med Syst. 2023;47(1):86.

24.

Liu

Okuhara

Chang

, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res. 2024;26(1):e60807.

25.

Davis

. ChatGPT yields a passing score on a pediatric board preparatory exam but raises red flags. Glob Pediatr Health. 2024;11:2333794X–241240327.

26.

Alessi MR, Gomes HA, De Castro ML, Okamoto CT. Performance of ChatGPT in solving questions from the Progress Test (Brazilian National Medical Exam): a potential artificial intelligence tool in medical practice. Cureus 2024;16(7):e64924.

27.

Cung

Sosa

Yang

, et al. The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries. J Bone Miner Res. 2024;39(2):106–115.

28.

Sarangi

Narayan

Mohakud

Vats

Sahani

Mondal

. Assessing the capability of ChatGPT, Google Bard, and Microsoft Bing in solving radiology case vignettes. Indian J Radiol Imaging. 2023;34(2):276.

29.

Sarangi

Mondal

. Response generated by large language models depends on the structure of the prompt. Indian J Radiol Imaging. 2024;34(3):574–575.

30.

Wei

Wang

Yao

, et al. Evaluation of ChatGPT’s performance in providing treatment recommendations for pediatric diseases. Pediatr Discov. 2023;1(3):e42.

31.

Assessment of the capacity of ChatGPT as a self-learning tool in medical pharmacology: a study using MCQs | BMC Medical Education | Full Text [Internet]. Cited May 11, 2025. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-023-04832-x

32.

Sarangi

Panda

Sanjay

Pattanayak

Panda

Mondal

. Exploring radiology postgraduate students’ engagement with large language models for educational purposes: a study of knowledge, attitudes, and practices. Indian J Radiol Imaging. 2024;35(1):35.

33.

Kasalaei

Amini

Nabeiei

Bazrafkan

Mousavinezhad

. Barriers of critical thinking in medical students’ curriculum from the viewpoint of medical education experts: a qualitative study. J Adv Med Educ Prof. 2020;8(2):72–82.

34.

Sarangi

Datta

Swarup

, et al. Radiologic decision-making for imaging in pulmonary embolism: accuracy and reliability of large language models—Bing, Claude, ChatGPT, and Perplexity. Indian J Radiol Imaging. 2024;34(4):653–660.

Performance of ChatGPT-4 on the Nepalese Undergraduate Medical Licensing Examination: A Cross-Sectional Study

Abstract

Introduction

Objective

Methods

Results

Conclusions

Keywords

Introduction

Materials and Methods

Study Design

Medical Examination Datasets: Inclusion and Exclusion Criteria

Question Format Classification

Prompt Design

Evaluation

Qualitative Evaluation of ChatGPT Responses

Variable and Outcome

Statistical Analysis

Ethical Considerations

Results

Overall Performance

Factors Associated With Incorrect Responses

Discussion

Limitations

Conclusions

Footnotes

Acknowledgments

ORCID iD

Authors’ Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References