Abstract
Introduction
ChatGPT has shown remarkable performance in medical licensing examinations such as the United States Medical Licensing Examination. However, limited research exists regarding its performance on national medical licensing exams in low-income countries. In Nepal, where nearly half of the candidates fail the national medical licensing exam, ChatGPT has the potential to contribute to medical education.
Objective
To evaluate ChatGPT's (GPT-4) performance on the Nepal Medical Council Licensing Medical Examination (NMCLE).
Methods
The NMCLE-May 2024 dataset, comprising 900 multiple-choice questions, was used to assess ChatGPT's performance. After excluding 8 questions that contained figures or were not compatible with text-only input, 892 questions were analyzed. Specific prompt, including a background description, question, and choices, was entered. The response generated by ChatGPT was compared taking responses from experienced clinicians as a reference. Descriptive statistics were used to present the results, and regression analysis was employed to determine the association between variables, including set, question type, pattern, and subject, and incorrect responses.
Results
GPT-4 generated 783 correct responses in 892 questions, an accuracy rate of 87.8%. Incorrect responses were more likely with questions requiring logical reasoning (odds ratio 14.7, 95% confidence interval [CI] 8.94-24.16).
Conclusions
ChatGPT-4 performs at a standard comparable to or above that of medical graduates on the Nepalese undergraduate medical licensing examination. Incorrect responses were mainly in questions requiring logical reasoning, underscoring the need for caution when relying on its outputs in the same. These findings are encouraging and highlight the need for further studies to evaluate its role as an educational resource in Nepalese medical education.
Introduction
There has been a growing interest in exploring large language models (LLMs) for biomedical question answering and automated dialogue generation within the medical field. 1 Among many existing models, ChatGPT (Chat Generative Pre-trained Transformer) has demonstrated exceptional language comprehension and generation capabilities, with its applicability extending across various domains, including the generation of clinical responses, images, clinical decision support, medical education, literature retrieval, scientific writing, and peer review.2–7 It has passed the United States Medical Licensing Exam (USMLE), 8 Radiology Board-style Examination, 9 Polish nephrology Specialty Certificate Examination, 10 Plastic Surgery In-Service Exam 11 achieving outcomes comparable to experts. However, failures have also been reported in a few, for example, in the Family Medicine Board Exam 12 and all medical, pharmacist and nurse licensing examination. 1 Recent discussions have emphasized ChatGPT's role in improving access to high-quality education. Assessing performance of ChatGPT in national medical licensing examinations is an important proxy measure to assess how well do they respond in terms of medical education. 13
The Nepal Medical Council Licensing Examination (NMCLE) is a national-level, professional requirement for medical graduates to practice medicine in Nepal. It is a computer-based test (CBT) totaling 180 marks, primarily based on vignettes related to common diseases and health issues prevalent in Nepal in clinical, surgical, and public health areas, and requiring a minimum score of 90 to pass. Eligible candidates must have completed an MBBS (Bachelor of Medicine and Bachelor of Surgery) program and a 6-month internship. It has a pass rate notably lower than comparable international medical licensing exams. For instance, the pass rate for the January-February 2024 NMCLE was 56.68% for MBBS candidates. 14 There is a large amount and variety of resources undergraduate medical students need to consume to become competent clinicians. Very often, these large amounts of resources can overwhelm the students and paradoxically leave them choosing none. ChatGPT's ability to process vast amounts of medical literature and generate context-specific explanations could help bridge these gaps. In this context, the application of ChatGPT presents prospective advantages for enhancing medical education in Nepal.
Comparing ChatGPT's performance in the NMCLE to international licensing exams provides insights into its adaptability across diverse educational systems. While ChatGPT's strong performance in USMLE and UK-based exams reflects alignment with Western curricula, its application in Nepal's context, characterized by different cultural and curricular emphases, is yet to be explored. While Western exams emphasize standardized guidelines, ethical scenarios, and evidence-based practice, Nepal's licensing exam assesses locally relevant public health concerns, for example, infectious disease, endemic diseases, community medicine, and context-specific clinical scenarios. To anticipate and prepare for the role of artificial intelligence (AI) in Nepalese medical education, it is imperative to systematically evaluate the potential benefits and inherent risks of ChatGPT in this unique setting. We aimed to assess ChatGPT-4's performance on the NMCLE and identify factors associated with incorrect responses.
Materials and Methods
Study Design
This study utilized a cross-sectional design. The assessment of ChatGPT was carried out by employing the GPT-4, powered by the GPT-4-turbo model, an enhanced version of GPT-4 developed by OpenAI. All model testing was conducted using the October 26, 2024 version of ChatGPT, accessed via manual entry on the official OpenAI web interface. This study was conducted from October to November 2025.
Medical Examination Datasets: Inclusion and Exclusion Criteria
Our data source comprised questions from 5 sets (A-E) of the NMCLE, May 2024. These 5 sets were distributed over 3 days to different groups of students. The entire 5 sets were selected to ensure the inclusivity of questions, making a single year's data representative and providing generalizability to the NMCLE. The dataset, consisting of 180 multiple-choice questions from each set, was subsequently uploaded to a Google Spreadsheet. Eight questions were excluded because they contained visual elements (figures or tables) that could not be accurately assessed through ChatGPT's text-only interface. Thus, a total of 892 questions were included in the final analysis.
Question Format Classification
The questions were categorized into 2 formats: Multiple Choice Questions (MCQs): These were single-stem questions typically testing recall or direct application of knowledge. Example: “What is the most common cause of community-acquired pneumonia in adults?” and Case Study Questions (CSQs): These were vignette-based items that presented a brief clinical scenario, requiring interpretation and decision-making. Example: “A 5-year-old boy presents with fever, sore throat, and a sandpaper-like rash. On examination, there is cervical lymphadenopathy and a ‘strawberry’ tongue. What is the most likely diagnosis?”
Prompt Design
Given the influence of prompt structure on LLM performance, a standardized prompt format was developed and used across all questions. Each prompt included a brief contextual introduction, the full question with answer choices, and an explicit instruction asking the model to return only one best answer with a brief explanation. This structured prompting approach was informed by recent literature highlighting the significant influence of prompt design on LLM performance in medical domains. 15
The general structure was: “This is a multiple-choice question from the Nepal Medical Council Licensing Exam. Please analyze the options and return only the best answer with a brief explanation. Only one option is correct. Question: A 22-year-old female presents with fatigue, pallor, and exertional dyspnea. Lab results reveal Hb: 8.2 g/dL, MCV: 72 fL, serum ferritin: 5 ng/mL. What is the most likely diagnosis? A. Vitamin B12 deficiency B. Iron deficiency anemia C. Aplastic anemia D. Anemia of chronic disease”
Evaluation
Board-certified clinicians (completed residency after undergraduate degree) of respected subjects evaluated the response generated by ChatGPT to determine the predicted answer, which was compared with the correct answer. A score of 1 was given when the expected answer corresponded with the accurate answer, and a score of 0 was assigned for inconsistency. Questions were classified according to the set, type, and pattern: MCQ or CSQ, and subject.
Qualitative Evaluation of ChatGPT Responses
Each response was reviewed to identify the selected choice, and ChatGPT responses were evaluated as correct or incorrect. Each response was further classified based on an evaluation of the following criteria 16 : first, logical reasoning, assessed whether the explanation demonstrated a clear rationale for choosing the answer, applying reasoning to the information provided. A 65-year-old man with long-standing hypertension presents with sudden, severe chest pain radiating to the back. On examination, his blood pressure is markedly different in the 2 arms. What is the most likely diagnosis? Options: (A) Acute myocardial infarction, (B) Aortic dissection, (C) Pulmonary embolism, and (D) Pericarditis. Second, internal information, which evaluated if the response was based on details from the question itself to support the answer, for example: a patient presents with yellow discoloration of the sclera. What is it called? Options: (A) Cyanosis, (B) Pallor, (C) Jaundice, and (D) Erythema. Third, external information, examined whether the response incorporated relevant knowledge beyond the question stem to justify the choice. Example: A 24-year-old primigravida at 10 weeks of gestation comes for routine antenatal care. Which live vaccine is contraindicated during pregnancy? Options: (A) Influenza (inactivated), (B) Tetanus toxoid, (C) Hepatitis B, and (D) Varicella. The responses were organized into a Google Spreadsheet for further analysis.
For each incorrect response, we categorized the incorrect response into one of the following: logical error, where the response identified the relevant information but failed to apply it correctly to reach an appropriate conclusion, example: A 60-year-old smoker presents with hematuria. Ultrasound shows a renal mass. The question asks for the most likely diagnosis. The model selects “renal stone” instead of “renal cell carcinoma,” despite the malignant features described; information error, where it failed to identify a crucial piece of information, whether from the question stem or external knowledge, that is expected to be considered. Example: A 25-year-old woman presents with amenorrhea and a positive urine pregnancy test. The question asks for the most appropriate next investigation. The model chooses “FSH/LH levels” instead of recognizing that a pregnancy test result was already provided and “ultrasound pelvis” was the appropriate next step and statistical error, which referred to an arithmetic mistake, including direct errors or indirect ones, such as inaccurate estimations of disease prevalence. Example: What is the most common cause of community-acquired pneumonia in adults? The model answers “Klebsiella pneumoniae” rather than the correct “Streptococcus pneumoniae,” reflecting an error in estimating disease frequency.
Variable and Outcome
The primary outcome was ChatGPT's performance on the NMCLE, assessed by evaluating whether its responses were correct or incorrect, with a correct response defined as alignment with the answers provided by board-certified clinicians (described above). The independent variables included the type of question, categorized into logical reasoning, internal information, or external knowledge. Other independent variables were the specific dataset, the question format MCQs or CSQs, and the subject. Subjects were classified into basic sciences, internal medicine, surgery, community medicine or public health, forensic medicine, obstetrics and gynecology, and pediatrics.
Statistical Analysis
Data were analyzed using SPSS version 26 (IBM Corp., Armonk, NY, USA). Descriptive statistics were reported as frequency distributions and percentages for categorical variables. To identify predictors of incorrect responses generated by GPT-4, binary logistic regression analysis was performed. The dependent variable was response correctness, dichotomized as incorrect = 1 and correct = 0, so that model coefficients directly reflected the odds of an incorrect response. Independent variables included question type and subject. All predictors were entered simultaneously into the model using the Enter method to assess their independent contributions. Regression coefficients were expressed as odds ratios (ORs) with corresponding 95% confidence intervals (CIs). For categorical predictors, indicator (dummy) coding was applied, with the last category designated as the reference. Model fit was evaluated using the −2 log likelihood statistic and Nagelkerke R². A P-value < .05 was considered statistically significant.
Ethical Considerations
This study adhered to the Helsinki Declaration and is reported in accordance with STROBE guidelines. 17 No humans were involved during the study. Therefore, evaluation by the ethics committee was not considered necessary.
Results
Overall Performance
A total of 892 responses were analyzed, with 783 (87.8%) correct responses.
Factors Associated With Incorrect Responses
The distribution of correct and incorrect responses across different sets (Set A-E) showed significant variation (P = .004). Incorrect responses ranged from 7.2% in Set E to 17.8% in Set B. Logical reasoning questions had a significantly higher proportion of incorrect responses (51.6%) than external information questions (5.7%) (P < .05; Figure 1). Analysis by the system revealed significant differences (P = .007). Pediatrics had the highest proportion of incorrect responses (22%) (Figure 2). There was no significant difference based on the pattern of questions (Table 1).

Performance of ChatGPT on Different Question Types in the Nepal Medical Council Licensing Medical Examination.

ChatGPT's Performance Across Different Subjects in the Nepal Medical Council Licensing Medical Examination (y-Axis Frequency).
Distribution of GPT-4 Responses by Set, Type of Question, Pattern, and Subject.
Abbreviation: CSQ, Case Study Questions; MCQ, Multiple Choice Question.
Logistic regression with incorrect response as the outcome showed that question type was a significant independent predictor (χ² = 112.3, P < .001, Nagelkerke R² = 0.266). Logical reasoning questions had markedly higher odds of being answered incorrectly compared with external-information questions (OR = 14.7, 95% CI: 8.94-24.16, P < .001). Subject area was not an independent predictor after adjusting for question type (Table 2).
Logistic Regression Analysis of Predictors of Incorrect Responses by GPT-4.
*Significant at P < .05.
Discussion
This study is the first of its kind to assess ChatGPT's performance in a low-resource, English-medium setting where access to advanced technology is not routine. First, we demonstrated that ChatGPT performed at a standard or above medical graduates on the NMCLE. Second, we found that GPT-4's incorrect responses were associated with questions requiring logical reasoning, indicating potential areas for further research to explore why ChatGPT struggles with certain types of questions, contributing to a deeper understanding of the behavior of LLMs. Furthermore, the NMCLE includes a unique mix of context-specific public health, community medicine, and endemic disease content, offering a novel benchmark to test ChatGPT's adaptability to localized medical priorities.
ChatGPT-4-turbo generated correct response in 87.8% of the NMCLE questions. As the threshold of 50% is considered the passing standard for the NMCLE exam, ChatGPT performs at the level expected of a medical graduate in our setting. These are consistent with the conclusions drawn from prior studies in developing nations.13,18 Likewise, the findings also resonate with those from developed countries, where Yang et al 19 reported ChatGPT-4's ability to respond to USMLE questions involving images to be 90.7% accurate, exceeding the passing threshold of approximately 60% accuracy. A previous study reported that an accuracy rate of >95% would make ChatGPT a reliable educational tool. 20
However, GPT-4-Turbo had previously failed other tests based on non-English languages like the Spanish, Japanese, and Chinese National Licensing Medical Examination.21–23 Compared to earlier versions of the GPT, newer versions perform better on medical licensing exams. 24 Different studies have shown that language and cultural contexts are the reasons for discrepancies in the performance of ChatGPT on licensing exams across the globe. Because ChatGPT's training data is primarily in English, which possibly leads to poor performance in the licensing exams on languages other than English. 1 Additional factors are differences in medical policies, legal requirements, and management practices in each country. ChatGPT's strong performance on the NMCLE likely reflects the exclusive use of English in Nepal's medical education and exam. 1 In addition, the sets contained more science-based questions than those related to policies or local law.
Compared to a meta-analysis by Levin et al 24 that found an overall accuracy of 61.1%, ChatGPT's higher success rate in the NMCLE is probably because the test relied more on recall-based knowledge than logical reasoning or problem-solving. This is consistent with earlier findings that ChatGPT performs better on tests requiring standardized knowledge bases, like Basic Sciences and Community Medicine, 16 than on subjects like Pediatrics, which call for clinical reasoning.25,26 Half of the NMCLE errors were found in questions that required logical reasoning. This limitation, which is in line with results from earlier research, emphasizes the model's inability to solve complex problems.27–29 Even though ChatGPT frequently offers reasonable justifications for its responses, even when they are inaccurate, less knowledgeable users may be misled by these. Furthermore, the model frequently supplied titles, DOIs, and links that were either nonexistent or irrelevant when asked for references, highlighting the need for supervision when utilizing AI-generated citations in academic settings.24,30
ChatGPT's adaptability across question formats, as in our study, highlights its versatility as an educational tool. This adaptability suggests that ChatGPT could be integrated into diverse learning scenarios, exam preparedness, mnemonic creation, rephrasing questions, and the overall learning experience.31,32 In settings with limited faculty availability or restricted access to up-to-date textbooks and journal articles, ChatGPT could help bridge educational gaps by being a one-stop solution for summarizing key concepts, offering explanations, and simulating interactive discussions.26,33 ChatGPT is convenient and accessible for everyone. Additionally, ChatGPT could be integrated into existing medical curricula as a self-assessment tool, allowing students to test their knowledge through AI-generated quizzes. Such an approach may be particularly beneficial in rural medical schools or institutions with faculty shortages, where AI models can provide immediate feedback and explanations. Moreover, structured AI-assisted learning platforms could complement clinical training by providing case-based learning simulations, reinforcing diagnostic reasoning, and offering management guidelines. Additionally, Nepalese students pursuing medical education in non-English settings such as Egypt, China, or the Philippines can integrate ChatGPT early in their studies by using English-language prompts to clarify complex concepts, rehearse clinical scenarios, and review local guidelines, which can accelerate mastery of core knowledge and improve preparedness for both national and international exams.
ChatGPT's errors in logical reasoning raise concerns regarding its applicability in clinical decision making. Therefore, professional judgment remains indispensable, and in clinical practice, the need for careful oversight of AI-generated recommendations is paramount. 27 The observed tendency of ChatGPT to provide plausible but incorrect answers highlights the risk of over-reliance on AI-generated information.
Students and trainees must be aware of these limitations and critically analyze AI-generated responses rather than accepting them at face value. One potential strategy to mitigate this issue is to develop AI models explicitly trained in stepwise clinical reasoning, incorporating real-world patient cases and decision trees. Additionally, integrating ChatGPT into medical education should emphasize critical thinking, ensuring students understand not just what the model suggests but also why certain responses may be flawed. AI tools like ChatGPT should complement, rather than replace, traditional educational approaches, emphasizing a partnership between human expertise and machine capabilities to optimize outcomes.25,27,34
Limitations
To the best of our knowledge, this is the first study to evaluate the performance of GPT-4 in medical examination in the Nepalese context. This study has several limitations. The questions of NMCLE relied entirely on memory recall; thus, clinical vignettes might not be entirely represented. Questions were not classified based on the difficulty level. In addition, a specific sample was not calculated before the study; instead, we included all questions of the test during the year. The exclusion of questions with figures and tables may introduce selection bias by limiting the dataset to text-based questions, potentially underrepresenting the role of visual interpretation in clinical decision making. Future studies should explore the application of LLMs beyond ChatGPT, such as Microsoft's Bing Chat, Google's Bard, and Meta's LLaMA, on their efficacy in medical education.
Conclusions
ChatGPT-4 performs at a standard comparable to or above that of medical graduates on the Nepalese undergraduate medical licensing examination. Incorrect responses were mainly in questions requiring logical reasoning, underscoring the need for caution when relying on its outputs in such domains. These findings are encouraging and highlight the need for further studies to evaluate ChatGPT-4's role as an educational resource in Nepalese medical education.
Footnotes
Acknowledgments
The authors acknowledge the role of all individuals involved in the recall of the questions for this research.
Authors’ Contributions
PL and SP were involved in conceptualization and data curation; DU and AY in formal analysis and resources; PL, SP, DU, AY, and GJ in funding acquisition, investigation and visualization; PL, SP, and AY in methodology; PL, SP, and DU in project administration; DU, AY, and GJK in software; PL in supervision; validation in PL, SP, and GJK; writing—original draft in SP and DU; and PL, AY, and GJK in writing—review and editing.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data supporting this study are available from the corresponding author upon reasonable request.
