Abstract
Background
Artificial intelligence (AI), particularly large language models like vision-capable ChatGPT 4.0, is increasingly shaping medical education. While these systems show promise for automated feedback and adaptive assessments, their performance in visually intensive, image-based disciplines remains insufficiently studied.
Objective
This cohort study aims to compare the performance of ChatGPT 4.0 and undergraduate medical students on standardized, image-based multiple-choice questions in Anatomy, Pathology, and Pediatrics. Standardized exams were administered to second-, third-, and fifth-year students, and the same questions were submitted to ChatGPT 4.0 using a two-step deterministic and stochastic protocol. Items with images that ChatGPT 4.0 failed to recognize were excluded. The statistical unit of analysis was the question, and all questions were analyzed as independent question-level observations within each domain. Paired t-tests or Wilcoxon signed-rank tests were used as appropriate, and subgroup analyses were restricted to questions with a discrimination index ≥ 0.1.
Results
Of 90 questions, 52 were eligible for the primary comparative analysis after exclusion of items in which ChatGPT 4.0 failed image recognition. ChatGPT 4.0 significantly underperformed students in Anatomy (mean difference = -0.387, p < 0.00001, Cohen’s d = 2.10) but outperformed students in Pediatrics (mean difference = +0.174, p = 0.00013, Cohen’s d = 0.81); these findings were similar or stronger in discrimination-based subgroup analyses. No Pathology items were eligible for comparative analysis because ChatGPT 4.0 failed image recognition for all Pathology images; therefore, no inference about comparative downstream reasoning can be made for Pathology under this protocol. In the global end-to-end analysis, which scored image-recognition failures as incorrect, ChatGPT 4.0 accuracy was 17.3% in Anatomy, 84.9% in Pediatrics, and 0% in Pathology.
Conclusion
These findings demonstrate marked variability in ChatGPT’s visual reasoning across medical domains, underscoring the need for multimodal integration and critical evaluation of AI applications before adoption in image-dependent medical education settings.
Keywords
1. Introduction
Artificial intelligence (AI) is rapidly reshaping medical education by providing scalable and adaptive tools for instruction, assessment, and clinical reasoning.1,2 Among these innovations, large language models (LLMs) - such as vision-capable ChatGPT 4.0 3 - have demonstrated the capacity to generate automated feedback, deliver tailored recommendations, and simulate clinical scenarios, offering unique opportunities for advancing medical training. 4 While the adoption of AI holds considerable promise, concerns remain regarding over-reliance on algorithmic outputs, the propagation of bias, and potential disruption of established educational relationships. 5 Moreover, the integration of AI into educational environments introduces important questions about data privacy, ethical use, and regulatory oversight.6,7
In recent years, generative models like ChatGPT 4.0 have gained significant traction among both medical students and educators for information retrieval, examination preparation, and clinical case analysis. 8 Despite their increasing prevalence, little is known about the effectiveness of such models in domains that require the interpretation of complex visual information.9,10 Prior research has primarily compared AI models with practicing clinicians or residents, but direct head-to-head evaluations involving medical students, particularly in image-based assessments, remain limited.11-13
This study addresses this critical gap by systematically comparing the performance of ChatGPT 4.0 3 and undergraduate medical students of pre-clinical and clinical years on standardized, image-based multiple-choice examinations in anatomy, pathology, and pediatrics. We investigate ChatGPT performance on image-based MCQs conditional on the model passing an image-recognition gate, rather than an isolated measure of ‘visual reasoning.
2. Methods
This was a cross-sectional comparative study evaluating ChatGPT 4.0 and medical students on standard image-based medical examinations in Anatomy, Pathology, and Pediatrics for the academic year 2024 – 2025. The study included 70 second-year students for Anatomy, 26 third-year students for Pathology, and 31 fifth-year students for Pediatrics at the European University Cyprus. The examination was conducted independently of formal coursework and did not influence academic records or course grades. Participation was open to all students in the respective academic years and was promoted as part of a broader research initiative exploring the role of artificial intelligence in medical education.
The examination questions were drawn from a combination of validated international question bank sources (e.g., AMBOSS) 14 and custom-designed items created by course coordinators. The latter were locally authored/custom items created by course faculty specifically for the research protocol and aligned with the local curriculum. In the custom items, de novo question text was paired with faculty-selected and examination images from medical textbooks and presentations utilized as teaching material in the medical school. These were further edited and adapted - through annotation, cropping, or arrow labelling - such that the final assembled items were not simple reproductions of standard publicly available question-bank content. All questions were reviewed by two faculty members per course to ensure content relevance and alignment with the corresponding curriculum. Each question consisted of a clinical vignette and an accompanying medical image, with four answer options (A–D), one key/correct option and three distractors/wrong answers.
Participation was organized by academic curriculum year and discipline: second-year students undertook Anatomy, third-year students completed Pathology, and fifth-year students completed Pediatrics. Anatomy and Pediatrics examinations were conducted online and completed by students at home, whereas Pathology examinations were administered in person under supervised classroom conditions. Each examination consisted of 15 questions completed within 15 minutes (60 seconds per question). Students were informed that these were official course examinations; however, participation was voluntary and performance did not contribute to final course grades. Students were instructed to complete the examinations independently and honestly. Student responses were anonymized and recorded per question as either correct or incorrect. All students in each cohort received the same examination; no randomization or test-order variation was applied.
No formal a priori power analysis was performed due to the exploratory nature of this study and the use of entire class cohorts as a convenience sample. All eligible students in the specified years and disciplines were invited to participate.
2.1. ChatGPT 4.0 Testing Protocol
ChatGPT 4.0 was accessed via the web-based interface (OpenAI) 3 through the latest stable release of Google Chrome 15 at the time of the study. Following student assessments, the same exams were submitted to ChatGPT 4.0 using a standardized two-step protocol. Each evaluation began with Part A (image recognition), followed by Part B (multiple-choice question answering). Part A was treated as deterministic and conducted under standardized web-interface conditions. ChatGPT 4.0 was accessed via the OpenAI web interface through Google Chrome, and testing was performed in January 2025. For Part A, the temperature was manually set to 0, and a new session was used for each test. Exam materials (including the embedded images) were uploaded in batches as PDF files (e.g., test1.pdf, test2.pdf). For each item, image recognition was elicited using the following fixed prompt template: “You are assisting with an image-based medical exam. Task: IMAGE RECOGNITION ONLY. Carefully examine the attached image and identify what it depicts (e.g., anatomical structure, clinical finding, radiologic feature, or histopathologic diagnosis). Provide a concise identification in 1–2 sentences. Do NOT answer the multiple-choice question and do NOT provide differential diagnoses. If the image cannot be interpreted with confidence, reply exactly: UNABLE_TO_INTERPRET_IMAGE.” Because the web interface does not provide a model build/version string, we report the model as ChatGPT 4.0 as labeled in the interface. Questions where ChatGPT 4.0 successfully identified the image structure in Part A were included in the subsequent stochastic analysis.
More specifically, images in part A had the following characteristics:
Anatomy questions featured imaging techniques (CT scans, chest x-rays) and anatomical schematics, with images typically of size ≥1024×768 pixels and resolution ≥300 dpi, available in grayscale or color as needed. Formats included DICOM screenshots and vector-based JPEG or PNG diagrams. These resources were used to assess anatomical localization, spatial reasoning, and clinical applications, such as identifying brachial plexus injuries, understanding hernia anatomy, or locating pacemaker placement.
Pediatrics questions included high-resolution clinical images (800×600 to 1600×1200 pixels) and pediatric radiographs in color JPEG/PNG format. Images were standardized for lighting and white balance to improve diagnostic accuracy for conditions like roseola, Mongolian spots, hemangiomas, or signs of croup.
Pathology questions featured high-definition histopathological photomicrographs (commonly 1600×1200 pixels or at 40× magnification) in JPEG or TIFF format, including hematoxylin & eosin and immunohistochemistry stains. These images highlighted intricate nuclear details vital for recognizing neoplastic, inflammatory, and infectious tissue alterations, such as squamous cell carcinoma, granulomas, and villous adenomas.
Across all exams, images were embedded beneath each clinical vignette in Microsoft Word format using standardized templates. Compression was avoided to preserve image clarity, and content was reviewed independently by two expert faculty members to ensure accuracy, relevance, and educational validity.
This exclusion criterion was determined a priori, ensuring that stochastic performance measures reflected only those questions where the model meaningfully interpreted the visual input.
Accordingly, the primary outcome estimates multiple-choice performance conditional on successful image recognition (i.e., Part B accuracy among Part A–passing items). To contextualize this conditional estimate and provide an end-to-end measure across the full item set, we additionally computed a global (end-to-end) accuracy in which any Part A image-recognition failure was scored as incorrect (accuracy = 0). Global accuracy was computed at the question level by assigning each Part A-passing question its Part B mean accuracy (across the 120 runs) and assigning 0 to each Part A failure, then averaging across all questions within each domain. We report both conditional and global accuracy by domain alongside the Part A pass rate to clarify interpretation.
For the included questions, each exam was submitted for 120 independent ChatGPT 4.0 manually via the OpenAI web interface, with temperature = 1.0 specified as a prompt, to simulate probabilistic variation. Each Part B run was performed in a fresh session with no prior conversational context to prevent carryover effects across runs. We recorded the model’s selected answer for each run and computed summary statistics, including mean accuracy, standard deviation (SD), and standard error (SE) per question. Accordingly, these repeated samples estimate expected accuracy under stochastic generation rather than a single best-effort deterministic attempt. This repeated-measures approach was selected to encapsulate the inherent randomness of generative models and to yield stable performance estimates across diverse clinical image prompts. By employing a standardized sampling method, the analysis reduces the influence of outlier responses or idiosyncratic model behaviors.
A temperature setting of 1.0 was chosen to match the intended construct of this evaluation: expected single-response performance under stochastic generation, rather than maximal (“best-effort”) deterministic exam performance. Under this framework, the 120 repeated runs function as a Monte Carlo estimate of the probability that the model selects the correct option for a given item. Lower temperature settings (e.g., 0–0.2) would substantially reduce output variability and shift the construct toward near-deterministic decoding, whereas self-consistency approaches (e.g., majority vote across samples) represent a different, typically more favorable paradigm that is not directly comparable to the single-attempt condition under which students completed the examinations. This approach follows reproducibility frameworks used in recent large language model evaluation studies. 16
2.2. Statistical Assessment of Question Distributions and Pooling
The statistical unit of analysis was the question, not the student. For each question, student accuracy was calculated as the proportion of students answering correctly and was then compared with the corresponding ChatGPT 4.0 accuracy for that same question within each domain.
In Anatomy, students were divided into two sections, and each section completed the same two test forms. Thus, four Anatomy administrations were conducted, but the question content was duplicated across sections: Test_I1 and Test_II1 contained the same 15 questions (Form 1), and Test_I2 and Test_II2 contained the same 15 questions (Form 2). Because these were identical question sets, student responses were aggregated across sections at the question level within each form to obtain a single per-question student accuracy estimate.
In Pediatrics, students were not divided into sections and completed two distinct 15-question tests. Because the analytic unit was the question, all 30 Pediatrics questions were analyzed as independent question-level observations within the Pediatrics domain.
In Pathology, students completed a single 30-question domain-specific examination, which was analyzed at the question level without any duplicate-form aggregation.
2.3. Statistical Comparison Between Students and ChatGPT
All statistical preprocessing and analyses were performed with R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria) using the dplyr, readxl, and ggplot2 packages for data manipulation and visualization. 17 The accuracy score for each question was calculated as the ratio of correct responses, separately for ChatGPT 4.0 and medical students, and compared on a per-question basis within the matched Anatomy, Pediatrics and Pathology domains.
Prior to hypothesis testing, the distribution of paired accuracy differences between ChatGPT 4.0 and student responses (ChatGPT - Student) was planned to be assessed for normality using the Shapiro-Wilk test and histogram visualization. 18 Given the modest number of questions per domain/form (e.g., ∼15), Shapiro-Wilk and histogram visualization were used as screening tools to guide test choice rather than as definitive proof of normality. When distributional assumptions were not supported, we used the Wilcoxon signed-rank test as a distribution-robust alternative. Parametric testing, when applied, was performed on paired per-question differences, and results were interpreted alongside effect sizes and confidence intervals to emphasize magnitude and uncertainty.
Depending on the outcome of the normality assessment, either a paired t-test (for normally distributed differences) or the Wilcoxon signed-rank test (for non-normal distributions) would be employed to compare per-question accuracies between ChatGPT 4.0 and medical students within each subject domain. Additionally, Pearson correlation analysis was pre-specified to evaluate linear agreement in accuracy rates between the two groups. All statistical analyses - including normality checks, matched-pair testing, and data visualization - were conducted using a standardized, domain-consistent workflow implemented in R. Visualizations, such as scatter plots and Bland–Altman plots, were generated using the ggplot2 package to support interpretability of comparative performance across questions. 19
2.4. Subgroup Analysis
Subgroup analyses were pre-specified for Anatomy, Pathology, and Pediatrics to evaluate performance on higher-quality exam items. These analyses included only questions with a discrimination index ≥ 0.1, as calculated by Blackboard’s item analysis tool. 20 The discrimination index measures how well a question distinguishes higher-from lower-performing students, typically using the Pearson correlation between the item score and the total test score. Items with values below 0.1 or negative are typically flagged for review due to poor discriminatory power. For each domain, student and ChatGPT accuracies on qualifying items were compared at the question level using paired tests, with the choice of paired t-test or Wilcoxon signed-rank test determined by the distribution of paired accuracy differences as described in Section 2.3.
2.5. Ethical Considerations
The study protocol was originally submitted as part of the author’s MD thesis at the European University Cyprus. A detailed version of the protocol was also submitted to the National Bioethics Committee of Cyprus (NBCC) during the project’s ethical review process and is available upon reasonable request. No major methodological deviations were identified between the registered thesis protocol and the analyses reported in this manuscript.
This research received formal ethical approval from the National Bioethics Committee of Cyprus, with documentation issued under decision number EEBK/EP/2023.01.10, dated 23 January 2025. Approval was granted for the project titled: “Chat GPT Against Medical Students: A Comparative Analysis of Image-Based Medical Examination Results.” The manuscript title has since been revised (“ChatGPT-4.0 and Medical Students: A Recognition-Gated Comparative Evaluation on Image-Based Medical Examinations”) for a more neutral academic framing in response to editorial and reviewer feedback; this change does not alter the approved study protocol, methods, or scope.
All participants were 2nd, 3rd, or 5th year medical students at the European University Cyprus. Exams were conducted anonymously, with participation entirely voluntary and without academic consequences. An official informed consent form was distributed and signed prior to data collection, outlining the project scope, anonymity of responses, and data security provisions in accordance with GDPR requirements. Participants were informed of their right to withdraw at any time without penalty.
Data collection took place over six weeks at the European University Cyprus School of Medicine. All data was anonymized at the source, and no personally identifiable information was recorded or processed. The study was classified as minimal-risk educational research, involving no clinical interventions, and was approved for academic dissemination as part of the principal investigator’s MD thesis.
3. Results
To preserve anonymity and align with the non-interventional nature of the study, no demographic data (e.g., age, sex) were collected from student participants. While this precluded subgroup analyses by gender or age, it allowed an unbiased comparison of overall performance across academic years.
Each subject domain comprised 30 image-based MCQs (Anatomy n = 30; Pediatrics n = 30; Pathology n = 30). Across the 90 total questions, 52/90 (57.8%) passed Part A image recognition and were retained for Part B evaluation, while 38/90 (42.2%) were excluded due to image-recognition failure. Exclusion rates differed markedly by domain: Anatomy 8/30 (26.7%) excluded, Pediatrics 0/30 (0%) excluded, and Pathology 30/30 (100.0%) excluded. Accordingly, comparative performance estimates for Anatomy and Pediatrics reflect Part B accuracy conditional on Part A success, whereas no Pathology items met inclusion criteria.
Image-Recognition Gate Performance and Conditional vs Global (End-To-End) Accuracy by Domain. Part a Pass Rate Indicates the Proportion of Items in Each Domain for Which ChatGPT 4.0 Successfully Identified the Image (Image-Recognition Gate). Conditional Accuracy Represents the Mean per-Question Accuracy in Part B Among Part a - Passing Items, Computed From 120 Independent Runs at Temperature = 1.0. Global (End-To-End) Accuracy Incorporates the Full Item Set by Scoring Part a Failures as Incorrect (Accuracy = 0) and Averaging Across all Items Within Each Domain. SD Denotes Standard Deviation
3.1. Anatomy
3.1.1. Student Performance and Pooling
Student accuracy was evaluated across four Anatomy cohorts (Test_I1, Test_I2, Test_II1, Test_II2). Because the question sets were identical within forms (Form 1 and Form 2) and the score distributions were comparable, results were pooled within each form to obtain a single per-question student accuracy estimate.
3.1.2. ChatGPT 4.0 Performance
Anatomy Part a Image-Recognition Failures by Image Type (Item-Level Summary). Listed Items are Anatomy Questions That Were Excluded From Part B Because ChatGPT 4.0 Failed Deterministic Part a Image Recognition. Failures are Reported by Test Form (Anatomy 1 vs Anatomy 2), Question Identifier, and Image Type (CT Scan vs Schematic) to Characterize Failure Modes and Inform the Educational Implications of Image-dependent Assessment
These questions were excluded as predefined, leaving 22 questions for the final performance analysis. Across these included items, ChatGPT 4.0 accuracy was generally low, with most per-question scores clustering between 0.20 and 0.40.
3.1.3. Comparison Between ChatGPT 4.0 and Students
Across the 22 included Anatomy questions, ChatGPT performed worse than students on most items. On average, the accuracy difference (ChatGPT - Student) was -0.387 (95% CI: -0.469 to -0.305), meaning ChatGPT scored 38.7 percentage points lower than students per question (paired t-test, p < 0.00001; Cohen’s d = 2.10). Agreement between ChatGPT 4.0 and student performance patterns across individual questions was weak (r = 0.167, p = 0.459), indicating that items students found easier were not necessarily easier for the model.
The Bland–Altman plot (Figure 1) shows this as a clear negative bias, with nearly all points below zero. Notably, mean per-question accuracy (averaged across ChatGPT 4.0 and students) clustered below 0.60, indicating low absolute performance for both groups across most Anatomy items. Bland–Altman plot comparing ChatGPT 4.0 and student per-question accuracy in Anatomy. The x-axis shows the mean per-question accuracy averaged across ChatGPT 4.0 and students, and the y-axis shows the accuracy difference (ChatGPT − Student). The dashed blue line indicates the mean bias (−0.387), demonstrating that ChatGPT 4.0 scored 38.7 percentage points lower than students on average. The dotted red lines represent the 95% limits of agreement (−0.748 to −0.026). Nearly all points fall below zero, confirming a systematic negative bias in ChatGPT 4.0 performance. Notably, mean per-item accuracy values clustered below 0.60, indicating low absolute performance for both groups across most anatomy items
3.2. Pediatrics
3.2.1. Student Performance and Pooling
In Pediatrics, the statistical unit of analysis was the question. The domain comprised two distinct 15-question tests, and all 30 questions were analyzed as independent question-level observations within the Pediatrics domain.
3.2.2. ChatGPT 4.0 Performance
In contrast to Anatomy, ChatGPT 4.0 performed very strongly in Pediatrics. Per-question accuracy was consistently high, with most items between 0.90 and 0.95, indicating that the model answered most questions correctly with little variation across items.
3.2.3. Comparison Between ChatGPT 4.0 and Students
Overall, ChatGPT 4.0 outperformed students in Pediatrics. The average accuracy difference (ChatGPT − Student) was +0.174 (95% CI: +0.093 to +0.255), corresponding to a 17.4 percentage point advantage for ChatGPT 4.0 (paired t-test, p = 0.00013; Cohen’s d = 0.81). The Bland–Altman plot (Figure 2) shows that most points lie above zero, with the largest gains occurring on questions where students performed relatively poorly. Mean per-question accuracy (x-axis) versus accuracy difference (ChatGPT − Student) (y-axis). The dashed blue line indicates the mean difference (+0.174), showing that ChatGPT 4.0 scored 17.4 percentage points higher than students on average. Dotted red lines denote the 95% limits of agreement (+0.093 to +0.255. Most points lie above zero, indicating that ChatGPT 4.0 exceeded student accuracy on the majority of Pediatrics questions, with variability in differences across items
3.3. Pathology
The Pathology domain was excluded from the final comparative analysis. In both administered exams (n = 30 questions), ChatGPT 4.0 failed to identify the images during the deterministic image-recognition step (Part A). Per the predefined protocol, these items were excluded from downstream analysis. Consequently, no Pathology items met inclusion criteria because ChatGPT failed deterministic image recognition for 30/30 questions (100.0%) during Part A.
3.4. Subgroup Analysis Among Domains
3.4.1 Subgroup Analysis in Anatomy
A focused analysis was performed on 20 Anatomy questions with a discrimination index ≥ 0.1 (i.e., questions that better differentiated higher-from lower-performing students). In this subset, ChatGPT 4.0 again underperformed students, with a mean difference of –0.401 (95% CI: –0.487 to –0.315; p = 7.7 × 10-9). The effect size was large (Cohen’s d = 0.81). Compared with the overall Anatomy analysis (Cohen’s d = 2.10), the effect size attenuated in the discrimination-index subset - while still indicating a large effect size - consistent with the restricted item set and a different distribution/variability of per-question accuracy differences within higher-discrimination items. (Figure 3) Bland–Altman Plot for Accuracy Comparison in Validated Anatomy Subset (Discrimination Index ≥ 0.1). Each point represents a single anatomy question meeting psychometric validation criteria. The x-axis denotes mean per-question accuracy, and the y-axis shows the difference in accuracy (ChatGPT − Student). The dashed blue line indicates the mean difference (−0.401), with dotted red lines marking the 95% limits of agreement (−0.761 to −0.041). Nearly all points lie below the zero line, signifying a consistent and systematic performance deficit for ChatGPT 4.0. Nearly all points lie below the zero line, indicating that ChatGPT underperformed students across the majority of higher-discrimination Anatomy items, with variability in the magnitude of differences across questions
3.4.2. Subgroup Analysis in Pediatrics
A focused analysis was performed on 25 Pediatrics questions with a discrimination index ≥ 0.1. ChatGPT 4.0 continued to outperform students, with a mean accuracy difference of +0.140 (95% CI: 0.084 to 0.196; p = 2.9 × 10-5). The effect size was very large (Cohen’s d = 1.03). Figure 4 shows that most points lie above zero, indicating a consistent advantage for ChatGPT 4.0 across higher-discrimination pediatric items. Bland–Altman Plot: Pediatrics Subgroup (Discrimination Index 805; 0.1). Each point represents a pediatrics question meeting the predefined item discrimination threshold (discrimination index ≥ 0.1), where higher-performing students were more likely to answer correctly than lower-performing students. The x-axis shows the mean per-question accuracy averaged across ChatGPT 4.0 and students, and the y-axis shows the accuracy difference (ChatGPT − Student). The dashed blue line indicates the mean bias (+0.14), demonstrating that ChatGPT 4.0 scored, on average, 14 percentage points higher than students. The dotted red lines represent the 95% limits of agreement (−0.127 to +0.406). Most points lie above zero, indicating an overall performance advantage for ChatGPT 4.0 across higher-discrimination pediatric items, with variability in the magnitude of differences across questions
4. Discussion
This study evaluated vision-capable ChatGPT 4.0’s performance in image-based Anatomy, Pediatrics, and Pathology examinations by comparing it to undergraduate medical students. A total of 90 questions were initially administered but only 52 were retained for the primary (conditional) analysis because ChatGPT 4.0 failed to identify key visual features in 38 questions during the deterministic image-recognition phase (Part A). In line with our predefined protocol, these failed-recognition items were excluded from downstream Part B evaluation; therefore, we did not empirically characterize the content or “confidence” of the model’s responses on failed-recognition items. Nonetheless, these failures represent a clinically and educationally important capability boundary, as they indicate that the model cannot reliably extract the required visual features from certain medical image modalities. To contextualize this conditional analysis, we additionally report global (end-to-end) accuracy in which Part A failures are scored as incorrect (accuracy = 0), alongside Part A pass rates, to clarify interpretation across domains.
In Anatomy, ChatGPT 4.0 significantly underperformed compared to students, reflecting its difficulty in processing spatially complex anatomical structures.
In contrast, several Pediatrics items paired clinical vignettes with clinical photographs or radiographs, and expert review suggested that approximately 20 of 30 questions provided sufficient diagnostic context in the vignette text for the likely diagnosis to be inferred without sole reliance on image-feature recognition. Accordingly, ChatGPT 4.0 performance in this domain may reflect combined use of narrative cues and visual features rather than visual interpretation alone. Pathology was excluded entirely from analysis, as ChatGPT 4.0 failed to recognize all histological images, preventing any valid downstream comparison. Collectively, these findings underscore the domain- and modality-specific nature of LLM performance and reinforce the need for cautious, evidence-based deployment in image-dependent educational settings.
4.1. Interpretation of Findings
In the Anatomy assessment, ChatGPT underperformed relative to students by an average of 38.7 percentage points (Cohen’s d = 2.10), indicating poor performance on tasks requiring extraction and integration of visuospatial/anatomical relationships from image-based prompts. Although Part A recognition succeeded for included items, Part B performance remained low, consistent with limitations in using the provided visual information to support correct option selection. Furthermore, the lack of correlation between ChatGPT 4.0 and student accuracy implies a fundamental misalignment in how item difficulty is perceived and processed. These findings raise important questions about the model’s ability to emulate human diagnostic reasoning in domains that demand visuospatial integration and structural understanding.
Similarly, the results of the subgroup analysis of anatomy are in strong agreement with the main study results, reinforcing the conclusion that ChatGPT 4.0 underperforms in image-based anatomy tasks. By limiting the analysis to items with acceptable discrimination indices, this approach controls for potential biases introduced by poorly functioning questions and highlights that the model’s limitations are not restricted to flawed items.
These findings align with other recent studies. For instance, Yang et al observed that GPT-4 attained only 25% accuracy on an National Board of Medical Examiners (NBME)-style anatomy shelf exam, which was lower than its performance in other preclinical subjects. They attributed this to the model’s limitations in understanding visual-spatial content. 21 Likewise, Gilson et al noted that ChatGPT 4.0 faced significant difficulties with radiographic and anatomy questions on the United States Medical Licensing Examination (USMLE) Step 1, in contrast to systems-based or pathophysiology topics. 12
In contrast to its performance in Anatomy, ChatGPT 4.0 demonstrated a substantial advantage in the Pediatrics domain, outperforming students by an average of 17.4 percentage points (Cohen’s d = 0.81). This performance gap may be partially explained by the nature of the pediatric clinical vignettes, which were often more direct and text-driven. Such formats likely allowed the model to apply its language-based reasoning capabilities effectively, even when image interpretation was limited. The clarity and structure of these scenarios may have enhanced ChatGPT’s ability to generate accurate responses, particularly in cases with lower visual complexity. These findings are consistent with prior studies showing ChatGPT’s strong performance in text-based medical reasoning. Kung et al reported that GPT-4 correctly answered 89% of USMLE Step 1-style vignettes lacking images, 13 while Han et al found that GPT-4 outperformed most third-year students in pediatric case scenarios without visual components.
The findings of the subgroup analysis were also in full agreement with the primary study outcomes, confirming that ChatGPT’s advantage in pediatric image recognition persists even after excluding items flagged for poor discrimination. The performance gap remained statistically and practically significant, with a mean improvement of 14 percentage points and a very large effect size (Cohen’s d = 1.03). This reinforces the view that ChatGPT’s strengths in pediatric clinical reasoning are not dependent on flawed or ambiguous questions. Rather, its consistently high accuracy across validated test items highlights its potential utility in undergraduate medical education for image-based domains.
4.2. Domain Differences and Educational Context
The variation in performance highlights modality-specific limitations of ChatGPT-4.0 when used with image input. Anatomy questions require accurate extraction of visuospatial relationships and structural detail, whereas many Pediatrics items provide substantial diagnostic context within the vignette and may be less dependent on fine-grained spatial localization. Overall, these results suggest that general-purpose multimodal LLMs may perform inconsistently across medical image modalities, supporting the need for careful, domain-specific validation prior to educational deployment. Initial findings from models such as Med-PaLM M demonstrate potential in addressing this challenge through vision-language pretraining. 22
4.3. Pathology Domain Performance
In contrast to Anatomy and Pediatrics, ChatGPT’s performance in Pathology was consistently undermined at the image-recognition stage. The model could not deterministically identify any of the 30 histological or microscopic images used in the two Pathology exams, leading to no questions being kept for the final stochastic evaluation. This result underscores a significant limitation of current language models, such as ChatGPT 4.0, when dealing with histopathological data.
This is not an isolated finding. Recent studies highlight that although LLMs can reason about pathological processes in text, they currently fail to interpret histological images effectively. In research conducted by Ding et al, GPT-4 demonstrated a low F1-score (the harmonic mean of precision (PPV) and sensitivity) around 0.47 when asked to identify tubular adenomas from histopathological slides in the first round – untrained model. 23
Pathology, especially microscopic diagnosis, depends on recognizing patterns, understanding cellular architecture, and analyzing staining characteristics. This necessitates high-quality image interpretation and extensive visual training. In our recognition-gated protocol, ChatGPT-4.0 failed deterministic image recognition on all histopathology items, indicating that its image-understanding performance was not sufficient for this modality in the tested setting.
While multimodal models such as Med-PaLM M and GPT-4V aim to integrate visual encoders, their efficacy in histopathology remains insufficiently established in clinical-grade evaluations. 22 It is worth pointing out that domain-specific systems are not directly comparable to a general-purpose LLM, but we cite them here only to contextualize why purpose-built histopathology models differ from general multimodal models in image-feature extraction. These domain-specific systems trained and validated explicitly for histopathology (e.g., Paige Prostate Detect; Ibex Prostate Detect) have demonstrated clinical-grade performance in their target tasks.24,25 In the context of medical education, this distinction is instructive: specialized vision models are purpose-built for image-feature extraction within a narrow domain, whereas general-purpose LLMs such as ChatGPT 4.0 currently lack validated reliability for microscopic image interpretation. Accordingly, pathology education should preferentially use dedicated histopathology tools (or supervised expert review) for image-dependent teaching and assessment, while limiting general LLM use to text-based adjunctive support.
These results underscore that while dedicated AI systems have attained clinical-grade performance, general-purpose LLMs like ChatGPT 4.0 currently lack the validated visual reasoning capabilities for standalone use in pathology education. Until such LLMs can match the reliability of these specialized tools, their role in pathology should remain text-based or adjunctive.
4.4. Strengths and Future Implications
This study provides one of the first direct comparisons between ChatGPT 4.0 and undergraduate medical students on standardized, image-based examinations, offering empiric insight into the domain-specific performance profile of contemporary large language models (LLMs). A major strength is the structured evaluation framework, which paired deterministic image recognition with probabilistic performance assessment, thereby restricting analysis to items for which the model demonstrated meaningful image interpretation and supporting a fair comparison. The use of psychometrically informed question selection and robust statistical testing further strengthens the validity of the observed performance differences.
Collectively, the findings demonstrate that performance is highly context- and modality-dependent: ChatGPT 4.0 performed strongly in text-rich, clinically oriented pediatrics questions, yet showed substantial limitations in visually intensive disciplines such as anatomy and pathology. This pattern reinforces the need for future AI systems that can reliably integrate visual and textual inputs, particularly for tasks that hinge on visuospatial relationships or microscopic pattern recognition.
From an educational implementation standpoint, these results support targeted, evidence-based integration of AI tools - leveraging strengths for formative support in domains dominated by clinical reasoning and structured language, while avoiding unvalidated use in image-dependent assessment or instruction. Future work should evaluate next-generation multimodal models on discipline-specific image types, quantify their educational impact on learning outcomes and error propagation, and develop guardrails that promote transparency, verification, and ethical use within training environments.
4.5. Practical Guidance for Educators
These results support domain-specific integration of LLMs in undergraduate medical education as a supporting tool for image - based MCQ testing. First, in Pediatrics, LLMs may be used for formative support (e.g., explaining differential diagnoses, synthesizing vignette information, generating practice questions, or providing feedback on reasoning), provided that outputs are treated as suggestions rather than authoritative answers. Second, for visually intensive domains - particularly histology and image-feature–dependent assessment - LLMs such as ChatGPT 4.0 should not be used as standalone tools for image interpretation or grading given the observed image-recognition failures and domain-specific limitations. Third, when LLM outputs are used in educational settings, both educators and learners should critically evaluate the model’s answers and explanations against curated reference sources rather than treating them as inherently reliable. A central concern is that the model may fail to integrate the visual and textual elements of a prompt, instead producing a seemingly rational answer based on whichever cues its training most strongly prioritizes. In image-based tasks, this dissociation increases the risk of fluent but inappropriate or even hallucinatory responses. Accordingly, assessments designed to evaluate visual competence should explicitly test image-feature extraction (e.g., labeling, localization, and description of salient findings) rather than allowing correct responses to be derived primarily from vignette cues and should clearly communicate AI limitations and acceptable-use expectations to learners.
5. Study Limitations
Several limitations should be acknowledged. First, the student examinations were administered under different conditions across domains: Anatomy and Pediatrics were completed remotely at home, whereas Pathology was conducted in person under supervised classroom conditions. We attempted to mitigate this by standardizing test duration across domains (15 questions in 15 minutes; 60 seconds per question), presenting the assessments as official course examinations, and explicitly instructing students to complete them independently and honestly. Nevertheless, because the remote examinations were not proctored and did not contribute to final grades, differential use of external resources or differences in test-taking behavior cannot be excluded. Second, although all images were standardized for format, resolution, and presentation, inherent differences in visual quality (such as brightness, contrast, and clarity of anatomical labelling) could have affected ChatGPT’s recognition accuracy. Third, exclusion of questions due to failed image recognition in Part A reduced the sample size and introduced selection effects. The direction of bias is uncertain: excluding failed-recognition items can inflate conditional performance estimates by restricting analysis to more interpretable items, while also potentially penalizing the model relative to students because no equivalent recognition gate was applied to student responses. Importantly, this asymmetry in evaluation potentially introduces bias: medical students may have received credit for correct responses without necessarily demonstrating image recognition accuracy, as no equivalent validation step was employed for students. In contrast, ChatGPT’s performance was subject to a stricter standard - any failure to identify the image correctly in Part A resulted in the complete exclusion of that question from analysis. To address interpretability, we therefore report both conditional performance (Part B among Part A–passing items) and global (end-to-end) accuracy in which Part A failures are scored as incorrect.
Additionally, this study evaluated only one large language model (ChatGPT 4.0) without comparison to other AI systems, thereby limiting the generalizability of findings across platforms. Finally, the model was tested only in a single configuration and time period (January 2025) using the ChatGPT 4.0 web interface; because ChatGPT 4.0 is a proprietary, continuously updated system and the interface does not provide a model build/version string, performance may differ across time or deployments, limiting reproducibility and generalizability.
6. Conclusion
This study demonstrates domain-specific variability in vision-capable ChatGPT 4.0 performance on image-based MCQs: ChatGPT outperformed students in Pediatrics, underperformed in Anatomy, and no Pathology items met inclusion criteria because all failed deterministic image recognition in Part A. Educationally, these findings support cautious, domain-specific use, leveraging LLMs as adjunctive support for image-based MCQ training in text-rich contexts while prioritizing validated multimodal systems, expert verification, and explicit image-feature assessment for visually intensive domains. Importantly, these findings reflect a recognition-gated protocol and should not be generalized to all image-based assessments without this constraint.
Footnotes
Ethical Considerations
This research received formal ethical approval from the National Bioethics Committee of Cyprus (NBCC) under decision number EEBK/EP/2023.01.10, dated 23 January 2025, for the project titled: “Chat GPT Against Medical Students: A Comparative Analysis of Image-Based Medical Examination Results.” The study was conducted as minimal-risk educational research using anonymized participant responses.
Consent to Participate
Written informed consent to participate was obtained from all student participants prior to data collection. Participation was voluntary and carried no academic consequences. Participants were informed of their right to withdraw at any time without penalty.
Consent for Publication
Not applicable. This manuscript does not contain identifiable individual-level data, images, or videos from participants.
Funding
Dr Andreas Sarantopoulos and Dr Dimitrios Ntourakis received funding from the European Health and Digital Executive Agency (HADEA) under the EU4Health Programme (EU4H) for the project Health Professionals’ and the “DigitAl team” SkillS Advancement (H-PASS) (Topic: EU4H-2022-PJ-06; Project ID: 101101139). The present study is independent of, and not directly related to this funded project.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Due to ethical and legal (GDPR) considerations and the conditions of ethics approval and informed consent for this educational study, the underlying anonymized student response dataset is not publicly available; it can be provided by the authors upon reasonable request. Aggregated data supporting the findings are included in the manuscript, and additional de-identified information may be made available from the corresponding author upon reasonable request, subject to institutional approvals and applicable regulations.
