A Performance-Based Rubric for Generative AI use in Medical Students’ Research Tasks: Development and Initial Psychometric Evaluation

Abstract

Background

As generative AI becomes embedded in medical training, patient safety depends on graduates’ ability to recognize AI limitations and bias, document AI involvement transparently, and verify AI-generated information rather than accept it uncritically. We developed a performance-based rubric to assess observable generative AI (LLM) literacy behaviors within authentic coursework.

Methods

In a single-institution evaluation (Spring 2025), third-year medical students (n = 50 submissions) completed a structured research proposal and submitted the corresponding AI chat transcript and an AI-use disclosure. A four-domain rubric was developed through three pilot–revise cycles: AI Use Documentation, Prompt Generation, Verification, and Integration. Each domain was scored 0–3 (total 0-12). Three educators independently scored all submissions. Inter-rater reliability was assessed using ICC (average-measures, agreement). Construct-relevant patterns were examined via domain distributions (floor effects), performance bands (lower 25%, middle 50%, upper 25%), within-submission differences across domains (Friedman with Bonferroni-adjusted Wilcoxon tests), inter-domain associations (Spearman), and correlation with overall GPA (Spearman).

Results

Mean (SD) domain scores were: AI Use Documentation 0.67 (1.08), Prompt Generation 1.33 (0.69), Verification 0.41 (0.71), and Integration 1.64 (0.67); total score 4.06 (1.80). Floor effects were substantial for AI Use Documentation (64% scored 0) and Verification (60% scored 0). Inter-rater reliability was high (ICC: Documentation 0.99, Prompt Generation 0.84, Verification 0.93, Integration 0.83). Verification was significantly lower than Prompt Generation and Integration (Bonferroni-adjusted p < 0.008). Inter-domain correlations were weak (ρ −0.206 to 0.310). Total scores showed no significant association with GPA (r = 0.194, p = 0.201).

Conclusions

This rubric demonstrated strong scoring reliability and produced initial psychometric evidence consistent with measuring distinct, observable LLM-use competencies. Findings highlight prominent gaps in verification and transparent documentation, reinforcing competency guidance that emphasizes recognizing AI limitations and verifying AI output to protect patient safety. Further multi-site validation and implementation work is warranted.

Keywords

Artificial intelligence AI literacy medical students performance-Based assessment medical education

Introduction

Artificial intelligence (AI) is rapidly expanding in healthcare, with applications spanning diagnostics, clinical decision support, and system-level operations.^1,2 Generative AI refers to systems that produce new content (most commonly text) in response to prompts; large language models (LLMs) are a prominent form of generative AI that can support tasks relevant to medical training and future practice (eg, synthesizing information, drafting structured documents, and supporting research writing). Importantly, LLM use is already emerging in everyday clinical workflows. In a real-world implementation study, emergency physicians used an LLM assistant to draft discharge documentation, with reduced documentation time and perceived workload and decreasing concerns over time.³ Survey evidence also suggests real-world adoption across settings: UK general practitioners reported using generative AI chatbots to assist with aspects of clinical practice.⁴ In addition, a global cross-sectional survey of healthcare professionals reported widespread ChatGPT use across clinical, research, and educational activities, while also noting concerns about accuracy, privacy, and related risks.⁵

However, LLM outputs may be incomplete or misleading and can present inaccurate content with high plausibility; for example, evaluations of ChatGPT in medical question answering have documented problems such as fabricated or unverifiable citations, highlighting why verification is essential when these tools are used in health-related contexts.^6,7

In parallel, medical students are increasingly using conversational AI tools to support studying and academic writing tasks, including summarizing content and drafting text.⁸ Yet this uptake often occurs outside structured curricula; even in well-resourced medical education settings, formal training in AI remains limited: large student surveys show that substantial proportions of medical students already use AI chat tools, while curricular teaching on AI/AI ethics is reported as minimal or inadequate.^9,10

Despite growing interest in “AI literacy,” most existing assessments in medical education rely on self-report, which measures perceived competence rather than demonstrated performance. Laupichler et al¹¹ used validated self-assessment instruments to examine students’ AI literacy across domains such as technical understanding, critical appraisal, and practical application.¹¹ Kimiafar similarly synthesizes evidence that preparedness and operational AI literacy are variable across healthcare professionals and students, but the literature remains dominated by perception-based measures.¹² A BEME scoping review likewise emphasizes that many studies use questionnaires and other non-behavioral approaches, offering limited insight into how learners actually engage with AI in practice.¹³

To address this limitation, the present study evaluates Generative AI (LLM) literacy using authentic student work products and submitted LLM chat transcripts from an Evidence-Based Medicine research-proposal assignment. We operationalize Generative AI (LLM) literacy as observable competencies in AI Use Documentation, Prompt Generation, Verification of AI-generated content, and Integration of AI-supported output into a final academic product. This focus differs from broader AI literacy constructs that also encompass understanding and evaluation of machine-learning/deep-learning prediction models used in clinical decision support; accordingly, the findings primarily characterize LLM-use behaviors in academic research writing rather than general AI competence across clinical AI systems.

Methods

Study Design and Setting

This was a single-institution, coursework-embedded observational study evaluating Generative AI (LLM) literacy using a performance-based rubric applied to authentic student coursework artifacts. The study was conducted within the mandatory Evidence-Based Medicine (EBM) course for third-year MD students (6-year program) at Ken Walker International University (Tbilisi, Georgia). Data were collected during Spring 2025.

Participants and Eligibility Criteria

All students enrolled in the third-year EBM course during Spring 2025 were eligible. Inclusion criteria were enrollment in the course and submission of the individual final research proposal with the corresponding AI chat transcript. Students were required to submit the AI chat transcript as a shared-chat record to document how they interacted with the LLM during completion of the assignment under real-world course conditions. Exclusion criteria were non-completion of the course, non-submission of the final project, or failure to provide the AI chat transcript. No students met the exclusion criteria because submission of both the project and the AI chat transcript was mandatory for final grading; therefore, the final sample comprised 50 students (n = 50).

Sample Size Justification

The sample represented a census of the course cohort (n = 50) and was determined by enrollment rather than an a priori power calculation. This cohort-based sample was used for initial tool development and inter-rater reliability assessment in an authentic educational setting; however, the single-site design limits generalizability and warrants replication in larger, multi-institution samples.

Assignment and Submitted Materials

The final assignment required submission of a structured research proposal including: a clearly defined hypothesis; specification of the study population; sampling strategy and sample size calculation; study design; statistical approach to test the hypothesis; and identification of potential biases with mitigation strategies and residual limitations. Students were instructed to provide justification and rationale for each decision rather than listing choices.

Each submission included 1) the research proposal, and 2) the corresponding AI chat transcript (submitted as shared chats).

Students also completed an AI use disclosure, describing how AI was used (eg, generating content, verifying content, or other support) and how it informed the final proposal.

LLM use Policy and Curricular Context

To capture students’ typical interaction patterns, use of an LLM was required for this assignment component, and students were permitted to use any LLM tool. The primary requirement was submission of the complete AI chat transcript (shared-chat record), with no additional constraints (eg, no mandated prompting template, no prescribed verification steps). Students were not provided with a verification documentation template and were not explicitly required to document verification within the chat; therefore, verification could be credited only when students chose to demonstrate it in the transcript and/or reflected it in the final proposal. At the time of the study, the curriculum included no formal instruction in Generative AI/LLM literacy; therefore, this evaluation reflects baseline, real-world student behaviors under routine course conditions. Based on the submitted transcripts, all students used ChatGPT.

During the assignment briefing, students were instructed to conduct all AI-assisted work related to the project within a single continuous chat and to submit the complete record. This requirement was intended to document the interaction process and preserve an auditable trail, maximizing capture of observable behaviors such as iterative prompting, clarification requests, and any verification attempts that occurred within the recorded workflow.

Rubric Development and Final Domains

A rubric to assess applied Generative AI (LLM) literacy was developed through an iterative, practice-based process over three revision cycles, with all authors involved in review and refinement. In each cycle, the draft rubric was piloted on five sample submissions, and revisions were guided by rater feedback emphasizing clarity, feasibility, and the ability to distinguish performance levels. Across cycles, we refined performance anchors to avoid awarding points when no meaningful AI-use behaviors were evident, expanded the scoring range to improve discrimination, and standardized terminology to reflect observable behaviors.

The rubric underwent three pilot–revise cycles using authentic submissions; in each cycle, raters independently applied the draft rubric and provided structured feedback that guided revisions to enhance clarity, feasibility, and alignment with observable LLM-use behaviors. Early versions included broader academic-quality criteria, but piloting showed these were time-intensive and could inadvertently reward general writing ability rather than AI-use practices. The rubric was therefore streamlined to focus on four core, observable competencies in transcript–proposal pairs: (1) AI Use Documentation (clarity and traceability of LLM use in the submitted materials), (2) Prompt Generation (task alignment, specificity, and actionability of prompts for proposal-relevant outputs), (3) Verification (observable evidence-seeking and challenge of LLM outputs within the submitted chat and/or reflected in the written proposal), and (4) Integration (effective adaptation and incorporation of LLM-supported content into the proposal in a coherent, value-adding manner). To improve discrimination across performance levels, the scoring scale was expanded from 0–2 to 0–3 per domain with behaviorally anchored descriptors. Across all iterations, scoring was restricted to evidence visible in the submitted proposal and chat transcript. Behaviorally anchored descriptors for each score level are provided in Table 1.

Table 1.

Generative AI (LLM) Literacy Scoring Rubric.

Rubric Domain	0	1	2	3
AI Use Documentation (AD)	Misleading: Tool/provider not stated or wrong; transcript missing/irrelevant; disclosure conflicts with transcript/final work so AI contribution cannot be traced.	Partial: Tool may be named but key details unclear; transcript incomplete with key steps omitted; disclosure vague or partly inconsistent.	Adequate: Tool/model and purpose stated; main stages documented with minor gaps; largely consistent with transcript and final work.	Traceable: Provider/tool/model specified; substantially complete task-relevant transcript with key iterations; AI versus student contributions clearly distinguishable and consistent across materials.
Prompt Generation (P)	Vague: Off-task or generic prompts; missing context/constraints; AI must guess what is needed.	General: Relevant but broad/conversational; key parameters missing; relies on AI to infer essentials.	Task-focused: Clearly tied to assignment; includes some key elements (eg, population/design/formula) but not all; follow-ups needed to add assumptions/parameters.	Precise: Directly aligned with task requirements; parameters/constraints stated upfront; complete/actionable prompts requiring minimal AI inference.
Verification (V)	Passive: Accepts outputs as true; no questioning; no source/evidence requests.	Superficial: Minimal surface questions; no sources/reasoning requested; no meaningful challenge/correction.	Evidence-seeking: Requests sources/evidence/reasoning at least once; at least one meaningful challenge; limited follow-through but clear attempt at critical engagement.	Active verification: Repeated critical questioning; requests sources + rationale; actively verifies/challenges/corrects (including responding to bias/misinformation when relevant).
Integration (I)	Copy: Direct copy-paste or unmodified insertion; no adjustment to context; poor fit.	Light edit: Minor edits only; content feels inserted/awkward; style/context inconsistencies.	Adapt: Meaningfully rephrased/adapted; mostly coherent fit; minor inconsistencies remain.	Deep integrate: Fully rewritten/deeply adapted; reads naturally; strengthens the work and adds clear value beyond generic insertion.

Note. Each domain is scored from 0–3; total score = AD + PG + V + I (range 0-12).

Raters and Scoring Procedure

All submissions were independently scored by the same three EBM faculty members involved in teaching the course. Raters were clinician-educators with experience in EBM, research-methods instruction, and rubric-based assessment of student written work, and were regular users of LLM tools for academic tasks.

To reduce bias, submissions were coded and raters were blinded to student identity. Raters scored submissions independently after de-identification and did not discuss individual submissions during scoring to minimize expectancy effects and shared influence. For analysis, domain and total scores for each submission were calculated as the mean of the three raters’ scores.

Statistical Analysis

All analyses were conducted in IBM SPSS Statistics (Version 26.0). Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) with a two-way random-effects model and average-measures agreement. Differences across domains were examined using the Friedman test, followed by Wilcoxon signed-rank tests for pairwise comparisons with Bonferroni correction (significance threshold p < 0.008). To explore whether domain scores differed across overall performance levels, submissions were stratified into low, mid, and high performers based on total-score quartiles, and between-group differences in domain scores were tested using Kruskal–Wallis tests (two-tailed, α = 0.05).

Reporting Guideline

The reporting of this reliability and agreement study conforms to the GRRAS guideline.¹⁴ A completed GRRAS checklist is provided as Supplementary table S1 (GRRAS checklist).

Ethics

All participants provided written informed consent. Ethical approval was obtained from the Ken Walker International University Institutional Review Board (IRB approval number: #1-2024/002). All submissions and transcripts were de-identified through coding prior to analysis.

Results

Fifty third-year medical students participated in the Evidence-Based Medicine final project (n = 50). Participants were 32% male (n = 16) and 68% female (n = 34), with a mean age of 20.3 years (SD = 0.46). Each submission consisted of one research proposal and its corresponding AI chat transcript. All submitted transcripts were generated using ChatGPT, although any LLM was permitted.

Submissions were scored across four Generative AI (LLM) literacy domains—AI Use Documentation, Prompt Generation, Verification, and Integration—and a total score. The mean total score was 4.06 (SD = 1.80) (Table 2).

Table 2.

Descriptive Statistics of Generative AI (LLM) Literacy Domain Scores by Performance Group (n = 50 Submissions).

Generative AI (LLM) Literacy Domain	Full Sample Mean (SD)	High Performers Meann = 12	Mid Performers Meann = 30	Low Performers Meann = 8	Kruskal–Wallis P-Value
AI Use Documentation	0.67 (1.08)	1.62	0.43	0.20	0.150
Prompt Generation	1.33 (0.69)	1.90	1.35	0.47	0.003
Verification	0.41 (0.71)	1.24	0.20	0.00	0.004
Integration	1.64 (0.67)	1.71	1.83	0.87	0.062
Total score (max = 12)	4.06 ( 1.80)	6 . 5	3 . 8	1 . 5	—

Note. Domain scores range from 0–3 (higher scores indicate stronger performance); total score range is 0–12. Subgroup columns report means to facilitate comparison across performance strata.

Domain Performance and Discrimination Across Performance Levels

Participants were stratified into three performance bands based on the distribution of total rubric scores: the lower 25%, middle 50%, and upper 25%. Because multiple students had identical total scores at the quartile cut points, these tied scores were kept within the same band to avoid arbitrary splitting, resulting in final group sizes of n = 12 (higher band), n = 30 (middle band), and n = 8 (lower band).

Across domains, performance was highest in Integration and Prompt Generation and lowest in Verification (Table 2). When submissions were stratified into low, mid, and high performers based on total-score quartiles, Kruskal–Wallis tests showed statistically significant between-group differences for Prompt Generation (p = 0.003) and Verification (p = 0.004). Differences were not statistically significant for AI Use Documentation (p = 0.150), while Integration showed a non-significant trend (p = 0.062) (Table 2), indicating that prompting and verification behaviors most clearly distinguished higher- from lower-performing submissions.

Figure 1 illustrates the distribution of scores across the four domains, including variability and outliers:

Figure 1.

Distribution of Generative AI (LLM) literacy domain scores across submissions (n = 50). Boxes represent the interquartile range (IQR) with the median shown as a horizontal line; whiskers extend to 1.5 × IQR; open circles indicate outliers; diamonds indicate the mean. Scores range from 0–3.

Many submissions received a score of 0 in the Documentation and Verification domains. A score of 0 was assigned when the relevant behavior was not observable in the shared AI chat transcript and/or final proposal (eg, no documented verification attempt, or missing AI-use disclosure as part of Documentation)

Floor and ceiling effects by domain are summarized in Supplementary Table S2.

Inter-Rater Reliability Testing

Inter-rater agreement was high across domains (Table 3). ICC values ranged from 0.83 to 0.99 (all p < 0.001), supporting consistency of rubric scoring across the three raters.

Table 3.

Inter-Rater Reliability Summary (n = 50 Submissions).

Domain	ICC	95% CI (Lower–Upper)	p-Value	Interpretation
AI Use Documentation	0.99	0.99–1.00	<0.001	Excellent agreement
Prompt Generation	0.84	0.70–0.92	<0.001	Very good agreement
Verification	0.93	0.88–0.97	<0.001	Excellent agreement
Integration	0.83	0.68–0.91	<0.001	Very good agreement

Note. ICC calculated in SPSS across three raters using an average-measures agreement model.

To examine relationships among the four rubric domains, we computed Spearman rank correlations (Supplementary Table S3). Inter-domain correlations were weak overall (ρ range: −0.206 to 0.310) and did not indicate redundancy across domains. None of the pairwise correlations met the Bonferroni-adjusted significance threshold (p < 0.008).

Differences Across Domains

A Friedman test indicated significant differences across the four domains (χ²(3) = 32.66, Kendall's W = 0.363, p < 0.001). Post-hoc Wilcoxon signed-rank tests with Bonferroni correction (α = 0.008) demonstrated that Prompt Generation and Integration scores were higher than both AI Use Documentation and Verification (Table 4). The comparison between Prompt Generation and Integration did not remain significant after correction, and AI Use Documentation did not differ significantly from Verification (Table 4).

Table 4.

Post-hoc Comparisons Between AI Literacy Domains (Wilcoxon Signed-Rank Tests, n = 50 Submissions).

Domain Pair	Z	r Effect Size	p-Value	Significant After Bonferroni (p < 0.008)
AI Use Documentation versus Prompt Generation	−2.68	0.49	0.007	Yes
AI Use Documentation versus Verification	−1.18	0.22	0.237	No
AI Use Documentation versus Integration	−3.14	0.57	0.002	Yes
Prompt Generation versus Verification	−3.91	0.71	<0.001	Yes
Prompt Generation versus Integration	−2.12	0.39	0.034	No
Verification versus Integration	−4.37	0.80	<0.001	Yes

Note. Bonferroni-adjusted significance threshold set at p < 0.008 (0.05/6).

Overall, documented Verification was the weakest competency, while Prompt Generation and Integration were consistently higher across submissions.

To assess whether Generative AI (LLM) literacy was related to students’ overall general academic performance, we examined the relationship between AI literacy total scores and students’ overall grade point average (GPA). Given the ordinal nature of rubric scores, we used Spearman correlation analysis, which showed no significant association between AI literacy and GPA (Spearman r = 0.194, n = 50, p = 0.201), indicating a weak relationship in this cohort.

Discussion

This study used a performance-based rubric to examine third-year medical students’ Generative AI (LLM) literacy through authentic coursework submissions paired with complete AI chat transcripts. Across 50 submissions, overall performance was low (mean total score = 4.06/12), suggesting that while many students were able to use LLMs to support drafting and organization, they often did not demonstrate the higher-order behaviors associated with responsible, critical, and reflective use.

A central finding was the consistent weakness in Verification, even when Prompt Generation and Integration were comparatively stronger. This pattern is educationally important because LLM outputs may be incomplete, biased, or confidently incorrect, making verification and critical appraisal essential in health-related contexts where inaccuracies can have meaningful consequences.^6,15,16 At the same time, LLM adoption is increasingly reported in clinical workflows, including documentation support and clinical reasoning assistance, underscoring the need to prepare learners to verify and contextualize AI assistance rather than treat it as authoritative.¹⁷ his urgency is reflected in emerging competency guidance for medical training, including the Digital Health Competencies in Medical Education (DECODE) framework, which identifies AI-relevant competencies expected of medical graduates, with particular emphasis on understanding AI limitations and potential bias and systematically verifying AI-generated information to protect patient safety.¹⁸

The results indicate that Prompt Generation and Verification most clearly differentiated performance strata, whereas AI Use Documentation and Integration were less discriminating. This supports a practical interpretation: many students can use LLMs in a functional sense, for example, to structure content and generate draft text, but the behaviors that distinguish stronger from weaker performance are the ability to ask targeted, task-aligned questions, iteratively refine prompts, and critically challenge, confirm, or qualify AI-generated claims. In this study, Prompt Generation was operationalized as task-aligned specificity and constraints, such as explicitly requesting proposal elements including population, design justification, sampling assumptions, and bias considerations, and requesting structured outputs. Accordingly, this domain reflects task-adequate prompting rather than the full range of prompt-engineering techniques discussed in the broader generative AI literature.

A common critique is that documentation reflects academic integrity rather than LLM literacy, and we agree that these constructs overlap. However, in educational, research, and clinical workflows, documenting AI assistance also serves a traceability function, a core element of responsible AI-enabled work, because it supports supervision, feedback, accountability, and an auditable record of how AI contributed to the final product.¹⁹ In this study, AI Use Documentation was scored behaviorally, that is, according to the completeness and consistency of disclosure and transcript evidence, rather than as a moral judgment. Nevertheless, this domain cannot fully distinguish capability from compliance, and future work could strengthen construct validity by pairing documentation with additional observable indicators, such as explicit rationale for accepting, modifying, or rejecting AI suggestions.

Because Verification was scored only when it was observable within the submitted transcript and the written proposal, the rubric emphasized behaviors such as requesting sources, asking the model to justify or qualify claims, checking assumptions behind methodological or statistical recommendations, and correcting inconsistencies across turns. Importantly, this domain should be interpreted as assessing the visibility and traceability of verification within the submitted AI workflow, rather than as a direct measure of students’ total factual-checking ability. Verification scores of 0 reflected passive acceptance, for example when an LLM suggested a sampling method and the student incorporated it into the proposal without asking clarifying questions and without providing justification. Thus, a score of 0 indicated that verification was not observable in the submitted evidence, not that the student necessarily performed no independent checking at all. Conversely, higher verification performance was characterized by evidence-seeking and challenge (eg, requesting sources or rationale, asking the model to compare alternatives, or refining recommendations based on explicit assumptions). Although students were asked to keep AI-assisted work in a single continuous chat to preserve an auditable interaction trail, students were not explicitly required to document verification steps within the chat, and some verification may have occurred outside the recorded workflow through consultation of external sources such as textbooks, PubMed, or other references and thus could not be captured. Thus, Verification scores should be interpreted as indicators of documented and observable verification behavior within the submitted workflow, not as absolute measures of all verification activity undertaken by the student.

Integration scores suggested that many students could incorporate AI-supported content into an academic proposal coherently. However, this domain may partially reflect baseline academic writing and synthesis ability rather than LLM-specific competence. Although scoring was anchored to incorporation of AI-derived content evidenced in the transcript and final proposal rather than generic writing quality, we could not fully isolate students’ underlying writing proficiency from AI-enabled performance. Future studies should address this by incorporating pre-AI writing samples, independent baseline writing scores, or parallel non-AI writing tasks scored with a conventional academic-writing rubric. Another useful design would be to model Integration scores while statistically controlling for prior writing performance or GPA in writing-intensive coursework. Such approaches would help determine whether the Integration domain captures uniquely AI-related adaptation skills, general academic writing ability, or a combination of both.

The rubric demonstrated strong inter-rater agreement within this locally calibrated rater group, supporting consistency of scoring in the present setting. Although this was a single-site cohort (n = 50) scored by three raters, the reliability estimates reported with 95% confidence intervals provide adequate precision for this initial tool-development phase. However, these strong ICC estimates should be interpreted as evidence of scoring consistency within a locally calibrated rater group rather than as proof of broader scoring generalizability. Because all three raters were EBM faculty involved in the course and rubric refinement, they likely shared a similar instructional frame and implicit standards for what constituted strong AI-supported work. This shared context may have reduced variation in judgment and contributed to higher agreement than might be observed among raters without the same training or curricular familiarity. Accordingly, the present ICC values are best viewed as encouraging initial reliability evidence for this setting, while generalizability across institutions and rater backgrounds remains to be established. Application was also feasible in this cohort, as all submissions were scored independently by three raters within half a working day. However, implementation in larger class sizes would benefit from formal evaluation of time burden and staffing requirements. Practical approaches may include calibrated single-rater scoring with periodic reliability checks and staged review processes for a subset of submissions.

In exploring criterion relevance, AI literacy showed only a weak relationship with students’ overall GPA. This suggests that the competencies captured by our performance-based rubric may represent a distinct skill set rather than a proxy for general academic achievement as reflected in cumulative grades. This pattern is expected given that these measures capture different underlying constructs: overall GPA aggregates performance across diverse assessment formats and content domains, whereas our rubric targets process-oriented, observable behaviors during AI-supported work (eg, prompt specificity, documentation/traceability, verification actions, and integration decisions). However, relationships between “AI literacy” and academic outcomes may also vary as a function of how AI literacy is measured. Many studies rely on self-report readiness or literacy instruments; for example, Hamad et al²⁰ reported a positive association between overall GPA and self-reported AI readiness among medical students. Self-report measures capture perceived competence and may be vulnerable to confidence–competence miscalibration and related cognitive biases (eg, patterns consistent with the Dunning–Kruger effect²¹), whereas our approach relied on transcript-anchored evidence of students’ enacted behaviors. Taken together, these considerations support interpreting LLM literacy, particularly verification, as an independent educational target that warrants explicit instruction and assessment rather than assuming it will develop in parallel with overall academic performance.

This assessment evaluates LLM-specific literacy demonstrated in one academic task and does not measure general AI literacy (eg, evaluation of ML/DL predictive models, calibration, dataset shift, or algorithmic bias in clinical prediction systems). It also does not evaluate LLM use at the point of clinical care. These boundaries matter because competence with LLM prompting and verification may not transfer directly to appraisal of ML/DL systems or to high-stakes bedside use. We therefore frame this work as an assessment of Generative AI (LLM) literacy in academic research writing, rather than broad AI literacy across clinical AI modalities.

Limitations

This single-institution study analyzed a modest number of coursework-embedded submissions. At the time of the study, students had no formal curricular instruction in Generative AI/LLM literacy; therefore, performance likely reflects baseline, informal use rather than trained competence. Verification was credited only when evidence was observable in the submitted AI chat transcript and/or final proposal; verification performed outside the recorded workflow could not be systematically captured. We did not assess intra-rater reliability or test–retest stability.

Because course faculty who contributed to rubric refinement also served as raters, the reported ICC values may partly reflect local calibration and shared mental models of expected student performance. Although submissions were de-identified and scored independently, agreement may therefore be higher than would be achieved by external raters unfamiliar with the course context. The current reliability findings should thus be interpreted as setting-specific rather than fully transportable. Future validation should include independent external raters, formal rater-training comparisons, and multi-site testing to determine whether the rubric's anchors remain interpretable and reliable across different educational settings.

Conclusion

A performance-based approach can reveal meaningful gaps in students’ Generative AI (LLM) literacy that may not be apparent from self-report alone. In this cohort, Verification emerged as the most consistent weakness, and together with Prompt Generation, best differentiated higher- from lower-performing submissions. AI Use Documentation was also low overall, indicating limited traceability of LLM use despite the requirement to submit transcripts. These findings support targeted curricular approaches that teach students how to ask task-aligned questions and routinely verify model outputs as generative AI becomes more embedded in health education and clinical workflows.

Supplemental Material

sj-docx-1-mde-10.1177_23821205261442106 - Supplemental material for A Performance-Based Rubric for Generative AI use in Medical Students’ Research Tasks: Development and Initial Psychometric Evaluation

Supplemental material, sj-docx-1-mde-10.1177_23821205261442106 for A Performance-Based Rubric for Generative AI use in Medical Students’ Research Tasks: Development and Initial Psychometric Evaluation by Nino Shiukashvili, Mariam Rochikashvili, Vasil Kupradze, Nana Gonjilashvili, Nino Gvajaia, Luka Kutchava, Nona Janikashvili, Nino Tevzadze, Archil Undilashvili and Eka Ekaladze in Journal of Medical Education and Curricular Development

Footnotes

Acknowledgments

The authors have no acknowledgments to report.

ORCID iD

Nino Shiukashvili

Ethics Approval Statement

Ethical approval was obtained from the Ken Walker International University Institutional Review Board (IRB approval number: #1-2024/002). All submissions and transcripts were de-identified through coding prior to analysis.

Consent to Participate

All participants provided written informed consent.

Author Contributions

Nino Shiukashvili MD, PhD - Conceptualization, Methodology, Writing - Original Draft Preparation, Writing - Review & Editing

Mariam Rochikashvili MD - Data Curation, Formal Analysis

Vasil Kupradze MD - Investigation

Nana Gonjilashvili MD - Investigation

Nino Gvajaia MD - Investigation

Luka Kutchava MD - Investigation

Nona Janikashvili MD, PhD - Conceptualization

Nino Tevzadze MD, PhD - Investigation

Archil Undilashvili MD, MPH, PhD - Investigation (study execution) and Methodology

Eka Ekaladze MD, PhD - Conceptualization and Methodology

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

De-identified data supporting the findings of this study have been deposited in the Open Science Framework (OSF) repository and are available at

AI use Disclosure

Artificial intelligence tools (OpenAI ChatGPT, GPT-5) were used during manuscript preparation. Specifically, AI assisted with language refinement, grammar correction, and the generation of one illustrative figure. All research design, data collection, statistical analysis, and interpretation were performed solely by the authors. AI outputs were critically reviewed, edited, and verified for accuracy to ensure the integrity and originality of the work.

Supplemental Material

Supplemental material for this article is available online.

References

Khosravi

Zare

Mojtabaeian

Izadi

. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. Health Serv Res Manag Epidemiol. 2024;11:23333928241234863. 10.1177/23333928241234863

Sharma

Savage

Nair

Larsson

Svedberg

Nygren

. Artificial intelligence applications in health care practice: scoping review. J Med Internet Res 2022;24(10):e40238. 10.2196/40238

Lee

Song

You

Kim

. Shifts in emergency physicians’ attitudes toward large language model-based documentation: a pre- and post-implementation study. Sci Rep 2025;15(1):40643. 10.1038/s41598-025-24659-4

Blease

Locher

Gaab

Hägglund

Mandl

. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform 2024;31(1):e101102. 10.1136/bmjhci-2024-101102

Ozkan

Tekin

Ozkan

Cabrera

Niven

Dong

. Global health care professionals’ perceptions of large language model use in practice: cross-sectional survey study. JMIR Med Educ. 2025;11:e58801. 10.2196/58801

Chelli

Descamps

Lavoué

, et al. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis. J Med Internet Res 2024;26(1):e53164. 10.2196/53164

Gravel

D’Amours-Gravel

Osmanlliu

. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clinic Proceedings: Digital Health. 2023;1(3):226‐234. 10.1016/j.mcpdig.2023.05.004

Zhang

Yoon

Williams

DKA

Pinkas

. Exploring the usage of ChatGPT among medical students in the United States. J Med Educ Curric Dev. 2024;11:23821205241264695. 10.1177/23821205241264695

Pucchio

Rathagirishnan

Caton

, et al. Exploration of exposure to artificial intelligence in undergraduate medical education: a Canadian cross-sectional mixed-methods study. BMC Med Educ 2022;22(1):815. 10.1186/s12909-022-03896-5

10.

Weidener

Fischer

. Artificial intelligence in medicine: cross-sectional study among medical students on application, education, and ethical aspects. JMIR Med Educ 2024;10:e51247. 10.2196/51247

11.

Laupichler

Aster

Meyerheim

Raupach

Mergen

. Medical students’ AI literacy and attitudes towards AI: a cross-sectional two-center study using pre-validated assessment instruments. BMC Med Educ 2024;24(1):401. 10.1186/s12909-024-05400-7

12.

Kimiafar

Sarbaz

Tabatabaei

, et al. Artificial intelligence literacy among healthcare professionals and students: a systematic review. Front Health Inform. 2023;12:168. 10.30699/fhi.v12i0.524

13.

Gordon

Daniel

Ajiboye

, et al. A scoping review of artificial intelligence in medical education: BEME guide No. 84. Med Teach 2024;46(4):446‐470. 10.1080/0142159X.2024.2314198

14.

Kottner

Audigé

Brorson

, et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. J Clin Epidemiol 2011;64(1):96‐106. 10.1016/j.jclinepi.2010.03.002

15.

Farquhar

Kossen

Kuhn

Gal

. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625‐630. 10.1038/s41586-024-07421-0

16.

Templin

Fort

Padmanabham

, et al. Framework for bias evaluation in large language models in healthcare settings. NPJ Digit Med. 2025;8(1):414. 10.1038/s41746-025-01786-w

17.

Schuitmaker

Drogt

Benders

Jongsma

. Physicians’ required competencies in AI-assisted clinical settings: a systematic review. Br Med Bull 2025;153(1):ldae025. 10.1093/bmb/ldae025

18.

Car

Ong

Erlikh Fox

, et al. … Digital Health Systems Collaborative. (2025). The digital health competencies in medical education framework: an international consensus statement based on a Delphi study. JAMA Netw Open 8(1): e2453131. 10.1001/jamanetworkopen.2024.53131

19.

Flanagin

Kendall-Taylor

Bibbins-Domingo

. Guidance for authors, peer reviewers, and editors on use of AI, language models, and chatbots. JAMA. 2023;330(8):702‐703. 10.1001/jama.2023.12500

20.

Hamad

Qtaishat

Mhairat

, et al. Artificial intelligence readiness among Jordanian medical students: using medical artificial intelligence readiness scale for medical students (MAIRS-MS). J Med Educ Curric Dev 2024;11:23821205241281648. 10.1177/23821205241281648

21.

Kruger

Dunning

. Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. J Pers Soc Psychol. 2000;77(6):1121-1134. 10.1037//0022-3514.77.6.1121

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

1.56 MB

0.00 MB