Trust,Verify,Override: Behavioral Governance for Generative Artificial Intelligence in Medical Imaging

Abstract

Generative artificial intelligence (AI) in clinical communication, including in medical imaging, presents behavior-mediated safety challenges: outcomes depend on how clinicians verify AI-generated content under time constraints. However, guidance has largely focused on predeployment validation, with less specificity about postdeployment governance in day-to-day workflows. This perspective synthesizes evidence on failure modes, automation bias, and implementation monitoring and proposes a practical framework organized around three target behaviors: trust (transparent scope limits), verify (structured cross-checks against source data), override (documented corrections that become learning signals). Drawing on behavior change and implementation science, we translate postdeployment risks into stakeholder-specific interventions, including competency-based education, equity-stratified monitoring with prespecified triggers for fairness drift, and rollback procedures. The framework extends to patient-facing AI-generated explanations, where comprehension and autonomy must be safeguarded. This approach positions governance as a health education and behavior challenge essential for safe, equitable adoption.

Keywords

generative artificial intelligence medical imaging clinical documentation implementation science governance automation bias

Introduction

Generative artificial intelligence (GenAI) is entering medical imaging through everyday workflows. In many organizations, it functions as a documentation assistant: drafting reports, summarizing prior studies, and producing patient-facing explanations. Unlike traditional artificial intelligence (AI) tools, GenAI introduces behavior-mediated risks, where safety often hinges on how clinicians handle fluent text under time pressure. Do they verify it? Do they escalate discordant outputs? Do they document when they override it?

Those questions point to the core challenge: GenAI is not just a model to be validated but also a behavioral intervention delivered through workflow interfaces. Governance, therefore, must be specified as behavior change techniques (BCTs) paired with implementation strategy bundles, not model evaluation alone. In practice, governance means designing, measuring, and reinforcing reliance behaviors: what clinicians accept, verify, override, and escalate when the workflow is moving rapidly. Medical imaging is a useful sentinel case because its digitized, time-critical workflows make these behaviors visible and measurable early. Because GenAI drafts action-driving text, governance must shape reliance behaviors in real workflows, including patient-facing communication.

Drawing on narrative synthesis of evidence on GenAI failure modes, automation bias, and postdeployment monitoring, this perspective addresses four recurring risk pathways, translating governance principles into measurable practice organized around trust, verify, override. The analysis integrates the Capability, Opportunity, Motivation–Behavior (COM-B) model (Michie et al., 2011) to specify target safety behaviors, the Consolidated Framework for Implementation Research (CFIR; Damschroder et al., 2009) to identify contextual determinants, and the Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) framework (Glasgow et al., 1999) to structure pragmatic evaluation.

While individual elements appear in prior work, no governance guidance for clinical GenAI has integrated BCT-specified reliance behaviors, equity-primary fairness drift triggers with prespecified thresholds, and RE-AIM-structured evaluation tied to governance action into a single auditable package. We advance a testable proposition: this package will increase appropriate reliance behaviors and reduce automation bias errors compared with governance focused on model performance alone.

Behavior Mechanisms in AI-Mediated Clinical Communication

Postdeployment safety depends largely on a small set of observable clinician behaviors: verifying AI outputs against source data, escalating discordant or out-of-scope cases, and documenting overrides as learning signals.

Generative AI can alter these behaviors through two primary mechanisms. First, fluent outputs may be perceived as more accurate than warranted, fostering uncritical acceptance (S. S. Y. Kim et al., 2025; Reber & Schwarz, 1999). Second, reduced effort may encourage faster decisions with fewer verification steps (Lyell & Coiera, 2017; Reber & Unkelbach, 2010). These tendencies can erode the safeguards that clinical judgment provides.

These dynamics reveal workflow points where behavior can be intentionally shaped. The COM-B framework offers a structure for intervention: (1) Capability, through verification and calibration training; (2) Opportunity, through embedded supports such as structured prompts and lightweight documentation; and (3) Motivation, through uncertainty cues and error feedback (Michie et al., 2011).

Four Recurring Risk Pathways

GenAI-related harms cluster around predictable interactions between fluent systems, constrained workflows, and human cognition. Four pathways recur and should inform governance design.

Hallucination and Unsupported Assertions

GenAI can produce fluent but clinically incorrect statements, including action-driving errors (wrong problems, wrong timelines, wrong follow-up, wrong certainty; Asgari et al., 2025; Butler et al., 2024; Park et al., 2025; Rao et al., 2024). Prompt engineering can reduce but not eliminate these errors (Anh-Hoang et al., 2025; Cheng et al., 2025). The central risk is detectability: fluent text feels credible, reducing the friction that normally triggers source-checking (Reber & Unkelbach, 2010).

Automation Bias

Clinicians may overweight confident algorithmic suggestions while underweighting their own judgment, a pattern consistent with automation bias (Dratsch et al., 2023; Goddard et al., 2012). GenAI drafts may omit or confabulate information (Song et al., 2025), and fluent phrasing can create a certainty illusion that discourages verification (Sun et al., 2023).

Performance Drift and Fairness Drift

Real-world performance shifts as workflows, documentation norms, staffing, and patient mix evolve. Drift may appear as new error patterns, unstable summarization, or weaker fidelity to source-of-truth data. Fairness drift, defined as widening subgroup gaps that may be masked by stable overall metrics, should be treated as a distinct governance failure mode (Davis et al., 2025).

Workflow Integration Failures

GenAI can amplify the workflows it enters. Poor interfaces and ad hoc integrations that do not scale, along with weak auditability, can increase operational burden and the likelihood of unanticipated problems as AI is deployed across workflows (Tejani et al., 2024). Conversely, designs that make user corrections visible (e.g., capturing override reasons) can convert edits and rejections into learning signals for monitoring and improvement (Aaron et al., 2019).

These pathways define what postdeployment governance must control: how scope boundaries are made visible, how verification is prompted, and how overrides are captured as learning signals for monitoring and improvement.

Trust, Verify, Override: A Postdeployment Governance Heuristic

We propose trust, verify, override as a behavioral governance heuristic for AI-mediated clinical communication. The goal is observable reliance: bounded use, structured verification, and accountable overrides, designed using BCTs and implemented through strategy bundles.

For replicability, we specify each component using recognized BCTs from the BCT Taxonomy (Michie et al., 2013). Trust maps to environmental restructuring and prompts/cues; verify to action planning with a verification checklist and brief attestation; and override to problem-solving steps, reason-coded documentation, and audit-and-feedback (Michie et al., 2011) delivered within a just-culture framework that emphasizes learning over blame (Reason, 2000).

Five supporting tables operationalize each component: sentinel indicators and triggers (Table 1), stakeholder behavioral targets (Table 2), an education blueprint (Table 3), governance checklists with rollback criteria (Table 4), and RE-AIM evaluation designs (Table 5).

Table 1.

Proposed Sentinel Indicators for Behavioral Governance: Early-Warning Signals and Example Triggers Requiring Local Validation.

Domain	Sentinel indicator	Operational definition (numerator/denominator)	Data source	Cadence	Threshold concept (trigger)	Prespecified action/response
Diagnostic safety	Hallucination flag rate	Numerator: Reports in which an audit sample identifies at least one GenAI-originated statement not supported by the source-of-truth record (imaging, EHR, or patient-reported data). Denominator: Audited GenAI-assisted reports.	Monthly audit of a fixed sample (e.g., 30–50 cases) plus incident reports.	Monthly with a rolling 3-month view.	A sustained upward trend or a step-change from baseline.	Restrict scope to lower-risk tasks, add targeted verification drills, and if persistent, pause the use case pending remediation.
Workflow reliability	Override-with-reason completion	Numerator: GenAI-assisted reports with any override that includes a reason code. Denominator: GenAI-assisted reports with an override.	Reporting log fields.	Weekly during rollout, then monthly.	Low completion suggests friction or norm failure.	Simplify the field, reinforce expectations in training, and use feedback to improve prompts and UI cues.
Equity and comprehension	Equity sentinel for patient-facing comprehension support	Numerator: Patients who correctly answer a brief comprehension check (or complete recommended follow-up within a defined window) after receiving an AI-assisted explanation. Denominator: Patients who received AI-assisted explanations, stratified by language and other priority subgroups.	Portal analytics, brief survey in pilot clinics, and follow-up completion from the EHR.	Monthly.	Widening subgroup gaps over two review cycles.	Consider revising language templates, add question prompts, engage patient advisors, and evaluate whether restricting patient-facing generation is warranted.

Note. These sentinel indicators are hypothesized early-warning monitoring targets intended to support learning during rollout; they are not safety standards. Metrics, thresholds, and actions should be locally defined and prospectively validated for sensitivity/specificity and feasibility.

Table 2.

Behavioral Targets, Determinants, Policy Levers, and Measures by Stakeholder Group.

Stakeholder group	Target behaviors (what to do)	Key determinants (why behavior occurs)	Policy levers/intervention components (how to change)	Example measures (what to track)
Clinicians authoring clinical text	Independent assessment before AI exposure where feasible; verify discordant output; override with rationale; escalate when uncertainty persists.	Capability: calibration skills and knowing limits; Opportunity: workflow sequencing/time and access to scope/version info; Motivation: norms and accountability reinforced by audit/feedback.	Tier 1 trust, verify, override drills; point-of-care scope/version display; verification workflow (AI-second or independent-first); documentation prompts; near-miss reporting and review cadence.	Timing of AI exposure; override rate and rationale quality; escalation compliance; near-miss volume and themes.
Referring clinicians and downstream care teams	Treat GenAI-supported findings as decision support; check discordance with the clinical picture; avoid copy-forward of unverified text; use a defined escalation pathway.	Capability: uncertainty literacy and appropriate reliance; Opportunity: EHR-linked job aids and escalation routes; Motivation: shared responsibility for follow-up and harm prevention.	Tier 2 workflow-embedded decision aids (EHR link); brief refreshers at version/scope change; escalation guidance; communication templates that surface uncertainty/scope.	Documented verification steps; copy-forward frequency; escalation/referral for discordant results.
Patients and caregivers	Understand what GenAI did and did not do; ask clarifying questions; follow through on recommended next steps; avoid over-trust of portal narrative.	Capability: health literacy and numeracy; Opportunity: plain language summaries and question prompts; Motivation: perceived relevance, trust, and self-efficacy.	Tier 3 plain language explanations; questions to ask prompts; portal-friendly uncertainty language; transparency about scope/limitations and human oversight.	Patient understanding (brief survey); portal questions/messages; follow-up completion.
Leadership, governance, and quality teams	Define scope and decision rights; act on safety/equity signals (review, restrict, pause, rollback); communicate updates; reinforce just-culture learning.	Opportunity: governance infrastructure and data access; Capability: monitoring and adjudication capacity; Motivation: accountability and quality incentives.	Lifecycle governance routines; equity-stratified monitoring with triggers/rollback; change control for updates/scope shifts; standing near-miss review cadence and communication plan.	Time-to-detection of drift; trigger activations; pauses/rollbacks; equity gaps over time.

Note. Determinants are framed using the COM-B model (capability, opportunity, motivation). CFIR and RE-AIM inform implementation and evaluation planning.

Table 3.

Tiered Education Blueprint for GenAI Safety in AI-Mediated Clinical Communication.

Audience	Learning objectives	Workflow behaviors	Delivery	Assessment	Refresh cadence/owner
Radiologists and imaging clinicians	Calibrated reliance; verification self-efficacy; risk calibration; apply scope limits; document overrides.	Independent assessment before AI when feasible; verify discordant outputs; use structured override language.	Onboarding module plus case-based drills (including AI-wrong examples); optional integration into QI activities.	May audit a small sample for fidelity, verification documentation, and override rationale quality; may use periodic reader exercises; summarize signals for governance review.	Onboarding plus periodic refresh (e.g., semiannual); imaging QI lead/medical director.
Referring clinicians and care teams	Interpret AI-supported imaging as decision support; expectations for escalation; avoid over-trust.	Use an escalation pathway when imaging output conflicts with symptoms, labs, or clinical trajectory.	EHR-linked microlearning; workflow aid linked from imaging reports and order sets.	Brief knowledge check; may monitor escalation compliance and inappropriate reliance events through near-miss reporting; review signals in governance huddles.	Onboarding plus update-driven refresh; service-line leader.
Patients and caregivers	Understand what AI did and did not do; know safeguards; support activation for questions and follow-up.	Ask verification questions; seek clarification when the narrative conflicts with experience.	Standardized disclosure policy; portal FAQ; teach-back prompts embedded in pilot clinics.	Short survey or teach-back in pilot clinics plus a brief comprehension check; incorporate patient advisory feedback.	Refresh with major changes; patient experience lead.
Community partners/patient advisors	Improve clarity, cultural fit, and trust; identify barriers to follow-up.	Co-design disclosure language and activation prompts; advise on equity monitoring priorities; review disclosure and escalation policies.	Advisory sessions; translated materials review.	Qualitative feedback; track comprehension and follow-up barriers.	Periodic (e.g., quarterly); equity/population health lead.

Table 4.

Governance Checklist for Postdeployment GenAI Use in AI-Mediated Clinical Communication.

Governance domain	Routine	Metric	Priority subgroups (local)	Trigger threshold	Rollback action	Reinstatement rule
Model documentation (model card)	Before deployment and at updates; typically reviewed by the product owner and AI governance committee.	Documentation completeness; version history; known limitations.	N/A	Missing required fields or outdated version.	Pause deployment/update until documentation is complete; communicate limitations as needed.	Documentation complete plus governance sign-off.
Local validation and staged rollout	Pilot followed by phased expansion; typically overseen by an AI governance committee.	Local performance; workflow impact; safety events.	Site, scanner/protocol, and locally prioritized patient subgroups.	Fails acceptance criteria or subgroup gap exceeds an agreed threshold.	Restrict scope; pause expansion; revalidate.	Meets criteria in two review cycles.
Monitoring and rollback (equity + drift)	Ongoing dashboards with periodic review (e.g., monthly) by governance leads/committee.	Sentinel metrics; fairness drift trends; data-shift indicators; near-miss signals.	Locally prioritized patient subgroups; site; scanner/protocol where relevant.	Gap exceeds threshold for two consecutive cycles or a widening trend is detected; drift/data-shift alert persists over two cycles.	Increase audit; switch to AI-second; disable probability; restrict or pause affected use; revalidate.	Two consecutive cycles within thresholds plus governance sign-off.
Incident learning (near-misses and discordant cases)	Ongoing reporting with periodic review (e.g., monthly) by quality/safety and AI governance leads.	Near-miss rate; discordant-case themes; escalation compliance.	Review for differential burden across settings and subgroups.	Spike in events or repeated failure mode.	Targeted training; UI/prompt changes; narrow use; communicate changes.	Event rate returns to baseline plus corrective actions verified.
Transparency and communication	At go-live, during updates, and when limitations are identified; typically coordinated by communications/patient experience leads.	User acknowledgment; patient-facing materials availability.	Ensure accessibility by locally prioritized languages and literacy needs.	Missing communication or low comprehension signals.	Issue update; revise materials; reinforce training.	Materials updated plus comprehension confirmed in pilot feedback.

Note. Thresholds and triggers are illustrative and should be adapted to local context, risk tolerance, and data availability.

Table 5.

Pragmatic Evaluation Designs Mapped to RE-AIM Outcomes.

Primary evaluation question (testable)	Pragmatic design option	RE-AIM outcomes emphasized	Example outcomes and measures (two to three per row)	Interpretation and action use
Does safety performance remain stable postdeployment, and does it change after updates or workflow shifts?	Interrupted time series or pre/post with staged rollout (where feasible); paired with targeted audit sampling.	Effectiveness, Maintenance, Implementation	Hallucination flag rate; near-miss/discordant-case rate; trigger activations (pause/restrict/rollback).	Use for early warning and governance decisions (scope restriction, pause, rollback) when trends deviate from baseline.
Does GenAI change verification behavior or reliance patterns in ways that increase risk?	Audit of verification documentation plus workflow log review; optionally compare AI-first vs. AI-second configurations during a defined pilot period.	Implementation, Effectiveness, Adoption	Verification documentation/fidelity; override-with-reason completion; escalation compliance for discordant/uncertain outputs.	Use to identify friction, automation bias signals, and training or UI changes needed to support calibrated reliance.
Are benefits and risks distributed equitably across locally prioritized patient subgroups?	Equity-stratified monitoring embedded in routine governance review; sentinel sampling when subgroup n is small.	Reach, Effectiveness, Maintenance	Comprehension support proxy (brief check or follow-up completion); subgroup trend gaps over time; differential trigger activations by subgroup/site.	Use to detect widening gaps and guide mitigation (template revisions, workflow supports, restrict patient-facing generation as warranted).

Note. Outcomes are organized using RE-AIM (Reach, Effectiveness, Adoption, Implementation, Maintenance). Metrics, thresholds, and response actions are presented as examples and should be locally defined and, where feasible, prespecified to fit the use case, risk tolerance, and available data.

Trust (Bounded Use, Visible Constraints)

Trust is not a feeling. It is constrained authorization. Health systems should define and publish a narrow initial scope (task, population, workflow step) and make those limits operationally explicit to users, including clear versioning, limitations, and ‘not for’ conditions at the point of use (Geis et al., 2019). When clinicians cannot see boundaries, they cannot reliably enact boundaries, and reliance becomes accidental rather than governed.

Verify (Structured Cross-Check Before Sign-Off)

A brief, standardized verification routine combined with accountability cues can reduce automation bias by shifting from passive acceptance to active confirmation (Dratsch et al., 2023; Skitka et al., 2000). Before sign-off, clinicians cross-check GenAI outputs against source-of-truth data, prioritizing high-risk claims, action-driving statements, and any content that could alter triage, follow-up, or patient understanding.

Override (Accountability With Learning Signals)

When outputs are discordant, uncertain, or out of scope, clinicians should edit or reject them and record a reason code (e.g., hallucination, overconfidence, incorrect comparison, out of scope). Track override rates and reasons by setting and subgroup. If only some patients benefit from rigorous verification, equity has already failed.

Operationalizing the Framework

The proposed package includes auditable governance routines, competency-based education for verification and escalation, and pragmatic evaluation for emerging safety and equity risks. The following sections address each in turn.

Stakeholder-Aligned Governance

In our view, postdeployment governance succeeds when the safer action is also the easier action. That requires clarity about who is responsible for which behaviors and supports. Table 2 translates the trust, verify, override heuristic into stakeholder-specific target behaviors, likely implementation determinants (using the COM-B model), policy levers, and measures. If verification is required but time is not protected, or escalation is encouraged but reporting is burdensome, workflows tend to drift toward unsafe behavior (Lyell & Coiera, 2017). CFIR’s inner-setting constructs (implementation climate, readiness, available resources) informed the institutional determinants and policy levers in Table 2.

Education

Education bridges governance intent and frontline behavior. Generic AI literacy is insufficient (K. Kim et al., 2024) because the most consequential failure modes are behavior-specific: verifying claims against sources, communicating uncertainty (Bragazzi & Garbarino, 2024), and escalating or overriding when outputs appear inconsistent (Adams et al., 2025). Table 3 presents a tiered education blueprint linking learning objectives to concrete workflow behaviors, delivery options, and assessment methods. Training should include cases where output is correct but still requires confirmation and cases where output is subtly wrong but fluent; otherwise, concerns will be suppressed, and learning will stall.

Auditable Governance Artifacts

Governance should produce routine, auditable artifacts that answer three questions: What did the model do? What did the clinician do with it? What happened next? Table 4 outlines minimum routines and thresholds, including scope and versioning documentation, sentinel monitoring, incident learning, and explicit rollback procedures. Without prespecified triggers, monitoring becomes purely retrospective. Rollback must be operational, with named owners, defined thresholds, and a review calendar, not a theoretical safeguard. If an organization cannot pause or restrict use when safety or equity signals worsen, its governance is only nominal.

Patient-Facing Explanations

These principles extend to the most visible output of generative AI: what patients read. Patient-facing explanations shape what patients understand, worry about, and do next. Accordingly, AI-generated or AI-assisted explanations should be clearly labeled, written to invite questions rather than convey certainty, and evaluated for comprehension and unintended anxiety (World Health Organization, 2024). As a pragmatic target, explanations should use plain language for broad readability (DeWalt et al., 2011) and include a brief comprehension check such as teach-back (Talevski et al., 2020) or brief chunking with confirmatory checking (DeWalt et al., 2011).

Pragmatic Evaluation

Governance and education should be paired with feasible, decision-oriented evaluation that prioritizes learning and safety over perfect causal inference. Table 5 maps pragmatic evaluation designs to RE-AIM outcomes (Glasgow et al., 1999), with example measures and governance decisions each design informs.

Two principles guide this approach. First, pair automated signals with targeted audit: logs can show when and how GenAI was used, but expert review is often needed to determine whether edits reflect error correction, stylistic preference, or training needs. Second, tie metrics to action: each measure should prompt a specific governance decision.

Applying the Heuristic Beyond Imaging

Imaging is a useful sentinel case, but the trust, verify, override heuristic applies wherever GenAI drafts health content, including community health worker visit summaries, discharge instructions, behavioral intervention prompts, and patient portal explanations. The source of truth changes, but the governance logic does not.

Verify means the responsible author cross-checks action-driving claims before distribution for guideline alignment, patient constraints (language, contraindications, access barriers), and clarity about next steps. A ‘verified against: ___’ annotation keeps this traceable.

Override means intentionally correcting or rejecting output that could misdirect behavior (incorrect eligibility criteria, wrong referral pathway, culturally mismatched framing, or unsafe triage advice) and recording a reason code (e.g., policy mismatch, missing patient constraint, safety concern, tone/cultural fit) so corrections become improvement signals rather than one-off edits.

Equity monitoring translates directly: track comprehension and follow-through (teach-back results, referral completion) by language and other locally prioritized groups, and narrow or pause generation when subgroup gaps widen.

Implementation Considerations and Unintended Consequences

Implementation is not primarily a technical capacity problem; it is a behavioral systems problem. Every deployment should include a minimum behavioral package: (1) clearly declared scope boundaries that reduce inappropriate trust, (2) a verification routine that is teachable and auditable, and (3) an override pathway that is psychologically safe and operationally fast. Sites with fewer resources can scale the tooling, but they cannot omit the behavioral minimum without accepting predictable safety and equity failures.

Several unintended consequences deserve attention. Verification can be resource-intensive, especially when it relies on manual steps or retrospective review, increasing operational burden and limiting scalability (Chow et al., 2025). Complex interfaces and added supervisory demands can dilute perceived value and strain workloads when clinician capacity is already tight (Brady et al., 2024). Safeguards designed as hard-stops or rigid control rules can backfire by reducing usability and receptivity (Poly et al., 2020). Equity monitoring is ethically important (Embi, 2021), but data-intensive monitoring may feel like surveillance unless it is transparently governed and clearly framed as quality improvement (Muller et al., 2025). Health systems should track workload and trust alongside safety indicators and adjust routines when burden or distrust undermines safe use.

These unintended consequences raise ethical questions that governance must confront directly. Patients receiving AI-generated explanations may not understand what that means for reliability, raising concerns about meaningful informed consent. Logging clinician verification and override behavior is necessary for learning but creates surveillance risk if not governed by transparent, jointly developed policies. And when verification burden falls on those with the least time, accountability becomes inequitable: the organization claims human oversight while the conditions for exercising it are unevenly distributed. Governance design should make these tensions explicit and subject to periodic review rather than treating them as resolved by policy language alone.

One gap this framework does not resolve is vendor accountability. Deploying health systems should require, through procurement and service-level agreements, that vendors accept structured adverse-event reports, disclose version changes before local revalidation is needed, and maintain model cards with known limitations. Override signals and fairness drift findings should flow back to vendors as improvement data, not remain siloed within deploying organizations.

Limitations and Boundary Conditions

The evidence base for postdeployment GenAI safeguards remains uneven, consisting largely of conceptual frameworks and simulation studies rather than multi-site empirical evaluations. Accordingly, several recommendations here should be read as ethically grounded best practices rather than a definitive hierarchy of interventions with known effect sizes.

Generalizability is limited at the level of implementation details and expected effects. While the core governance behaviors generalize beyond imaging, other contexts differ in task structure, feedback loops, and outcome visibility, requiring adapted thresholds and monitoring cadence. Equity monitoring adds constraints: small subgroups may yield unstable estimates, requiring pooled data and sentinel case review. Implementation also involves trade-offs: logging, auditing, training, and monitoring require infrastructure and staffing; tighter controls can add friction; and resource constraints may concentrate residual harm in settings least able to absorb it.

Future Directions

Empirical research on postdeployment governance for GenAI in medical imaging is still emerging, and we present this framework as a set of testable hypotheses. Key priorities include prospective evaluation of governance routines and training programs to assess whether they reduce error propagation and improve equity, and development of sensitive sentinel indicators for each risk pathway.

Conclusion

GenAI will not be safe or equitable based on technical performance alone. Its real-world safety and equity effects will depend largely on what clinicians and patients do with generated text under time pressure and within imperfect workflows. The trust, verify, override heuristic provides a practical structure for postdeployment governance by translating broad principles into auditable routines, competency-based education, and pragmatic evaluation linked to action, including rollback.

More broadly, deploying AI is an intervention in human behavior. We encourage professional societies, accreditation bodies, and regulators to complement technical standards with expectations for postdeployment behavioral accountability, including observable verification practices, equity-based triggers, and learning-oriented incident systems. This accountability must also extend to AI developers and vendors, who share responsibility for limitation transparency, update disclosure, and structured adverse-event feedback from deploying organizations. Without these safeguards, GenAI may scale risk as readily as it scales efficiency.

Footnotes

ORCID iDs

Jeffry Glenning

Lisa Gualtieri

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Aaron

McEvoy

D. S.

Ray

Hickman

T.-T. T.

Wright

(2019). Cranky comments: Detecting clinical decision support malfunctions through free-text override reasons. Journal of the American Medical Informatics Association, 26(1), 37–43. https://doi.org/10.1093/jamia/ocy139

Adams

S. J.

Acosta

J. N.

Rajpurkar

(2025). How generative AI voice agents will transform medicine. Npj Digital Medicine, 8(1), 353. https://doi.org/10.1038/s41746-025-01776-y

Anh-Hoang

Tran

Nguyen

L.-M.

(2025). Survey and analysis of hallucinations in Large Language Models: Attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8, Article 1622292. https://doi.org/10.3389/frai.2025.1622292

Asgari

Montaña-Brown

Dubois

Khalil

Balloch

Yeung

J. A.

Pimenta

(2025). A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digital Medicine, 8(1), 274. https://doi.org/10.1038/s41746-025-01670-7

Brady

A. P.

Allen

Chong

Kotter

Kottler

Mongan

. . .Slavotinek

(2024). Developing, purchasing, implementing and monitoring AI tools in radiology: Practical considerations. A multi-society statement from the ACR, CAR, ESR, RANZCR and RSNA. Radiology: Artificial Intelligence, 6(1), Article e230513. https://doi.org/10.1148/ryai.230513

Bragazzi

N. L.

Garbarino

(2024). Toward clinical generative AI: Conceptual framework. JMIR AI, 3(1), Article e55957. https://doi.org/10.2196/55957

Butler

J. J.

Puleo

Harrington

M. C.

Dahmen

Rosenbaum

A. J.

Kerkhoffs

G. M. M. J.

Kennedy

J. G.

(2024). From technical to understandable: Artificial Intelligence Large Language Models improve the readability of knee radiology reports. Knee Surgery, Sports Traumatology, Arthroscopy: Official Journal of the ESSKA, 32(5), 1077–1086. https://doi.org/10.1002/ksa.12133

Cheng

Yuan

Liu

Tao

. . .Li

(2025). Chain-of-thought prompting obscures hallucination cues in Large Language Models: An empirical evaluation. In Christodoulopoulos

Chakraborty

Rose

Peng

(Eds.), Findings of the association for computational linguistics: EMNLP 2025 (pp. 1272–1305). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-emnlp.67

Chow

Lee

(2025). How do radiologists currently monitor AI in radiology and what challenges do they face? An interview study and qualitative analysis. Journal of Imaging Informatics in Medicine, 39, 6–19. https://doi.org/10.1007/s10278-025-01493-8

10.

Damschroder

L. J.

Aron

D. C.

Keith

R. E.

Kirsh

S. R.

Alexander

J. A.

Lowery

J. C.

(2009). Fostering implementation of health services research findings into practice: A consolidated framework for advancing implementation science. Implementation Science: IS, 4, 50. https://doi.org/10.1186/1748-5908-4-50

11.

Davis

S. E.

Dorn

Park

D. J.

Matheny

M. E.

(2025). Emerging algorithmic bias: Fairness drift as the next dimension of model maintenance and sustainability. Journal of the American Medical Informatics Association, 32(5), 845–854. https://doi.org/10.1093/jamia/ocaf039

12.

DeWalt

D. A.

Broucksou

K. A.

Hawk

Brach

Hink

Rudd

Callahan

(2011). Developing and testing the health literacy universal precautions toolkit. Nursing Outlook, 59(2), 85–94. https://doi.org/10.1016/j.outlook.2010.12.002

13.

Dratsch

Chen

Rezazade Mehrizi

Kloeckner

Mähringer-Kunz

Püsken

. . . Pinto Dos Santos

(2023). Automation bias in mammography: The impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology, 307(4), Article e222176. https://doi.org/10.1148/radiol.222176

14.

Embi

P. J.

(2021). Algorithmovigilance-advancing methods to analyze and monitor artificial intelligence-driven health care for effectiveness and equity. JAMA Network Open, 4(4), Article e214622. https://doi.org/10.1001/jamanetworkopen.2021.4622

15.

Geis

J. R.

Brady

A. P.

C. C.

Spencer

Ranschaert

Jaremko

J. L.

. . .Kohli

(2019). Ethics of artificial intelligence in radiology: Summary of the joint European and North American multisociety statement. Radiology, 293(2), 436–440. https://doi.org/10.1148/radiol.2019191586

16.

Glasgow

R. E.

Vogt

T. M.

Boles

S. M.

(1999). Evaluating the public health impact of health promotion interventions: The RE-AIM framework. American Journal of Public Health, 89(9), 1322–1327. https://doi.org/10.2105/ajph.89.9.1322

17.

Goddard

Roudsari

Wyatt

J. C.

(2012). Automation bias: A systematic review of frequency, effect mediators, and mitigators. Journal of the American Medical Informatics Association, 19(1), 121–127. https://doi.org/10.1136/amiajnl-2011-000089

18.

Kim

Cho

Jang

Kyung

Lee

Ham

. . .Kim

(2024). Updated primer on generative artificial intelligence and Large Language Models in medical imaging for medical professionals. Korean Journal of Radiology, 25(3), 224–242. https://doi.org/10.3348/kjr.2023.0818

19.

Kim

S. S. Y.

Vaughan

J. W.

Liao

Q. V.

Lombrozo

Russakovsky

(2025). Fostering appropriate reliance on Large Language Models: The role of explanations, sources, and inconsistencies. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1–19). Association for Computing Machinery. https://doi.org/10.1145/3706598.3714020

20.

Lyell

Coiera

(2017). Automation bias and verification complexity: A systematic review. Journal of the American Medical Informatics Association : JAMIA, 24(2), 423–431. https://doi.org/10.1093/jamia/ocw105

21.

Michie

Richardson

Johnston

Abraham

Francis

Hardeman

. . .Wood

C. E.

(2013). The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques: Building an international consensus for the reporting of behavior change interventions. Annals of Behavioral Medicine: A Publication of the Society of Behavioral Medicine, 46(1), 81–95. https://doi.org/10.1007/s12160-013-9486-6

22.

Michie

van Stralen

M. M.

West

(2011). The behaviour change wheel: A new method for characterising and designing behaviour change interventions. Implementation Science, 6(1), 42. https://doi.org/10.1186/1748-5908-6-42

23.

Muller

S. H. A.

van Delden

J. J. M.

van Thiel

G. J. M. W

. (2025). Towards responsible surveillance in preventive health data-AI research. PLOS Digital Health, 4(12), Article e0001146. https://doi.org/10.1371/journal.pdig.0001146

24.

Park

Lee

J. H.

Yoon

Kim

D. H.

Jung

J.-Y.

Lee

Y. H.

(2025). Clinical applications, challenges & pitfalls, and recommendations for Large Language Model and generative AI in musculoskeletal imaging. Journal of the Korean Society of Radiology, 86(5), 655–670. https://doi.org/10.3348/jksr.2025.0018

25.

Poly

T. N.

Islam

Md. M.

Yang

H.-C.

Y.-C. (Jack)

. (2020). Appropriateness of overridden alerts in computerized physician order entry: Systematic review. JMIR Medical Informatics, 8(7), Article e15653. https://doi.org/10.2196/15653

26.

Rao

V. M.

Zhang

Acosta

J. N.

Adithan

Rajpurkar

(2024). ReXErr: Synthesizing clinically meaningful errors in diagnostic radiology reports. Biocomputing, 2025, 70–81. https://doi.org/10.1142/9789819807024_0006

27.

Reason

(2000). Human error: Models and management. BMJ, 320(7237), 768–770. https://doi.org/10.1136/bmj.320.7237.768

28.

Reber

Schwarz

(1999). Effects of perceptual fluency on judgments of truth. Consciousness and Cognition, 8(3), 338–342. https://doi.org/10.1006/ccog.1999.0386

29.

Reber

Unkelbach

(2010). The epistemic status of processing fluency as source for judgments of truth. Review of Philosophy and Psychology, 1(4), 563–581. https://doi.org/10.1007/s13164-010-0039-7

30.

Skitka

L. J.

Mosier

Burdick

M. D.

(2000). Accountability and automation bias. International Journal of Human-Computer Studies, 52(4), 701–717. https://doi.org/10.1006/ijhc.1999.0349

31.

Song

J. W.

Park

Kim

J. H.

You

S. C.

(2025). Large Language Model assistant for emergency department discharge documentation. JAMA Network Open, 8(10), Article e2538427. https://doi.org/10.1001/jamanetworkopen.2025.38427

32.

Sun

Ong

Kennedy

Tang

Chen

Elias

. . .Peng

(2023). Evaluating GPT-4 on impressions generation in radiology reports. Radiology, 307(5), Article e231259. https://doi.org/10.1148/radiol.231259

33.

Talevski

Wong Shee

Rasmussen

Kemp

Beauchamp

(2020). Teach-back: A systematic review of implementation and impacts. PLOS ONE, 15(4), Article e0231350. https://doi.org/10.1371/journal.pone.0231350

34.

Tejani

A. S.

Cook

T. S.

Hussain

Sippel Schmidt

O’Donnell

K. P.

Moy

Arzen

(2024). Integrating and adopting AI in the radiology workflow: A primer for standards and Integrating the Healthcare Enterprise (IHE) profiles. Radiology, 311(3), Article e232653. https://doi.org/10.1148/radiol.232653

35.

World Health Organization. (2024). Ethics and governance of artificial intelligence for health: Guidance on large multi-modal models. https://www.who.int/publications/i/item/9789240084759