Abstract
Objectives
To synthesize current evidence on the clinical applications of generative artificial intelligence (GenAI), particularly large language models (LLMs), in head and neck oncology, with a focus on translational readiness, clinical safety, and real-world applicability.
Methods
A scoping review was conducted using structured searches of PubMed and Scopus for studies published between January 1, 2020, and December 15, 2025. Search strategies combined controlled vocabulary and free-text terms related to generative AI and head and neck oncology. Eligible studies evaluated GenAI/LLMs in tasks including TNM staging, treatment planning, tumor board support, and patient education. Non-GenAI and non-oncologic studies were excluded. Following duplicate removal, records underwent title and abstract screening with full-text review of potentially relevant studies. Due to heterogeneity in study design, outcomes, and reporting, findings were synthesized qualitatively.
Results
Evidence remains early-stage and heterogeneous, dominated by simulation-based and small cohort studies with limited real-world validation. GenAI performs best in structured, language-based tasks such as clinical documentation, case summarization, and patient education. Moderate agreement with clinical standards is reported for TNM staging and guideline navigation in common scenarios, with reduced reliability in complex cases. In tumor board settings, GenAI supports summarization but produces variable treatment recommendations. Patient-facing outputs are generally readable but may lack accuracy or completeness. Common limitations include hallucination, omission of key clinical factors, and overgeneralization.
Conclusion
GenAI shows promise as an assistive tool in head and neck oncology but is not yet suitable for autonomous clinical decision-making. Prospective, workflow-integrated evaluation and standardized validation are needed before safe clinical adoption.
Keywords
Introduction
Generative artificial intelligence (GenAI), driven largely by advances in transformer-based large language models (LLMs), has rapidly transitioned from experimental technology to a visible presence across clinical medicine. In oncology, GenAI systems are increasingly explored for tasks ranging from clinical decision support and guideline interpretation to documentation, patient education, and research synthesis.1–4 This rapid uptake has been accompanied by growing scrutiny. Recent oncology-wide reviews describe both the transformative potential of GenAI and its intrinsic risks, including hallucinations, guideline drift, uncalibrated probability estimates, and unresolved medico-legal responsibility.5–8
Despite this expanding literature, head and neck oncology remains notably under-represented in existing syntheses. This omission is consequential. Head and neck cancer (HNC) care is distinguished by exceptional clinical complexity; deeply subsite-specific anatomy, laterality-dependent treatment decisions, granular and evolving Tumor–Node–Metastasis (TNM) staging systems, and therapeutic trade-offs that directly affect speech, swallowing, airway protection, and appearance.9–12 Decision-making is inherently multidisciplinary, relying on close coordination among surgeons, radiation and medical oncologists, radiologists, pathologists, speech-language pathologists, and supportive care teams. In this setting, even modest inaccuracies or overconfident recommendations may propagate through workflows with disproportionate clinical consequences.9–12
Crucially, GenAI systems differ fundamentally from traditional diagnostic AI. Whereas diagnostic models are trained and validated to optimize specific predictive endpoints, GenAI systems generate probabilistic language outputs based on learned statistical patterns.13,14 Their apparent “reasoning” reflects linguistic coherence rather than causal or mechanistic understanding. As a result, fluent outputs may convey a misleading sense of authority, a phenomenon increasingly described as synthetic confidence (defined as the tendency of GenAI systems to generate fluent, authoritative-sounding outputs regardless of underlying accuracy) especially when responses are delivered in professional clinical language.14–16 In oncology, where uncertainty is common and nuance matters, this epistemic mismatch is not trivial.13–16
Beyond these general limitations, head and neck oncology poses distinctive cognitive and anatomical challenges that make it an especially rigorous test case for GenAI. Staging frequently depends on subtle descriptors, paraglottic fat invasion, pre-epiglottic space involvement, skull base foraminal extension, retropharyngeal nodal spread, where seemingly minor misinterpretations can dramatically alter treatment plans and functional outcomes. Treatment decisions regularly require balancing oncologic control against airway preservation, swallowing integrity, voice outcomes, and long-term quality of life. These complex trade-offs, central to HNC practice, are rarely represented in the textual patterns that LLMs learn from and thus expose the limits of GenAI’s ability to engage with uncertainty, nuance, and competing clinical priorities.9,11,17
At the same time, early head and neck–specific studies have begun to emerge. These include evaluations of GenAI for TNM staging from real-world clinical records, tumor board case summarization, and generation of patient education materials.11,18–20 Collectively, these studies suggest that GenAI may perform adequately, or even well, in certain constrained, language-centric tasks, while remaining unreliable for autonomous clinical decision-making.11,18–22 However, the evidence is fragmented, heterogeneous in methodology, and often interpreted without sufficient attention to the unique risks of the head and neck oncology context.
To date, no comprehensive review has synthesized this emerging literature through a head and neck–specific lens. To our knowledge, this is the first review to systematically collate GenAI applications specific to head and neck oncology and interpret them through a clinical risk–stratified framework. This gap limits the field’s ability to distinguish where GenAI offers genuine value from where its limitations pose unacceptable risk. Accordingly, this review critically examines the peer-reviewed evidence on GenAI in head and neck oncology, emphasizing its performance in TNM staging and treatment support, its influence within multidisciplinary tumor boards, its role in patient education, and its ethical, governance, and safety implications. Rather than advocating uncritical adoption, this work seeks to articulate a principled framework for responsible integration, positioning GenAI as an assistive technology that may augment, rather than replace, expert clinical judgment in one of oncology’s most complex domains.
Methods
Search strategy.
Note. The search and screening process was reported in a PRISMA-ScR–informed manner adapted to the scoping nature of the review.
We included studies that explicitly evaluated, deployed, or analyzed GenAI/LLMs for HNC–related tasks, including TNM staging, treatment planning, tumor board support, patient counseling, and patient-facing informational content. Articles restricted to non-GenAI approaches (e.g., traditional machine learning, radiomics, convolutional neural networks) or to non-oncologic otolaryngology were excluded to maintain topic specificity
Because the included studies varied substantially in design, clinical task, comparator, and evaluation metrics, quantitative meta-analysis was not attempted. Instead, studies were grouped thematically by clinical application (e.g., staging, tumor boards, patient education). A qualitative synthesis approach was used, emphasizing reported performance, contextual factors influencing outputs, reproducibility issues, and documented failure modes such as hallucination, omission, and overgeneralization. Formal risk-of-bias assessment tools were not applied due to the heterogeneity and early-stage nature of the literature, consistent with scoping review methodology.
Our objective was not to rank models but to identify which GenAI applications appear closest to real-world implementation, which remain exploratory, and what safety or governance gaps must be addressed before clinical use. This approach aligns with best practices for scoping reviews, which prioritize conceptual clarity and translational relevance when evidence heterogeneity precludes statistical aggregation.
Clinical applications of generative artificial intelligence in head and neck oncology
Clinical applications of generative AI in head and neck oncology.
Note. Summary of current GenAI applications across head and neck oncology, including TNM staging, tumor board support, and patient education. Approximate study scale and performance metrics (e.g., accuracy, concordance, readability indices, and agreement measures such as Cohen’s κ) are reported descriptively due to heterogeneity in study design and reporting. Evidence strength reflects consistency of findings, study design (simulation-based vs real-world), and degree of clinical validation. Readiness levels indicate assistive, not autonomous, clinical use based on reported performance, benefits, and observed failure modes.
TNM staging and clinical decision support
Several peer-reviewed, head and neck–specific studies have evaluated GenAI for TNM staging and clinical decision support. In one of the largest evaluations to date, a 2024 study assessed ChatGPT-4 on 263 HNSCC cases (oral cavity, oropharynx, hypopharynx, larynx), reporting moderate to substantial concordance with multidisciplinary tumor board decisions and NCCN guidelines, with κ values ranging from 0.48 to 0.78 for treatment recommendations. A 2025 simulation comparing ChatGPT-o1 and DeepSeek-V3 on staged HNC scenarios similarly found statistically significant accuracy across subsites (p < 0.05), with highest performance observed for common, well-represented subsites such as the larynx and early-stage disease.19,23 Collectively, these findings suggest that GenAI can approximate clinician reasoning under constrained, structured conditions and may function as a staging cross-check or educational aid, rather than an autonomous decision-maker.19,23,24
Complementary evidence is provided by Marchi et al., who evaluated ChatGPT responses to NCCN-style clinical scenarios in head and neck oncology. The model demonstrated strong alignment with guideline-concordant recommendations for adjuvant therapy and surveillance, particularly when prompts were explicitly structured and limited to guideline interpretation. 25 These results reinforce a consistent pattern across studies; the more bounded and rules-based the task, the more reliably large language models tend to perform.
However, these same studies underscore important limitations. Performance degrades in the presence of incomplete clinical data, ambiguous imaging descriptors, or rare anatomic subsites such as the sinonasal tract or skull base. Borderline or “gray-zone” scenarios, including subtle cortical bone erosion distinguishing T3 from T4a oral cavity tumors, equivocal midline involvement in oropharyngeal primaries, or minimal prevertebral fascia contact in hypopharyngeal disease, further expose GenAI’s tendency to default to generic heuristics rather than interrogate uncertainty. In such cases, expert interpretation depends on integrating imaging, endoscopy, pathology, and functional assessment, information that is rarely fully captured in textual prompts. GenAI systems lack intrinsic awareness of missing or indeterminate data and may extrapolate rather than explicitly flag uncertainty. Consequently, erroneous outputs may appear linguistically polished yet clinically misleading, resulting in high-stakes “silent errors’’.19,24,25
Multidisciplinary tumor boards
Multidisciplinary tumor boards (MDTs) represent a second major area of exploration for GenAI in head and neck oncology. Given the volume and heterogeneity of data reviewed in HNC MDTs, GenAI has been proposed as a tool for case preparation and summarization.26,27 Lechien et al. reported that ChatGPT-4 produced accurate TNM explanations in 95% of cases and appropriately identified diagnostic workup steps in most scenarios. 28 These findings suggest that GenAI may reduce preparatory workload and promote more standardized case summaries in high-volume centers.
However, MDTs are not merely information-processing venues; they are deliberative social systems in which clinical reasoning evolves through negotiation, dissent, and iterative hypothesis refinement. Introducing GenAI-generated summaries therefore risks reshaping this communication ecology. Emerging evidence suggests that GenAI may inadvertently pre-frame discussions, emphasizing certain dimensions (e.g., oncologic aggressiveness or guideline alignment) while underrepresenting others (e.g., functional preservation, patient preference, or reconstructive complexity). Such framing may influence the trajectory of multidisciplinary debate even before expert discussion begins.18,28–30 Social science literature consistently demonstrates that early framing exerts a disproportionate influence on downstream group decisions, particularly among less-experienced clinicians.
Empirical studies reinforce these concerns. Schmidl et al. observed that ChatGPT occasionally proposed non–guideline-concordant recommendations and struggled to integrate competing clinical priorities in real MDT contexts. 18 Similarly, comparative studies of ChatGPT-4 and ChatGPT-4o in recurrent and metastatic HNC tumor boards found that model fluency did not reliably translate into clinical soundness or decision consistency. 28 Importantly, the principal risk is not overtly incorrect recommendations, but subtle distortions of reasoning pathways, including anchoring effects, reinforcement of implicit assumptions, and narrowing of the perceived decision space, that clinicians may not consciously detect. For these reasons, current evidence supports the use of GenAI in MDTs only for preparatory synthesis and documentation, with explicit human review and critical discussion prior to clinical decision-making.18,23,26,28
Patient education and communication
Patient education and communication represent the most mature and consistently supported application of GenAI in head and neck oncology. Unlike decision support, these tasks are fundamentally linguistic and therefore closely aligned with GenAI’s strengths in natural language generation, simplification, and adaptation to varying literacy levels.11,20 Lee et al. found that ChatGPT-generated explanations for HNC surgeries achieved comparable accuracy and superior readability relative to conventional patient education materials. 31 Mnajjed and Patel similarly reported high performance on validated patient education metrics, including the Suitability Assessment of Materials and the Patient Education Materials Assessment Tool. 32
These findings are reinforced by broader otolaryngology evidence. In a meta-analysis of otolaryngology-related tasks, Hack et al. reported that communication and education domains outperformed diagnostic and decision-support applications, achieving approximately 83% accuracy overall. 22 Given that HNC treatments often involve complex trade-offs among survival, airway safety, swallowing, and voice, improved patient comprehension may meaningfully enhance shared decision-making. GenAI may also help bridge literacy, language, and access gaps.
Nevertheless, GenAI-mediated education is not risk-free. Wei et al. found that ChatGPT responses to common HNC patient questions were occasionally incomplete or less accurate than vetted online sources, particularly for nuanced clinical topics.
33
Oversimplification, omission of uncertainty, and authoritative tone may inadvertently mislead patients if GenAI outputs are used without clinician review
Synthesis across applications
The current evidence base evaluating GenAI in head and neck oncology remains early-stage and methodologically heterogeneous. Most studies are single-center, retrospective, or simulation-based evaluations with relatively small sample sizes and limited external validation. These designs introduce several potential sources of bias, including selection bias in case construction, incomplete representation of real-world clinical complexity, and reliance on curated or structured inputs that may not reflect routine clinical documentation.
A key distinction across studies is the use of simulated-case environments versus real clinical workflows. Simulation-based studies, often using standardized vignettes or fully specified clinical scenarios, tend to report higher agreement with guidelines or expert decision-making. However, these settings reduce ambiguity and omit the fragmented, incomplete, and context-dependent data that characterize real-world head and neck oncology practice. In contrast, the limited number of studies evaluating GenAI within actual clinical workflows or multidisciplinary tumor boards demonstrate more variable and less predictable performance, underscoring the gap between controlled evaluation and real-world deployment.
The literature is also susceptible to publication and reporting bias, with a predominance of proof-of-concept studies reporting favorable or promising results. Negative findings, inconsistent performance, or clinically unsafe outputs may be underreported. In addition, many studies evaluate GenAI under optimized prompting conditions, which may not reflect routine clinical use and may further overestimate real-world performance. Taken together, these factors suggest that current performance estimates should be interpreted cautiously, particularly when extrapolating beyond bounded, assistive use cases. These limitations should be considered when interpreting reported performance and when assessing the readiness of GenAI for clinical integration.
In aggregate, the available evidence across clinical applications converges on a central conclusion: GenAI is best suited to assistive, language-centric tasks in head and neck oncology that can be clearly bounded, structured, and reviewed by clinicians.35–37 When applied to staging verification, guideline navigation, case summarization, and patient education within defined constraints, GenAI can enhance efficiency, standardization, and communication.2,13–15,19,24,28,38
Several use cases therefore appear appropriate for near-term clinical deployment, including documentation assistance, literacy-adapted patient education, structured staging cross-checks against established guidelines, and preparatory tumor board case summarization with mandatory expert oversight. Medium-term targets include benchmark-driven, workflow-embedded evaluations of GenAI-assisted tumor board preparation and guideline navigation tools. By contrast, autonomous diagnosis or treatment recommendation, particularly in rare subsites or gray-zone scenarios
When extended beyond these assistive roles toward independent clinical judgment, GenAI’s known limitations—including hallucination, omission of functional considerations, overgeneralization, and susceptibility to framing effects—introduce clinically meaningful risk in a domain as anatomically complex and functionally consequential as head and neck oncology.9–12,14,15,19,24,39,40
From a head and neck oncology perspective, several clinically critical domains remain underexplored in the current GenAI literature. Functional outcome prediction—including speech intelligibility, swallowing function, and airway preservation—plays a central role in treatment selection but is rarely incorporated into current GenAI evaluations, which remain predominantly focused on oncologic endpoints.41–44 Similarly, reconstructive planning, including flap selection and anticipated functional rehabilitation, introduces an additional layer of complexity that is not well captured by existing language-based models.19,42,45,46 Integration of GenAI into radiotherapy planning workflows and decision-making for less common subsites, such as sinonasal and skull base malignancies, also remains limited.19,42,45,46 These gaps highlight important areas for future development, particularly for models intended to support comprehensive, multidisciplinary decision-making in head and neck oncology.19,42,45–48
Safety, governance, and the path forward
Major GenAI failure modes in head and neck oncology and recommended governance measures.
Note. Key failure modes observed across head and neck oncology studies, with representative clinical scenarios, potential consequences, and governance strategies. Framework emphasizes human-in-the-loop oversight, guideline anchoring, version control, and safeguards against automation bias, temporal drift, and equity gaps.
Across the head and neck–specific literature, several recurring failure modes are consistently observed. Hallucinations remain a central concern, particularly when GenAI systems are queried beyond narrowly constrained tasks.30,50,51 Omission errors, in which critical considerations such as airway risk, swallowing function, or quality-of-life trade-offs are absent from generated outputs, pose equal risk. Overgeneralization occurs when common disease patterns are inappropriately applied to rare subsites or atypical presentations. This reflects biases in training data and is especially hazardous in anatomically complex regions. Finally, temporal drift threatens factual reliability as staging systems and guidelines evolve, particularly when models are not explicitly anchored to version-controlled sources.20,30,40,52,53 These patterns have been reported primarily in controlled or retrospective settings, and their real-world frequency and impact remain incompletely characterized. Collectively, these failure modes may be particularly consequential in head and neck oncology, where small inaccuracies in tumor extent or reconstructive implications can lead to major deviations in care pathways.
Beyond model-intrinsic limitations, GenAI introduces important human–machine interaction risks. Automation bias, the tendency to over-trust algorithmic outputs, may be amplified by GenAI’s fluency and professional tone. In time-pressured clinical environments, clinicians may unconsciously defer to GenAI-generated summaries or recommendations, particularly when outputs appear confident or align with initial impressions. Of greater concern is subtle cognitive anchoring. Once a GenAI-generated frame or hypothesis is introduced, it can disproportionately shape subsequent reasoning, even when incorrect.39,40,49,52 These effects are well described in decision science and may be amplified in AI-assisted contexts, particularly in multidisciplinary settings where early framing strongly influences group consensus.
These safety concerns carry unresolved medico-legal implications. Most GenAI systems currently used in oncology fall outside formal medical device regulation, placing accountability primarily on clinicians and institutions.4,49 When GenAI-assisted content contributes to patient harm, responsibility attribution, among clinician, institution, and AI vendor, remains unclear, particularly when AI-generated text is incorporated into the medical record without explicit labeling. This uncertainty is compounded by the rapid and opaque update cycles of commercial LLMs, whereby identical prompts may yield different outputs over time, undermining reproducibility and legal defensibility. Emerging guidance, including recommendations from the 2025 NCCN AI Summit, suggests explicit labeling of GenAI-assisted content in clinical documentation to ensure auditability and medico-legal transparency.39,40,49
A related governance challenge is the need for rigorous AI provenance tracking. As models evolve, institutions require mechanisms to document the specific model version, prompt structure, and contextual inputs used to generate a given output. Without such provenance logs, analogous to metadata in radiologic PACS systems, it becomes difficult to retrospectively evaluate decisions, conduct morbidity-and-mortality review, or support institutional learning. This lack of traceability represents a major barrier to quality assurance in head and neck oncology.12,54
Responsible integration therefore requires carefully designed and enforceable governance frameworks. Human-in-the-loop oversight should be considered a foundational safeguard, whereby GenAI outputs are reviewed, edited, and explicitly endorsed by clinicians before influencing decision-making or entering the medical record.2,39,49,52 Retrieval-augmented generation (RAG), anchored to curated and version-controlled guideline sources (e.g., NCCN), may improve factual grounding but does not eliminate error or bias. Complementary explainable-AI techniques, such as attribution methods highlighting which guideline clauses or clinical features informed a recommendation, may further enhance interpretability, although head and neck–specific validation remains limited. 12 GenAI-assisted materials should be clearly identified, and patients should be informed when such systems contribute to education or communication.2,39,49,52
Finally, institutional policies must clearly define acceptable use cases, documentation standards, and auditability requirements. Professional societies in head and neck oncology are well positioned to establish field-specific guidance, reducing inter-institutional variability and reinforcing shared ethical norms. In parallel, clinician education in GenAI literacy, including recognition of model limitations, common failure modes, and appropriate skepticism, will be essential for safe adoption.2,39,49,52 As deployment expands, institutions will require dynamic monitoring systems capable of detecting performance drift, equity gaps, and unintended downstream effects on clinical outcomes.
Future directions
Looking forward, future research should prioritize prospective, workflow-embedded evaluations of GenAI in head and neck oncology. To date, nearly all studies have relied on retrospective prompts, synthetic vignettes, or simulation-based assessments, which do not capture the real pressures, uncertainties, and incomplete data that shape clinical decision-making. Cluster-randomized or stepped-wedge evaluations of GenAI-assisted tumor board preparation, patient counseling, or staging verification, measuring time savings, decision consistency, error rates, and clinician trust, represent a critical next phase of validation.
Importantly, many of the limitations observed across current GenAI applications in head and neck oncology are not merely implementation failures but reflect more fundamental constraints of large language model architectures. LLMs are trained predominantly on large-scale internet and text-based corpora and lack exposure to sufficiently granular, high-quality, domain-specific clinical datasets. As a result, their apparent “reasoning” reflects probabilistic linguistic pattern matching rather than true clinical, anatomical, or pathophysiological understanding. This epistemic limitation likely underlies the consistent degradation in performance observed in nuanced, gray-zone scenarios that require integration of imaging subtleties, functional trade-offs, and tacit subspecialty knowledge.
Beyond evaluation design, advances in multimodal AI architectures offer the potential to better address the complexity of HNC care. Models capable of integrating CT/MRI imaging, pathology descriptors, genomic data, endoscopic findings, and structured clinical text may eventually support safer, more holistic decision assistance. However, multimodal foundation models will require rigorous domain-specific training, robust guardrails, and prospective clinical trials; without these safeguards, increased model complexity may simply compound existing risks.
In this context, future progress in AI-assisted clinical decision support may depend less on further scaling of general-purpose LLMs and more on domain-specific models trained on structured clinical data and explicit domain knowledge representations. Emerging approaches such as tabular foundation models, including TabPFN, demonstrate that strong predictive performance can be achieved from comparatively small, structured datasets when appropriate inductive priors are incorporated. Hybrid systems that integrate structured clinical variables, domain priors, and language-based interfaces may therefore offer a more reliable and clinically aligned pathway toward decision support than text-only generative models.
A parallel priority is the development of head and neck–specific benchmark suites. Existing LLM benchmarks are overwhelmingly generic and fail to capture the unique anatomical, functional, and multidisciplinary considerations of HNC. Purpose-built datasets, including expertly annotated staging vignettes, imaging-grounded decision scenarios, reconstruction planning dilemmas, and validated patient FAQ sets across multiple literacy levels, would allow reproducible comparisons across models and facilitate detection of performance drift over time.
Equity considerations must also be foregrounded. High reliability and cost-effective deployment could, in principle, expand access to expert-quality information in regions with limited HNC subspecialists. Yet unequal access to high-quality deployments, language limitations, and the risk of model biases replicating existing disparities could widen gaps in outcomes if not closely monitored. Ensuring equitable access to validated tools, multilingual capability, culturally sensitive content generation, and community-centered evaluation frameworks will be essential.
Finally, the field must avoid treating head and neck oncology as a permissive testing ground for experimental AI tools. HNC decisions can alter long-term airway function, swallowing, voice, appearance, and overall quality of life. The threshold for acceptable AI error is therefore likely to be substantially lower than in less functionally consequential domains. Accordingly, deployment strategies should be approached cautiously and supported by rigorous validation prior to widespread clinical integration.
Conclusion
In conclusion, GenAI holds genuine promise as an assistive technology in head and neck oncology, particularly for language-centric tasks such as documentation, case summarization, guideline navigation, literacy-adapted patient education, and cross-checking structured staging information. At present, its strengths lie in augmenting human expertise rather than replacing it. Autonomous diagnosis or treatment recommendation, whether in primary, recurrent, or metastatic disease, remains unsupported by evidence and carries unacceptable risk.
A principled path forward requires restraint, transparency, and governance, ensuring that GenAI augments, not substitutes, expert clinical judgment. Explicit labeling of GenAI-assisted content, human-in-the-loop oversight, version-controlled AI provenance logs, and ongoing monitoring for drift and inequity will be essential components of safe clinical integration. If developed and deployed responsibly, GenAI may streamline preparation, enhance communication, and narrow access gaps, while preserving the clinical reasoning, nuance, and multidisciplinary expertise that define high-quality HNC care.
Footnotes
Author contribution
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
