Abstract
Objectives
To explore the effectiveness of utilizing ChatGPT 4.0 to assist human physicians in assessing dysphagia in patients undergoing radiotherapy for head and neck cancer.
Methods
This prospective study included 100 head and neck cancer (HNC) patients who visited our hospital between January 2025 and October 2025. All participants first underwent an independent dysphagia assessment in the control group conducted by a human physician (Physician A). Subsequently, they were evaluated in the experimental group by a similarly qualified physician (Physician B) with the assistance of ChatGPT 4.0. The comprehensive assessment results from an expert group consisting of two senior head and neck surgeons with ten years of experience served as the “gold standard.” Consistency comparisons of the evaluation results among the three groups were conducted to validate the effectiveness of the language model-assisted assessment.
Results
The consistency Kappa index between the experimental group and the expert group was 0.87, indicating a “good” level of consistency, significantly superior to the control group’s 0.70. Subgroup analysis of different EAT-10 and MDADI score ranges showed that in 85 patients with EAT-10 scores ≥ 3: the control group accurately identified 72 cases, achieving an accuracy of 84.7%; the experimental group accurately identified 80 cases, with an accuracy of 94.1%. Among 78 patients with MDADI scores ≤ 69, the control group accurately identified 65 cases (accuracy of 83.3%), while the experimental group identified 73 cases accurately (93.6%).
Conclusions
The assessment model combining large language models with human physicians effectively improves the accuracy and consistency of dysphagia assessment in patients undergoing radiotherapy for head and neck cancer.
Background
Head and neck cancer (HNC) is one of the most common malignancies in the world and has a high mortality rate. 1 The primary treatment modalities for HNC include surgery, radiotherapy, chemotherapy, and targeted molecular therapy, with at least 80% of HNC patients receiving preoperative or postoperative radiotherapy.2,3 The treatment of HNC, particularly radiotherapy, significantly improves patient survival rates; however, the long-term sequelae, especially dysphagia, have become a core challenge impacting patients’ quality of life. 4 The impact of severe dysphagia can be even greater than that of xerostomia. 5 These patients primarily exhibit symptoms such as difficulty swallowing food, coughing during meals, food retention in the oropharynx, nasal regurgitation, or aspiration. These symptoms not only severely impact the patients’ nutritional intake and physiological condition but also often lead to psychological issues such as anxiety and depression, limiting their social interactions. The interplay of these factors ultimately significantly decreases the patients’ overall quality of life.
Currently, the methods used to assess dysphagia in patients with head and neck cancer are primarily categorized into objective and subjective assessments. Objective assessment methods are considered the “gold standard” for diagnosing swallowing disorders and mainly include Videofluoroscopic Swallowing Study (VFSS) and Fiberoptic Endoscopic Evaluation of Swallowing (FEES). Although VFSS and FEES are precise, they have limitations such as high equipment requirements, complex procedures, high costs, and being invasive or involving ionizing radiation. These factors restrict their widespread application in primary medical institutions and limit their feasibility as routine screening tools.6,7 Subjective assessments primarily rely on standardized questionnaire scales, such as the Eating Assessment Tool-10 (EAT-10) and the M.D. Anderson Dysphagia Inventory (MDADI).88,9 A study on the evaluation methods for dysphagia after radiochemotherapy for nasopharyngeal cancer found that, using VFSS as the gold standard, the EAT-10 demonstrated a sensitivity of 85% but was unable to distinguish between different degrees of swallowing difficulties. 10 This indicates that while these methods are convenient and non-invasive, they have limitations in diagnostic accuracy, which may lead to a portion of patients being “missed.” Therefore, developing a tool that can integrate multidimensional patient information to assist physicians in conducting precise and efficient assessments holds significant clinical importance for improving the prognosis and quality of life of HNC patients.
In recent years, large language models (LLMs), exemplified by ChatGPT, have made significant breakthroughs in the field of natural language processing and have begun to demonstrate tremendous application potential in healthcare. 11 LLMs, through deep learning on vast amounts of medical literature, clinical guidelines, and case records, have developed the ability to deeply understand medical terminology and complex contexts, enabling them to efficiently analyze and integrate diverse and heterogeneous healthcare data. In terms of assisting diagnosis, LLMs can analyze patients’ medical history, clinical symptoms, laboratory test results, and imaging reports, providing physicians with comprehensive and in-depth diagnostic recommendations. In some cases, they can even predict disease progression and potential complications.12–14
The exploration of LLM applications in clinical practices related to head and neck cancer has also begun to take shape. Kayaalp et al. 15 utilized ChatGPT, DeepSeek, and Grok to standardize the staging of HNC patients using information extracted from electronic medical record systems. ChatGPT demonstrated the best overall consistency rate, with a Cohen’s kappa value of 0.797 and an F1 score of 0.84, indicating relatively high accuracy. Rajendran et al. 16 integrated imaging and clinical data using LLMs through visual language models, providing an optimal method for automatic localization of radiotherapy targets, which improved the accuracy of target volume delineation and facilitated broader AI-assisted automation in radiotherapy planning. Lorenzi et al. 17 assessed the efficacy of ChatGPT 4.0 and Gemini Advanced in providing treatment recommendations for head and neck tumor cases. The results showed that ChatGPT 4.0 outperformed Gemini Advanced in guideline adherence and comprehensive treatment planning, highlighting its potential in multidisciplinary management of head and neck tumors.
Despite the broad application prospects of large language models in the medical field, their application value in the specific context of assessing dysphagia after radiochemotherapy for head and neck cancer remains insufficiently validated by systematic clinical research. Therefore, this study aims to use LLMs as an auxiliary tool directly applied in clinical assessment processes and to compare the effectiveness and consistency of human physician assessments conducted with the assistance of these models against traditional human physician assessments.
Methods
Sample size calculation
The difference in assessment accuracy between the two groups serves as the core indicator. Assuming the control group’s assessment accuracy p0 is 75%, and the experimental group’s accuracy p1 increases to 90%. Setting α=0.05 (two-tailed) and β=0.2 (80% test power), the sample size Equation (1) yields an approximate sample size of 85 cases per group. To account for potential dropouts and exclusions during the study, the sample size is expanded by 10%, resulting in a minimum required sample size of 94 cases per group.
Patients
Inclusion Criteria: (1) Pathologically confirmed diagnosis of head and neck malignant tumors.
Age ≥ 18 years; (2) Completed radical radiotherapy; (3) Clear consciousness and ability to cooperate with various assessment examinations.
Exclusion Criteria: (1) Severe dysphagia existing prior to radiotherapy (e.g., inability to eat orally); (2) Coexisting severe dysfunction of major organs such as the heart, liver, or kidneys.
Presence of cognitive impairment or psychiatric disorders that hinder cooperation during assessments; (3) Severe distant metastasis with a life expectancy of less than 3 months; (4) Refusal to participate or inability to complete the entire assessment.
Based on the above inclusion and exclusion criteria, this study prospectively collected data from 100 HNC patients who visited our hospital from January 2025 to October 2025.
This study was approved by the Institutional Review Board of the Second Affiliated Hospital of Fujian Medical University [2025-074], and all participants gave written informed consent before enrollment.
Research methods
Subject grouping
All 100 participants first entered the control group, where they underwent independent dysphagia assessments conducted by Physician A, an attending physician with over 5 years of experience in the rehabilitation of head and neck cancer. After completing the control group assessments, all participants then entered the experimental group, where assessments were conducted by Physician B, another attending physician of equal qualifications and title, with the assistance of the LLM. Throughout the study, the two physicians were blinded to each other’s assessment results to ensure the independence of the evaluations.
Control group evaluation process
During the control group assessment phase, Physician A conducted a comprehensive clinical evaluation for each participant. The assessment included: (1) Medical History Collection: Detailed inquiries about the patient’s tumor type, stage, treatment plan (radiotherapy dosage, chemotherapy regimen), treatment completion date, and the patient’s subjective experience of dysphagia symptoms (e.g., difficulty with solid or liquid food, frequency of coughing, prolonged eating time); (2) Physical Examination: Visual inspection of the oropharynx to observe for mucosal inflammation, ulcers, and saliva secretion, as well as an examination of mouth opening, tongue movement, and strength; (3) Scale Assessment: Guiding patients to complete standardized swallowing function evaluation scales, including EAT-10 and MDADI; (4) Bedside Screening: Selective administration of either the Water Swallowing Test (WST) or the Repeated Saliva Swallowing Test (RSST) as clinically warranted.
Physician A synthesized all this information to make a final judgment on the patient’s dysphagia, recording the assessment results (whether dysphagia was present and the degree of severity).
Experimental evaluation process
During the experimental group assessment phase, Physician B first received a swallowing function analysis report generated by ChatGPT 4.0 and then conducted the final clinical evaluation based on this report. The specific process was as follows: (1) Data Input: The electronic medical record text of the participants (including medical history, treatment plans, etc.) and the raw data from EAT-10 and MDADI scales completed during the control group phase were input into the ChatGPT 4.0 analysis system; (2) ChatGPT 4.0 Analysis: ChatGPT 4.0 performed a deep analysis of the input data. The model automatically extracted key information such as tumor location, radiotherapy dosage, scale scores, etc., and generated a structured analysis report by integrating its learned medical knowledge and risk prediction models. The report included: basic patient information, identification of key risk factors, preliminary risk level assessment based on scale scores, and a comprehensive evaluation conclusion from the model. (3) Physician Comprehensive Assessment: After reviewing the analysis report generated by ChatGPT 4.0, Physician B combined this information with direct patient interviews and physical examinations to evaluate, supplement, or correct ChatGPT 4.0’s recommendations, ultimately making an independent clinical judgment and recording the assessment results.
Assessment tools and methods
Eating attitudes test (EAT-10) questionnaire
The EAT-10 is a self-assessment scale that includes 10 items designed to screen for and assess the severity of dysphagia (Figure 1). Each item is rated on a 4-point scale (0 = no problem, 1 = mild problem, 2 = moderate problem, 3 = severe problem, 4 = unable to complete), with a total score ranging from 0 to 40. A higher total score indicates more severe dysphagia. Typically, a total score of ≥3 is used as the cutoff value to determine the presence of dysphagia. EAT-10 swallowing difficulties screening questionnaire.
M.D. Anderson Dysphagia Inventory (MDADI)
MDADI is a swallowing-related quality of life assessment scale specifically designed for head and neck cancer patients. It consists of 20 items divided into four dimensions: emotional, functional, physical, and social (Figure 2). Each item uses a 5-point rating scale, with higher scores indicating better swallowing function and a lesser impact on quality of life. The total score ranges from 20 to 100, with scores of 69 or lower indicating the presence of swallowing difficulties. M.D. Anderson dysphagia inventory (MDADI).
Large language model-assisted analysis system
Medical record transcription and questionnaire data entry
To ensure consistency in the research, this study employed GPT-4-turbo-2025-08-26, maintaining the same version. Before the experiment commenced, ChatGPT 4.0 was pre-trained on the Expert Consensus Statement: Management of Dysphagia in Head and Neck Cancer Patients, published in 2023 by the American Academy of Otolaryngology-Head and Neck Surgery. 18 The prompt used was: “You are a medical auxiliary analysis tool specialized in evaluating swallowing dysfunction following radiotherapy for head and neck malignancies. Based on the patient’s clinical data, you must generate a structured report that strictly adheres to evidence-based principles and fixed assessment criteria. Core tasks: (1) Extract key clinical information relevant to swallowing function; (2) Integrate scale data to stratify risk; (3) Provide interpretable evaluation grounds and clinical reference recommendations, performing objective analysis solely from the input data without adding unsubstantiated assumptions or inferences beyond the data scope.” The physicians in the experimental group will directly copy and paste the patients’ unstructured medical record text (such as chief complaints, present illness, past medical history, treatment records, EAT-10 and MDADI scale scores, etc.) into the text box. The backend of the system will automatically integrate and preprocess these structured and unstructured data.
Model analysis and risk level output
Upon receiving the input data, ChatGPT 4.0 will conduct a multi-step analysis. First, it will utilize natural language processing (NLP) techniques to automatically extract key entities and relationships related to swallowing dysfunction from the medical record text, such as “tumor stage: Stage III,” “radiation dose: 70 Gy,” and “radiation-induced mucositis: Grade III.”. Next, the model will integrate this structured information with the EAT-10 and MDADI scale scores. Following that, the model will activate its internal risk prediction module, trained on a large dataset of cases, to generate a structured analytical report. This report will clearly present the final assessment level, the basis for the judgments, and clinical guidance recommendations.
Dysphagia assessment criteria
To evaluate the consistency between the two assessment models, the comprehensive assessment results from the expert group (composed of two senior head and neck surgeons with 10 years of experience and one radiation oncologist with 10 years of experience) will serve as the “gold standard.”
Regardless of whether it is the control group or the experimental and expert group, the final assessment by human physicians will be based on unified clinical standards. The diagnosis of swallowing dysfunction will comprehensively consider the following aspects: (1) Subjective Symptoms: Patients report significant swallowing difficulties, such as eating with effort, a sensation of food obstruction, or coughing; (2) Scale Scores: An EAT-10 total score of ≥3, or an MDADI total score of ≤69; (3) Bedside Screening: A Water Swallowing Test (WST) rating of ≥III, or fewer than 3 swallows in 30 seconds during a Repeated Saliva Swallowing Test (RSST); (4) Clinical Signs: Physical examination reveals abnormalities related to swallowing function, such as tongue muscle weakness, limited mouth opening, or poor soft palate movement.
Physicians must integrate the above information to assess the status of swallowing dysfunction as: no swallowing dysfunction, or mild, moderate, or severe swallowing dysfunction.
Statistical analysis
All statistical analyses will be conducted using SPSS version 26.0. Continuous data will be expressed as mean ± standard deviation (x̄ ± s), and comparisons between groups will be performed using paired t-tests. Categorical data will be presented as frequencies and percentages, with comparisons between groups conducted using chi-square tests. The consistency between the assessment results of the control group and the experimental group with those of the expert group will be analyzed using the Kappa consistency test. A Kappa value > 0.75 indicates good consistency, 0.40-0.75 indicates moderate consistency, and < 0.40 indicates poor consistency. A p-value of < 0.05 will be considered statistically significant.
Results
Basic characteristics
Participant demographics.
Distribution of assessment results
Distribution of evaluation results between the control group and the experimental group.
Kappa consistency test analysis
The Kappa index for consistency between the control group and the expert group was 0.70 (95% CI: 0.55 - 0.81), with a p-value of 0.002. The Kappa index for consistency between the experimental group and the expert group was 0.87 (95% CI: 0.74 - 0.95), also with a p-value of < 0.001. These results indicate that the consistency of assessment results between the experimental group and the expert group is significantly higher than that of the control group, achieving a level of “good” consistency. This suggests that the assistance of a large language model significantly enhances the accuracy of physician assessments, bringing them closer to expert consensus.
Accuracy analysis of language model-assisted evaluation
We further analyzed the accuracy of assessments in different EAT-10 and MDADI score ranges. Among patients with an EAT-10 score of ≥3 (n=85): the control group accurately identified 72 cases, resulting in an accuracy rate of 84.7%, while the experimental group accurately identified 80 cases, with an accuracy rate of 94.1%. For patients with an MDADI score of ≤69 (n=78): the control group accurately identified 65 cases, yielding an accuracy rate of 83.3%, whereas the experimental group accurately identified 73 cases, achieving an accuracy rate of 93.6%. These data indicate that among patients where the scale scores suggest the presence of swallowing dysfunction, the assistance of a large language model further enhances the accuracy of physician identification.
Analysis of typical cases
This study presents the analysis process of a typical case (Figure 3). The patient is a 58-year-old male with oropharyngeal carcinoma (T2N1M0) who has completed radical radiotherapy (66 Gy/33 fractions) and concurrent cisplatin chemotherapy. His EAT-10 score was 25, and the MDADI score was 62. The physician in the control group primarily relied on the scale scores, assessing the patient as having “moderate” swallowing dysfunction. During the analysis of the medical record, ChatGPT 4.0 captured the following key information: “The patient reports a significantly worsening sensation of obstruction when swallowing solid food over the past week,” “He frequently wakes up at night due to dry mouth and occasionally coughs after drinking,” and “Family members report that the patient’s eating speed has noticeably slowed and that he appears distressed while eating.” The model combined this textual information with scale scores and indicated that the patient might have “severe” swallowing dysfunction. After referring to the model’s analytical report, the physician in the experimental group re-engaged in detailed communication with the patient and family to confirm the reported symptoms and revised the assessment to “severe.” ChatGPT 4.0-Assisted dysphagia evaluation flowchart: Patient’s personal medical history input section (a), structured clinical report generation section (b), and comprehensive analysis and risk stratification section (c).
In subsequent follow-up over three months, the patient’s swallowing difficulties continued to worsen, ultimately leading to a gastrostomy to improve nutritional status, confirming the assessment of “severe” swallowing dysfunction. This case underscores the potential of large language models to deeply mine details from medical texts, assisting physicians in identifying severe symptoms that may be obscured by scale scores or initial impressions, thereby enabling more accurate assessments.
Discussion
The core finding of this study is that the evaluation model combining large language models with human physicians demonstrates significantly greater accuracy and consistency in assessing swallowing dysfunction in patients undergoing radiotherapy and chemotherapy for head and neck malignancies compared to traditional assessments conducted solely by physicians. This finding has important clinical significance. It confirms the potential application of artificial intelligence technologies in complex clinical evaluation tasks, indicating that large language models are not a replacement for physicians but serve as powerful auxiliary tools. By processing vast amounts of unstructured textual data, these models provide decision support to physicians. Additionally, this study reveals the potential shortcomings of traditional assessment methods, addressing issues of critical information omission due to cognitive overload, subjective bias, or time constraints faced by human physicians. Furthermore, this model offers new possibilities for the early identification and intervention of swallowing dysfunction. More accurate assessments can facilitate the prompt recognition of high-risk patients, timely initiation of rehabilitation training, or nutritional support, thereby preventing severe complications and improving long-term patient outcomes.
The introduction of large language models can significantly enhance the efficiency of swallowing function assessments. Traditional evaluation processes require physicians to spend substantial time reading medical records, communicating with patients, and synthesizing various pieces of information to make judgments. In contrast, the model can analyze medical text and scale data within seconds, generating structured reports that greatly reduce assessment time. 11 Moreover, the evaluation standards of the model are based on predefined algorithms and rules, providing a high degree of consistency and repeatability. This helps to avoid subjective judgment biases that may arise from differences in experience and knowledge backgrounds among different physicians.19,20 This advancement contributes to the standardization of swallowing function assessments, ensuring that results are more comparable across different physicians, departments, and even hospitals. Such consistency provides a reliable foundation for clinical research and quality control, facilitating more accurate benchmarking and evaluation of treatment outcomes.
The core advantage of large language models lies in their powerful capabilities for information integration and reasoning. 21 It can organically combine seemingly isolated symptom descriptions scattered throughout the medical records—such as “dry mouth,” “coughing,” and “slow eating”—with quantitative scale scores to create a comprehensive, multidimensional profile of the patient.22,23 For instance, in a typical case from this study, the model inferred that the patient might have more severe dysfunction than reflected by the scale scores by identifying details such as “nocturnal coughing” and “distress during eating,” alongside higher EAT-10 scores. This integrated analytical capability helps physicians transcend linear thinking limitations, allowing them to understand the patient’s condition from a more macro and systematic perspective, thereby facilitating more comprehensive and precise clinical decision-making.
Although the collaboration of ChatGPT with human physicians in assessing swallowing dysfunction in head and neck cancer patients undergoing radiation therapy demonstrates significant advantages, it is crucial to acknowledge the potential negative impacts of biases and hallucinations associated with large language models (LLMs) on patient safety and healthcare equity. To address this issue, the first step is to enhance the diversity and representativeness of the data used in model training. Ensuring that the dataset includes information from multiple medical institutions, various patient demographics, and a range of medical record types can effectively mitigate systemic biases resulting from imbalanced data. Moreover, regularly reviewing and updating the training data to reflect the latest clinical practices and patient characteristics is essential for maintaining the model’s adaptability and accuracy in dynamic healthcare environments. Additionally, incorporating clinical expertise to review the model’s outputs serves as an effective strategy for reducing biases and hallucinations. By establishing a feedback mechanism, physicians can evaluate and adjust the suggestions provided by the model, thereby ensuring the final decision-making is accurate and reliable.
Although this study achieved positive results, it still has significant limitations: the performance of large language models is highly dependent on the quality and quantity of training data, with potential biases being replicated and amplified. There is also the risk of generating false information, leading to “hallucinations.” Therefore, the outputs should be considered as clinical support suggestions, with the ultimate decision-making authority resting with professional physicians. Additionally, the study was designed as a single-center investigation with a relatively limited sample size of 100 cases, which may introduce selection bias. The generalizability of the results needs to be further validated in multi-center and diverse medical settings. Besides, all patients underwent the control group assessment first, which may introduce bias. Finally, since ChatGPT 4.0 was not pre-trained on control data specifically for dysphagia evaluation, any errors cannot be appropriately addressed in the final assessment report.
Conclusions
Large language models, with their powerful natural language processing capabilities, can efficiently and accurately extract key information from unstructured medical records. This provides physicians with objective, data-driven decision support, effectively addressing potential oversights and subjective biases that humans may encounter when dealing with vast amounts of information.
Footnotes
Ethical considerations
This study was approved by the Institutional Review Board of the Second Affiliated Hospi-tal of Fujian Medical University [2025-074].
Consent for publication
The study was reviewed and approved by the senior authors’ institutional review board and approved for publication.
Authors contributions
D.C.: Conceptualization; L.H.: Methodology; X.Z.: Validation; K.T.: Formal analysis; Y.L.: Investigation; S.S.: Resources; D.C.: Data Curation; S.S.: Writing - Original Draft; L.H.: Writing - Review & Editing; X.Z.: Visualization; S.S.: Supervision, Project administration and Funding acquisition. All authors read and approved the final manuscript.
Funding
The authors declare that financial support was received for the research and/or publication of this article. This research was funded by the Science and Technology Bureau of Quanzhou (2021N011S) and Fujian Province science and technology innovation joint fund project (2024Y9354).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets used and analyzed in the current study are available from the corresponding author, [S.S.], upon reasonable request.
