Sage Journals: Discover world-class research

Abstract

Purpose

This study enhances the efficiency of predicting complications in lung cancer patients receiving proton therapy by utilizing large language models (LLMs) and meta-analytical techniques for literature quality assessment.

Materials and Methods

We integrated systematic reviews with LLM evaluations, sourcing studies from Web of Science, PubMed, and Scopus, managed via EndNote X20. Inclusion and exclusion criteria ensured literature relevance. Techniques included meta-analysis, heterogeneity assessment using Cochran’s Q test and I² statistics, and subgroup analyses for different complications. Quality and bias risk were assessed using the PROBAST tool and further analyzed with models such as ChatGPT-4, Llama2-13b, and Llama3-8b. Evaluation metrics included AUC, accuracy, precision, recall, F1 score, and time efficiency (WPM).

Results

The meta-analysis revealed an overall effect size of 0.78 for model predictions, with high heterogeneity observed (I² = 72.88%, P < 0.001). Subgroup analysis for radiation-induced esophagitis and pneumonitis revealed predictive effect sizes of 0.79 and 0.77, respectively, with a heterogeneity index (I²) of 0%, indicating that there were no significant differences among the models in predicting these specific complications. A literature assessment using LLMs demonstrated that ChatGPT-4 achieved the highest accuracy at 90%, significantly outperforming the Llama3 and Llama2 models, which had accuracies ranging from 44% to 62%. Additionally, LLM evaluations were conducted 3229 times faster than manual assessments were, markedly enhancing both efficiency and accuracy. The risk assessment results identified nine studies as high risk, three as low risk, and one as unknown, confirming the robustness of the ChatGPT-4 across various evaluation metrics.

Conclusion

This study demonstrated that the integration of large language models with meta-analysis techniques can significantly increase the efficiency of literature evaluations and reduce the time required for assessments, confirming that there are no significant differences among models in predicting post proton therapy complications in lung cancer patients.

Plain Language Summary

Using Advanced AI to Improve Predictions of Treatment Side Effects in Lung Cancer: This research uses cutting-edge artificial intelligence (AI) techniques, including large language models like ChatGPT-4, to better predict potential side effects in lung cancer patients undergoing proton therapy. By analyzing extensive scientific literature quickly and accurately, this approach has proven to enhance the evaluation process, making it faster and more reliable in foreseeing complications from treatments.

Keywords

lung cancer proton therapy large language model ChatGPT meta-analysis prediction model risk of bias assessment tool

Introduction

This study is dedicated to developing and evaluating advanced predictive models to increase the accuracy of predicting complications after proton therapy for patients with lung cancer. Proton therapy is a sophisticated radiation therapy technique that focuses the radiation dose precisely on the tumor target, significantly reducing damage to surrounding healthy tissues.¹ It has been clinically proven to be highly effective for treating lung cancer patients. However, despite these advantages, posttreatment complications such as radiation esophagitis and radiation pneumonitis can still severely impact the quality of life of patients.^2,3

In recent years, numerous studies have utilized genetic markers, pharmacological interventions, imaging diagnostics, and invasive techniques to improve the diagnostic accuracy of these complications.^4-7 By accurately predicting such complications, it is possible to reduce the occurrence of adverse effects and enhance treatment outcomes, effectively decreasing mortality rates. This growing body of research underscores the need for innovative approaches that combine the precision of modern medical technologies with predictive modeling to better manage and mitigate the risks associated with proton therapy.

Owing to the relatively small sample size of lung cancer patients receiving proton therapy, statistical significance is often not apparent. Therefore, the use of meta-analysis methods to quantitatively assess the overall effects of all studies can increase the credibility of the research. With the increasing volume of medical literature, as demonstrated by a search using the keywords “machine learning” and “cancer” in the Web of Science database, which yielded 51,398 articles, there is a clear upward trend in research related to predictive models. This necessitates meticulous evaluation of each study’s results and bias risks to ensure its clinical applicability.

Assessing bias risk is crucial in the meta-analysis process, as biased results directly affect the credibility and applicability of the research.⁸ At least two independent reviewers are required to minimize the influence of individual bias in bias risk assessment.⁹ However, previous studies have shown that interrater reliability in bias risk assessment ranges from 48% to 90%.^10,11 As the number of studies increases, the time required for assessment can range from several days to weeks .⁹ To ensure the quality of the literature and accurate assessment of bias risk, we employed the prediction model risk of bias (PROBAST) Assessment tool to evaluate the bias risk and applicability of the predictive model studies. This tool covers 20 signaling questions that help researchers systematically identify and quantify biases in the literature, thereby improving the accuracy and reliability of predictive models.

In recent years, natural language processing (NLP) has become a powerful tool,^12-14 with generative AI and large language models (LLMs) such as OpenAI’s ChatGPT and Meta AI’s Llama series taking center stage.^15,16 These technologies have demonstrated remarkable capabilities in processing and generating natural language, garnering widespread attention. These LLMs can now perform a variety of complex NLP tasks, and increasing research is emerging on the application of ChatGPT in the health and medical fields.¹⁷ However, these models face numerous challenges in specialized medical literature assessment, such as generalizability, accuracy in understanding specialized terminology, and objectivity and consistency in evaluating literature quality.¹⁸ While NLP technology has been widely applied in many medical fields, such as in our previous research, to improve the efficiency and accuracy of literature assessment,¹⁹ its application in predicting complications after proton therapy remains relatively limited. Current studies rely mostly on traditional statistical methods, which may not fully utilize the hidden potential information in complex datasets.

To address these challenges, we combined meta-analysis methods with LLMs to investigate the performance of different predictive models in terms of complications after proton therapy for lung cancer. We evaluated the effectiveness of various models through meta-analysis and heterogeneity analysis and conducted in-depth literature quality assessments via LLMs, comparing the consistency of bias risk assessment between humans and the ChatGPT-4 and Llama models.

Materials and methods

Research Framework

This study comprises two main components. The first component focuses on exploring and comparing predictive models for identifying complications following proton therapy in lung cancer patients. We conducted detailed analyses of various complications, such as radiation esophagitis and radiation pneumonitis, and compared different predictive models to determine the most effective models for specific complications. Subgroup analyses further help in identifying optimal model choices for each complication type. The second component uses LLMs to increase the accuracy and efficiency of literature quality assessment. We employ three models, ChatGPT-4, Llama2-13B, and Llama3-8B, to perform preliminary evaluations of literature quality. These assessments are further deepened with the use of tools such as PROBAST, aiming to refine the accuracy and reduce the time needed for literature assessments. The overall research framework is visually represented in Figure 1, which illustrates our comprehensive approach to integrating technological advancements with medical research to improve treatment planning and literature evaluation efficiency. This article was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines,²⁰ and was registered with INPLASY. The registration number for this study is INPLASY202480103.

Figure 1.

Overall research flowchart.

Data Sources and Search Strategy

We utilized PubMed’s Medical Subject Headings (MeSH) as a basis for our search keywords and extended our searches across databases, including Web of Science, PubMed, and Scopus. The search cutoff date was April 1, 2024. Our strategy, which is structured around the PICOS framework,²¹ aims to encompass a broad and relevant spectrum of research. The search terms included “lung cancer” AND “proton therapy” AND “machine learning” AND “complications” OR “radiation pneumonitis” OR “radiation esophagitis,” with the use of Boolean operators “AND” and “OR” to maximize the retrieval of pertinent literature (detailed keywords can be found in Supplemental Table S1). The collected data were managed and analyzed via EndNote X20 (Clarivate Analytics, X20, Philadelphia, USA) software.

Inclusion and Exclusion Criteria

In this study, we established strict inclusion and exclusion criteria to ensure the scientific validity and relevance of the literature. The inclusion criteria mandated that studies must detail machine learning models predicting complications of proton therapy for lung cancer, encompassing any patient undergoing such treatment. We considered all proton therapy techniques and comparative analyses with other treatment modalities, with the requirement that the studies report posttreatment complications and embrace various research designs. The exclusion criteria were meta-analyses, systematic reviews, studies lacking complete model construction or model outcome evaluation, studies with unclear methodologies, studies with incomplete data reporting, studies in languages other than English, and studies with inaccessible full texts.

Data Extraction

In this study, the data extraction phase was conducted after strictly adhering to the inclusion and exclusion criteria, with full-text evaluations carried out independently by two authors. Any disagreements between the two authors were resolved through consultation with a third author, ensuring fairness and consistency in the data extraction process.

The extracted data items included the first author’s name, which records the primary researcher responsible for the study to facilitate subsequent analysis and citation; the year of publication to identify the timeliness and relevance of the study; the country indicating the geographic location where the study was conducted; aiding in analyzing regional differences’ impact on the results; the type of complications, detailing various complications caused by lung cancer proton therapy such as radiation esophagitis and radiation pneumonitis; the radiation therapy method, noting the specific radiation therapy techniques used, including details of the proton therapy method; the sample size, extracting the number of patients involved in the study, a key factor in evaluating the statistical significance of the results; overfitting prevention methods, recording the techniques or methods used to prevent model overfitting, ensuring the generalizability of the model; feature selection methods, extracting the methods used to increase model predictive capability and reduce variables; and the type of models used, describing the types of machine learning models employed, such as support vector machines, decision trees, etc.

Through this data extraction process, the study ensures that the collected information is comprehensive and accurate. This process also helps standardize data extraction methods, increasing the transparency and reproducibility of the research.

Quality Assessment and Risk of Bias

We employed PROBAST to evaluate the quality of the studies and the risk of bias.²² The assessment covers four main domains—participants, predictors, outcomes, and analysis—which together comprise 20 signaling questions to facilitate a structured judgment of risk of bias (ROB).

Each signaling question is answered with “Yes”, “ Probably Yes”, “ Probably No”, “No” or “No Information.” If any signaling question is answered with “No” or “ Probably No,” there may be a risk of bias. If all four domains are rated as having low risk, the overall risk of bias can be considered low. If at least one domain has a high risk of bias, the overall risk of bias is assumed to be high. If at least one domain has an unclear risk of bias and the other domains have low risk, the overall risk of bias is deemed unclear.

Through this structured assessment process, we can systematically identify and quantify potential biases in the literature, further enhancing the transparency and scientific rigor of the research. This approach not only helps improve research quality but also ensures the reliability and applicability of the study results.

Statistical Analysis

This study utilized the following methods for statistical analysis, all of which were performed via Python (version 3.11.5). The effect size was measured via the area under the curve (AUC), which was used to assess the predictive ability of the models. We employed the inverse-variance-weighted average method to calculate the weights for the combined effect size,²³ assigning weights to each study’s effect size on the basis of its variance, with smaller variances receiving larger weights.

To evaluate the heterogeneity among studies, we applied degrees of freedom (DF), Cochran’s Q test, and the I² statistic. An I² value greater than 50% indicates significant heterogeneity. When significant heterogeneity was present, we used a random effects model because, regardless of the degree of statistical heterogeneity, there are differences in treatment methods and types of complications among studies.

To identify the sources of heterogeneity, we conducted subgroup analyses by evaluating different complications to investigate potential sources of heterogeneity and compare the performance of different subgroups. The overall effect size was expressed with a 95% confidence interval (CI),²⁴ and a P value of 0.05 was used as the significance level. We also plotted forest plots to display the results, and Publication bias was evaluated using a funnel plot

Large Language Model Design

In this study, we utilized transformer-based LLMs to increase the efficiency and accuracy of risk assessment.²⁵ These models include OpenAI’s ChatGPT-4 and META’s Llama3-8b and Llama2-13b, which are known for their superior capabilities in understanding and generating language, making them well suited for learning language-related features from large volumes of text and applying them to various language processing tasks.

The design phase of the LLMs is described as follows:

1. Model Selection and Application:

• The chosen models include ChatGPT-4, Llama3-8b, and Llama2-13b, which are all based on the transformer architecture and are capable of handling complex language understanding and generation tasks.

• These models were used to analyze and answer questions regarding literature quality and risk assessment, specifically addressing the 20 questions in the PROBAST risk assessment.

• To assess model performance, use AUC, accuracy, precision, recall, F1 score, and time efficiency as evaluation metrics, with accuracy being the primary evaluation metric.

2. Data Preprocessing and Model Fine-Tuning

• Initially, the 20 questions in the PROBAST risk assessment were organized (detailed in Supplemental Table S2), providing structured inputs for model evaluation.

• These questions were integrated into the prompts for the LLMs, and the models were fine-tuned according to specific assessment needs to enhance their adaptability and accuracy in evaluating professional literature.

3. Interactive Interface and Ensuring Independence:

• ChatGPT and AnythingLLM were used as user interaction interfaces, allowing researchers to interact with the models through natural language, pose questions, and receive answers.

• To ensure the independence and accuracy of the assessments, a new conversation window was restarted after each article evaluation to prevent the memory of the previous interaction from influencing subsequent assessments.

Through this design, we not only improved the efficiency of risk assessment but also ensured the reliability and accuracy of the evaluation results. Additionally, the high degree of automation in this system allows for the processing of large volumes of literature in a short time, significantly enhancing research efficiency.

Results

Literature Screening and Selection

A total of 1569 articles were identified across the search databases, including Web of Science, Scopus, and PubMed. After 791 duplicate articles were removed, 634 articles were further excluded on the basis of title and abstract screening. For the full-text evaluation, 81 articles were initially included. At this stage, articles that did not align with the research topic or lacked model prediction outcomes for complications were excluded, resulting in 11 articles being retained. Additionally, 2 new studies that met the inclusion criteria were identified from the references of these 11 articles. Ultimately, 13 articles met the inclusion criteria for the systematic review. The flowchart of the literature screening process is illustrated in Figure 2.

Figure 2.

Literature screening flowchart.

Characteristics of the included studies

Table 1 summarizes the characteristics of the 13 retrospective studies included in this research. These studies originate from different countries and cover various complications following proton therapy for lung cancer, including radiation esophagitis, radiation pneumonitis, radiation dermatitis, pericardial effusion, and lymphopenia. Detailed information for each study included the first author’s name; year of publication; radiation therapy method implemented, such as passive scoring proton therapy (PSPT) and intensity-modulated radiation therapy (IMRT); and sample size. These studies exhibit methodological diversity, incorporating not only pure proton therapy but also cases combining proton and photon therapies. Therefore, six studies were excluded because they did not meet the purity criteria for proton therapy. Ultimately, only seven studies met the criteria for pure proton therapy and were included in the final meta-analysis.

Table 1.

Studies-Included Characteristics.

No	Main Author (Year)	Country	Complications	Treatment Type	Sample Size
1	Mei Chen (2022)²⁶	China	Radiation esophagitis	PSPT	328
2	Zeming Wang (2020)²⁷	United States	Radiation esophagitis	PSPT	328
3	Joshua S. Niedzielski (2017)²⁸	United States	Radiation esophagitis	PSPT, IMRT	134
4	Serena Monti (2022)²⁹	Italy	Radiation esophagitis	PSPT, IMRT	173
5	Shannon O’Reilly (2020)³⁰	United States	Radiation pneumonitis	Proton	74
6	Albrecht Weiß (2024)³¹	Germany	Radiation pneumonitis	PSPT, PBT	96
7	Wendy B. Harris (2020) ³²	United States	Radiation pneumonitis	PBS, DS, 3D-CRT, IMRT, VMAT
8	Gilmer Valdes (2023)³³	United States	Radiation pneumonitis	PBS, PBT	965
9	José Marcio Luna (2020)³⁴	United States	Radiation pneumonitis	PBT, IMRT	202
10	Richard Castillo (2014)³⁵	United States	Radiation pneumonitis	3D-CRT, IMRT, proton	100
11	Giuseppe Palma (2020)³⁶	Italy	Radiation dermatitis	PSPT, IMRT	166
12	Joshua S. Niedzielski (2020)³⁷	United States	Pericardial effusion	PSPT, IMRT	141
13	Yaqi Li (2023)³⁸	China	Lymphopenia	IMRT, PCIRT	343

Abbreviation: PSPT, Passive-Scattering Proton Therapy; IMRT, Intensity-Modulated Radiation Therapy; PBT, Proton Beam Therapy; PBS, Pencil Beam Scanning; DS, Double Scattering; 3D-CRT, Three-Dimensional Conformal Radiation Therapy; VMAT, Volumetric Arc Therapy; PCIRT, Proton and Carbon Ion Radiation Therapy.

Table 2 details the methodological characteristics of the included studies, covering overfitting prevention methods, feature selection, and model types. Among the 13 studies, 4 used 5-fold cross-validation, 1 used leave-one-out cross-validation, and the remaining 8 did not report any validation methods. The models employed in these studies include logistic regression (LR), support vector machines (SVMs), random forests (RFs), K-nearest neighbors (K-NNs), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and gradient boosting (GB). Additionally, some studies have utilized least absolute shrinkage and selection operator (LASSO) and other statistical methods for feature selection and model optimization, demonstrating diversity in the choice and application of statistical methods.

Table 2.

Studies-Included Methodological Characteristics.

NO	Main Author (Year)	Over Fitting Method	Feature Selection	Model Type
1	Mei Chen (2022)	5-fold cross-validation	Stepwise feature selection & LASSO	sLKB, gLKB, MLR, SVM
2	Zeming Wang (2020)	NR	Univariable logistic regression	LKB
3	Joshua S. Niedzielski (2017)	NR	Univariate logistic regression	LR
4	Serena Monti (2022)	Leave-one-out cross-validation	Multivariable stepwise logistic regression	NTCP
5	Shannon O’Reilly (2020)	NR	Univariate logistic regression	LR, RF
6	Albrecht Weiß (2024)	NR	Univariate and Multivariate analyses	LR
7	Wendy B. Harris (2020)	NR	Univariate and Multivariate analyses	LR
8	Gilmer Valdes (2023)	10-fold cross-validation	Univariate and Multivariate analyses & LASSO	GB
9	José Marcio Luna (2020)	5-fold cross-validation	Backward sequential feature selection	LR, K-NN, SVM, RF, EN, LDA, QDA
10	Richard Castillo (2014)	NR	Stepwise backward model selection	LR
11	Giuseppe Palma (2020)	NR	Univariate and Multivariate analyses	LKB LR
12	Joshua S. Niedzielski (2020)	5-fold cross-validation	Elastic net logistic regression	LR
13	Yaqi Li (2023)	NR	Logistic regression & LASSO	LR

Abbreviation: sLKB, standard Lyman–Kutcher–Burman; gLKB, generalized Lyman–Kutcher–Burman; MLR, Multivariable Logistic Regression; LR, Logistic Regression, SVM, Support Vector Machine; LASSO, Least Absolute Shrinkage and Selection Operator; NTCP, Normal Tissue Complication Probability; RF, Random Forest; K-NN, K-Nearest Neighbors; EN, Elastic Net; LDA, Linear Discriminant Analysis; QDA, Quadratic Discriminant Analysis; NR, No Report; GB, Gradient Boost.

Performance Analysis of Models Predicting Complications after Proton Therapy for Lung Cancer

According to the results depicted in Figure 3, both fixed effects and random effects models were applied to estimate the effect sizes in the analysis of the prediction of complications after proton therapy for patients with lung cancer. The fixed-effects model (Figure 3(a)) showed an effect size of 0.77 with a 95% confidence interval (CI) ranging from 0.75--0.78, suggesting a relatively stable predictive effect when considering the consistency across proton therapy methods and types of complications.

Figure 3.

Forest plot of the model used to predict complications after proton therapy for lung cancer (a) Fixed-effect model (b) Random-effect model. Abbreviation: sLKB, standard Lyman–Kutcher–Burman; gLKB, generalized Lyman–Kutcher–Burman; LR, Logistic Regression; SVM, support vector machine; LASSO, least absolute shrinkage and selection operator; RF, random forest; K-NN, K-Nearest Neighbors; GB, gradient boost; AUC, area under the ROC Curve; ROC, receiver operating characteristic.

Nevertheless, despite the similarity in the types of treatments and complications studied, significant differences were observed in the study outcomes. The random-effects model (Figure 3(b)) demonstrated an effect size of 0.78 with considerable heterogeneity (I² = 72.88%, P < 0.001). This indicates substantial variability among the studies, likely due to differences in study design, patient demographics, treatment methodologies, or methods of complication assessment (The funnel plot is included in Supplemental Figure S1).

Subgroup analyses were conducted to delve deeper into the sources of this heterogeneity, identifying specific conditions or variables that influence the predictive performance of the models. These analyses not only assess the practical application performance of the predictive models but also underscore the significant discrepancies that exist even under similar treatment and complication scenarios, highlighting potential areas for future research.

Subgroup Analysis of Radiation Esophagitis and Radiation Pneumonitis Patients

The subgroup analysis detailed in Figure 4(a)–(c) illustrates the stability and variability of the predictive models for radiation esophagitis and radiation pneumonitis. In the analysis of radiation esophagitis (Figure 4(a)), the model predicted an effect size of 0.79, with a confidence interval ranging from 0.78--0.81 and a heterogeneity index of I² = 0%, P = 0.45. This finding indicates a high level of consistency among the studies in predicting radiation esophagitis, with no significant heterogeneity observed, suggesting that the models used can consistently reflect the risk of this complication after treatment.

Figure 4.

Forest plot of subgroup analysis (a) Analysis of radiation esophagitis (b) Analysis of radiation pneumonitis (c) Analysis of radiation pneumonitis after excluding heterogeneous studies.

In contrast, the initial analysis for radiation pneumonitis (Figure 4(b)) presented more challenges, with an effect size of 0.74, a confidence interval from 0.70 to 0.80, and significant heterogeneity (I² = 68.65%, P = 0.0125), indicating considerable differences among the studies. After the exclusion of studies with high heterogeneity, the effect size for the prediction of radiation pneumonitis was adjusted to 0.77, with a confidence interval ranging from 0.73--0.80, and the heterogeneity was significantly reduced (I² = 0%, P = 0.56). This adjustment demonstrated that addressing sources of heterogeneity significantly enhances the consistency and reliability of model predictions for radiation pneumonitis.

These analyses not only highlight the effectiveness of proton therapy predictive models for different complications but also emphasize the importance of further exploration of sources of heterogeneity to optimize predictive accuracy.

Comparison of LLMs and manual evaluations in quality and risk assessment

Figure 5(a)–(d) and Table 3 elucidate the disparities between LLMs such as ChatGPT-4, Llama2-13b, and Llama3-8b and traditional manual evaluation methods in terms of quality and risk of bias assessment. The data underscore notable differences in risk grading outcomes (high risk, low risk, unknown risk) across the various evaluation techniques. Notably, ChatGPT-4 achieved an impressive 90% accuracy rate in risk assessments, significantly outstripping the performance of Llama2-13b and Llama3-8b, which registered accuracies of 62% and 44%, respectively. This difference is also mirrored in the AUC, precision, recall, and F1 score metrics: ChatGPT-4 posted AUC is 0.88 and precision, recall, and F1 scores all at 0.86, markedly superior to those of Llama2-13b and Llama3-8b. Furthermore, while manual evaluations typically require 3423 seconds, LLM assessments are completed in approximately 1.06 seconds, as detailed in Supplemental Table S3. This substantial disparity highlights LLMs’ tremendous advantage in time efficiency, particularly demonstrated by the Words Per Minute (WPM) metric, which highlights LLMs' ability to swiftly process extensive data volumes and execute rapid assessments.

Figure 5.

Risk of bias in predictive models included in the study (a) manual assessment (b) ChatGPT-4 assessment (c) Llama2-13b assessment (d) Llama3-8b assessment. Note: High risk “-“; Low risk “ + “; Unclear or ambiguous ‘?”. Abbreviation: ChatGPT, chat generative pretrained transformer; Llama, large language model Meta AI.

Table 3.

Performance and efficiency comparison of LLMs versus manual bias assessment.

LLM risk of bias assessment
Models	ChatGPT4	Llama2-13b	Llama3-8b
AUC	0.88	0.74	0.69
Accuracy	0.90	0.62	0.44
Precision	0.86	0.65	0.69
Recall	0.86	0.73	0.64
F1 score	0.86	0.60	0.55
Average time spent (seconds)
	Manual time spent	LLM time spent	Time spent Comparison
Risk of bias assessment	3423	1.06 s	3229:1

Abbreviation: b, billion; LLM, Large Langue Model; ChatGPT, Chat Generative Pretrained Transformer; Llama, Large Language Model Meta AI; AUC, Area Under the Roc Curve.

Independent evaluations via the PROBAST tool revealed that 9 studies were classified as high risk, 3 as low risk, and 1 as unknown risk. The results presented in Figure 5 indicate that LLMs more frequently marked studies as high risk and had a higher frequency of marking studies as unknown risk. This may reflect differences in the strictness of the evaluation standards and the sensitivity of the models.

These findings underscore the feasibility of using LLMs for risk assessment, demonstrating that they can provide high-accuracy evaluations and significantly improve efficiency. LLMs have been shown to be effective tools in the research quality assessment process, offering considerable advantages in both accuracy and speed.

Discussion

Results of the Meta-Analysis

Our study investigated the efficacy of predictive models for complications following proton therapy for lung cancer through meta-analysis combined with an evaluation of LLMs. The meta-analysis revealed that these models exhibit moderate to strong predictive capabilities, as evidenced by an overall effect size of 0.78. Despite this, a significant level of heterogeneity (I² = 72.88%, P < 0.001) was noted, indicating considerable variability in the performance of these models across different studies. Subgroup analysis further suggested that no single model consistently outperformed others across all scenarios, underscoring the complexity of predicting complications in proton therapy.

Additionally, the use of LLMs, particularly ChatGPT-4, has significantly shortened the time required for risk assessment evaluations, drastically enhancing the efficiency of the literature review process. ChatGPT-4 demonstrated superior performance across all the evaluation metrics, indicating notable advancements in the application of artificial intelligence in medical research. This integration of LLMs not only streamlines the evaluation process but also highlights the potential for these technologies to improve the accuracy and speed of clinical research assessments.

When evaluating the complex dynamics of predictive models, particularly the efficacy of LLMs compared with manual evaluations, it is essential to consider a comprehensive set of evaluation metrics. Beyond just accuracy and time efficiency, our analysis now includes precision, recall, the F1 score, and the AUC. These metrics are critical for a thorough assessment of model performance in clinical environments. Precision is vital for minimizing the clinical risk associated with false positives, whereas recall is crucial for ensuring that all relevant cases are identified, preventing missed diagnoses. The F1 score, which balances precision and recall, is essential in clinical settings where the consequences of both false positives and false negatives are significant. Moreover, the AUC provides insights into the model’s ability to differentiate between outcomes across various thresholds, which is crucial for customizing clinical interventions on the basis of risk stratification. This expanded evaluation framework not only deepens our understanding of the models’ capabilities but also identifies areas for future improvement, enhancing their utility in clinical practice. By detailing these metrics, we aim to clarify how effectively these models can support clinical decision-making, guiding further enhancements in their development and application.

For the detailed analysis, we conducted subgroup analyses for radiation esophagitis and radiation pneumonitis. The results showed that the logistic regression model used in Joshua S. Niedzielski’s study²⁸ performed better in predicting radiation esophagitis than other models that relied on traditional clinical and dosimetric factors. The inclusion of new biomarkers, such as CT imaging, significantly improved the predictive accuracy. However, for radiation pneumonitis, Shannon O'Reilly’s study³⁰ revealed that even with the use of biomarker imaging, the results did not significantly differ from those of other studies. This could be due to insufficient model training and sample size. In Gilmer Valdes’s study,³³ the use of data imputation methods to handle missing values, although helpful in addressing data incompleteness, may have introduced biases that affected the accuracy of the model results.

From the PROBAST risk assessment, we found that most studies indicated high or unclear risk results, which could significantly affect the reliability of our study outcomes. The prevalence of high-risk biases suggests that the results might be unreliable, particularly since the majority of studies in our analysis did not employ methods to validate overfitting or were deficient in predictive factors. This scenario potentially leads to overly optimistic outcomes, aligning with other research findings.^39-41 In fact, most studies reported only partial indicators of their models, which could contribute to the overestimation of performance metrics and excessive reliance on model predictions. The lack of comprehensive reporting underscores the need for more rigorous evaluation methods to ensure accuracy and reliability.

Moreover, to enhance the generalizability of models, it is crucial to validate them across different contexts and populations.⁴² The ongoing development of PROBAST-AI, a more rigorous tool specifically designed for evaluating artificial intelligence and machine learning models,⁴³ represents a significant step toward addressing these issues, although it is still under development and not yet available.

Results of Large Language Model Analysis

In this study, we employed LLMs such as ChatGPT-4 to assess the quality of the literature, achieving an accuracy rate of up to 90%, which is significantly higher than the 44% to 62% accuracy rates of smaller models such as Llama2 and Llama3. This significant difference in performance is evident not only in accuracy but also in AUC, precision, recall, and F1 score: ChatGPT-4 achieves 0.86 in all three metrics and an AUC of 0.88, far outperforming Llama2-13b and Llama3-8b, which display lower metrics in these areas.^44,45 These models demonstrated unique advantages in handling large datasets and extracting complex features,⁴⁶ particularly in terms of time efficiency, being 3229 times faster than manual methods in the literature.

ChatGPT-4’s superior performance is attributed to its larger model parameters, which enable it to more effectively understand and integrate complex contextual information within professional medical literature.⁴⁷ Conversely, smaller parameter models such as Llama2-13b and Llama3-8b, even after adjustments to their PROMPTs to enhance prediction accuracy, still have limitations. Their lower capacity for learning and fewer parameters render them less suitable for complex professional assessment tasks. These findings highlight the critical importance of selecting and applying appropriate LLMs for medical literature evaluation, especially in scenarios requiring high accuracy and efficiency.

Impact of Model Heterogeneity on Predicting Complications after Proton Therapy for Lung Cancer

Through detailed meta-analysis, our study revealed significant heterogeneity in the performance of different predictive models for complications following proton therapy for lung cancer. This finding highlights substantial differences in the effectiveness and applicability of models across various clinical and research settings. This heterogeneity primarily stems from variations in study design, sample size, treatment methods, and patient populations. For example, traditional statistical models such as logistic regression tend to perform better with smaller datasets or when there is a strong correlation between variables, whereas machine learning models, particularly deep learning models, may demonstrate greater advantages in handling large datasets and identifying complex, nonlinear relationships.⁴⁸

By employing subgroup or sensitivity analysis, researchers can better understand and interpret heterogeneity, thereby enhancing the reliability and applicability of model predictions.⁴⁹ These findings improve the assessment of therapeutic outcomes and the management of complications for patients undergoing proton therapy for lung cancer. Analyzing these heterogeneity factors in depth not only reveals the limitations of various models but also provides practical guidelines for selecting more effective models in specific clinical contexts. For clinicians, understanding the performance of different models in predicting complications for particular patient groups or conditions can guide more precise treatment decisions and the design of personalized treatment plans.

Bridging the Gap: From Theoretical Models to Clinical Implementation

While our study highlights the theoretical advantages of using LLMs for predicting complications from proton therapy, we recognize the importance of discussing the practical implications of these findings for clinical practice. The integration of LLMs, such as ChatGPT-4, has shown promising results in reducing the time required for literature assessment and increasing the accuracy of predictive models. However, the transition from theoretical models to clinical application involves several crucial steps.

First, the clinical implementation of these models necessitates rigorous validation and calibration processes to ensure that they are tailored to the specific needs and contexts of clinical environments. This includes extensive testing against existing standards of care to confirm that model predictions align with real-world outcomes.

Second, to facilitate the adoption of these models in clinical settings, it is essential to develop user-friendly interfaces that allow clinicians to interact with the models efficiently. These interfaces must be designed to integrate seamlessly into existing medical workflows, ensuring that they augment rather than disrupt clinical practice.

Moreover, there is a need for ongoing training and support for medical professionals to use these models effectively. This training should focus on understanding model outputs and integrating these insights into clinical decision-making processes.

Finally, ethical considerations, particularly concerning patient data privacy and the interpretability of model decisions, must be addressed to ensure that the use of LLMs in clinical settings upholds the highest standards of patient care and transparency.

By focusing on these aspects, we can bridge the gap between the theoretical improvements offered by LLMs and their practical implementation, ultimately enhancing patient outcomes and the efficiency of medical practices.

Limitations and Challenges for Future Research

This study revealed several limitations and challenges in evaluating the performance of predictive models for complications following proton therapy for lung cancer. First, differences in sample size, study design, or predictive models may affect the generalizability of the results, leading to high heterogeneity in our meta-analysis. Second, the subgroup analysis did not reveal significant differences in model superiority, indicating that no single model consistently performs best in all the scenarios. Additionally, our intervention measures were confined to proton therapy. However, many studies do not distinguish between patients treated with proton therapy and those treated with photon therapy, resulting in research data that may not fully meet the needs of this study. Finally, the issue of bias in data imputation remains a challenge, as imputed values may not fully reflect the real situation,⁵⁰ thereby affecting the accuracy of model predictions. Furthermore, the application effectiveness of smaller parameter models in specific domains is limited by their scale and complexity.

These limitations and challenges highlight the need for further improvements and refinements in research methodologies in future work. This includes selecting more suitable models, improving data collection and processing methods, and enhancing the generalizability of models. Future research should also consider a more granular approach to distinguish between different types of treatments and carefully address potential biases in data handling to ensure the reliability and accuracy of predictive models.

LLMs excel at generating coherent and contextually relevant responses, but their performance heavily depends on the quality, diversity, and representativeness of the training data. Accuracy remains a major concern, as these models can produce outdated or entirely fictitious responses, a phenomenon known as “hallucination”.^51-53 The models may inadvertently learn and perpetuate biases and inaccuracies present in the data, leading to biased outputs and diagnostic discrepancies.⁵⁴ Additionally, these language models lack a true understanding of fundamental medical concepts, as their responses are based solely on statistical patterns, making it difficult for them to address more complex evaluation questions.⁵⁵ Relying solely on responses from LLMs in scenarios requiring deep background understanding can result in inaccurate or incomplete information.

Future research should focus on several areas of improvement. First, the sample size of the literature should be increased, and the study design should be optimized to enhance the representativeness and reliability of the results. Second, the performance of different predictive models should be further explored and compared, and integrated models that combine their strengths should be developed. Addressing the limitations of LLMs involves targeted improvements, including enhancing the quality and diversity of training data and fine-tuning the models to improve their understanding and response accuracy to specialized medical questions, thus reducing hallucinations and increasing reliability. Finally, interdisciplinary collaboration and the implementation of clinical trials should be promoted to ensure that research findings can be effectively translated into clinical applications. Through these concerted efforts, future research can more effectively address the prediction of complications following proton therapy for patients with lung cancer, thereby having a profound impact on treatment decisions and patient management.

Conclusion

This study highlights the transformative potential of integrating LLMs, such as ChatGPT-4, with meta-analytic techniques for predicting posttreatment complications in lung cancer patients receiving proton therapy. ChatGPT-4 demonstrated a high degree of accuracy, achieving an agreement rate of 90% with manual evaluations, and significantly accelerated the literature review process, which enhances the overall efficiency of the research.

The benefits to clinical practice are clear: these advanced predictive models can potentially refine treatment planning, contribute to personalized patient care, and reduce the incidence of adverse effects. However, despite these advances, this study identified key areas for further research. Given the high heterogeneity observed across studies, future efforts should aim to standardize research methodologies to reduce variability and improve outcome predictability.

Moreover, refining the ability of LLMs to accurately understand specialized medical terminologies and assess risk in an unbiased manner remains imperative. Future research should continue to focus on enhancing the training processes of these models, particularly in terms of their ability to analyze complex medical literature effectively. This approach will not only optimize the practical application of LLMs in clinical settings but also pave the way for their broader adoption in precision medicine and clinical decision support systems.

In summary, while this study confirms the feasibility and benefits of using LLMs in the medical field, it also lays the groundwork for substantial future advancements. The ongoing refinement and application of these models are essential for realizing their full potential in enhancing medical research and patient care.

Supplemental Material

Supplemental Material - Improving Prediction of Complications Post-Proton Therapy in Lung Cancer Using Large Language Models and Meta-Analysis

Supplemental Material for Improving Prediction of Complications Post-Proton Therapy in Lung Cancer Using Large Language Models and Meta-Analysis by Pei-Ju Chao, Chu-Ho Chang, Jyun-Jie Wu, Yen-Hsien Liu, Junping Shiau, Hsin-Hung Shih, Guang-Zhi Lin, Shen-Hao Lee, and Tsair-Fwu Lee in Cancer Control.

Footnotes

Acknowledgments

This study was partially supported by a grant from the National Science and Technology Council (NSTC) of the Executive Yuan, Republic of China (Grant No. 113-2221-E-992-011-MY2). A portion of these results has been submitted in abstract form to the ICMLSC 2025 Conference.

Author Contributions

Pei-Ju Chao: Guided the analysis and interpretation of the data; participated in the data collection; and contributed to the preparation and revision of the manuscript. Chu-Ho Chang: Data organization and combing; Analyzed and interpreted the data; contributed to the writing of the paper. Jyun-Jie Wu assisted in data interpretation and wrote the draft. Yen-Hsien Liu: Data organization and combing; Analyzed and interpreted the data; contributed to the writing of the paper. Junping Shiau: Guided the analysis and interpretation of the data; Hsin-Hung Shih: Guided the analysis and interpretation of the data; Guang-Zhi Lin: Played a role in data analysis and interpretation. Shen-Hao Lee: Participated in data collection and analysis and provided technical support for data management. Tsair-Fwu Lee: Conceived and designed the analysis; contributed data or analysis tools; performed the analysis; and wrote the draft and final manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science and Technology Council (NSTC) of the Executive Yuan of the Republic of China; 113-2221-E-992-011-MY2.

Ethical Statement

ORCID iD

Tsair-Fwu Lee

Supplemental Material

Supplemental material for this article is available online.

References

Chang

Zhang

Wang

, et al. Significant reduction of normal tissue dose by proton radiotherapy compared with three-dimensional conformal or intensity-modulated radiation therapy in Stage I or Stage III non–small-cell lung cancer. Int J Radiat Oncol Biol Phys. 2006;65:1087-1096.

Verma

Simone

Mishra

. Quality of life and patient-reported outcomes following proton radiation therapy: a systematic review. J Natl Cancer Inst. 2018;110:341-353.

Liu

Dai

Zeng

Chen

. The trajectory of oral mucositis in head and neck cancer patients undergoing radiotherapy and its influencing factors. Ear Nose Throat J 2024;0(0):1-13. doi: 10.1177/01455613241228211.

Liang

, et al. Screening for primary aldosteronism on and off interfering medications. Endocrine. 2024;83:178-187.

Duan

Xie

Zhang

, et al. A novel robotic bronchoscope system for navigation and biopsy of pulmonary lesions. Cyborg Bionic Syst. 2023;4:0013.

Zhong

Wang

Liang

Wei

She

. Combined methylation of SHOX2 and RASSF1A genes in diagnosing malignant pleural effusion. Discov Med. 2023;35:845-852.

Shen

, et al. Accuracy of gallium-68 pentixafor positron emission tomography–computed tomography for subtyping diagnosis of primary aldosteronism. JAMA Netw Open. 2023;6:e2255609.

Ioannidis

. Interpretation of tests of heterogeneity and bias in meta‐analysis. J Eval Clin Pract. 2008;14:951-957.

Higgins

Green

. Cochrane handbook for systematic reviews of interventions version 5.0. 1. The Cochrane Collaboration. UK: cochrane-handbook. org; 2008.

10.

Langenhuijsen

Janse

Venema

, et al. Systematic meta-review of prediction studies demonstrates stable trends in bias and low PROBAST inter-rater agreement. J Clin Epidemiol 2023;159:159-173.

11.

Venema

Wessler

Paulus

, et al. Large-scale validation of the prediction model risk of bias assessment Tool (PROBAST) using a short form: high risk of bias models show poorer discrimination. J Clin Epidemiol. 2021;138:32-39.

12.

Chowdhary

. Natural language processing. Fundamentals of artificial intelligence 2020;2020:603-649. doi: 10.1007/978-81-322-3972-7_19.

13.

Tran

T-O

NQK

. Sa-ttca: an svm-based approach for tumor t-cell antigen classification using features extracted from biological sequencing and natural language processing. Comput Biol Med. 2024;174:108408.

14.

NQK

. Leveraging transformers-based language models in proteome bioinformatics. Proteomics. 2023;23:2300011.

15.

Achiam

Adler

Agarwal

, et al. Gpt-4 technical report. arXiv 2023;2303(08774). doi: 10.48550/arXiv.2303.08774.

16.

Touvron

Lavril

Izacard

, et al. Llama: open and efficient foundation language models. arXiv preprint arXiv 2023;2302(13971). doi: 10.48550/arXiv.2302.13971.

17.

Dave

Athaluri

Singh

. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.

18.

Pons

Braun

Hunink

MGM

Kors

. Natural language processing in radiology: a systematic review. Radiology. 2016;279:329-343.

19.

Lee

T-F

Hsieh

Y-W

Yang

P-Y

, et al. Using meta-analysis and CNN-NLP to review and classify the medical literature for normal tissue complication probability in head and neck cancer. Radiat Oncol. 2024;19:5.

20.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.

21.

Schardt

Adams

Owens

Keitz

Fontelo

. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Med Inf Decis Making. 2007;7:16.

22.

Wolff

Moons

Riley

, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med. 2019;170:51-58.

23.

Lee

Cook

Lee

Han

. Comparison of two meta-analysis methods: inverse-variance-weighted average and weighted sum of Z-scores. Genomics Inform. 2016;14:173-180.

24.

Hazra

. Using the confidence interval confidently. J Thorac Dis. 2017;9:4125-4130.

25.

Han

Xiao

, et al. Transformer in transformer. Adv Neural Inf Process Syst. 2021;34:15908-15919.

26.

Chen

Wang

Jiang

, et al. Predictive performance of different NTCP techniques for radiation-induced esophagitis in NSCLC patients receiving proton radiotherapy. Sci Rep. 2022;12:9178.

27.

Wang

Chen

Sun

, et al. Lyman–Kutcher–Burman normal tissue complication probability modeling for radiation-induced esophagitis in non-small cell lung cancer patients receiving proton radiotherapy. Radiother Oncol. 2020;146:200-204.

28.

Niedzielski

Yang

Mohan

, et al. Differences in normal tissue response in the esophagus between proton and photon radiation therapy for non-small cell lung cancer using in vivo imaging biomarkers. Int J Radiat Oncol Biol Phys. 2017;99:1013-1020.

29.

Monti

Mohan

Liao

Palma

Cella

. Radiation-induced esophagitis in non-small-cell lung cancer patients: voxel-based analysis and NTCP modeling. Cancers. 2022;14:1833.

30.

O’Reilly

Jain

Huang

, et al. Dose to highly functional ventilation zones improves prediction of radiation pneumonitis for proton and photon lung cancer radiation therapy. Int J Radiat Oncol Biol Phys. 2020;107:79-87.

31.

Weiß

Löck

Liao

Hoffmann

Troost

EGC

. Prediction of radiation pneumonitis using the effective α/β of lungs and heart in NSCLC patients treated with proton beam therapy. Radiother Oncol. 2024;190:110013.

32.

Harris

Zou

Cheng

, et al. Higher dose volumes may be better for evaluating radiation pneumonitis in lung proton therapy patients compared with traditional photon-based dose constraints. Adv Radiat Oncol. 2020;5:943-950.

33.

Valdes

Scholey

Nano

, et al. Predicting the effect of proton beam therapy technology on pulmonary toxicities for patients with locally advanced lung cancer enrolled in the proton collaborative group prospective clinical trial. Int J Radiat Oncol Biol Phys. 2024;119:66-77.

34.

Luna

Chao

H-H

Shinohara

, et al. Machine learning highlights the deficiency of conventional dosimetric constraints for prevention of high-grade radiation esophagitis in non-small cell lung cancer treated with chemoradiation. Clin Transl Radiat Oncol. 2020;22:69-75.

35.

Castillo

Pham

Ansari

, et al. Pre-radiotherapy FDG PET predicts radiation pneumonitis in lung cancer. Radiat Oncol. 2014;9:74.

36.

Palma

Monti

Conson

, et al. NTCP models for severe radiation induced dermatitis after IMRT or proton therapy for thoracic cancer patients. Front Oncol. 2020;10:344.

37.

Niedzielski

Wei

, et al. Development and application of an elastic net logistic regression model to investigate the impact of cardiac substructure dose on radiation-induced pericardial effusion in patients with NSCLC. Acta Oncol. 2020;59:1193-1200.

38.

Fan

, et al. Proton and carbon ion radiation therapy decreased severe lymphopenia by reducing thoracic vertebra and aortic doses in non-small cell lung cancer versus intensity modulated radiation therapy. Int J Radiat Oncol Biol Phys. 2023;116:579-589.

39.

Wynants

Van Calster

Collins

, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. Bmj. 2020;369:m1328.

40.

Bellou

Belbasis

Konstantinidis

Tzoulaki

Evangelou

. Prognostic models for outcome prediction in patients with chronic obstructive pulmonary disease: systematic review and critical appraisal. Bmj. 2019;367:l5358.

41.

Hueting

van Maaren

Hendriks

Koffijberg

Siesling

. The majority of 922 prediction models supporting breast cancer decision-making are at high risk of bias. J Clin Epidemiol. 2022;152:238-247.

42.

Polit

Beck

. Generalization in quantitative and qualitative research: myths and strategies. Int J Nurs Stud. 2010;47:1451-1458.

43.

Collins

Dhiman

Andaur Navarro

, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11:e048008.

44.

Zhou

Yang

Loy

Liu

. Learning to prompt for vision-language models. Int J Comput Vis. 2022;130:2337-2348.

45.

Jin

Cheng

Shen

, et al. A good prompt is worth millions of parameters: low-resource prompt-based learning for vision-language models. arXiv preprint arXiv 2021;2110(08484). doi: 10.48550/arXiv.2110.08484.

46.

Raiaan

MAK

Mukta

MSH

Fatema

, et al. A review on large Language Models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024;12:26839-26874.

47.

Michelet

Breitinger

. ChatGPT, Llama, can you write my report? An experiment on assisted digital forensics reports written using (local) large language models. Forensic Sci Int: Digit Invest. 2024;48:301683.

48.

Schulz

M.-A.

Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets. Nature communications 2020;11(1):4238.

49.

Hahn

Williamson

Hutton

Garner

Flynn

. Assessing the potential for bias in meta-analysis due to selective reporting of subgroup analyses within studies. Stat Med. 2000;19:3325-3336.

50.

Donders

ART

Van Der Heijden

Stijnen

Moons

KGM

. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59:1087-1091.

51.

, et al. Towards mitigating LLM hallucination via self reflection. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics; 2023:1827-1843.

52.

Yao

J-Y

Ning

K-P

Liu

Z-H

et al. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:231001469 2023. doi: 10.48550/arXiv.2310.01469.

53.

Ouyang

Jiang

, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730-27744.

54.

Liévin

Hother

Motzfeldt

Winther

. Can large language models reason about medical questions? Patterns. 2024;5:100943.

55.

Chen

Xiang

, et al. Evaluating large language models in medical applications: a survey. arXiv preprint arXiv 2024;2405(07478). doi: 10.48550/arXiv.2405.07468.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.85 MB