Sage Journals: Discover world-class research

Abstract

Objective

With the increasing global burden of cancer, there is a growing need for innovative strategies to improve oncology care. Health-related quality of life (HRQoL) is an outcome measure for assessing the overall wellbeing of patients with cancer. We used machine learning to predict HRQoL and to identify key factors that can inform patient-centered cancer care.

Methods

We conducted a cross-sectional study enrolling patients diagnosed with lung, breast, or colorectal cancer across two provinces in China. We collected data on demographics, clinical characteristics, and patient-centered features. HRQoL was assessed using the widely accepted EQ-5D-5L instrument in cancer care. We trained and evaluated seven machine learning models. SHapley Additive exPlanations (SHAP) analysis was employed to assess feature importance.

Results

Data from 924 patients with cancer were available. The random forest and extreme gradient boosting models had superior predictive performance. Positive SHAP values were primarily observed in patients with early-stage cancer and those enrolled in Urban Employees Basic Medical Insurance. Negative SHAP values were mainly associated with longer duration of chronic comorbidities, colorectal cancer, and ongoing chemotherapy. Age and time since cancer diagnosis exhibited bidirectional impacts.

Conclusions

Our study demonstrates the potential of machine learning models to predict HRQoL in patients with cancer. We identified key predictors of patient HRQoL, like duration of chronic comorbidities, early-stage cancer diagnosis, age, and health insurance coverage. Our findings would facilitate early identification of patients with lower HRQoL and promote the provision of patient-centered oncology care.

Keywords

Machine learning quality of life neoplasms patient reported outcome measures

Introduction

Health-related quality of life (HRQoL) is a multidimensional concept that encompasses physical, emotional, social, and functional wellbeing, and it is increasingly recognized as a critical outcome in oncology care. Cancer remains a leading cause of morbidity and mortality worldwide, with a significant impact on the HRQoL of patients.¹ In China, the situation is particularly concerning, with a rising incidence of cancer and an increasing burden of chronic comorbidities. This dual challenge underscores the urgent need for effective strategies to monitor and improve the wellbeing of patients with cancer.^2,3

HRQoL is increasingly recognized as a critical outcome in oncology research and routine care. It serves as a key predictor of treatment adherence, symptom burden, and overall survival.⁴ Furthermore, it is considered as an essential measure of overall wellbeing that directly informs healthcare planning and health service delivery.⁵ Previous studies using multivariate analysis showed that the HRQoL of patients with cancer might be influenced by several factors, including the type and stage of cancer, treatment modalities, and demographic characteristics.^6–8 However, these traditional methods often have limitations. They typically assume linear relationships between variables and struggle with the complex and nonlinear nature of the factors influencing HRQoL.⁹ Furthermore, they are better at identifying broad associations across a population than at making accurate predictions for individual patients.¹⁰ As oncology care becomes more personalized, predicting HRQoL is therefore necessitate for improving patient-centered outcomes.

Recent advancements in machine learning and deep learning have revolutionized the field of health outcomes research by providing digital health tools to analyze complex datasets and predict health outcomes.^11–14 Machine learning is particularly suitable for HRQoL prediction since it can handle the complex, nonlinear interactions of clinical, demographic, and patient-report data to generate individualized predictions. This approach could provide individualized predictions essential for personalized care planning. However, the development of interpretable machine learning models capable of predicting HRQoL across multiple cancer types remains limited. This gap underscores the urgent need for robust, patient-centered predictive models to support personalized cancer care.

To address this gap, our study aims to develop, assess, and interpret machine learning models to predict HRQoL in patients with cancer. We integrate a diverse set of patient-centered features and focus on the most common cancer types in China, including lung, breast, and colorectal cancer.¹⁵ Our study not only aligns with the mission of oncology care but also offers actionable advice on improving patient outcomes. Our findings will provide an actionable, and data-driven framework for the early identification of patients at risk for impaired wellbeing.

Methods

Study design and sampling

The study was designed and reported according to the STROBE statement for observational studies (Supplemental material 1).¹⁶ The cross-sectional design was chosen to meet our primary objective of identifying factors associated with HRQoL at a specific time point. This approach is methodologically supported by prior research employing similar data with machine learning to predict health-related outcomes.^4,17,18

Our study consecutively enrolled patients aged 18 years or older who had been diagnosed with one of the three most common types of cancer in China: lung, breast, or colorectal cancer. Patients were excluded if they found it difficult to understand the survey items. All participants were required to participate in public health insurance programs, as these programs have been a cornerstone of China's healthcare system over the past decades.¹⁹

The patients were recruited from five hospitals in Jiangsu province (i.e. Affiliated Hospital of Nantong University, the People's Hospital of Rugao, Jiangsu Provincial Hospital, Affiliated Dongtai Hospital of Nantong University, and Suqian People's Hospital) and one hospital in Hebei province (i.e. Shijiazhuang Traditional Chinese Medicine Hospital). Patients were mainly enrolled from Department of Oncology, Department of Chemotherapy, and Department of Radiotherapy in sampling hospitals.

Machine learning methods are not always immune to sample size requirements. For binary outcomes, the technique needs more than 10 times as many events for each predictor.²⁰ We followed the events-per-variable rule. We had 30 original features and at least 300 samples would be required. To ensure the representativeness of patients, the sample size was balanced within each sampling hospital.

Data collection

Data were collected through structured questionnaires and medical record reviews. The questionnaires mainly consisted of patient demographic information, clinical characteristics, patient-centered items, and assessment of HRQoL (Supplemental material 2). We conducted a pilot survey on 10 patients to test the feasibility and refine the methodology of our study. The pilot survey helped us to improve the clarity of the questionnaires and identify problems in the data collection process, allowing us to make necessary adjustments before the formal survey. Our formal survey was conducted from 2 March to 31 June 2023.

Questionnaires were printed for ease of reading and filling out. They were administered through one-on-one, face-to-face interviews conducted by 14 medical interns and 20 physicians. We developed survey manuals and distributed them to interviewers. We conducted either in-person or online training for interviewers to ensure consistency in data collection. Clinical characteristics were filled in by interviewers based on patients’ electronic medical records.

Patients had the right to refuse to participate in the survey at any time. In order to ensure data completeness, each questionnaire is required to be reviewed by the interviewer immediately after completion. For any patient who had trouble understanding the questionnaire, the interviewer explained each item until the patient was able to understand its meaning. All patients were fully informed of the survey and signed informed consent forms. We provided each patient with a towel as a token of appreciation after completing the survey.

Outcome variables

The HRQoL in patients was assessed using the EQ-5D-5L instrument, which is a reliable and generic tool to describe health status across diverse populations and settings including cancer.^21–25 We used the official version of the EQ-5D-5L questionnaire exclusively for academic research, as approved by the EuroQol Group. The EQ-5D-5L comprises five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Utility scores were generated using Chinese-specific scoring algorithms based on the time tradeoff technique.²⁶ The health utility value was anchored between 0 (death) and 1 (full health), with higher scores meaning better HRQoL.

Statistical analysis

Baseline characteristics of the patients were analyzed according to their HRQoL, stratified into two subgroups: higher and lower. Descriptive statistics were computed for each subgroup. The normality of continuous variables was evaluated using the Shapiro–Wilk test. Variables that followed a normal distribution were reported as mean ± standard deviation and differences between subgroups were tested using independent samples t-tests. In contrast, variables that did not follow a normal distribution were presented as medians with interquartile ranges, and differences were analyzed using the Wilcoxon rank-sum test. Categorical variables were presented as numbers (percentages) and were compared using the Pearson's chi-square test.

A two-tailed p-value of less than 0.05 was considered statistically significant for conventional analyses. However, in the context of machine learning, a p-value of less than 0.1 could be considered as the threshold for statistical significance.²⁷ Data analysis was performed using SPSS software (version 21.0).

Model development

Data preprocessing for models was conducted by Python software (version 3.9.19). We performed basic data cleaning, including range checks, plausibility checks, and handling of missing values. Variables with low levels of missingness (<5%) were imputed using the median for continuous variables and the mode for categorical variables, while features with extensive missingness were excluded. Multicollinearity for independent variables was assessed via variance inflation factors (Supplemental material 3).

Numerical features were normalized to ensure consistency across scales, while categorical variables were encoded using one-hot encoding. Feature selection was performed by recursive feature elimination with cross-validation, with a random forest classifier serving as the base estimator. This technique is relatively robust to moderate noise and correlations among predictors. The process iteratively removed the least important features based on model-derived importance scores. Each candidate feature subset was evaluated by five-fold stratified cross-validation, with the final subset selected to maximize the mean cross-validated area under the receiver operating characteristic curve (AUC) while maintaining model simplicity. Clinical relevance was also considered to ensure applicability in cancer care.

We extended our analysis to encompass patient preferences, which, though often underestimated in HRQoL research, are hypothesized to impact patient satisfaction and treatment outcomes.^28,29 Therefore, our study incorporated a comprehensive set of features, including demographic characteristics (gender, age, education, occupation, marital status, monthly household income, and type of health insurance), clinical characteristics (diagnosis and stage of cancer, time since diagnosis, types and duration of comorbid chronic diseases, and current cancer therapies), and patient preferences (willingness to receive new anticancer drugs, preferences for routes of drug administration, and satisfaction with health insurance).

We employed seven machine learning models to predict HRQoL: random forest (RF), support vector machine (SVM), K-nearest neighbor (KNN), gradient boosting (GBoost), adaptive boosting (AdaBoost), extreme gradient boosting (XGB), and multilayer perceptron (MLP). These models were selected for their diverse strengths. RF integrates multiple decision trees to enhance accuracy. SVM offers strong generalizability with minimal data requirements; KNN provides fast training and robustness to outliers.³⁰ GBoost and AdaBoost combine weak learners into strong predictive models, with AdaBoost iteratively refining weights.³¹ XGB is renowned for its performance in classification and regression tasks. MLP utilizes deep learning for multilayer feature extraction and classification.³²

In our study, the dataset was divided into 70% for training, 10% for validation, and 20% for testing. For each of the seven machine learning models, we ran a grid search over a defined range of key hyperparameters (Supplemental material 4). The optimal hyperparameter values are shown in Supplemental material 5, and all other parameters were kept at the default settings in scikit-learn.

Given the lack of a universally acknowledged threshold for predicting HRQoL in patients with cancer, we reviewed prior research and found that the average health utilities ranged from 0.62 to 0.75.^33–36 We adopted the 0.70 cutoff because it lies at the lower end of the clinically meaningful range, allowing us to separate patients with clearly impaired HRQoL from others while preserving a sufficient sample size in each group (Supplemental material 6).

Since the two HRQoL groups showed moderate imbalance, we used synthetic minority over-sampling technique to reduce the effect of unequal class sizes before model training (Supplemental material 7). Our primary aim was to investigate general HRQoL determinants in patients with cancer, as key predictors are shared across cancer types. Pooling cohorts also improved statistical power.

The predictive performance of the seven models was evaluated using accuracy, AUC, precision, recall, and F1-score, providing a comprehensive assessment of their ability to predict HRQoL. To evaluate whether insurance variables would introduce policy-driven bias, we conducted a sensitivity analysis by removing the type of insurance features and retrained the models.

To assess the impact of each feature on HRQoL predictions, we performed SHapley Additive exPlanations (SHAP) analysis using the best-performing model. The SHAP value was calculated to observe the influence of each feature on the model's prediction. Positive SHAP values represent features that enhance prediction, while negative values represent features that reduce predictive power. To further explore nonlinear effects and potential interactions, we generated SHAP dependence plots for key predictors. We also generated an SHAP waterfall plot for a representative participant to illustrate local feature contributions. Model explanations derived from SHAP analyses were reviewed by two clinical experts to ensure clinical consistency and relevance to patient care.

Results

Demographic and clinical characteristics

A total of 949 patients participated in our survey, of which 25 were excluded due to incomplete data. As a result, data from 924 patients were available for analysis. The EQ-5D-5L demonstrated good internal consistency with a Cronbach's alpha of 0.876 (Supplemental material 8). Table 1 presented characteristics of the patients, stratified by HRQoL. Participants had an average age of 64 years, ranging from 28 to 93 years. Significant differences were observed between subgroups for key variables, such as age, monthly household income, types of public health insurance, types of cancer, cancer stage, time since cancer diagnosis, and comorbid chronic diseases.

Table 1.

Characteristics of patients with cancer.

Variables	Subgroups	Higher HRQoL (n = 563)	Lower HRQoL (n = 361)	Total (n = 924)	p-Value
Gender	Male	315 (55.95)	216 (59.83)	531 (57.47)	0.244
Gender	Female	248 (44.05)	145 (40.17)	393 (42.53)
Age	NA	64.0 (55.0–70.0)	68.0 (58.0–74.0)	64.2 (57.0–72.0)	<0.001
Education	Elementary school or illiteracy	224 (39.79)	192 (53.19)	416 (45.02)	<0.001
	Junior high school	154 (27.35)	93 (25.76)	247 (26.73)
	High school	94 (16.70)	40 (11.08)	134 (14.50)
	Junior college or above	91 (16.16)	36 (9.97)	127 (13.74)
Occupation	Urban employee	231 (41.03)	113 (31.30)	344 (37.23)	<0.001
	Farmer	175 (31.08)	153 (42.38)	328 (35.50)
	Retiree	146 (25.93)	79 (21.88)	225 (24.35)
	Unemployed or freelancers	11 (1.95)	16 (4.43)	27 (2.92)
Marital status	Single	5 (0.89)	3 (0.83)	8 (0.87)	0.007
	Married	527 (93.61)	323 (89.47)	850 (91.99)
	Divorced or separated	13 (2.31)	5 (1.39)	18 (1.95)
	Widowed	18 (3.20)	30 (8.31)	48 (5.19)
Monthly household income	≤4000 CNY	209 (37.12)	162 (44.88)	371 (40.15)	<0.001
	4001–8000 CNY	215 (38.19)	148 (41.00)	363 (39.29)
	≥8001 CNY	139 (24.69)	51 (14.13)	190 (20.56)
Types of public health insurance	UEBMI	237 (42.10)	88 (24.38)	325 (35.17)	<0.001
Types of public health insurance	URRBMI	326 (57.90)	273 (75.62)	599 (64.83)
Satisfaction with health insurance reimbursement	Not satisfied	11 (1.95)	11 (3.05)	22 (2.38)	0.021
	Neutral	161 (28.60)	75 (20.78)	236 (25.54)
	Satisfied	391 (69.45)	275 (76.18)	666 (72.08)
Types of cancer	Lung cancer	398 (70.69)	267 (73.96)	665 (71.97)	0.001
	Breast cancer	108 (19.18)	40 (11.08)	148 (16.02)
	Colorectal cancer	57 (10.12)	54 (14.96)	111 (12.01)
Newly diagnosed with cancer	Yes	179 (31.79)	84 (23.27)	263 (28.46)	0.005
Newly diagnosed with cancer	No	384 (68.21)	277 (76.73)	661 (71.54)
Cancer stage	0 or 1	146 (25.93)	40 (11.08)	186 (20.13)	<0.001
	2	163 (28.95)	104 (28.81)	267 (28.90)
	3	118 (20.96)	122 (33.80)	240 (25.97)
	4	136 (24.16)	95 (26.32)	231 (25.00)
Time since cancer diagnosis (years)	NA	0.3 (0.0–1.0)	0.5 (0.1–1.3)	0.9 (0.0–1.0)	<0.001
Comorbid chronic diseases	Yes	237 (42.10)	219 (60.66)	456 (49.35)	<0.001
Comorbid chronic diseases	No	326 (57.90)	142 (39.34)	468 (50.65)
Duration of comorbid chronic disease (years)	NA	0.0 (0.0–7.0)	5.0 (0.0–10.0)	5.2 (0.0–10.0)	<0.001
Ongoing chemotherapy	Yes	87 (15.45)	91 (25.21)	178 (19.26)	<0.001
Ongoing chemotherapy	No	476 (84.55)	270 (74.79)	746 (80.74)
Ongoing surgery	Yes	171(30.37)	67 (18.56)	238 (25.76)	<0.001
Ongoing surgery	No	392 (69.63)	294 (81.44)	686 (74.24)
Preferences for new anticancer drugs	Low to moderate	55 (9.77)	38 (10.53)	93 (10.06)	0.073
	High	221 (39.25)	115 (21.86)	336 (36.36)
	Very high	287 (50.98)	208 (57.62)	495 (53.57)
Preferences for routes of drug administration	Intravenous injection	204 (36.23)	91 (25.21)	295 (31.93)	<0.001
	Oral administration	332 (58.97)	224 (62.05)	556 (60.17)
	Intramuscular injection	25 (4.44)	39 (10.80)	64 (6.93)
	Subcutaneous injection	2 (0.36)	7 (1.94)	9 (0.97)

*In year 2023, 1000 CNY approximately equals to 141.3 US Dollars; HRQoL: health-related quality of life; NA: not applicable; UEBMI: Urban Employee Basic Medical Insurance; URRBMI: Urban and Rural Residents Basic Medical Insurance.

Model performance

Table 2 presents the evaluation results of the model performance across training, validation, and testing datasets. Accuracy measures the proportion of correct predictions made by a model. The test accuracy of the models ranged from 0.71 to 0.82, which indicated a moderate to high level of predictive performance. Notably, the RF and XGB models had the highest accuracy, demonstrating superior predictive power compared to the other models.

Table 2.

Summary of the model performance for predicting HRQoL.

Model	Train					Validation					Test
Model	ACC	AUC	F1	Precision	Recall	ACC	AUC	F1	Precision	Recall	ACC	AUC	F1	Precision	Recall
RF	0.92	0.98	0.92	0.93	0.92	0.78	0.81	0.78	0.78	0.78	0.81	0.82	0.80	0.80	0.81
SVM	0.91	0.98	0.91	0.92	0.91	0.74	0.77	0.71	0.71	0.74	0.79	0.73	0.77	0.79	0.79
KNN	0.79	0.87	0.79	0.80	0.80	0.78	0.71	0.71	0.71	0.72	0.77	0.76	0.77	0.77	0.78
MLP	0.81	0.87	0.81	0.81	0.80	0.79	0.79	0.79	0.79	0.80	0.77	0.80	0.77	0.77	0.77
GBoost	0.76	0.82	0.74	0.76	0.76	0.77	0.82	0.76	0.76	0.77	0.73	0.76	0.71	0.71	0.73
AdaBoost	0.75	0.81	0.73	0.74	0.75	0.76	0.81	0.75	0.74	0.76	0.71	0.76	0.69	0.68	0.69
XGB	0.84	0.91	0.83	0.85	0.84	0.74	0.79	0.74	0.73	0.74	0.80	0.78	0.79	0.79	0.80

ACC: accuracy; AdaBoost: adaptive boosting; AUC: area under the receiver operating characteristic curve; GBoost: gradient boosting; KNN: K-nearest neighbor; MLP: multilayer perceptron; RF: random forest; SVM: support vector machine; XGB: extreme gradient boosting.

ROC curves for each model, depicted in Figure 1 (panels A, B, and C), visually capture the tradeoff between sensitivity and specificity at various threshold settings. AUC score shows how well a machine learning model can separate different groups. In the validation set, the AUC values of the RF model and XGB model were close to 0.8, which indicated reliable discriminative capabilities.

Figure 1.

Performance of the prediction model. (a) ROC curves of the train model, (b) ROC curves of the validation model, (c) ROC curves of the test model, (d) comparison of precision-recall curves on the test set.

The precision-recall curves (Figure 1(d)) further illustrate the models’ balanced performance between precision and recall. The probability distributions of models are shown in Figure 2. Given their consistent high performance across multiple metrics and ability to distinguish positive and negative classes effectively, the RF and XGB models were selected for further analysis in our study. The sensitivity analysis shows that the model performance remained stable (Supplemental material 9) and our main findings are not driven by insurance-related features.

Figure 2.

Probability distributions of predication model.

Feature importance

Figure 3(a) displays the relationship between features and their average SHAP values of the XGB model. It indicates the mean absolute SHAP values for each feature across all predictions. Positive SHAP values, indicative of features that enhance HRQoL, were observed in patients with early-stage cancer (stage 0 or 1) and those enrolled in UEBMI. The type of health insurance had a more substantial impact on HRQoL than satisfaction with insurance reimbursement. Conversely, negative SHAP values, which predict a decrease in HRQoL, were associated with longer duration of chronic comorbidities, colorectal cancer, and ongoing chemotherapy. Chronic respiratory diseases also emerged as detrimental to HRQoL. In addition, age, time since cancer diagnosis, and preferences for new drugs exhibited bidirectional impacts on HRQoL.

Figure 3.

SHAP analysis of the XGB model for predicting HRQoL. (a) Summary plot for SHAP values. Each point represents a single prediction. The position along the x-axis was the SHAP value, which means the feature's contribution to the prediction. The color gradient (typically from blue to red) stands for the value of the feature. (b) Bar plot of feature importance. The length of each bar corresponds to the average impact of each feature across all predictions. HRQoL: health-related quality of life; SHAP: SHapley Additive exPlanations; XGB: extreme gradient boosting

Figure 3(b) provides an in-depth analysis of feature contributions for each prediction, ranking features by their importance. The most significant predictors of HRQoL included the duration of chronic comorbidities, early-stage cancer (stages 0 and 1), age, UEBMI enrollment, and time since cancer diagnosis.

SHAP analysis of the RF model yielded results highly consistent with the XGB model (Supplemental material 10). Both models identified the similar core predictors like duration of comorbid chronic diseases, age, early-stage cancer, time since cancer diagnosis, and type of insurance. Minor variations in feature ranking are attributable to their differing algorithmic structures (bagging vs. boosting). The consistency confirms the key predictors and their impact on HRQoL are robust across tree-based ensembles.

Based on the SHAP dependence plots for main predictors, age had a curvilinear effect, with HRQoL highest in middle age (Figure 4). A longer duration of chronic diseases showed a strong negative linear association. Time since cancer diagnosis had its strongest negative effect soon after diagnosis, which diminishes over time. Preference for new drugs had a weakly positive association. The SHAP waterfall plot for an individual patient is shown in Supplemental material 11. A long duration of comorbid chronic diseases strongly pulls the prediction toward lower HRQoL, while not being in stages 0 to 1 pushes the prediction toward lower HRQoL.

Figure 4.

SHAP dependence plot for key predictors in the model. (a) Age; (b) duration of comorbid chronic diseases; (c) time since cancer diagnosis; and (d) preference for new drugs. Notes: In each panel, the x-axis shows the original value of the predictor and the y-axis its SHAP value. The vertical color bar on the right indicates the values of the interaction feature used to color the points. SHAP: SHapley Additive exPlanations.

Discussion

Our study employs a comprehensive set of machine learning models to predict HRQoL in patients with cancer. This approach enables a robust evaluation of different algorithms and their predictive capabilities. Our findings demonstrated that the RF and XGB models achieved the highest predictive performance among seven machine learning models. The superior performance of RF and XGB models in our study coincides with their advantages in other health-related prediction tasks.^37,38 Given the capability to deal with complex feature interactions and resist overfitting,^31,39,40 these models are ideally suited for integration into clinical practice to dynamically predict HRQoL.

The SHAP analysis revealed the influence of various features on HRQoL.^41,42 Several key predictors were identified, like the duration of chronic comorbidities, early-stage cancer (stage 0 or 1), age, enrollment in UEBMI, and time since cancer diagnosis. Specifically, longer duration of chronic comorbidities had a significant negative impact on HRQoL. This finding is consistent with studies that underline the heavy burden of multimorbidity in cancer care.^43,44 In our study, we also observed that chronic respiratory diseases would deteriorate HRQoL of patients with cancer. Patients with chronic respiratory diseases often experience symptoms such as shortness of breath, coughing, and wheezing. Such symptoms can limit physical activity and contribute to feelings of fatigue and anxiety.^45–47 Therefore, integrated management strategies for chronic comorbidities are essential to improve HRQoL in patients with cancer.

Our study found that patients with early-stage cancer had significantly higher HRQoL compared to those with more advanced stages. This finding emphasizes the importance of early detection and aligns with China's national cancer control priorities. The Healthy China 2030 blueprint calls on strengthening early screening programs to shift the focus from late-stage treatment to disease prevention and early intervention. Early-stage cancers are generally less aggressive. The treatments are often less invasive and more effective, leading to better HRQoL outcomes.⁴⁸ Timely interventions not only improve survival rates but also enhance the overall wellbeing of patients.

Our findings also revealed that patients enrolled in the UEBMI reported higher HRQoL compared to those in the URRBMI. This reveals the system level determinants of patient wellbeing. Variations in insurance coverage and reimbursement policies can affect patient access to cancer care and subsequently HRQoL. The disparity is likely due to the more comprehensive coverage, superior health services, and reduced financial burden provided by UEBMI.^37,49 Over the past two decades, China has made significant advancements toward universal health coverage. However, there are still challenges in ensuring equitable access to healthcare service coverage and high-quality care for patients with cancer.⁵⁰ Future policy initiatives should concentrate on addressing these disparities to ensure equal access to high-quality cancer care.

We found that age exerts a bidirectional effect on HRQoL, with some older patients maintaining or even achieving higher HRQoL, while others experience a decline. Although older patients with cancer might have better HRQoL in terms of social functioning, role performance, and financial stability, they often face problems in physical and cognitive functioning.⁵¹ Notably, older adults undergoing curative treatments could have comparable or superior HRQoL to their younger counterparts.⁵² Similar variability in our study was observed regarding duration since cancer diagnosis and patient preferences for new drugs. Care plans can be adjusted according to regular HRQoL assessments. Staying responsive to patient needs can ensure that interventions remain effective and patient-centered.

The findings of our study hold implications for oncology care, emphasizing the need for personalized care plans for patients with cancer, especially those with chronic comorbidities. The care team are well-positioned to identify patients at risk of lower HRQoL based on factors such as the duration of chronic conditions, advanced cancer stages, and age-related physical or cognitive problems.

To enhance patient outcomes, healthcare providers can promote regular cancer screenings and educate patients on the benefits of early diagnosis.⁵³ They can also coordinate care for patients with multiple chronic conditions. These approaches operationalize the integrated service delivery model central to the Universal Health Coverage, ensuring comprehensively treatment plans that could manage cancer and comorbidities effectively.

Age-appropriate interventions, such as geriatric assessments and supportive care, are crucial for meeting the unique needs of older patients with cancer. Moreover, by providing clear information and emotional support, they can help patients make informed decisions that align with their preferences, ultimately improving their overall wellbeing.⁵⁴

Our study has several notable strengths that enhance its validity and applicability. First, we reveal the potential of machine learning, a powerful digital health tool, to play a key role in personalizing oncology care. We are one of the first to apply machine learning models to predict HRQoL in patients with cancer using a comprehensive set of patient-centered features. Multiple machine learning models (RF, XGB, etc.) allow for more precise identification of patients at risk of lower HRQoL, enabling targeted interventions.

Second, our study uniquely examines a wide range of patient-centered variables. This approach provides understanding of factors influencing HRQoL, which has not been extensively explored in previous research. By identifying these variables as key predictors of HRQoL, our study generates new evidence for developing personalized care plans that address the multifaceted needs of patients with cancer. Recent study also showed that patients’ views could affect how they accept digital decision-support tools.⁵⁵ Therefore, patient-centered insights should be integrated in future model development and implementation strategies.

Third, we found the differential impact of different health insurance programs on HRQoL. This novel result informs policymakers of the critical role of equitable insurance reimbursement policies in mitigating the financial strain on patients with cancer, thereby enhancing their HRQoL.

Fourth, our study demonstrates the bidirectional effect of age on HRQoL. By highlighting age-related challenges in physical and cognitive functioning, our study supports the development of age-appropriate interventions and geriatric assessments, thus enhancing the quality of care for older patients with cancer.

Finally, we utilized the EQ-5D-5L instrument, a validated and well-known tool, to assess HRQoL. The generic EQ-5D questionnaire was the most widely used instrument in cancer research,^56–58 thus ensuring the credibility and applicability of our findings.

Despite notable strengths, our study has several limitations. First, the cross-sectional design, while common and widely accepted in health prediction research, restricts the ability to establish causality between predictors and HRQoL. To address this, future research should use longitudinal designs to explore the causal relationships over time. Second, the model was developed and validated on data from specific regions. The lack of external validation on independent, multicenter datasets or via a temporal split limits the assessment of model generalizability. Third, the recruitment of participants from two provinces may introduce potential selection bias due to differences in socioeconomic characteristics or patient populations. Future research should enroll participants from broader geographic regions to mitigate this limitation. Fourth, the dichotomized HRQoL outcome was used in our study. Although dichotomization facilitates decision making, it represents a methodological compromise. Continuous regression would increase statistical power and yield more precise estimates. Finally, like other survey-based HRQoL assessments, our findings may be susceptible to self-report bias.

Conclusions

In conclusion, our study demonstrates the potential of machine learning models, particularly RF and XGB, to predict HRQoL in patients with cancer using patient-centered features. Key factors were identified for improving HRQoL, such as chronic comorbidity management, early-stage cancer diagnosis, and equitable health insurance coverage. Our findings offer valuable suggestions for digital health and oncology care. Healthcare providers can refer to these insights to develop personalized care plans, promote early cancer detection, advocate for equitable health insurance coverage, manage chronic comorbidities, alleviate age-related functional declines, and continuously monitor HRQoL.

To translate our findings into practice, we propose developing a clinical decision support tool such as a web-based risk calculator to provide real-time HRQoL estimates. By integrating electronic health record data, the tool would enable routine HRQoL collection and provide clear risk profiles for a patient's HRQoL. Together, these efforts will improve patient outcomes and contribute to a more patient-centered healthcare system.

Supplemental Material

sj-pdf-1-dhj-10.1177_20552076261430462 - Supplemental material for Predicting health-related quality of life in patients with cancer using machine learning: A step toward personalized oncology care

Supplemental material, sj-pdf-1-dhj-10.1177_20552076261430462 for Predicting health-related quality of life in patients with cancer using machine learning: A step toward personalized oncology care by Jingyu Chen, Jiaran Chen, Ruiting Shen, Shuchen Ji, Guohua Wang, Xingyun Geng and Jinsong Geng in DIGITAL HEALTH

Footnotes

Acknowledgements

We acknowledge the contributions made by our interviewers who did one-on-one, face-to-face interviews with the patients. We are grateful to the patients for their efforts and time. When designing this study, Jinsong Geng was a research fellow at the Fellowship in Health Policy and Insurance Research, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Healthcare Institute.

ORCID iD

Guohua Wang

Xingyun Geng

Jinsong Geng

Ethics considerations

Our study, including the patient consent process, has been approved by the Medical Ethics Committee at Nantong University (Ethical Approval-2021069, Ethical Approval-2021070) and conforms to the ethical guidelines of the Declaration of Helsinki.

Consent to participate

Informed, written consent was obtained from all individual participants in the study.

Consent for publication

Not applicable.

Author contributions

GJ and GX led the study design. CJR contributed to the literature search and qualitative analysis. GJ and CJY took part in the design of the questionnaire and the interpretation of the data. GJ, JS, and WG contributed to implementing the survey. CJY, SR, and GX performed the statistical analysis. CJY wrote the manuscript, and prepared tables and figures. GJ and GX provided comments on the manuscript. All authors reviewed the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China (grant no.72374113), and Postgraduate Research and Practice Innovation Program of Jiangsu Province (KYCX25_3783, KYCX24-3568). The funders provided financial support for the conduct of the study. The funders had no role in the design, implementation, data collection and statistical analysis, data interpretation, or writing of the manuscript.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Original datasets will be available upon reasonable request to the corresponding author.

Guarantor

Jinsong Geng.

Supplemental material

Supplemental material for this article is available online.

References

Fukushima

Suzuki

Tanaka

, et al. Global quality of life and mortality risk in patients with cancer: a systematic review and meta-analysis. Qual Life Res 2024; 33: 2631–2643.

Jiang

, et al. Trends in cancer incidence and potential associated factors in China. JAMA Netw Open 2024; 7: e2440381–e2440381.

Chen

, et al. The efficacy of decision aids on enhancing early cancer screening: a meta-analysis of randomized controlled trials. Worldviews Evid Based Nurs 2025; 22: e70048.

Kim

Jeong

Lee

, et al. Machine-learning model predicting quality of life using multifaceted lifestyles in middle-aged South Korean adults: a cross-sectional study. BMC Public Health 2024; 24: 159.

Zhang

Duan

, et al. The effectiveness of death education on death anxiety, depression, and quality of life in patients with advanced cancer: a meta-analysis of randomised controlled trials. J Nurs Scholarsh 2025; 57: 941–956.

AlFayyad

Al-Tannir

Howaidi

, et al. Health-related quality of life of breast and colorectal cancer patients undergoing active chemotherapy treatment: patient-reported outcomes. Qual Life Res 2022; 31: 2673–2680.

Chen

Liu

Fong

DYT

, et al. Health-related quality of life and its influencing factors in patients with breast cancer based on the scale QLICP-BR. Sci Rep 2023; 13: 15176.

Dixit

Gupta

Kataki

, et al. Health-related quality of life and its determinants among cancer patients: evidence from 12,148 patients of Indian database. Health Qual Life Outcomes 2024; 22: 26.

Steyerberg

. Assumptions in regression models: additivity and linearity. In: Clinical Prediction Models. Statistics for Biology and Health. Cham: Springer, 2019, pp.227–245.

10.

Obermeyer

Emanuel

. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375: 1216–1219.

11.

Bitkina

Park

Kim

. Application of artificial intelligence in medical technologies: a systematic review of main trends. Digit Health 2023; 9: 20552076231189331.

12.

Tao

, et al. Happiness prediction with domain knowledge integration and explanation consistency. IEEE Trans Comput Soc Syst 2025; 12: 2949–2962.

13.

Zixuan

Yunxiang

Yuebin

, et al. Cervical cancer metastasis and recurrence risk prediction based on deep convolutional neural network. Curr Bioinf 2022; 17: 164–173.

14.

, et al. Detecting muscle fatigue among community-dwelling senior adults with shape features of the probability density function of sEMG. J Neuroeng Rehabil 2024; 21: 196.

15.

Cao

, et al. Comparative analysis of cancer statistics in China and the United States in 2024. Chin Med J (Engl) 2024; 137: 3093–3100.

16.

von Elm

Altman

Egger

, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med 2007; 147: 573–577.

17.

Wang

Cao

, et al. Exploring COPD patient clusters and associations with health-related quality of life using a machine learning approach: a nationwide cross-sectional study. Engineering 2025; 50: 220–228.

18.

Song

Yuan

Liu

, et al. Machine learning algorithms to predict mild cognitive impairment in older adults in China: a cross-sectional study. J Affect Disord 2025; 368: 117–126.

19.

. Universal health insurance coverage for 1.3 billion people: what accounts for China's success? Health Policy 2015; 119: 1145–1152.

20.

Riley

Ensor

Snell

KIE

, et al. Calculating the sample size required for developing a clinical prediction model. Br Med J 2020; 368: m441.

21.

Feng

Kohlmann

Janssen

, et al. Psychometric properties of the EQ-5D-5L: a systematic review of the literature. Qual Life Res 2021; 30: 647–673.

22.

Yao

Yang

Zhang

, et al. EQ-5D-5L population scores in mainland China: results from a nationally representative survey 2021. Value Health 2024; 27: 1573–1584.

23.

Catto

JWF

Downing

Mason

, et al. Quality of life after bladder cancer: a cross-sectional survey of patient-reported outcomes. Eur Urol 2021; 79: 621–632.

24.

Yang

, et al. Perceived economic pressure and health-related quality of life of patients with colorectal cancer after surgery: evidence from China. J Family Med Prim Care 2025; 14: 2131–2137.

25.

Khanal

Sapkota

Poudel

, et al. Association between out-of-pocket expenditure and health-related quality of life among patients receiving cancer treatment: a cross-sectional study from Nepal. Health Qual Life Outcomes 2025; 23: 73.

26.

Luo

Liu

, et al. Estimating an EQ-5D-5L value set for China. Value Health 2017; 20: 662–669.

27.

Chowdhury

MZI

Turin

. Variable selection strategies and its importance in clinical prediction modelling. Fam Med Commun Health 2020; 8:e000262.

28.

Jayadevappa

Chhatre

Gallo

, et al. Patient-centered preference assessment to improve satisfaction with care among patients with localized prostate cancer: a randomized controlled trial. J Clin Oncol 2019; 37: 964–973.

29.

Flood

McCutcheon

Beusterien

, et al. Patient preferences influencing treatment decision-making in early-stage breast cancer in Germany, Italy, and Japan. Patient Prefer Adherence 2024; 18: 1517–1530.

30.

Ehsani

Drabløs

. Robust distance measures for kNN classification of cancer data. Cancer Inform 2020; 19: 1176935120965542.

31.

Azar A

Rikan S

Naemi

, et al. Application of machine learning techniques for predicting survival in ovarian cancer. BMC Med Inform Decis Mak 2022; 22: 345.

32.

Bhattacharya

Bennet

Davidson

, et al. Multi-layer perceptron classification & quantification of neuronal survival in hypoxic-ischemic brain image slices using a novel gradient direction, grey level co-occurrence matrix image training. PLoS ONE 2022; 17: e0278874.

33.

Huang

Yang

Liu

, et al. Assessing health-related quality of life of patients with colorectal cancer using EQ-5D-5L: a cross-sectional study in Heilongjiang of China. BMJ Open 2018; 8: e022711.

34.

Zhou

Guan

Wang

, et al. Health-related quality of life in patients with different diseases measured with the EQ-5D-5L: a systematic review. Front Public Health 2021; 9: 675523. doi:10.3389/fpubh.2021.675523

35.

Ngan

Mai

Van Minh

, et al. Health-related quality of life among breast cancer patients compared to cancer survivors and age-matched women in the general population in Vietnam. Qual Life Res 2022; 31: 777–787.

36.

Perwitasari

Purba

Candradewi

, et al. Mapping EORTC-QLQ-C30 onto EQ-5D-5L index in Indonesian cancer patients. Asian Pac J Cancer Prev 2023; 24: 1125–1130.

37.

Liu

Yin

, et al. Establishment and validation of a machine learning prediction model based on big data for predicting the risk of bone metastasis in renal cell carcinoma patients. Comput Math Methods Med 2022; 2022: 5676570.

38.

Zhou

Xie

Liu

, et al. Interpretable machine learning model for early prediction of disseminated intravascular coagulation in critically ill children. Sci Rep 2025; 15: 11217.

39.

Taye

Woubet

Hailie

, et al. Application of the random forest algorithm to predict skilled birth attendance and identify determinants among reproductive-age women in 27 sub-Saharan African countries; machine learning analysis. BMC Public Health 2025; 25: 901.

40.

Pahlevani

Rajabi

Taghavi

, et al. Developing a decision support tool to predict delayed discharge from hospitals using machine learning. BMC Health Serv Res 2025; 25: 56.

41.

Zhang

Liao

Zhang

, et al. Explainable machine learning models for identifying mild cognitive impairment in older patients with chronic pain. BMC Nurs 2025; 24: 72.

42.

Kim

Han

, et al. Machine-learning model for predicting depression in second-hand smokers in cross-sectional data using the Korea National Health and Nutrition Examination Survey. Digit Health 2024; 10: 20552076241257046.

43.

Keats

Cui

DeClercq

, et al. Burden of multimorbidity and polypharmacy among cancer survivors: a population-based nested case-control study. Support Care Cancer 2021; 29: 713–723.

44.

Koné

Scharf

. Prevalence of multimorbidity in adults with cancer, and associated health service utilization in Ontario, Canada: a population-based retrospective cohort study. BMC Cancer 2021; 21: 406.

45.

Miravitlles

Ribera

. Understanding the impact of symptoms on the burden of COPD. Respir Res 2017; 18: 67.

46.

Szymanska-Chabowska

Juzwiszyn

Tański

, et al. The fatigue and quality of life in patients with chronic pulmonary diseases. Sci Prog 2021; 104:368504211044034.

47.

Davis

Chen

. Impact of comorbidity on symptoms and quality of life among patients being treated for breast cancer. Cancer Nurs 2019; 42: 381–387.

48.

Fitzgerald

Antoniou

Fruk

, et al. The future of early cancer detection. Nat Med 2022; 28: 666–677.

49.

Hung

, et al. Disparities in end-of-life care, expenditures, and place of death by health insurance among cancer patients in China: a population-based, retrospective study. BMC Public Health 2020; 20: 1354.

50.

Yip

Jian

, et al. Universal health coverage in China part 1: progress and gaps. Lancet Public Health 2023; 8: e1025–e1034.

51.

Quinten

Coens

Ghislain

, et al. The effects of age on health-related quality of life in cancer populations: a pooled analysis of randomized controlled trials using the European Organisation for Research and Treatment of Cancer (EORTC) QLQ-C30 involving 6024 cancer patients. Eur J Cancer 2015; 51: 2808–2819.

52.

Berg

Silander

Bove

, et al. The effect of age on health-related quality of life for head and neck cancer patients up to 1 year after curative treatment. J Geriatr Oncol 2022; 13: 60–66.

53.

Liu

Xue

, et al. Effects of nurse-led interventions on early detection of cancer: a systematic review and meta-analysis. Int J Nurs Stud 2020; 110:103684.

54.

Boumendil

Yakubu

Al Wachami

, et al. How nurses’ interventions promote health literacy in patients with non-communicable diseases: a systematic review. J Clin Nurs 2025; 34: 2493–2509.

55.

Erul

Aktekin

Danışman

, et al. Perceptions, attitudes, and concerns on artificial intelligence applications in patients with cancer. Cancer Control 2025; 32: 10732748251343245.

56.

Colomer-Lahiguera

Bryant-Lukosius

Rietkoetter

, et al. Patient-reported outcome instruments used in immune-checkpoint inhibitor clinical trials in oncology: a systematic review. J Patient Rep Outcomes 2020; 4: 58.

57.

Rautenberg

Hodgkinson

Zerwes

, et al. Meta-analysis of health state utility values measured by EuroQol 5-dimensions (EQ5D) questionnaire in Chinese women with breast cancer. BMC Cancer 2022; 22: 52.

58.

Sprave

Gkika

Verma

, et al. Patient reported outcomes based on EQ-5D-5L questionnaires in head and neck cancer patients: a real-world study. BMC Cancer 2022; 22: 1236.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.95 MB