Sage Journals: Discover world-class research

Abstract

Objective

Accurate prognostication is crucial for managing human immunodeficiency virus (HIV)-associated cutaneous T-cell lymphoma. In this study, we aimed to develop an improved machine learning-based prognostic model for predicting the 5-year survival rates in HIV-associated cutaneous T-cell lymphoma patients.

Methods

We derived and tested machine learning models using algorithms including Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Random Forest. Our study involved data from a US population-based cohort of patients diagnosed with HIV-associated cutaneous T-cell lymphoma between 1 January 2000 and 31 December 2018, which were extracted from the Surveillance, Epidemiology, and End Results database. The primary outcome was the prediction of 5-year overall survival. Model discrimination was assessed using the area under the receiver operating characteristic curve (AUC), and calibration was assessed using Brier scores.

Results

A cohort of 381 HIV-associated cutaneous T-cell lymphoma patients was analyzed. Multivariate logistic regression identified age ≥60 years (odds ratio = 4.88), regional stage (odds ratio = 10.31), distant stage (odds ratio = 28.37), and chemotherapy (odds ratio = 4.71) as significant independent risk factors for 5-year mortality. Among seven machine learning models developed, the XGBoost model demonstrated the highest discrimination for 5-year overall survival (AUC = 0.867), followed by LightGBM (AUC = 0.835). Both models exhibited good calibration with low Brier scores (XGBoost = 0.130, LightGBM = 0.109). Support Vector Machine performed optimally in ten-fold cross-validation, logistic regression showed the lowest Brier score (0.106), and XGBoost provided the best balance of discrimination and robust performance.

Conclusion

Our novel machine learning approach produced prognostic models with superior discrimination for 5-year overall survival in HIV-associated cutaneous T-cell lymphoma patients using standard clinicopathological variables. These models offer potential for more accurate and personalized prognostics, potentially improving patient management and clinical decision-making.

Keywords

Human immunodeficiency virus-associated lymphoma cutaneous T-cell lymphoma prognosis prediction machine learning Surveillance Epidemiology and End Results database

Introduction

Lymphoma is a malignant neoplasm originating from the lymphatic system, categorized into Hodgkin lymphoma and non-Hodgkin lymphoma.¹ Human immunodeficiency virus (HIV)-associated cutaneous T-cell lymphoma (CTCL) is a rare but severe subtype of non-Hodgkin lymphoma, commonly observed in HIV-infected individuals.² Patients with this condition face unique challenges because an HIV infection not only affects the immune system but may also accelerate lymphoma progression.³ Clinical presentations of HIV-associated CTCL vary from localized skin lesions to systemic disease, often with a poor prognosis.³ Current treatment strategies include antiretroviral therapy, chemotherapy, radiotherapy, and targeted therapies; however, treatment outcomes remain suboptimal.^3,4 Given the disease complexity and prognostic uncertainty, accurate prediction of patient survival rates is crucial. Established prognostic factors for HIV-associated CTCL include age at diagnosis, disease stage (such as localized, regional, or distant spread), and treatment modalities such as radiotherapy and chemotherapy, which have consistently shown significant associations with patient survival outcome.^2,5,6

The Surveillance, Epidemiology, and End Results (SEER) database, a project of the National Cancer Institute, provides comprehensive cancer-related data, including incidence, mortality, and survival rates. Covering approximately 28% of the US population, it offers valuable resources for cancer research.⁷ Multiple internationally recognized tools exist for risk prediction models. Most tools categorize patients into different risk groups, derived from the D’Amico three-tier system.⁸ However, these models primarily focus on intermediate indicators such as biochemical recurrence rather than survival rates.⁸ Research indicates that effective prognostic models can be created using simple factors such as prostate-specific antigen, grade, and stage, while refined stratification systems can enhance a model’s discriminatory ability. Additionally, using continuous rather than categorical data allows for more accurate and personalized predictions, aiding clinical decision-making.⁹ For instance, the PREDICT Prostate tool and Memorial Sloan Kettering Cancer Center nomogram demonstrate high discriminatory ability in predicting survival rates and are available as web-based decision aids for patients and clinicians.⁹ However, specialized prediction models for HIV-associated CTCL remain scarce. Given the uniqueness and complexity of this disease, developing a dedicated predictive tool is crucial for improving patient management and prognosis.

However, enhanced personalized tools often rely on traditional statistical modeling methods, which typically use pre-specified variables and assume particular forms of interactions. In contrast, machine learning (ML), as a data-driven application of artificial intelligence, learns and improves from data automatically without explicit programming. Consequently, these algorithms can mine datasets to identify not only pre-established risk factors but also complex, nonlinear relationships and subtle patterns often overlooked by traditional methods. Although ML applications in healthcare (e.g. developing novel prognostic models) are rapidly growing, the use of ML in the specific and rare context of HIV-associated CTCL remains limited. Notably, to the best of our knowledge, there are no prior reports of ML-based prognostic models for HIV-associated CTCL utilizing the comprehensive SEER database, particularly involving a systematic comparison of multiple algorithms. Therefore, the core innovation of this study lies not only in addressing this gap by being among the first studies to apply such an approach in this specific patient cohort using SEER data but also in systematically developing and comparing multiple advanced ML models (including ensemble methods such as Extreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), alongside other classical algorithms) to predict 5-year survival. We hypothesize that these approaches can generate superior predictive models by more effectively integrating known prognostic factors (e.g. age, stage, and treatment) and uncovering their complex interplay, thus achieving enhanced discrimination and calibration over traditional models. Our aim is to develop this improved model and rigorously compare its performance against established benchmarks. Ultimately, this research may provide crucial evidence for more accurate and personalized prognostic assessment, thereby improving patient management and individualized treatment decisions for this challenging condition.

Materials and methods

Data source

Clinical information of patients diagnosed with HIV-associated CTCL from 2004 to 2017 was extracted from the SEER database using SEER*Stat software (version 8.4.3). The SEER database, an authoritative source for cancer statistics in the US, provides cancer-related data including clinicopathological characteristics, treatment, cancer-related incidence, and survival. For this study, the inclusion criteria were as follows: (a) site recode ICD-O-3/WHO 2008 classified as lymphoma; (b) Collaborative Stage (CS) site-specific factor 1 (varying by schema from 2004 to 2017) coded as 001 or 010; (c) lymphoma confirmed by microscopic examination; (d) complete tumor staging and diagnostic data. Patients were randomly divided into validation and training sets at a ratio of 7:3. The outcome variable was 5-year survival rate.

Construction and validation of ML models

Variables identified as statistically significant through univariate and multivariate logistic regression (LR) analyses were incorporated into the ML models. Seven ML models were constructed in this study, including K-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), LightGBM, LR, decision tree (DT), and XGBoost. The efficacy of the ML models was validated using receiver operating characteristic (ROC) curves, precision–recall (PR) curves, decision curve analysis (DCA), ten-fold cross-validation, and calibration curves. The clinical decision-making capability of the models was assessed through DCA curves. Calibration curves were employed to evaluate prediction performance. Additionally, metrics such as accuracy, sensitivity, specificity, precision, positive predictive value, negative predictive value, recall, and F1 score were utilized to assess the ML models. The significance of ML models and ML validation tool in this study is shown in Table 1.

Table 1.

Significance of ML models and ML validation tool in this study.

Machine learning models	General purpose	Significance in this study
K-Nearest Neighbors (KNN)	Classifies new data points based on the majority class of their “k” closest neighbors in the feature space.	Explores a nonparametric approach to classify patients based on similarity of their clinicopathological features to known survival outcomes.
Random Forest (RF)	An ensemble learning method that constructs multiple decision trees and merges their outputs for improved accuracy and robustness.	Leverages the power of multiple decision trees to identify complex patterns and interactions among variables influencing survival, reducing overfitting compared to a single decision tree.
Support Vector Machine (SVM)	Finds an optimal hyperplane that best separates data points of different classes in a high-dimensional space.	Aims to find the best decision boundary to distinguish patients likely to survive 5 years from those who are not, potentially capturing nonlinear relationships through kernel functions.
LightGBM (Light Gradient Boosting Machine)	A gradient boosting framework that uses tree-based learning algorithms, known for its speed and efficiency.	Provides a highly efficient and scalable gradient boosting model to predict survival, capable of handling large datasets and achieving high accuracy by sequentially correcting the errors of previous trees.
Logistic regression (LR)	A statistical model that uses a logistic function to model the probability of a binary outcome.	Serves as a well-established statistical baseline model to predict the probability of 5-year survival based on a linear combination of predictor variables.
Decision tree (DT)	A tree-like model where each internal node represents a “test” on an attribute, and each leaf node represents a class label.	Offers an interpretable model that creates a set of rules derived from patient characteristics to predict survival, providing insight into decision-making pathways.
Extreme Gradient Boosting (XGBoost)	An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.	Employs an advanced and regularized gradient boosting algorithm to achieve high predictive accuracy for 5-year survival by building an ensemble of decision trees in a sequential manner.
Receiver operating characteristic (ROC) curve and AUC	Evaluates a model’s ability to discriminate two classes (e.g. survival vs. nonsurvival) across all classification thresholds. AUC quantifies this overall discrimination.	Assesses and compares the ability of each ML model to correctly distinguish patients who will survive 5 years from those who will not. A higher AUC indicates better discrimination.
Precision–recall (PR) Curve & PR-AUC	Evaluates model performance, especially for imbalanced datasets, by plotting precision against recall for different thresholds. PR-AUC summarizes this trade-off.	Provides a more informative measure of model performance than ROC-AUC when dealing with potentially imbalanced survival outcomes, focusing on the model’s ability to correctly identify true survival cases.
Decision curve analysis (DCA)	Assesses the clinical usefulness of a predictive model by quantifying the net benefit at different probability thresholds.	Determines the clinical utility and net benefit of using the developed ML models for predicting 5-year survival, helping understand if a model’s predictions would lead to better clinical decisions.

HIV: human immunodeficiency virus; ML: machine learning; AUC: area under the ROC curve.

Statistical analysis

Categorical variables were expressed as percentages and compared using Fisher’s exact test or chi-square test. Risk factors were identified using multivariate LR analysis. A p-value <0.05 (two-sided) was considered to indicate statistical significance. All analyses were performed using R software (version 4.2.3, http://www.r-project.org).

Results

Baseline population characteristics

This study analyzed the clinical characteristics of 381 patients with HIV-associated CTCL, including 265 patients in the training set and 116 in the validation set. Baseline demographic characteristics and clinical features of the training and validation cohorts are detailed in Table 2. The patient population was predominantly aged <60 years (55%–60%), male (58%–66%), and married (54%–56%), with a majority being Caucasian (70%–73%). Most patients (86%–89%) were diagnosed between 2011 and 2017, indicating an increased diagnostic rate in recent years. Over half of the patients were from middle- to high-income households (annual income >US$74,999). Regarding disease characteristics, skin not otherwise specified was the most common primary site (53%–55%), and approximately 80% of patients were diagnosed at a localized stage. In terms of treatment, most patients did not undergo surgery (78%–80%), radiotherapy utilization was low (9%–13%), and chemotherapy usage was approximately 21%–23%. Notably, no statistically significant differences (p > 0.05) were observed between the training and validation sets for all variables, suggesting balanced data partitioning and good representation of the overall population by the validation set.

Table 2.

Baseline characteristics of patients with HIV-associated cutaneous T-cell lymphoma in the training and validation cohorts.

Parameter	Levels	1 (N = 265)	2 (N = 116)	p
Age (years)	<60	160 (60.4%)	64 (55.2%)	0.403
	≥60	105 (39.6%)	52 (44.8%)
Sex	Male	153 (57.7%)	76 (65.5%)	0.189
	Female	112 (42.3%)	40 (34.5%)
Marital status	Married	143 (54%)	65 (56%)	0.793
	Unmarried	122 (46%)	51 (44%)
Race	Caucasian	194 (73.2%)	81 (69.8%)	0.574
	Black	41 (15.5%)	23 (19.8%)
	Other	30 (11.3%)	12 (10.3%)
Year of diagnosis	2004–2010	38 (14.3%)	13 (11.2%)	0.507
	2011–2017	227 (85.7%)	103 (88.8%)
Median household income	<US$55,000	17 (6.4%)	8 (6.9%)	0.977
	US$55,000–US$74,999	112 (42.3%)	48 (41.4%)
	>US$74,999	136 (51.3%)	60 (51.7%)
Primary site	Skin of the trunk	47 (17.7%)	21 (18.1%)	0.830
	Skin of the lower limb and hip	33 (12.5%)	16 (13.8%)
	Skin not otherwise specified	141 (53.2%)	64 (55.2%)
	Other	44 (16.6%)	15 (12.9%)
Summary stage	Localized	212 (80%)	92 (79.3%)	0.806
	Regional	42 (15.8%)	18 (15.5%)
	Distant	4 (1.5%)	1 (0.9%)
	Unknown	7 (2.6%)	5 (4.3%)
Diagnosis-to-treatment duration (years)	0–2	49 (18.5%)	19 (16.4%)	0.743
	3–27	49 (18.5%)	19 (16.4%)
	>27	94 (35.5%)	48 (41.4%)
	Unknown	73 (27.5%)	30 (25.9%)
Surgery	No	207 (78.1%)	93 (80.2%)	0.752
	Yes	58 (21.9%)	23 (19.8%)
Radiation	No/Unknown	242 (91.3%)	101 (87.1%)	0.276
	Yes	23 (8.7%)	15 (12.9%)
Chemotherapy	No/Unknown	205 (77.4%)	92 (79.3%)	0.773
	Yes	60 (22.6%)	24 (20.7%)

HIV: human immunodeficiency virus.

“1” refers to the training cohort, and “2” refers to the validation cohort.

Univariate and multivariate LR analyses

Univariate and multivariate LR analyses identified risk factors for 5-year survival in patients with HIV-associated CTCL. The detailed results of these analyses are presented in Table 3. Age emerged as a significant prognostic factor, with patients aged ≥60 years exhibiting a 4.88-fold increased mortality risk (odds ratio (OR) = 4.88, 95% confidence interval (CI): 2.04–11.68, p < 0.001). Disease stage significantly impacted prognosis; compared with localized stage patients, regional stage patients showed a 10.31-fold increased mortality risk (OR = 10.31, 95% CI: 4.24–25.11, p < 0.001), while distant stage patients demonstrated a 28.37-fold increased risk (OR = 28.37, 95% CI: 1.58–507.67, p = 0.023). Radiotherapy was significantly associated with an increased mortality risk in univariate analysis (OR = 5.61, 95% CI: 2.29–13.72, p < 0.001) but showed no statistically significant association in multivariate analysis (p = 0.267). Chemotherapy was significantly associated with a high mortality risk in multivariate analysis (OR = 4.71, 95% CI: 1.97–11.28, p < 0.001). Female sex potentially exerted a protective effect, as indicated by univariate analysis (OR = 0.50, p = 0.049); however, this result was not significant in multivariate analysis (p = 0.318). Marital status, race, diagnosis year, household income, primary site, diagnosis-to-treatment time, and surgery did not demonstrate significant effects in univariate analysis. These findings provided crucial variable selection basis for constructing ML models to predict 5-year survival rates in HIV-associated CTCL patients.

Table 3.

Univariate and multivariate logistic regression analyses of factors associated with 5-year survival in patients with HIV-associated cutaneous T-cell lymphoma.

Characteristics		0 (N = 220)	1 (N = 45)	OR (univariate)	OR (multivariate)
Age (years)	<60	148 (67.3%)	12 (26.7%)
	≥60	72 (32.7%)	33 (73.3%)	5.65 (2.76–11.59, p < 0.001)	4.88 (2.04–11.68, p < 0.001)
Sex	Male	121 (55%)	32 (71.1%)
	Female	99 (45%)	13 (28.9%)	0.50 (0.25–1.00, p = 0.049)	0.63 (0.26–1.55, p = 0.318)
Marital status	Married	117 (53.2%)	26 (57.8%)
	Unmarried	103 (46.8%)	19 (42.2%)	0.83 (0.43–1.59, p = 0.573)
Race	Caucasian	165 (75%)	29 (64.4%)
	Black	30 (13.6%)	11 (24.4%)	2.09 (0.94–4.62, p = 0.070)
	Other	25 (11.4%)	5 (11.1%)	1.14 (0.40–3.21, p = 0.807)
Year of diagnosis	2004–2010	31 (14.1%)	7 (15.6%)
	2011–2017	189 (85.9%)	38 (84.4%)	0.89 (0.37–2.17, p = 0.798)
Median household income	<US$55,000	13 (5.9%)	4 (8.9%)
	US$55,000–US$74,999	91 (41.4%)	21 (46.7%)	0.75 (0.22–2.53, p = 0.643)
	>US$74,999	116 (52.7%)	20 (44.4%)	0.56 (0.17–1.89, p = 0.351)
Primary site	Skin of the trunk	43 (19.5%)	4 (8.9%)
	Skin of the lower limb and hip	29 (13.2%)	4 (8.9%)	1.48 (0.34–6.41, p = 0.598)
	Skin not otherwise specified	113 (51.4%)	28 (62.2%)	2.66 (0.88–8.04, p = 0.082)
	Other	35 (15.9%)	9 (20%)	2.76 (0.78–9.74, p = 0.114)
Summary stage	Localized	195 (88.6%)	17 (37.8%)
	Regional	18 (8.2%)	24 (53.3%)	15.29 (6.96–33.59, p < 0.001)	10.31 (4.24–25.11, p < 0.001)
	Distant	1 (0.5%)	3 (6.7%)	34.41 (3.39–349.05, p = 0.003)	28.37 (1.58–507.67, p = 0.023)
	Unknown	6 (2.7%)	1 (2.2%)	1.91 (0.22–16.82, p = 0.559)	2.96 (0.30–29.34, p = 0.353)
Diagnosis-to-treatment duration (years)	0–2	41 (18.6%)	8 (17.8%)
	3–27	38 (17.3%)	11 (24.4%)	1.48 (0.54–4.08, p = 0.445)
	>27	75 (34.1%)	19 (42.2%)	1.30 (0.52–3.22, p = 0.574)
	Unknown	66 (30%)	7 (15.6%)	0.54 (0.18–1.61, p = 0.272)
Surgery	No	172 (78.2%)	35 (77.8%)
	Yes	48 (21.8%)	10 (22.2%)	1.02 (0.47–2.22, p = 0.952)
Radiation	No/Unknown	208 (94.5%)	34 (75.6%)
	Yes	12 (5.5%)	11 (24.4%)	5.61 (2.29–13.72, p < 0.001)	1.94 (0.60–6.27, p = 0.267)
Chemotherapy	No/Unknown	186 (84.5%)	19 (42.2%)
	Yes	34 (15.5%)	26 (57.8%)	7.49 (3.73–15.01, p < 0.001)	4.71 (1.97–11.28, p < 0.001)

HIV: human immunodeficiency virus; OR: odds ratio.

“0” represents patients who were alive at 5 years, and “1” represents patients who died within 5 years.

Construction and validation of ML models

Seven ML models were constructed to predict the 5-year survival rates in patients with HIV-associated CTCL. XGBoost demonstrated the highest area under the ROC curve (AUC) value (0.867), followed by LightGBM (AUC = 0.835), LR (AUC = 0.8), RF (AUC = 0.757), SVM (AUC = 0.751), KNN (AUC = 0.737), and DT (AUC = 0.691) (Figure 1). The PR-AUC values for LightGBM, XGBoost, LR, RF, SVM, DT, and KNN were 0.527, 0.498, 0.48, 0.471, 0.415, 0.405, and 0.352, respectively (Figure 2). DCA indicated minimal differences in the net benefit across all ML models (Figure 3). Results from ten-fold cross-validation (Figure 4) revealed that SVM performed optimally in predicting the 5-year survival rate of patients with HIV-associated CTCL. Lower Brier scores indicated higher consistency between predictions and reality, indicating superior performance. As illustrated in Figure 5, LR exhibited the lowest Brier score (0.106), followed by LightGBM (0.109), KNN (0.114), SVM (0.122), RF (0.129), DT (0.129), and XGBoost (0.13).

Figure 1.

ROC curves for seven machine learning models predicting the 5-year survival rate in HIV-associated cutaneous T-cell lymphoma patients. XGBoost demonstrated the highest AUC (AUC = 0.867), followed by LightGBM (AUC = 0.835), logistic regression (AUC = 0.8), Random Forest (AUC = 0.757), Support Vector Machine (AUC = 0.751), K-Nearest Neighbors (AUC = 0.737), and decision tree (AUC = 0.691). ROC: receiver operating characteristic; HIV: human immunodeficiency virus; AUC: area under the ROC curve; XGBoost: Extreme Gradient Boosting; LightGBM: Light Gradient Boosting Machine.

Figure 2.

PR curves for seven machine learning models. PR-AUC values were as follows: LightGBM (0.527), XGBoost (0.498), logistic regression (0.48), Random Forest (0.471), Support Vector Machine (0.415), decision tree (0.405), and K-Nearest Neighbors (0.352). PR: precision–recall; AUC: area under the receiver operating characteristic curve; XGBoost: Extreme Gradient Boosting; LightGBM: Light Gradient Boosting Machine.

Figure 3.

DCA comparing the net benefit across all machine learning models for predicting the 5-year survival rate in HIV-associated cutaneous T-cell lymphoma patients. DCA: decision curve analysis; HIV: human immunodeficiency virus.

Figure 4.

Ten-fold cross-validation results for seven machine learning models, with SVM demonstrating optimal performance in predicting 5-year survival rates. SVM: Support Vector Machine.

Figure 5.

Brier scores for each machine learning model, indicating prediction accuracy. Lower scores represent higher consistency between predictions and actual outcomes. Logistic regression showed the lowest Brier score (0.106), followed by LightGBM (0.109), K-Nearest Neighbors (0.114), Support Vector Machine (0.122), Random Forest (0.129), decision tree (0.129), and XGBoost (0.13). XGBoost: Extreme Gradient Boosting; LightGBM: Light Gradient Boosting Machine.

Discussion

HIV-associated CTCL is a rare form of CTCL occurring in HIV-infected individuals. This condition typically manifests as skin lesions such as patches, plaques, or tumors. Owing to the compromised immune function of patients, its clinical presentation and treatment strategies may differ from non-HIV-associated CTCLs.² Our study developed and tested an ML-based prognostic model using a large-scale dataset to predict the 5-year survival rate in patients with HIV-associated CTCL, comparing its performance with those of multiple stratified and multivariate predictive models. To the best of our knowledge, this is the first study utilizing the SEER cohort data to compare multiple models for predicting HIV-associated CTCL-specific mortality. This research introduced an innovative mortality prediction approach, employing novel ML algorithms to automatically integrate the best features of different modeling methods, providing a new perspective on prognostic assessment in this field.

ML has shown considerable potential in research on HIV-associated health issues, with broad applications ranging from prognosticating opportunistic infections to assessing complication risks. For instance, while predicting survival for patients with HIV-associated opportunistic infections, a study utilized ML models to stratify risk in patients with cryptococcosis, achieving good predictive performance (C-index: 0.78).¹⁰ Similarly, ML has been successfully employed to identify predictive factors for common comorbidities in individuals with HIV and construct risk models. A study on peripheral neuropathy demonstrated that ML-based approaches, such as RF, not only effectively predicted disease status (AUC > 0.80) but also identified key predictive variables missed by traditional statistical methods.¹¹ Regarding neurological complications, ML has been used to optimize the detection of HIV-associated neurocognitive disorders (HAND). For example, by integrating data from brief neuropsychological assessments, ML models can detect HAND with over 90% accuracy, providing a robust tool for early clinical screening.¹² Furthermore, ML models based on neuroimaging data have been used to investigate the impact of HIV infection on brain functional networks and predict cognitive status.¹³ Researchers have also utilized SVM to analyze plasma extracellular vesicle phenotype data, successfully predicting cognitive dysfunction and highlighting ML’s role in integrating complex biological data for disease state assessment.¹⁴ For HIV-associated malignancies, ML has been applied to conduct noninvasive prognostic evaluation. A study on HIV-associated lung adenocarcinoma accurately predicted the expression levels of the proliferation marker Ki-67 (test set AUC: 0.905) using radiomics features and ML models, such as SVM, demonstrating its value in oncology.¹⁵ Collectively, these studies indicate that ML provides robust analytical tools for understanding complex pathophysiological processes in the context of HIV infection, improving disease prediction and risk stratification as well as guiding personalized management strategies. However, research employing advanced ML methods for survival prediction in HIV-associated CTCL, a specific, rare, and prognostically unfavorable tumor, remains insufficient, underscoring the necessity of the present study.

Our study developed and validated several ML models for predicting the 5-year survival rate in HIV-associated CTCL patients, with ensemble methods showing notable predictive capabilities. The favorable performance of models such as XGBoost (AUC = 0.867) and LightGBM (AUC = 0.835) may stem from their capacity to capture complex nonlinear relationships and variable interactions often missed by traditional regression methods. Gradient boosting algorithms such as XGBoost are adept at handling heterogeneous clinical data and can form robust predictors by integrating multiple base learners^16,17; their strength is also noted in predicting radiation dermatitis (AUC =0.890)¹⁶ and cardiac surgery mortality (AUC = 0.9145).¹⁷ Our models incorporated established prognostic factors (age, stage, and chemotherapy) identified via LR, the prognostic value of which is well supported in lymphoma. The ML models’ data-driven variable weighting likely enhances predictive accuracy, and tree-based methods such as XGBoost can indicate relative feature importance, potentially identifying key prognostic drivers.

To the best of our knowledge, this study is an early effort to systematically compare multiple advanced ML algorithms for 5-year survival prediction in HIV-associated CTCL patients using SEER data. Yang et al.² previously developed nomograms for this cohort using SEER data, identifying key prognostic factors and reporting good model accuracy. Our work has expanded on this by applying and comparing a suite of ML algorithms. Our XGBoost model’s high AUC (0.867) suggests strong discriminatory power. Compared with other SEER-based ML studies on lymphoma, such as Wang et al.’s DeepSurv model for primary gastrointestinal lymphoma (C-index up to 0.760), our ensemble boosting methods also demonstrated good performance using standard clinicopathological variables.¹⁸ The versatility of XGBoost is further indicated by its successful application in other oncological predictions, including thyroid cancer risk stratification (AUC up to 0.886)¹⁹ and non-small cell lung cancer STAS prediction (AUC up to 0.927).²⁰ Our systematic comparison aimed to identify suitable models for HIV-associated CTCL. These findings suggest that ML approaches merit further investigation for refining prognostication and potentially guiding personalized management in this rare malignancy.

The inherent rarity of HIV-associated CTCL considerably constrains dataset sizes, a common challenge in developing robust prognostic models for such conditions.²¹ Despite these limitations, ML offers a promising avenue for prognostic prediction in rare diseases by leveraging available data to identify predictive patterns.²¹ Studies tackling other rare conditions, such as primary central nervous system lymphoma²² and diverse rare diseases using electronic health records data,²³ have demonstrated the application of advanced ML strategies to specifically address data insufficiency and still achieve meaningful predictive outcomes. This highlights a broader trend of successfully applying ML to extract valuable prognostic information even when data are scarce, a context highly relevant to our investigation of HIV-associated CTCL. Within this framework, our systematic comparison identified XGBoost as a particularly effective algorithm for this rare lymphoma subtype (AUC = 0.867). The utility of XGBoost in medical prediction, especially with potentially limited datasets, is supported by its reported success in other clinical scenarios, such as achieving higher AUCs in predicting small-for-gestational-age births²⁴ and diabetic nephropathy²⁵ compared with multiple algorithms. The ability of our XGBoost model to achieve this using only standard clinicopathological variables further underscores its practical value in rare disease research, where extensive biomarker data may be unavailable.^21,26

In summary, this study utilized clinical data from 381 HIV-associated CTCL patients in the SEER database to construct and compare seven ML models for predicting 5-year survival rates, with XGBoost demonstrating strong predictive performance using standard clinicopathological variables. A key strength lies in the comprehensive multi-algorithm evaluation. This study is, however, primarily limited by its retrospective SEER-based design and the lack of external validation. Future research could expand on this work by validating these models in independent, prospective cohorts and exploring the integration of additional biomarkers. Crucially, investigating the broader applicability of these ML methodologies to other rare diseases, where data scarcity poses significant challenges, represents an important avenue for advancement. Comparing our model’s performance in detecting lymphomas not associated with HIV could also yield valuable insights. Nevertheless, this research provides a potentially useful tool for prognostic assessment in HIV-associated CTCL patients, with implications for developing more precise, individualized treatment strategies and ultimately improving patient outcomes in this disease and potentially other rare oncological conditions.

Footnotes

Acknowledgments

Not applicable.

Author contributions

Conception and design: Zhang Yongmin and Huang Weimin; Administrative support: Chen Lingzhen; Provision of study materials or patients: Tian Manwen and Ai Jing; Collection and assembly of data: Jia Lanlan, Chen Junteng, and Gan Jinying; Data analysis and interpretation: Huang Weimin; Manuscript writing: Zhang Yongmin and Huang Weimin; and Final approval of manuscript: all authors.

Data availability statement

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Declaration of generative AI and AI-assisted technologies in the writing process

The language was polished by GPT 4o to improve the readability, and the content has been reviewed by all authors. All authors take full responsibility for the content of the publication.

Declaration of conflicting interests

The authors have no conflicts of interest to declare.

Funding

This project was supported by the Science and Technology Program of Guangzhou (Nos. 2023A04J0568, 2024A03J0868); the Health Science and Technology Project of Guangzhou (No. 20231A011047).

ORCID iD

Yongmin Zhang

References

Munir

Hardit

Sheikh

, et al. Classical Hodgkin lymphoma: from past to future-a comprehensive review of pathophysiology and therapeutic advances. Int J Mol Sci 2023; 24: 10095.

Yang

Gong

Huang

, et al. Epidemiological characteristics and the development of prognostic nomograms of patients with HIV-associated cutaneous T-cell lymphoma. Front Oncol 2022; 12: 847710.

Beylot-Barry

Vergier

Masquelier

, et al. The spectrum of cutaneous lymphomas in HIV infection: a study of 21 cases. Am J Surg Pathol 1999; 23: 1208–1216.

Yamashita

Fujii

Ozaki

, et al. Human immunodeficiency virus-positive secondary syphilis mimicking cutaneous T-cell lymphoma. Diagn Pathol 2015; 10: 185.

Wang

Liang

Hao

, et al. Survival outcomes of primary cutaneous T-cell lymphoma in HIV-infected patients: a national population-based study. J Investig Med 2018; 66: 762–767.

Goldstein

Becker

DelRowe

, et al. Cutaneous T-cell lymphoma in a patient infected with human immunodeficiency virus type 1. Use of radiation therapy. Cancer 1990; 66: 1130–1132.

Friedman

Negoita

History of the Surveillance, Epidemiology, and End Results (SEER) Program. J Natl Cancer Inst Monogr 2024; 2024: 105–109.

Greenwald

Chamoun

, et al. Risk Stratification Index 3.0, a broad set of models for predicting adverse events during and after hospital admission. Anesthesiology 2022; 137: 673–686.

McAllister

Gnanapragasam

Thurtle

Composite risk stratification models optimise the value of imaging in prostate cancer staging. BJUI Compass 2023; 4: 501–503.

10.

Xun

, et al. Predictive survival modelings for HIV-related cryptococcosis: comparing machine learning approaches. Front Cell Infect Microbiol 2025; 15: 1542707.

11.

Johnson

Fujiwara

, et al. Predictive variables for peripheral neuropathy in treated HIV type 1 infection revealed by machine learning. AIDS 2021; 35: 1785–1793.

12.

Martinez-Banfi

Vélez

Mebarak Chams

, et al. Utility of a short neuropsychological protocol for detecting HIV-associated neurocognitive disorders in patients with asymptomatic HIV-1 infection. Brain Sci 2021; 11: 1037.

13.

Luckett

Paul

Hannon

, et al. Modeling the effects of HIV and aging on resting-state networks using machine learning. J Acquir Immune Defic Syndr 2021; 88: 414–419.

14.

Marques de Menezes

Bowler

Shikuma

, et al. Circulating plasma-derived extracellular vesicles expressing bone and kidney markers are associated with neurocognitive impairment in people living with HIV. Front Neurol 2024; 15: 1383227.

15.

Song

Chen

Zhao

, et al. Prediction of Ki-67 expression in HIV-associated lung adenocarcinoma patients using multiple machine learning models based on CT imaging radiomics. Cmar 2025; 17: 881–892.

16.

Lee

T-F

Liu

Y-H

Chang

C-H

, et al. Development of a risk prediction model for radiation dermatitis following proton radiotherapy in head and neck cancer using ensemble machine learning. Radiat Oncol 2024; 19: 78.

17.

Shan

Bai

, et al. The clinical applications of ensemble machine learning based on the Bagging strategy for in-hospital mortality of coronary artery bypass grafting surgery. Heliyon 2024; 10: e38435.

18.

Wang

Chen

Liu

, et al. Deep learning model for predicting the survival of patients with primary gastrointestinal lymphoma based on the SEER database and a multicentre external validation cohort. J Cancer Res Clin Oncol 2023; 149: 12177–12189.

19.

Liu

Yao

Wang

, et al. Machine learning model for risk stratification of papillary thyroid carcinoma based on radiopathomics. Acad Radiol 2025; 32: 2545–2553.

20.

Xing

, et al. Development and validation of a predictive model combining radiomics and deep learning features for spread through air spaces in stage T1 non-small cell lung cancer: a multicenter study. Front Oncol 2025; 15: 1572720.

21.

Wojtara

Rana

Rahman

, et al. Artificial intelligence in rare disease diagnosis and treatment. Clinical Translational Sci 2023; 16: 2106–2111.

22.

She

Marzullo

Destito

, et al. Deep learning-based overall survival prediction model in patients with rare cancer: a case study for primary central nervous system lymphoma. Int J CARS 2023; 18: 1849–1856.

23.

Liu

, et al. Multi-task learning via adaptation to similar tasks for mortality prediction of diverse rare diseases. AMIA Annu Symp Proc 2020; 2020: 763–772.

24.

Bai

Zhou

Luo

, et al. Development and evaluation of a machine learning prediction model for small-for-gestational-age births in women exposed to radiation before pregnancy. J Pers Med 2022; 12: 550.

25.

Liu

, et al. Development and external validation of a machine learning model to predict diabetic nephropathy in T1DM patients in the real-world. Acta Diabetol 2024; 62: 869–879.

26.

Kiely

Doyle

Drage

, et al. Utilising artificial intelligence to determine patients at risk of a rare disease: idiopathic pulmonary arterial hypertension. Pulm Circ 2019; 9: 1–9.

Machine learning-based prognostic model for human immunodeficiency virus-associated cutaneous T-cell lymphoma: A Surveillance,Epidemiology,and End Results database analysis

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Materials and methods

Data source

Construction and validation of ML models

Statistical analysis

Results

Baseline population characteristics

Univariate and multivariate LR analyses

Construction and validation of ML models

Discussion

Footnotes

Acknowledgments

Author contributions

Data availability statement

Declaration of generative AI and AI-assisted technologies in the writing process

Declaration of conflicting interests

Funding

ORCID iD

References