Abstract
Background:
There have been no reports about the application of random survival forest (RSF) model to predict disease progression of HIV-associated B-cell lymphoma.
Methods:
A total of 44 patients with HIV-associated B-cell lymphoma who were referred to Nanjing Second Hospital from 2012 to 2019 were included. The RSF model was used to find predictors of survival, and the results of the RSF model were compared with those of the Cox model. The data were analyzed using R software (version 4.1.1).
Results:
One-, 2-, and 3-year survival rates were 74.5%, 57.7%, and 48.6%, respectively, and the median survival was 59.0 months. The first 3 most important predictors of survival included lactate dehydrogenase (LDH), absolute monocyte count (AMC), and white blood cells (WBCs) count. The median survival of high-risk patients was only 4.0 months. Areas under the curve (AUCs) of the RSF model remained at more than 0.90 at 1, 2, and 3 years. The RSF model displayed a lower prediction error rate (21.9%) than the Cox model (25.4%).
Conclusions:
Lactate dehydrogenase, AMC, and WBCs count are the most important prognostic predictors for patients with HIV-associated B-cell lymphoma. Much larger prospective and/or multicentre studies are required to validtae this RSF model.
Introduction
The prevalence of lymphoma in HIV-positive individuals is higher than in the general population.1,2 A cohort study showed that the probability of developing non-Hodgkin’s lymphoma (NHL) in patients with HIV infection (PLWH) remains as high as 4%, even if it has appeared to decline in the highly active antiretroviral therapy (HAART) era. 3 More than 95% of lymphomas are of B-cell origin, including diffuse large B-cell lymphoma (DLBCL), Burkitt lymphoma (BL), and plasmablastic lymphoma (PBL). 4 The incidence of BL has increased significantly and the incidence of DLBCL has significantly decreased in Europe and United States.5-7 The advent of HAART has enhanced immune function and decreased the incidence of most NHL, though cases of BL have continued to grow. 8
Outcomes of HIV-associated lymphomas depend on patient-specific (age 9 and performance status), 10 lymphoma-specific (LDH 11 and stage), 12 the combination of patient-specific and lymphoma-specific (IPI, 13 revised IPI, and NCCN-IPI), 14 and HIV-specific (CD4 cell counts) 15 factors. Highly active antiretroviral therapy has also had a significant impact on the outcome of HIV-associated lymphoma. 13 Considerable controversy remains regarding prognostic factors for HIV-associated lymphoma. Most studies are based on the multivariable Cox proportional hazard (CPH) model for survival analysis, which has been criticized for creating bias. 16 The random survival forest (RSF) model, a direct extended method of the random forest (RF), is a quite novel machine learning algorithm in survival analysis with high performance and interpretability. 17 It was proposed by Ishwaran et al 17 to automatically assess nonlinear effects and complex interactions in all variables, reducing variance and bias. Random survival forest can also be used for variable selection. Thus, it has been used in several studies and shown to outperform CPH.18-20 Besides, RSF does not have to satisfy strict assumptions like the Cox model, thus is more applicable in clinical practice.21,22 Until now, there have been no reports about the application of RSF model to predict disease progression of HIV-associated B-cell lymphoma. Therefore, the aim of our study was to develop and test a prognostic model based on the RSF algorithm to identify predictors of overall survival (OS) in HIV-associated B-cell lymphoma.
Material and Methods
Study population
A total of 44 patients with HIV-associated B-cell lymphoma who were referred to Nanjing Second Hospital between June 2012 and June 2019 were enrolled in this retrospective study. The diagnosis of HIV-associated lymphoma met the following criteria: (1) western blot-confirmed diagnosis of HIV infection from the Center for Disease Control and Prevention and (2) a confirmed histologic diagnosis of B-cell lymphoma. This study was approved by the Institutional Review Board of Nanjing Second Hospital, and the need for written informed consent was waived due to the retrospective nature of the research. The confidentiality of personal information were maintained strictly.
Variables
The following variables were included in survival analysis: demographics including age and sex, clinical variables such as Eastern Cooperative Oncology Group (ECOG) performance status, stage, IPI, B symptoms, extranodal sites, hepatitis B virus (HBV)/hepatitis C virus (HCV)/HIV history and antiretroviral therapy history; and laboratory variables including white blood cells (WBCs), the absolute lymphocyte count (ALC), the absolute monocyte count (AMC), hemoglobin (HB), the red cell distribution width (RDW), platelets (PLT), the platelet-lymphocyte ratio (PLR), the lymphocyte-monocyte ratio (LMR), albumin, globulin, the albumin-globulin ratio (AGR), the platelet-albumin ratio (PAR), lactate dehydrogenase (LDH), the LDH-lymphocyte ratio (LLR), the CD4 count, the CD8 count, and the CD4-CD8 ratio. Patient files were retrieved, and the information was extracted.
Follow-up
Follow-up was performed by reviewing outpatient and inpatient medical records as well as phone calls. Overall survival was calculated from the date of diagnosis to the time of last follow-up or death.
Statistical analysis
Random survival forest models were implemented to identify factors affecting survival using the “randomForestSRC” package, and each run of RSF was performed based on 1000 repetitions. The variable was selected according to variable importance (VIMP). Variable importance analysis ranked variables by predictive power. Minimal depth was a dimensionless order statistic that measures the predictiveness of a variable in a survival tree. Random survival forest grows survival trees by randomly selecting features and then splits nodes using candidate features. The result of the RSF model was compared with that of the Cox model. All statistical analyses were performed using R software version 4.1.1. P < .05 was considered significant.
Results
Patient characteristics
A total of 44 patients with HIV-associated B-cell lymphoma were included in our study, with a median age of 48 years old. Most of them were male (39 cases, 88.6%). Regarding specific pathological type, DLBCL accounted for the vast majority (39 cases, 88.6%), followed by BL (3 cases, 6.8%), and other B-cell lymphomas (2 cases, 4.6%). Among all cases, 19 had a high CD4+ count (>200 cells/µl), and 29 had an elevated LDH level. At diagnosis, 35 patients (79.5%) were at Ann Arbor stage III-IV, and 7 patients (15.9%) had an IPI score of 4 to 5. Most patients (88.6%) received HAART at or after lymphoma diagnosis. Seventeen diagnosed patients presented with an HIV infection history at lymphoma diagnosis. One-, 2-, and 3-year survival rates were 74.5%, 57.7%, and 48.6%, respectively, and the median survival was 59.0 months. Details of the patient characteristics are shown in Table 1.
Baseline characteristics of 44 patients with HIV-related B-cell lymphoma.
BL: Burkitt lymphoma; DLBCL: diffuse large B-cell lymphoma; ECOGS: Eastern Cooperative Oncology Group; LDH: lactic dehydrogenase; HARRT: human immunodeficiency virus; HB: hemoglobin; HBV: hepatitis B virus; HCV: hepatitis B virus; HIV: human immunodeficiency virus; IPI: International Prognostic Index; NHL: non-Hodgkin lymphoma; PBL: plasmablastic lymphoma; PCNSL: primary central nervous system lymphoma; PLT: platelet; SD: standard deviation; TPPA: treponema pallidum particle agglutination test; WBC: white blood cells.
RSF model
Figure 1A shows a randomly chosen tree from the 1000-tree forest. From the RSF model with all variables, 11 were selected to be predictive for survival according to minimal depth and Vimp ranking: LDH, AMC, WBC, PAR, LMR, ALC, LLR, globulin, PLT, the CD4 count and the CD8 count (Figure1B). The results of the importance of variables were standardized, and 0.4 was set as the cutoff value of the relative importance of variables. The most important variables were obtained and sorted according to their relative importance results. The 3 variables LDH, AMC, and the WBC count were the most important predictive variables (Figure 1C), and these 3 risk factors were selected to develop a simplified RSF model. The error rates for predictors using the simplified model are presented in Figure 1D, with an estimated c-statistic of 0.8964. Figure 2A shows that survival decreased with monocytes and LDH level. Figure 2B indicates that the effect of monocytes increased depending on grouping within the increasing LDH group. The median survival of high-risk patients was only 5.0 months (Figure 3A). Areas under the curve (AUCs) of the RSF model were maintained at more than 0.90 at 1, 2, and 3 years (Figure 3B). After multivariable Cox analysis with all predictors, only IPI was found to be an independent significant predictor for survival. As shown in Figure 4, the RSF model displayed a lower prediction error rate (21.9%) than the Cox model (25.4%).

(A) Illustration of a random tree from our 1000-tree forest. (B) Comparing Minimal Depth and VIMP rankings. (C) The most important variables were sorted according to their relative importance results. (D) Error rates with the simplified model for the ensemble cumulative hazard function.

(A) Variable dependence of predicted survival at 1 and 3 years on LDH, AMC, and the WBC count. (B) Variable dependence coplot of survival at 1 year against AMC, conditional on LDH interval group membership.

(A) Kaplan-Meier curves for overall survival using the RSF model. (B) The receiver operating characteristic (ROC) curve shows the potential of the RSF model in predicting 1-, 2-, and 3-year overall survival.

Prediction error curves for the Cox model and the RSF model.
Discussion
In this retrospective study, we investigated predictors of survival for HIV-associated B-cell lymphoma. Whether HIV status impairs outcomes of DLBCL patients treated with standard immunochemotherapy in the HAART era remains controversial,23-25 which means that prognostic factors might differ between these 2 groups. By RNA-seq, Fedoriw et al 26 identified a strong contribution of HIV status to DLBCL gene expression. To our knowledge, our study is the first to use the RSF model to identify prognostic factors in HIV-associated B-cell lymphoma patients. The RSF model can address relationships between variables over time. Using the RSF model, we found that LDH, AMC, and WBC were the most important predictors for outcomes.
With the increase in the number of examinations, multidimensional and nonlinear characteristics would limit the application conditions of the Cox regression modeling method. 27 Cox model only identifies linear relationships between characteristics and survival outcomes. Besides, RSF can prevent over-fitting of models by the process of random sampling. 28 In this study, comparing RSF and Cox models with prediction error rate, it was observed that RSF had a better predictive performance. We fitted and tested an excellent prognostic model to predict OS of HIV-associated B-cell lymphoma patients with an AUC of 0.93, and this model was effective in providing a personalized survival prediction. For this work, we sampled with 1000 repetitions to train the RSF model, which would be an advantageous approach when the sample size is limited for model development. However, it comes with a cost as this procedure is repeated multiple times and is therefore computationally expensive.
There is still considerable controversy regarding the prognostic role of these factors. Several studies have shown that elevated LDH is associated with inferior survival in patients with HIV-BL.29,30 However, it was not an independent risk factor for OS in HIV-DLBCL in one study, 28 consistent with the findings of Wu et al. 31 High LDH levels reportedly predict worse OS for 100 HIV-associated lymphoma cases, with DLBCL being the predominant histological subtype (66%). 32 The level of LDH has a significant impact on the response rate to chemotherapy in relapsed or refractory HIV-associated NHL. 33 Another finding is that the monocyte count is an important predictor of survival in these patients. Indeed, the AMC may help in identification of high-risk DLBCL patients.34-36 Large B-cell lymphoma recruits monocytes via CCL5 to support B-cell survival and proliferation. 37 Monocytes can also reduce expression of the CD20/rituximab complex on the B-cell surface. 38 Consistent with our findings, Shen et al 39 showed that the WBC count is an independent prognostic factor for DLBCL on the basis of the least absolute shrinkage and selection operator (LASSO) model and RF model. Yan et al 40 also demonstrated the prognostic value of WBCs in DLBCL.
There are some limitations in this study. First, due to the retrospective design, the quality of the findings needs to be confirmed. Second, due to the small sample size, we were unable to perform subgroup analyses according to some factors, such as pathological type and cell of origin. Third, molecular testing was not performed. Myc protein expression is significantly associated with inferior survival in HIV-DLBCL. 41 Finally, the model should be validated with an external cohort.
Conclusion
Our study showed LDH, AMC, and the WBC count to be the most important prognostic factors for HIV-associated B-cell lymphoma. Comparison with the Cox model also showed that the RSF model is an appropriate method to identify high-risk patients. These findings need to be confirmed by large-scale studies in the future.
Footnotes
Authors’ Contributions
JW and QZ conceived and designed the study. HZ and CZ performed the study, data collection, abstraction, and data entry; YL and YC were statistical advisers; HZ and FZ drafted the manuscript; JW and HZ were responsible for the overall direction of the text and discussion. JW and QZ had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. All authors read and approved the final manuscript.
Availability of Data and Material
The data generated and analyzed in this study have been included in this article. Further inquiries can be directed to the corresponding authors.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was partially supported by the Natural Science Foundation of Nanjing University of Chinese Medicine (grant nos. XRZ2021078 and XZR2021080), General Project of Medical Research Program of Jiangsu Province Health Commission (grant no. M2020055), Medical science and technology development project of Nanjing (YKK22130), Talent lift Fund projects of Nanjing Second Hospital (RCMS23006), Medical science and technology development key project of Nanjing (grant no. ZKX23037), Talent lift Fund key projects of Nanjing Second Hospital (grant no. RCZD202302), and Scientific research project of Jiangsu Provincial Health Commission (grant no. Z2023059). The funder sponsors had no role in the study.
Ethics Approval and Consent to Participate
This study was approved by the Institutional Review Board of Nanjing Second Hospital, and the need for written informed consent was waived due to the retrospective nature of the research (2022-LY-kt082).
