Abstract
Background:
Coexistent pulmonary tuberculosis and lung cancer (PTB-LC) is a rare type of disease with frequent under- and/or mis-diagnosis. Establishment of a reliable screening model for PTB-LC holds considerable medical and economic significance.
Objectives:
We aimed to develop an efficient and convenient tool to identify high-risk individuals for tuberculosis (TB) infection among LC patients based on commonly available parameters in clinical practice.
Design:
This study consisted of a primary retrospective patient cohort for model construction and verification, and a prospective patient cohort for prospective validation.
Methods:
Patients with active PTB-LC and LC diagnosed in Beijing Chest Hospital from 2018 to 2022 were collected and 1:1 matched according to time of admission and were classified into a training set (n = 281) and testing set (n = 121). Baseline information, clinicopathological features, imaging manifestations, and blood testing results were collected and analyzed. Five machine learning methods, including logistic regression (LR), random forest (RF), support vector machine (SVM), decision tree (DT), and neural network (NN), were employed to develop a screening model for PTB-LC.
Results:
Through multivariable analysis, gender, pleural effusion, cavitation, monocyte count (MONO), and plasma adenosine deaminase (ADA) levels were identified as independent predictors of PTB-LC and included in model construction. LR, RF, SVM, DT, and NN were used to construct the screening or pre-diagnosis models. The RF demonstrated the best performance with an area under the curve of 0.966 in the training set, 0.817 in the testing set, and 0.805 in the prospective dataset. The accuracy, precision, recall, and F1 score of the RF model of the training set were 0.88, 0.87, 0.89, and 0.88, respectively, and these indicators of the testing set were 0.71, 0.75, 0.72, and 0.74, respectively, which were superior to those of other methods. The prospective cohort further validated the good performance of the screening model. We also established a nomogram with gender, pleural effusion, cavitation, MONO, and serum ADA in assessing high-risk patients of developing TB infection. Further TB-related diagnostic tests were recommended for these high-risk patients.
Conclusion:
The RF screening model constructed with gender, pleural effusion, cavitation, MONO, and ADA may help identify high-risk patients of PTB-LC from LC alone cases.
Plain language summary
We used five machine learning methods in establishing a screening model involving male gender, pleural effusion, cavitation, peripheral monocyte and serum ADA in screening high-risk cases in developing tuberculosis infection among lung cancer patients.
Keywords
Introduction
Lung cancer (LC) is the most common cancer worldwide, with approximately 2.5 million new cases reported in 2022, and is also the leading cause of cancer-related deaths, accounting for approximately 1.8 million fatalities globally annually. 1 Tobacco, environmental exposures such as benzene and aromatic hydrocarbons, chronic obstructive pulmonary disease, HIV infection, dietary habits, and genetic predisposition are contributors to the risk of developing LC.2–4 Pulmonary tuberculosis (PTB) is another significant global health threat, which has also been recognized as a significant risk factor for LC development.5,6 Coexistent PTB-LC is a rare and complex disease that is usually underestimated in clinical practice. 7 Noteworthy, our previous study has reported the increasing incidence of PTB-LC in China in the past decade. 8 Furthermore, PTB-LC presents a more aggressive nature with more lymph nodes and distant metastases compared with patients with LC alone. 8 Therefore, more awareness and attention are warranted for PTB-LC.
The accurate and timely diagnosis of PTB-LC remains a challenge for clinicians at present. The diagnosis of PTB-LC is a complex procedure based on the combination of patients’ clinical symptoms, imaging findings, pathological, and etiological examinations. 9 Since patients with tuberculosis (TB) and patients with LC share similar clinical manifestations and radiological features, missed or delayed diagnosis commonly occurs, causing untimely treatment and inferior outcomes of PTB-LC patients.10,11 It was reported that 41% of PTB-LC patients were misdiagnosed in a single-center retrospective observational study. 12 The median time interval of delayed diagnosis was more than 10 months. 13 However, it does not mean that there is no specificity in imaging or clinical findings between PTB-LC and LC. Significant cough, pleural effusion, low body mass index, and advanced tumor stage were more commonly documented in PTB-LC than in LC alone.4,14,15 Besides, patients with PTB-LC present more lobules, burrs, bronchial obstruction, and stenosis on CT imaging than those with LC alone. 12 Considering the large population base of LC patients in China and the quite small proportion of PTB-LC in LC, it is not cost-effective to conduct tuberculosis diagnostic tests on all LC patients. The identification of high-risk patients is necessary for further tuberculosis-related examinations. In the present study, we explored the differences in clinical symptoms, CT imaging, and laboratory testing between PTB-LC and LC, and we further established and validated a convenient and efficient screening tool for PTB-LC based on common clinical parameters from LC patients.
Materials and methods
Patient eligibility and data collection
The study was conducted in Beijing Chest Hospital, the authoritative institute for LC and TB in China. It consisted of one primary cohort used for model construction and verification and another independent cohort for prospective validation (Figure 1). In the primary cohort, we retrospectively enrolled patients who were diagnosed with PTB-LC between 2018 and 2022, and patients with LC were randomly 1:1 matched with PTB-LCs during the same period. Patients in the primary cohort were subsequently divided into the training set and the testing set with a ratio of 7:3. Patients who were diagnosed with PTB-LC from 2023 to 2024 in our institute were prospectively enrolled, and LC patients were randomly 10:1 matched with PTB-LCs according to the time of admission in the prospective validation cohort.

Schematic diagram of the research procedure.
Inclusion criteria were as follows: (1) patients with definite diagnosis of active PTB, confirmed by acid-fast bacillus culture, molecular detection of Mycobacterium tuberculosis using sputum, bronchoalveolar lavage fluid, biopsy, or surgical specimens; (2) patients with histologically or pathologically confirmed LC; and (3) patients with complete pathological, clinical, and radiological information. Exclusion criteria were as follows: (1) LC patients who could not be matched to enrolled PTB-LC based on the time of admission; (2) patients with undiagnosed or doubtful diagnosis of PTB and LC; (3) patients who had a previous history of PTB but recovered from anti-TB treatment; and (4) patients who were diagnosed with extra-pulmonary TB except pleural TB. Eligible patients with PTB-LC and LC-alone were enrolled as the primary study cohort.
Patients’ information including gender, age, smoking history, clinical symptoms, comorbidities, CT imaging manifestation, location, stage, pathological type of LC, complete blood count test results, blood biochemical test results, and LC-related tumor markers (Carcinoembryonic antigen (CEA), Neuron-specific enolase (NSE), Progastrin releasing peptide (pro-GRP), squamous cell carcinoma (SCC), and cytokeratin 19 fragment (Cyfra21-1)) was collected from each patient and used for further analysis.
Model construction and validation
Univariate analysis was used for the identification of PTB-LC-associated indicators, and those with p value <0.05 were included in multivariate regression for independent risk factors. The study employed five machine learning algorithms to develop the early diagnosis model for PTB-LC, including logistic regression (LR), decision tree (DT), support vector machine (SVM), random forest (RF), and neural network (NN). The models were constructed as follows: the LR model utilized the “glm” engine; the DT model used the “rpart” engine; the SVM model employed the “svm” engine; the RF model used the “randomForest” engine; and the NN model was built with the “nnet” engine. Model performance was evaluated based on accuracy, precision, recall, F1 score, and area under the curve (AUC) in both the primary cohort and the prospective validation cohort.
Statistical analysis
Categorical variables were compared using chi-square tests or Fisher’s exact test. We used mean ± standard deviation for the description of continuous variables. The comparison of normally distributed continuous data was conducted using the Independent Student’s t test, whereas the Mann–Whitney U test (Wilcoxon rank-sum test) was applied when the normality assumption was violated. Binary LR was used to analyze the risk factors between the PTB-LC and LC groups. A p-value of <0.05 was considered statistically significant. Statistical analyses were performed using SPSS v21 (IBM Inc., Armonk, NY, USA) and R (version 4.4.0, R Core Team, Statistical Computing, Vienna, Austria).
Results
Patient characteristics
In this study, 402 eligible patients were included in the primary cohort, including 148 PTB-LC cases and 133 LC cases in the training set, and 53 PTB-LC cases and 68 LC cases in the testing set from 2018 to 2020. In addition, 19 cases with PTB-LC and 178 cases with LC alone were included in the prospective validation cohort from 2023 to 2024.
Baseline characteristics were well balanced between patients in the training set and the testing set of the primary cohort (Table 1). Noteworthy, compared with those with LC alone, those with PTB-LC presented certain features. Generally, PTB-LC patients were older than those with LC alone. Males, especially those with having smoking history, accounted for a higher proportion in the PTB-LC group than the LC-alone group (p < 0.001). In addition, patients with PTB-LC exhibited more complicated clinical manifestations, including cough, expectoration, hemoptysis, fever, fatigue, chest pain, and dyspnea than LC alone (Table S1). In terms of chest imaging manifestations, consolidation, bronchiectasis, burr sign, pleural effusion, cavitation, interstitial lesions, calcification, and tree-bud sign were more commonly observed in PTB-LC. Meanwhile, PTB-LC presented with more advanced tumor T stage than LC alone (p = 0.088 in the training set and p = 0.044 in the testing set; Table 1). Adenocarcinoma (ADC) was the most common pathology type in the primary cohort; however, PTB-LC was associated with higher frequency of SCC in comparison to LC alone (37.2% vs 14.3%, p < 0.001 in the training set, and 43.4% vs 17.6%, p = 0.006 in the testing set).
Baseline information for the training set and testing set.
p-Value for the significance between PTB-LC and LC in the training set.
p-Value for the significance between PTB-LC and LC in the testing set.
p-Value for the significance between the training set and the testing set.
ADC, adenocarcinoma; LC, Lung cancer; PTB, pulmonary tuberculosis; SCC, squamous cell carcinoma.
Construction and verification of the screening model of PTB-LC
Through univariable analysis, 24 parameters were identified to be associated with PTB-LC and included in multivariate regression (Table S1). Finally, gender, pleural effusion, cavitation, peripheral monocyte count (MONO), and serum adenosine deaminase (ADA) level were identified as independent indicators of PTB-LC via multivariate regression, which were used for model construction (Table 2).
Multivariate logistic regression of PTB-LC in the primary patient cohort.
ADA, adenosine deaminase; ALB, serum albumin; CI, confidence interval; CV, Continuous variable; Cyfra21-1, cytokeratin 19 fragment; HGB, hemoglobin; hs-CRP, high-sensitivity C-reactive protein; LC, lung cancer; LY, lymphocyte; MONO, monocyte; NEUT, neutrophil; OR, odds ratio; PTB, pulmonary tuberculosis; SCC, squamous cell carcinoma; TP, serum total protein.
Using these five parameters, we developed five different machine learning models. The results demonstrate that the RF model, developed using the training set, significantly outperformed the other algorithms, achieving notable metrics with accuracy of 0.88, precision of 0.87, recall of 0.89, F1 score of 0.88, and AUC of 0.966 (Table 3, Figure 2(a)). The RF model also performed well when validated with the testing set, with an accuracy of 0.71, precision of 0.75, recall of 0.72, F1 score of 0.74, and AUC of 0.817 (Table 3). In addition, we compared the decision curve analysis (DCA) curves of the five models, and the results showed that the DCA performance of the RF model was significantly better than the other four models (Figure 2(b)). For the constructed RF model, the optimal cutoff value is 0.405. At this threshold, the model maintains high sensitivity (0.946) while also ensuring relatively high specificity (0.865).
Performance comparison of models based on five machine learning algorithms.
AUC, area under the curve; DT, decision trees; LR, logistic regression; NN, neural networks; RF, random forest; SVM, support vector machine.

ROC and DCA curves for the five predictive models. (a) ROC curves for the five models. (b) DCA curves for the five models.
The AUC values for the precision-recall (PR) curves of the RF model were 0.969 for the training set and 0.817 for the testing set (Table 3, Figure 3(a) and (b)). In addition, the AUC of the prospective cohort was 0.805 (Figure 3(c)). Analysis of feature importance revealed that cavitation was the most influential factor, followed by pleural effusion and gender. MONO and ADA levels were also important, ranking fourth and fifth, respectively (Figure 3(d)). In addition, to facilitate the clinical application of the model, we constructed a nomogram incorporating gender, pleural effusion, cavitation, ADA, and MONO (Figure 3(e)).

Validations of the RF model. (a) PR curve of the RF model for the training set. (b) PR curve of the RF model for the testing set. (c) ROC curve of the RF model for the prospective validation cohort. (d) Weight distribution of the RF model components. (e) Diagnostic nomogram for PTB-LC.
Discussion
The mutual relationships between PTB and LC are well recognized nowadays.15–19 Coexistent PTB-LC is a unique type of disease. The accurate and timely diagnosis of PTB-LC remains a challenge for clinicians. Both LC and PTB are serious respiratory diseases with overlapping clinical symptoms and CT imaging features, leading to delayed or missed diagnosis and inferior prognosis.10,11 Given the rarity of PTB-LC among LC patients, which accounts for quite a large population base in China, it is not cost-effective to conduct TB-related tests for every LC patient in clinical practice. Therefore, we analyzed the possible indicators of PTB-LC patients based on commonly available clinical, pathological, radiological, and blood testing results in this study. It was demonstrated that gender, pleural effusion, cavitation, MONO, and ADA levels were independent risk factors of PTB-LC. Accordingly, a simplified RF model was developed for the screening of PTB-LC, which was proved to be a reliable screening or pre-diagnosing tool through internal and prospective validations. Moreover, the five variables included in the model were easily accessible and practically implemented for use.
According to our previous study, similarly to LC, PTB-LC has been rapidly increasing in the past decade in China. 8 Compared with general populations, patients with LC had a higher risk of developing TB infection (hazard ratio = 25.21, 95% confidence interval (CI): 21.54–29.89). 8 In addition, PTB-LC predominantly occurs in old male patients in comparison to LC alone (median age, 63.61 ± 10.46 vs 61.08 ± 10.77, p < 0.001; male to female ratio, 2.82 vs 1.59, p = 0.044). 8 Consistently, our study also proved that PTB-LC was associated with older age, male sex, SCC, and mediastinal lymph node invasion. Besides, pleural effusion and cavitation in CT imaging, monocyte count in peripheral and serum ADA were identified as independent risk factors of PTB-LC in the present study.
Monocyte is the predominant innate immune cell at the early stage of MTB infection, as the host defense against intracellular pathogens. 20 It has also been reported as a negative predictor of prognosis in LC patients.21,22 In a study consisting of 181 patients with active PTB, monocyte was significantly lower in cured patients than in non-cured patients; besides, monocyte was identified as an independent immune-related risk factor for the prognosis (odds ratio = 7.881, 95% CI: 1.675–37.075, p = 0.009) with a cutoff value of 0.535 × 109/L. 23 Monocyte contributes to the inflammatory process through their differentiation into macrophages or dendritic cells in the tissue microenvironment, 24 so the peripheral blood monocyte count can be used to predict the TB infection. In LC patients, the elevation of peripheral monocyte count was significantly lower than those with coexistent PTB-LC, suggesting its potential in distinguishing PTB-LC from LC alone.
It is well established that ADA in pleural fluid (with a cutoff of 40 U/L) performs well in the detection of PT with sensitivity and specificity values above 86%.25,26 The value of serum ADA levels in diagnosing PTB has also been investigated. One study demonstrated that tuberculous lymphadenitis patients had significantly higher serum ADA than persistent reactive non-tuberculous lymphadenitis. 27 Moreover, it was reported that the serum ADA activity, along with CCL1, CXCL10, and VEGF, provided a promising tool for differentiating patients with active TB from latent TB infection individuals. 28 Salmanzadeh et al. 29 reported that the mean serum ADA level in PTB patients (26.0 IU/L) was significantly higher than that in patients with pneumonia (19.5 IU/L), LC (15.8 IU/L), and healthy controls (10.7 IU/L, p < 0.05). However, the sensitivity and specificity of ADA were defined as 35% and 91%, respectively, in patients with PTB. 29 Our study further demonstrated serum ADA as a potential noninvasive biomarker for differentiating patients with active PTB-LC from LC alone. However, further studies are warranted to investigate its exact value and underlying mechanisms.
The application of machine learning algorithms offers substantial value in identification and prognosis prediction for LC patients.30–32 Yang et al. 33 established an RF method based on the integration of CT imaging-based radiomics and clinicopathological characteristics, which presented satisfactory predicting values of survival benefit of LC patients from immune checkpoint inhibitors. Dong et al. 34 constructed an auxiliary scoring model for myelosuppression in patients with LC chemotherapy based on an RF algorithm, and the AUCs of the model in the training and validation sets were 0.878 and 0.885, respectively (p < 0.05). In the present study, among the five machine learning algorithms, the RF model outperformed the others in terms of performance indicators. In addition, we included an additional prospective patient cohort for further validation of the RF model. Taken together, the RF model built upon gender, pleural effusion, cavitation, monocyte, and serum ADA showed satisfactory accuracy in predicting high-risk patients with PTB-LC, and we recommend further TB-related diagnostic tests for these high-risk patients.
Limitation
Our study has several limitations. First, potential selection bias was inevitable due to the retrospective nature of this study. Besides, we included common pathological, clinical, and radiological information in LR and model construction, without interferon-γ release assays, a widely used laboratory test for previous or current TB infection. Future research involving the addition or integration of multi-omics data, including radiomics, is promising to establish more reliable and valuable models. Finally, external validation from another institute was warranted for this study.
Conclusion
In conclusion, the RF screening model constructed with gender, pleural effusion, cavitation, monocyte, and serum ADA may help identify high-risk patients of PTB-LC from LC alone cases. The application of this convenient screening model might facilitate early diagnosis and prognosis improvement of PTB-LC patients.
Supplemental Material
sj-docx-1-tam-10.1177_17588359251355058 – Supplemental material for Establishment and validation of a convenient and efficient screening tool for active pulmonary tuberculosis in lung cancer patients based on common parameters
Supplemental material, sj-docx-1-tam-10.1177_17588359251355058 for Establishment and validation of a convenient and efficient screening tool for active pulmonary tuberculosis in lung cancer patients based on common parameters by Fan Zhang, Fei Qi, Mengyan Sun, Peng Jiang, Minghang Zhang, Xiaomi Li, Yujie Dong, Juan Du, Liang Li and Tongmei Zhang in Therapeutic Advances in Medical Oncology
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
