Sage Journals: Discover world-class research

Abstract

Objective

Aiming at the problems of the long incubation period, insufficient early diagnosis, and lack of treatment methods of coal workers’ pneumoconiosis (CWP), the objective of this study is to accurately predict the CWP staging based on machine learning (ML) methods and small-sample clinical data.

Methods

The study included a comparative analysis of clinical data from 202 healthy individuals and 81 CWP patients at general Hospital of Xuzhou Mining Group. Firstly, various oversampling techniques were employed to address the issue of data imbalance. Subsequently, multiple ML methods were adopted for supervised learning and prediction of CWP staging. Then, an innovative feature selection method was proposed, integrating the importance and independence of clinical features to achieve high-precision predictions of CWP with a limited number of indicators.

Results

The study identified ALB, PLT, and WBC as significant predictive factors for CWP through the Random Forest importance assessment method. Furthermore, in terms of integrated feature selection, when the weight ratio of feature importance to independence was 7:3 or 6:4, all ML models showed optimal performance, with the Random Forest (RF)-Adaboost model demonstrating the best predictive accuracy for CWP, reaching a F1 score of 0.8757.

Conclusions

The integration of clinical biochemical examination data with ML models, especially the RF-Adaboost and support vector machine-particle swarm optimization models, effectively predicted the staging of CWP. The proposed integrated feature selection method, which considered both the importance and independence of features, significantly enhanced model performance, providing a valuable tool for early screening and diagnosis of CWP.

Keywords

Coal workers’ pneumoconiosis clinical data machine learning feature selection

Introduction

Pneumoconiosis, as the most common occupational disease worldwide, has long been a focus of attention in the field of public health.¹ Due to the differences in the composition of inhaled dust, pneumoconiosis has many classifications, of which coal workers’ pneumoconiosis (CWP) is a type of occupational disease caused by long-term inhalation of coal dust particles, leading to their retention in the lungs and the primary occupational disease characterized by pulmonary fibrosis.² The disease has a certain latency, a high mortality rate, and causes irreversible damage to the lungs.³ Up to now, there is no effective treatment for CWP but its early screening and diagnosis are crucial for preventing the treatment of further deterioration of the patient's condition.⁴ According to the occupational pneumoconiosis diagnostic guidelines proposed by the International Labour Organization, high-kV chest X-ray is the gold standard for the diagnosis of pneumoconiosis.⁵ Currently, the diagnosis of CWP predominantly relies on the assessment by professional doctors who classify and stage the condition based on the profusion and distribution of opacities observed on chest X-ray films. The staging of CWP helps doctors understand the patient's condition, guiding them to propose suitable treatment and prevention strategies. Generally, according to the severity of film reading, CWP patients are divided into stage I, stage II, and stage III.⁶ However, the diagnostic and staging process is susceptible to subjective factors such as the experience of doctors, potentially resulting in misdiagnosis.⁷ In recent years, computer-aided detection technology has been widely used to improve the accuracy of CWP diagnosis by clinical doctors,^8–11 but it must be noted that this auxiliary diagnosis scheme requires significant time investment for doctors to manually define features,¹² which is not suitable for complex pneumoconiosis staging tasks and hinders the early screening of CWP for frontline workers.

In addition to chest X-ray, integrating metabolomics into CWP prediction studies provides new opportunities for identifying biomarkers associated with CWP. Metabolomics, as a rapid and precise diagnostic method, holds significant potential for the early detection of CWP.^13,14 Among the biomarkers, serum levels of cytokines such as TGF-β, MCP-1,¹⁵ interleukin-8, intercellular adhesion molecule-1,¹⁶ IL-13, IL-18R, matrix metalloproteinase-9 (MMP-9), and matrix metalloproteinase inhibitor-9¹⁷ have been confirmed to be closely related to the disease progression of CWP patients. Meanwhile, serum lipid metabolites such as phosphatidylethanolamine,¹⁸ phosphatidylcholine, and lysophosphatidylcholine (lysopcs)¹⁹ also play a crucial role in guiding clinical diagnoses of pneumoconiosis. In recent years, the application of machine learning (ML) technology in the biomedical field has been increasing, demonstrating tremendous potential in improving the accuracy of disease diagnosis, the formulation of personalized treatment plans, and the development of new drugs.²⁰ The combination of ML and metabolomics enables the processing and analysis of high-throughput metabolomic data, thereby more rapidly and accurately identifying and screening biomarkers associated with specific diseases. In order to find biomarkers for CWP, Chen et al.¹⁹ combined metabolomics technology with ML methods to screen biomarkers in the serum of CWP patients. After adjusting for age, smoking, drinking, and other factors, three ML methods were used to screen out the potential biomarker of CWP as propylparaben from 68 different metabolites. Chen et al.²¹ combined lipidomics with ML methods to detect the serum of CWP patients and found that lipid metabolites represented by AHEXCER, DG, and DMPE may be good biomarkers of CWP. To further understand the application of clinical indicators in pneumoconiosis diagnosis, Dong et al.²² used embedded methods to select 62 clinical indicators and predict early CWP by combining three feature selection methods with five ML models. The results show that AaDO₂ and some lung indexes are critical in identifying the prediction of early CWP, and in this study, the support vector machine (SVM) algorithm is found to be the best ML model for predicting CWP. Compared with chest X-ray, it is more convenient to detect changes in serum biomarker content. However, it should be noted that metabolomic-based biomarker detection is costly and is not conducive to the routine screening of CWP in mine workers.

Routine biochemical tests are widely used in clinical practice because of their low cost, easy access, and high accuracy.²³ Studies have shown that combining these tests with ML can provide important guidance for disease prediction.²⁴ For example, Cai et al.²⁵ systematically evaluated the value of routine laboratory tests, including blood routine tests, coagulation tests, and urine tests, in the diagnosis of ovarian cancer, and combined with ML to develop an AI-assisted diagnostic model based on multiple test indicators. However, in the staging prediction of CWP, ML is mostly based on chest X-ray and metabolomics, while there are few studies based on routine biochemical examination. To enable an earlier, more accurate, and cost-effective assessment of the health status of patients with CWP, this article proposed an innovative method for evaluating the stage of CWP. This approach combined routine biochemical tests (blood routine, liver function, renal function, blood glucose, blood lipid) with ML algorithms to develop more effective treatment and prevention strategies. Additionally, to comprehensively consider the multifaceted effects of clinical data and improve the accuracy of evaluation, a multi-metric integrated feature selection method was proposed. By comparing the performance of different ML algorithms, key features that had a significant impact on CWP staging were identified, and the most effective CWP staging prediction model was found to achieve early prediction of CWP.

Our main contributions are summarized as follows. Firstly, we addressed the challenge of unbalanced data distribution inherent in the clinical dataset through dedicated data preprocessing techniques. Secondly, we conducted dimensionality reduction using various feature selection methods (including importance-based, correlation-based, and integrated approach we proposed) to identify the most discriminative variables for CWP staging. Subsequently, we determined the optimal ML model configuration for prediction.

Methods

Patients source and clinical data collection

The study was approved by the Medical Ethics Committee ([2022]-101501) and selected 81 patients with pneumoconiosis hospitalized in Xuzhou Mining Group General Hospital from 28 June 2022 to 13 July 2023, as well as 202 healthy individuals who underwent physical examinations at the hospital from 12 January 2024 to 24 May 2024. The inclusion criteria were as follows: 1) age between 40 and 90 years; 2) all male; 3) the examination report included common demographic and routine biochemical indicators, such as blood lipid tests including triglycerides, total cholesterol, high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, and very low-density lipoprotein cholesterol; 4) those who provided informed consent for the study. By combining the intersection of routine biochemical examination indicators from healthy individuals and clinical test indicators from pneumoconiosis patients, the most common 22 typical biochemical examination indicators were selected as sample data, which are detailed in Table 1.

Table 1.

Indicators of routine biochemical examination.

Feature name	Abbreviation
White Blood Cells	WBC
Red Blood Cells	RBC
Hemoglobin	HB
Platelets	PLT
Absolute Neutrophil Count	ANC
Absolute Lymphocyte Count	ALC
Absolute Monocyte Count	AMC
Absolute Eosinophil Count	AEC
Albumin	ALB
Globulin	GLB
Albumin/Globulin Ratio	A/G
Alanine Aminotransferase	ALT
Aspartate Aminotransferase	AST
Blood Urea Nitrogen	Urea
Creatinine	Cr
Uric Acid	UA
Glucose	GLU
Triglycerides	TG
Total Cholesterol	CHOL
High-Density Lipoprotein Cholesterol	HDL
Low-Density Lipoprotein Cholesterol	LDL
Very Low-Density Lipoprotein Cholesterol	VLDL

Data processing

In the diagnosis and prediction of diseases, there is often a problem of category imbalance between the healthy and the affected population.²⁶ This imbalance can lead to classification models misclassifying the minority class (diseased individuals) as the majority class (healthy individuals), which affects the accuracy of the model and may cause patients to miss the best treatment opportunities. Based on the clinical data we collected, the proportion of CWP patients at stages II and III was only 7.4% and 2.1%, respectively. This indicated a significant imbalance in the data distribution, with the majority of patients concentrated in stage I. To address this challenge and ensure robust model performance on minority classes, we employed specific data handling and evaluation strategies.

Given the limited size of the dataset (original distribution: Phase 0: 202, Phase I: 54, Phase II: 21, Phase III: 6, total 283 cases), rigorous data processing and model evaluation strategies were employed. Initially, given the extremely limited number of Phase III cases, which posed a significant challenge for reliable classification, CWP Phase III data were merged into the CWP Phase II category, resulting in a revised class distribution (Phase 0: 202, Phase I: 54, Phase II: 27). Despite this merger, the dataset still presented a considerable class imbalance. Therefore, to ensure robust model evaluation on this limited and imbalanced data, and crucially, to prevent data leakage during resampling techniques, we adopted a Stratified 5-fold Cross-Validation approach. In each fold, the dataset was stratified split into a training set and a test set, ensuring similar class distribution proportions to the original data within both splits. To mitigate the class imbalance within the training set of each fold, resampling techniques were applied, broadly categorized into undersampling and oversampling. Undersampling methods reduce the number of majority class samples, while oversampling methods increase the number of minority class samples. Common oversampling methods mainly include synthetic minority oversampling technique (SMOTE) and adaptive synthesis (ADASYN).²⁷ Synthetic minority oversampling technique generates new sample points by interpolating between the nearest neighbors of a few class samples, while ADASYN generates more new samples in the feature space regions of a few classes with lower data density, and fewer new samples in the regions with higher density. In this study, we employed a combined resampling strategy: firstly, random undersampling was applied to the majority class (Phase 0) in the training set to reduce its sample count. Subsequently, one of the oversampling techniques (either SMOTE or ADASYN) was applied to the undersampled training set to balance the representation of classes 0, 1, and 2. This entire resampling process, including both undersampling and oversampling, was strictly confined to the training data of each fold. The final reported metrics represent the average performance across the 5 folds, including specific evaluation measures (Precision, Recall, F1-score) for each individual class, in addition to macroaveraged and overall metrics.

Machine learning method

As a major branch of artificial intelligence, ML is usually divided into two categories: supervised learning and unsupervised learning.²⁸ Since this study was based on labeled clinical data to build a pneumoconiosis staging prediction model, Random Forest (RF), SVM, K-Nearest Neighbor (KNN), and Decision Tree (DT) were selected as the basic models.

Random Forest, renowned for its robustness and accuracy, is a prevalent ensemble learning method. However, it can exhibit diminished sensitivity to underrepresented classes, potentially impairing classification performance.²⁹ To address this limitation, we integrated the Adaboost algorithm into our model training process. Adaboost enhances classification by iteratively training weak classifiers to focus on misclassified instances, thereby refining the overall classifier's accuracy.³⁰ This method is particularly applied during the Model Training phase of the RF algorithm, as shown in Figure 1, where the specific optimization mechanism was delineated.

Figure 1.

RF optimization mechanism of Adaboost algorithm. This figure illustrates the steps from Data Preparation and setting up the RF model parameters (defining base weak learners) to the iterative Adaboost procedure. Adaboost iteratively trains these weak RF learners, adaptively adjusting sample weights based on misclassification errors in each step, thereby enhancing the overall classifier's performance and yielding the final optimized classifier. RF: Random Forest.

Support Vector Machine performs well in the case of small sample size and has good generalization ability but its performance largely depends on the choice of parameters, such as penalty parameter c and kernel function parameter g. Traditionally, the selection of the best parameters for SVM is a manual process that involves time-consuming methods like grid search, which can be quite cumbersome and inefficient.³¹ Conversely, Particle Swarm Optimization (PSO) is an evolutionary algorithm that leverages population-based cooperation and information exchange to rapidly converge to an optimal solution.³² Integrating PSO with SVM facilitates an efficient stochastic search for the optimal parameters c and g within a predefined search space, thereby enhancing SVM's parameter optimization process and making up for its limitations. The specific optimization mechanism is shown in Figure 2.

Figure 2.

SVM optimization mechanism of PSO algorithm. This figure illustrates the steps from Data Preparation and setting up the SVM model parameters (specifically c and g) to the iterative PSO procedure. PSO iteratively updates particle positions based on fitness evaluations in each step, adaptively searching the parameter space for optimal c and g values, thereby enhancing the overall model's performance and yielding the final optimized SVM model. PSO: Particle Swarm Optimization; SVM: support vector machine.

Evaluation metrics

To ensure a comprehensive and reliable evaluation of the ML models for this multiclass classification task with imbalanced data distribution, multiple performance metrics derived from the confusion matrix were employed, because reliance solely on overall accuracy can be misleading in such scenarios. The evaluation metrics^33–36 used and their corresponding formulas are as follows:

Accuracy = \frac{TN + FN}{TP + FP}

(1)

Precision = \frac{TP}{TP + FP}

(2)

Recall = \frac{TP}{TP + FN}

(3)

F 1 = \frac{2 * precision * recall}{precision + recall}

(4)

where TP (True Positive) refers to the number of instances correctly predicted as belonging to that class; TN (True Negative) refers to the number of instances correctly predicted as not belonging to that class; FP (False Positive) refers to the number of instances incorrectly predicted as belonging to that class; and FN (False Negative) refers to the number of instances incorrectly predicted as not belonging to that class.

Although resampling techniques were applied to the training data to mitigate class imbalance during training, model performance was evaluated on the test set in each fold, which reflects the original data distribution. Therefore, this study primarily reports the Macro F1 Score. The Macro F1 Score is computed by calculating the F1 Score for each individual class (0, 1, and 2), and then taking their unweighted arithmetic average:

Macro - F 1 = \frac{F 1_{class 0} + F 1_{class 1} + F 1_{class 2}}{3}

(5)

This approach provides a more reliable metric for measuring the model's overall performance in handling classification tasks that involve class imbalance by giving equal importance to the performance on each class.

Integrated feature selection method

To consider both feature importance and its relationship with other features, we introduced an innovative integrated approach to assess each feature variable based on a comprehensive evaluation of its impact on the model.

Initially, to assess feature importance, we averaged the scores obtained from multiple evaluations using the RF algorithm. Subsequently, to quantify feature independence, we calculated the average of the absolute values of its Spearman correlation coefficients with all other features in the dataset. A higher feature independence value indicates stronger average correlation with other features, implying lower independence and a higher risk of introducing multicollinearity. Thereafter, we combine them into a single score y for ranking and selection. The detailed formula is as follows:

y = ω_{1} x_{1} + ω_{2} x_{2}

(5)

where ω₁+ω₂ = 1 and x₁ represents the importance value while x₂ represents the independence value.

The higher the importance value, the greater the contribution to the model's prediction, positively affecting the model performance, so features corresponding to high-importance values should be retained whenever possible. Conversely, the higher the independence value, the more pronounced the multicollinearity among variables, which negatively impacts the model's prediction, thus features corresponding to high independence values should be eliminated whenever possible.⁴⁴ Therefore, we modify the aforementioned linear model to

y = ω_{1} x_{1} + ω_{2} (- x_{2})

(6)

Then, to investigate the impact of different weightings, 9 weight combinations were assigned. For each weight combination, the combined score was calculated for all 22 features, and based on the calculated y scores to exclude those ranked with the lowest scores. Finally, the remaining feature subsets were subsequently utilized for model training and evaluation.

To sum up, the overall technical route of this study is depicted in Figure 3.

Figure 3.

Technical route of machine learning-based CWP classification. This figure illustrates the sequence of steps including Data Processing (using SMOTE and ADASYN for imbalanced data), Feature Selection (evaluating importance by RF, independence by correlation and an integrated linear model combining both importance and independence), Model Selection (training various models including RF, SVM, DT, KNN, RF-Adaboost, SVM-PSO on the selected features), and Model Evaluation (using Accuracy, Precision, Recall, and F1-score metrics). ADASYN: adaptive synthesis; CWP: coal workers’ pneumoconiosis; DT: Decision Tree; KNN: K-Nearest Neighbor; RF: Random Forest; SMOTE: synthetic minority oversampling technique; SVM: support vector machine; SVM-PSO: support vector machine-particle swarm optimization.

Results

Results of oversampled balanced data on various ML models

The combined prediction results of two oversampling methods and six ML models are shown in Table 2. This table presents the average Accuracy, Macro Precision, Macro Recall, and Macro F1 Score obtained from the 5-fold cross-validation for each model on the same datasets. The results showed that model performance varied depending on the oversampling method used. For clarity and conciseness throughout this article, including Table 2 and all subsequent tables, the best average value for each performance metric is highlighted in bold, and the second best is underlined. Furthermore, the metrics will be referred to as Accuracy, Precision, Recall, and F1 score in the following text.

Table 2.

Classification results of each model under oversampled data.

Oversampling method	ML model	Accuracy	Precision	Recall	F1
SMOTE	RF	0.9364	0.9168	0.8268	0.8344
	RF-Adaboost	0 . 9469	0.9271	0.8412	0.8536
	SVM	0.8832	0.7912	0.8267	0.7945
	SVM-PSO	0.915	0.8553	0.8354	0.8368
	DT	0.8656	0.7788	0.8557	0.7992
	KNN	0.8727	0.7666	0.8174	0.7657
ADASYN	RF	0.9328	0.8845	0.8185	0.8238
	RF-Adaboost	0.9364	0.9205	0.8246	0.8298
	SVM	0.9151	0.8101	0.7882	0.7925
	SVM-PSO	0.9365	0.8788	0.8725	0.8534
	DT	0.805	0.7204	0.8166	0.7363
	KNN	0.8693	0.7681	0.8063	0.7559

On the SMOTE dataset, RF-Adaboost demonstrated the best performance, achieving the highest Accuracy (0.9469), Precision (0.9271), and F1 score (0.8536). Given the importance of accurately identifying early stages of CWP (Class 0 and 1), we further examined the per-class performance of the RF-Adaboost model on the SMOTE dataset. For Class 0, the Precision, Recall, and F1 scores were 0.9807, 0.9951, and 0.9878, respectively. For Class 1, these metrics were 0.8756, 0.9418, and 0.9024. Conversely, KNN exhibited the worst performance with the lowest F1 score (0.7657).

Similarly, on the ADASYN dataset, SVM-PSO performed best, achieving the highest Accuracy (0.9365), Recall (0.8725), and F1 score (0.8534). For Class 0, SVM-PSO achieved average Precision, Recall, and F1 scores of 0.9847, 0.9605, and 0.9723, respectively. For Class 1, the average scores were 0.9223, 0.9636, and 0.9396. However, DT had the worst performance with the lowest F1 score (0.7363). By comparing the two datasets, it was found that except for SVM-PSO, the other 5 models had slightly better performance indicators on the SMOTE dataset than the ADASYN dataset. In addition, SMOTE is more widely used than ADASYN in clinics,³⁷ so the subsequent study in this article will be based on the SMOTE oversampling data set.

Further visualization of the performance results of oversampling on six ML models is presented in Figure 4. The figure showed that the optimized RF-Adaboost and SVM-PSO had significantly improved compared with the four evaluation indicators before optimization, which indicated that the optimized model made up for the shortcomings of the original model, thereby improving the performance. In addition, the indexes of RF, RF-Adaboost, and SVM-PSO on the two oversampling datasets all reached more than 0.8, indicating that these three models were better than the other three models.

Figure 4.

Performance indicators of the models under different oversampling methods. (a) Performance indicators under SMOTE; (b) performance indicators under ADASYN. ADASYN: adaptive synthesis; SMOTE: synthetic minority oversampling technique.

When employing ML for predictive analytics, the primary concern is the model's generalization capability, with the avoidance of overfitting being essential to preserving this capability. Overfitting refers to the phenomenon that the model fits the training data excessively well, leading to suboptimal performance on unseen data and thereby impairing the model's generalization performance.³⁸ Furthermore, existing research indicated that an increased number of input variables raises the risk of overfitting.³⁹ Consequently, to enhance model performance and generalization ability, our subsequent research will focus on an in-depth analysis of the input feature variables.

Feature selection based on importance analysis

Feature selection is a method to reduce the redundancy of a dataset while retaining the most valuable information for the target variable. This approach can not only improve the accuracy and generalization of the model but also reduce the risk of overfitting, and enhance the interpretability of the model.⁴⁰ This study included 22 features, and to identify those with a greater influence on the predicted target, we employed the RF importance criterion for feature selection.

Random forest importance assessment is an embedded method in feature selection. We combined the oversampled balanced data with the RF respectively for importance assessment. The importance of the 22-dimensional feature variables before feature selection is shown in Figure 5. The results showed that the top five characteristics that had the greatest influence on the stage of CWP were ALB, PLT, WBC, LDL, and ALC.

Figure 5.

Importance evaluation of clinical features.

In this study, we selected features that ranked in the top 70% of importance as the variables after screening. Ultimately, 16 key features associated with the staging of CWP were identified. These features were then combined with six types of ML models as inputs for prediction. Additionally, since the F1 score is a weighted average of precision and recall, it can effectively measure the overlap of data between categories.⁴¹ That is, the larger the F1 score, the better the classification performance of the model. Therefore, we used the F1 score to compare the performance before and after feature selection, with the comparative results presented in Table 3.

Table 3.

Results of feature selection based on importance assessment.

	ML model	Accuracy	Precision	Recall	F1
RF importance assessment feature selection	RF	0.9257	0.8747	0.8064	0.8172
	RF-Adaboost	0.9364	0.8772	0.8224	0.8338
	SVM	0.9081	0.8215	0.7553	0.7542
	SVM-PSO	0.9117	0.8437	0.8299	0.8251
	DT	0.866	0.7489	0.812	0.7684
	KNN	0.8692	0.7517	0.7692	0.734

From the results in Table 3, we observed that the RF-Adaboost model achieved the best overall performance among the six models under this importance-based feature selection condition. To understand its effectiveness in identifying crucial early stages, we examined its per-class metrics. For Class 0, the Precision, Recall, and F1 scores were 0.9806, 0.9902, and 0.9852, respectively. And for Class 1, these metrics were 0.8509, 0.9236, and 0.8836.

Table 4.

Feature selection results based on correlation analysis.

	ML model	Accuracy	Precision	Recall	F1
The deleted dataset 1	RF	0.9259	0.8586	0.8131	0.8191
	RF-Adaboost	0 . 9294	0.8809	0.8191	0.8241
	SVM	0.8941	0.8038	0.7739	0.7659
	SVM-PSO	0.9044	0.8268	0.8138	0.8045
	DT	0.8731	0.7619	0.8181	0.7771
	KNN	0.8657	0.7489	0.8153	0.7639
The deleted dataset 2	RF	0.9365	0.8782	0.8414	0.8518
	RF-Adaboost	0.9435	0.9049	0.8373	0.8556
	SVM	0.9116	0.8466	0.8114	0.8192
	SVM-PSO	0.9187	0.8634	0.8337	0.8414
	DT	0.8873	0.7884	0.8336	0.8038
	KNN	0.8904	0.7911	0.8175	0.7871

Figure 6 shows the performance comparison before and after the feature selection more clearly. The results indicated a slight decrease in the performance metrics of the six models following feature selection, as assessed by the RF importance criterion when compared to the metrics obtained before feature selection.

Figure 6.

Performance comparison between uncharacterized selection and selection based on importance features.

Feature selection based on correlation analysis

Feature selection was performed based solely on feature importance; however, a comparative analysis of model performance before and after the selection process unexpectedly indicated a decline in performance rather than an enhancement. Notably, this decline was particularly significant for the SVM and KNN models, with performance decrements of 4.03% and 3.17%, respectively. Given that these algorithms fundamentally rely on distance-based classification, they are highly sensitive to spatial distribution and feature continuity.⁴² The observed decrease in performance suggests that feature selection based on importance may have introduced discontinuities in the feature space. This is likely due to the removal of variables that are highly interrelated with the retained features, thereby impairing the model's capacity to capture the underlying patterns and resulting in a significant deterioration of performance.

On the other hand, considering that all input variables are derived from clinical data, we must address the common issue of multicollinearity in clinical medical research datasets.⁴³ Multicollinearity arises when two or more features are highly correlated, potentially leading to overfitting in the training process of ML models and consequently impairing their predictive performance. To assess the impact of feature intercorrelation on model efficacy, we performed a correlation analysis using SPSS software on the 22 input features. The resulting correlation coefficients were visualized in a heatmap, as depicted in Figure 7.

Figure 7.

Correlation heat maps among CWP characteristic variables. This figure illustrates the Spearman correlation coefficients among the 22 CWP characteristic variables used in the study. The color scale maps positive correlations toward blue (closer to +1) and negative correlations toward red (closer to −1), with color intensity indicating the magnitude. Numerical correlation values are displayed in each cell. CWP: coal workers’ pneumoconiosis.

Figure 7 shows significant correlations between certain variables, such as the correlation between RBC and HB. To investigate the potential impact of these strong correlations on model performance, we established an absolute threshold of 0.7 for the correlation coefficient to screen variable pairs. After screening, the following eligible feature pairs were TG and VEDL(r = 0.71), CHOL and LDL (r = 0.931), HB and RBC (r = 0.899), GLB and the A/G ratio (r = 0.852), ANC and WBC (r = 0.828), and ALT and AST (r = 0.724).

To eliminate the influence of strong correlation between variables and obtain the highest accuracy with the least number of indicators, we arranged and combined these six feature pairs to obtain a total of 64 feature combinations that need to be eliminated. Then, six features contained in each combination were removed from the original data set to create 64 new datasets. Each dataset was then subjected to ML models to assess predictive performance, with the one showing the worst performance identified as the deleted dataset 1, and the best as the deleted dataset 2. Among them, the features excluded from the deleted dataset 1 included TG, CHOL, RBC, GLB, WBC, and ALT, while the features excluded from the deleted dataset 2 included VEDL, LDL, HB, A/G, ANC, and AST. The predictive outcomes of these two datasets in conjunction with six ML models are detailed in Table 4 and visualized in Figure 8.

Figure 8.

Performance comparison between uncharacterized selection and feature selection based on correlation.

Table 5.

Features selection results based on comprehensive importance and correlation.

Weight combination	ML model	Accuracy	Precision	Recall	F1
Dataset 1 $\begin{aligned} ω_{1} : ω_{2} = 9 : 1 \\ ω_{1} : ω_{2} = 8 : 2 \end{aligned}$	RF	0.94	0 . 9142	0.8474	0.8474
	RF-Adaboost	0.9506	0.9009	0.8662	0.8694
	SVM	0.9224	0.8375	0.8091	0.8174
	SVM-PSO	0.9363	0.8798	0.8608	0.8665
	DT	0.8518	0.7398	0.8112	0.7624
	KNN	0.887	0.768	0.7962	0.7681
Dataset 2 $\begin{aligned} ω_{1} : ω_{2} = 7 : 3 \\ ω_{1} : ω_{2} = 6 : 4 \end{aligned}$	RF	0.9435	0.8879	0.8491	0.8569
	RF-Adaboost	0.9506	0.9142	0.8757	0.8757
	SVM	0.9224	0.8375	0.8091	0.8174
	SVM-PSO	0.9363	0.8798	0.8608	0.8665
	DT	0.8836	0.7885	0.8054	0.7887
	KNN	0.8974	0.7951	0.8396	0.8034
Dataset 3 $ω_{1} : ω_{2} = 5 : 5$	RF	0.94	0.8866	0.8324	0.844
	RF-Adaboost	0.933	0.8983	0.8324	0.8464
	SVM	0.9328	0.8694	0.8351	0.8426
	SVM-PSO	0.9424	0.8881	0.8735	0.8757
	DT	0.8692	0.7551	0.7849	0.7572
	KNN	0.8692	0.7645	0.8169	0.7685
Dataset 4 $\begin{aligned} ω_{1} : ω_{2} = 4 : 6 \\ ω_{1} : ω_{2} = 3 : 7 \\ ω_{1} : ω_{2} = 2 : 8 \\ ω_{1} : ω_{2} = 1 : 9 \end{aligned}$	RF	0.926	0.8658	0.8191	0.8263
	RF-Adaboost	0.9365	0.8938	0.8324	0.8492
	SVM	0.9009	0.7897	0.81	0.7954
	SVM-PSO	0.9327	0.8558	0.8432	0.8445
	DT	0.8905	0.7981	0.7745	0.7752
	KNN	0.8586	0.7433	0.7915	0.7431

The results indicated that the F1 scores for the deleted dataset 2 had improved when compared to the unprocessed counterpart. Notably, the SVM model demonstrated the most pronounced optimization effect, with a 2.47% enhancement in performance. And among the models evaluated with the deleted dataset 2, RF-Adaboost achieved the highest overall performance Examining its effectiveness on the critical early stages, the Precision, Recall, and F1 scores for Class 0 were 0.9805, 0.9951, and 0.9877, respectively. For Class 1, these metrics were 0.8509, 0.9236, and 0.8836. This finding showed that the strong intervariable correlation was one of the contributing factors to the variability in model performance. Consequently, further investigation into the strong correlations between variables and their impact on model performance was of great importance for optimizing the predictive capabilities of our models.

Furthermore, we observed that the performance of the deleted dataset 1 was inferior to that of the unprocessed dataset. This outcome indicated that the features removed from this dataset substantially contributed to the model's predictive capacity. Therefore, based on the current findings, we believe that during feature selection informed by correlation analysis, it is crucial to retain feature that were identified to be eliminated when their exclusion led to a degradation in model performance should be retained. Conversely, features that were removed from datasets leading to performance improvement should be eliminated. This strategy aimed to preserve the model's predictive accuracy at an optimal level.

Integrated feature selection method based on a comprehensive assessment of importance and correlation

Analysis of these nine subsets revealed that the resulting 16 feature subsets generated by different weight combinations were not all unique. Specifically, we found that some feature sets were identical for the weight combinations, including ω₁:ω₂ = 9:1/8:2, ω₁:ω₂ = 7:3/6:4, and ω₁:ω₂ = 4:6/3:7/2:8/1:9. Consequently, the nine weight combinations collectively resulted in four distinct feature subsets after excluding the bottom six features, which are successively labeled as Dataset 1 to 4. To determine under which weight combination the model performed optimally, we combined these datasets with 6 ML models for predictive analysis. The outcomes of these predictions are presented in Table 5.

Furthermore, to more clearly and intuitively visualize the performance changes across different datasets, a comparative visualization is shown in Figure 9.

Figure 9.

Performance of data sets with different weight ratios on each model.

The analysis of the figure above revealed that Dataset 4, with its corresponding weight configurations ω₁:ω₂ = 4:6/3:7/2:8/1:9, had the worst performance across the six models based on the F1-score (average F1 ≈ 0.8056). Conversely, Dataset 2 with corresponding weight configuration ω₁:ω₂ = 7:3/6:4 showed the best overall performance based on F1-score (average F1 ≈ 0.8348). Notably, the RF-Adaboost model achieved the highest overall performance on Dataset 2, consistent with the findings for the other feature selection methods. Furthermore, compared to using uncharacterized selection, importance-based selection, and correlation-based selection methods, the F1 score for identifying early-stage Class 1 using RF-Adaboost under this combined approach improved by 1.66%, 3.54%, and 3.55%, respectively. This suggested that the specific set of 16 features retained in Dataset 2 provided the most effective contribution to the enhancement of model performance in terms of balancing precision and recall, while the feature set in Dataset 4 was less effective.

Additionally, to evaluate the performance efficacy of the six ML models using the innovative combined feature selection method relative to standard baseline approaches (using uncharacterized, importance-based, and correlation-based selection), F1-scores were compared (detailed in Figure 10). The analysis reveals distinct advantages of the combined method, particularly when represented by its best-performing subset Dataset 2.

Figure 10.

Performance comparison of different feature selection methods on different models. The integrated feature selection weight ratio is (a) 9:1, 8:2; (b) 7:3, 6:4; (c) 5:5; and (d) 4:6, 3:7, 2:8, and 1:9.

Compared to feature selection based solely on importance, the combined method demonstrated clear performance advantages. Using the best-performing Dataset 2, the F1-scores of the six models (RF, RF-Adaboost, SVM, SVM-PSO, DT, and KNN) were improved by approximately 3.97%, 4.19%, 6.32%, 4.14%, 2.03%, and 6.94%, respectively, relative to the importance-only baseline. The robustness of the combined approach is further highlighted by Dataset 4 (the lowest-performing subset), which also consistently outperformed the importance-only baseline. For instance, Dataset 4 still yielded F1-score improvements over importance-based selection, notably by approximately 4.12% for SVM and 1.94% for SVM-PSO. Therefore, integrating feature independence with importance provides a more effective feature selection strategy than relying solely on importance rankings.

When compared against using all features, the integrated method, represented by Dataset 2, also demonstrated clear value for most models. Specifically, Dataset 2 led to improved F1-scores for five of the six models: RF, RF-Adaboost, SVM, SVM-PSO, and KNN, with performance increases of approximately 2.25%, 2.21%, 2.29%, 2.97%, and 3.77%, respectively, compared to using uncharacterized selection. Conversely, the DT model performed slightly better (by about 1.05%) with the uncharacterized selection. Thus, while not universally superior to using all features across every model, the combined feature selection method demonstrated its effectiveness by enhancing performance for the majority of the algorithms evaluated.

The comparison between the combined method (using Dataset 2) and correlation-based feature selection revealed model-specific performance differences. The combined method demonstrated superior efficacy in four of the six algorithms. Specifically, compared to the correlation-based method, the F1-scores for RF, RF-Adaboost, SVM-PSO, and KNN increased by approximately 0.51%, 2.01%, 2.51%, and 1.63%, respectively. Conversely, correlation-based selection yielded slightly better results for the SVM and DT models, with F1-scores approximately 0.18% and 1.51% higher, respectively, than the combined method. This result indicates that while correlation-based selection remains an effective baseline for specific models, the combined strategy proposed in this study, integrating both importance and independence, can provide distinct performance advantages for certain ML algorithms, particularly for the evaluated ensemble RF-Adaboost and SVM-PSO models.

In summary, the comparative analysis emphasizes the value of the proposed combined feature selection method. By integrating both feature importance and independence, the method consistently and significantly outperformed selection based solely on importance. Despite model-dependent performance against the uncharacterized and correlation-based baselines, the combined method achieved substantial performance gains for the majority of the algorithms used in this article, particularly with the feature set corresponding to Dataset 2. Notably, it demonstrated distinct advantages for RF-Adaboost and SVM-PSO models, surpassing all baseline methods in these instances. These findings highlight the combined strategy's potential to enhance predictive performance (measured by F1-score), presenting it as a robust and effective feature selection method.

Discussion

This study analyzed clinical data from CWP patients compared to healthy individuals to construct an ML model for the staging diagnosis of CWP. The findings indicated changes in certain clinical data between CWP patients and the healthy control group. Using the RF feature selection method, the three most critical features for the staging of pneumoconiosis were identified as ALB, PLT, and WBC. However, after performing unidimensional feature selection in this study, it was observed that the model performance did not significantly improve but there was a downward trend in some models. To address this problem, we weighted the importance and independence of features for linear fitting, thereby comprehensively considering the impact of each dimension on the staging of CWP. The results, indicated by accuracy, precision, recall, and F1 scores, showed that the model performance improved with each weight combination compared to considering only a single dimension. The best model performance was achieved when the weight ratio of importance to independence was 7:3 or 6:4.

A substantial number of studies have demonstrated that ML based on clinical data has been extensively applied to various aspects of disease diagnosis and prognosis, such as the differentiation of acute appendicitis categories⁴⁵ and the rapid diagnosis of chronic kidney disease.²¹ Traditional diagnosis of CWP relies on doctors interpreting chest X-ray.⁴⁶ However, on the one hand, chest X-ray is more costly than traditional clinical data, significantly increasing the medical expenses for patients. On the other hand, even the most accurate digital radiography images can have issues with organ overlap leading to unclear visibility.⁴⁷ Therefore, this study combined clinical biochemistry examination data with ML models for the staging prediction of CWP, which not only addressed the existing problems in imaging but also reduced the medical costs for patients.

Feature selection methods are typically categorized into three types, including embedded, wrapper, and filter methods.⁴⁸ Generally, feature selection reduces the number of features, which can decrease the complexity of the model, reduce the risk of overfitting, and thereby enhance model performance.⁴⁹ In this study, we employed the RF importance for feature selection. However, we observed a decline in the predictive performance of the model, contrary to our expectations. Research suggests that this phenomenon may arise from the loss of variables with significant predictive contributions during the feature selection process, which could also account for the slight performance decrease when using only importance for feature selection in this study. Nevertheless, researchers have noted that despite a reduction in dimensionality and a decrease in computational time, the dataset post-feature selection remains applicable for classification tasks, especially when the performance decline is minimal.⁵⁰ Therefore, the feature selection based on importance in this study still held certain research significance.

Additionally, RF can also identify clinically relevant features for CWP according to the magnitude of feature importance scores. Previous studies have utilized RF importance to identify AaDO₂ as a predictive factor for CWP.²² Furthermore, some studies employed metabolomics analysis and discovered a negative correlation between direct bilirubin in the serum of pneumoconiosis patients and the severity of CWP.⁵¹ In this study, we used common clinical biochemical indicators, which are more readily available than pulmonary function and blood gas analysis, thus filling the gap in the prediction of CWP staging using common clinical data. Unlike the more costly metabolomics approach, we employed the Kruskal–Wallis H test, a statistical method, and identified significant variations in features such as ALC, ALB, and ALT across different stages of CWP. Box plots based on their medians, quartiles, and interquartile ranges are shown in Figure 11. The p-values from the Kruskal–Wallis H test were all less than the level of significance of 0.05, indicating significant differences between at least two groups. As shown in Figure 11, the levels of ALC, ALB, and ALT also decreased with the severity of CWP, suggesting that these features can serve as good predictive factors for CWP staging from a statistical perspective. However, further research is needed to explore the underlying biological mechanisms and clinical significance.

Figure 11.

Kruskal–Wallis H test results of different clinical indicators at different stages. (a) ALC; (b) ALB; and (c) ALT.

In the process of feature selection based on importance assessment, we excluded six indicators with the lowest importance values, including GLB, A/G, VLDL, TG, Cr, and AMC. However, due to the particularities of clinical data, there is often a strong correlation between two or more variables, known as multicollinearity, which is not uncommon in medical research.⁵² This phenomenon can affect the outcomes of model predictions since it becomes difficult to discern the individual impact of each input variable on the target variable. Up to date, a variety of methods have been identified to address multicollinearity,⁵³ such as stepwise regression and factor analysis. Yet, these traditional solutions tend to be complex, involving tedious hyperparameter tuning and substantial computational effort. In this study, we employed the simplest approach by directly utilizing Spearman's rank correlation analysis, setting a threshold for the absolute value of the correlation coefficient to guide feature selection, thereby creating a new dataset comprised of variables with low or no correlation. By comparing the model performance before and after the exclusion, we observed an overall improvement in performance postprocessing compared to preexclusion. This indicated that in this research, a simple method was also capable of resolving multicollinearity issues. The method is straightforward, time-efficient, and ensures the accuracy of the model, holding certain significance for clinical research.

To further substantiate that the application of correlation analysis in resolving multicollinearity is not unique to this study, we have referenced UCI's Heart Disease public dataset for validation. This dataset utilizes 14 clinical features such as trestbps to construct ML models for predicting the presence of heart disease. A correlation analysis was performed on these 14 clinical features, and the resulting heatmap of the correlation coefficients is depicted in Figure 12.

Figure 12.

Correlation heat maps of heart disease characteristic variables.

To ensure that the features excluded only have low intercorrelations, a threshold of 0.43 for the absolute value of the correlation coefficient was set for screening. After this process, the features that met the criteria were identified as thalach, oldpeak, and slope. Since these three variables all had intercorrelations above the threshold, the exclusion was performed four times. The datasets after exclusion were compared with the original Heart Disease dataset using the RF and SVM models for prediction, and the results are presented in Table 6. It was found that the datasets with strong intercorrelations removed showed a performance improvement compared to the original datasets, with the SVM model showing a more pronounced enhancement. The performance of the datasets with strong intercorrelations removed improved by 1.0384% for the RF model and 11.3214% for the SVM model compared to the original datasets. These results indicated that strong intercorrelations among features are one of the factors affecting the predictive performance of the models, and that correlation analysis can effectively address the issue of multicollinearity.

Table 6.

Performance comparison before and after strong correlation elimination.

	Original	Reserve slope	Reserve oldpeak	Reserve thalach	Eliminate all
RF	0.88158	0.893204	0.880435	0.902913	0.891304
SVM	0.76316	0.879121	0.868132	0.89011	0.868132

Based on the current research findings, we have observed that when conducting feature selection based on importance and correlation, the features eliminated by the two methods differ due to their distinct focuses. Relying on a single-dimensional feature selection criterion may lead to model misjudgment and overfitting due to the loss of information. To enable the model to grasp more information and achieve more stable and accurate performance, we explored a more comprehensive feature selection strategy. A common multidimensional feature selection method involves using the intersection of features selected by various feature selection methods as the optimal subset,⁵⁴ such as combining hybrid filter and embedded feature selection methods to filter key features.⁵⁵ In this study, we innovatively proposed a weighted linear ensemble feature selection method, which adjusted the weights of feature importance and independence to predict the staging of CWP. The results demonstrated that compared to single-dimensional feature selection, this ensemble method significantly enhances the model's predictive performance under various weight configurations, thereby validating its effectiveness. This approach also provides a novel perspective for the prediction of pneumoconiosis staging and may offer a reference for the diagnosis and classification of other clinical diseases.

Although this study introduces an innovative method for CWP staging prediction, certain limitations should be acknowledged. Firstly, the findings are derived from data sourced exclusively from a single institution, which may limit the generalizability of our predictive models. Future work should utilize data collected from multiple centers to enhance the broader applicability of these models. Secondly, the current study acquires patient information within a specific timeframe, lacking insight into the temporal changes of clinical indicators in CWP patients. Consequently, future research should consider temporal factors, tracking patients, and the changes in their relevant clinical and biochemical indicators. Such an approach could provide a more comprehensive assessment for CWP research.

Conclusion

In conclusion, the objective of this study was to achieve precise staging prediction for CWP through ML methods and a small sample of routine clinical data. Initially, we employed various oversampling techniques to balance the small-sample clinical data. Subsequently, we utilized multiple ML methods for supervised learning and prediction of CWP staging outcomes. Furthermore, we proposed an integrated feature selection approach that takes into account both feature importance and independence, thereby achieving high-accuracy CWP staging prediction with a reduced number of clinical test indicators.

Oversampling methods were utilized to enhance the small-sample clinical data, resulting in a balanced CWP dataset, and ML techniques were combined to achieve early prediction of CWP. Among these, various ML models demonstrated optimal performance on the SMOTE oversampling dataset. The RF and SVM models were optimized using the Adaboost algorithm and the PSO algorithm, respectively, to compensate for the models’ deficiencies. The optimized RF-Adaboost model demonstrated superior performance for staging prediction of CWP on the SMOTE oversampling dataset, achieving an accuracy of 0.8536, which represented a 1.23% improvement over the unoptimized RF model. The SVM-PSO model showed enhanced performance on the ADASYN oversampling dataset, with an accuracy of 0.8534, marking a significant 6.09% increase in performance compared to the original SVM model. The RF importance assessment method was employed to identify ALB, PLT, and WBC as significant predictive factors for CWP. Additionally, in an innovative approach, the Kruskal–Wallis H test, a statistical method, was utilized to detect significant changes in the features of ALC, ALB, and ALT across different stages of CWP. This approach fills the gap in identifying prognostic factors for CWP staging predictions using simple statistical methods. An innovative integrated feature selection method was proposed, which comprehensively considered the importance and independence of the clinical features utilized. In contrast to traditional prediction approaches that depend on chest radiography or metabolomics, this method enabled accurate staging prediction of CWP with fewer clinical indicators. The highest increase in prediction accuracy for CWP, by 6.94%, was achieved when the combined weight ratio of the importance to independence indicators was set at 7:3 or 6:4.

Footnotes

ORCID iD

Jiaqi Jia

Jingying Huang

Yuming Cui

Dekun Zhang

Songquan Wang

Wenlu Hang

Ethical considerations

This study was approved by the Medical Ethics Committee ([2022]-101501).

Consent for publication

All authors are aware of and agree to publish.

Author contributions

JJ: Investigation, methodology, and writing—original draft. JH: Data curation and methodology. YC: Methodology and writing—review & editing. DZ: Methodology and conceptualization. HL: Data curation and conceptualization. SW: Investigation, methodology, and writing—review & editing. WH: Funding acquisition, project administration, and writing—review & editing. All authors reviewed and approved this manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by National Natural Science Foundation of China (No. 82405130), Natural Science Foundation of Jiangsu Province (No. BK20220236), Occupational Health Research Project Jiangsu Province (No. JSZJ20233202), Health Commission Research Project of Jiangsu Province (No. Z2023023), Research Fund for Doctoral Degree Teachers of Jiangsu Normal University of China (No. 22XXFRS011), and Jiangsu Normal University Postgraduate Research and Practice Innovation Program (No. 2024XKT0651).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

SW.

References

Spagnolo

Ryerson

Guler

, et al. Occupational interstitial lung diseases. J Intern Med 2023; 294: 798–815.

Akira

Suganuma

. Imaging diagnosis of pneumoconiosis with predominant nodular pattern: HRCT and pathologic findings. Clin Imaging 2023; 97: 28–33.

, et al. The potential diagnostic biomarkers for the IgG subclass in coal workers’ pneumoconiosis. J Immunol Res 2023; 2023: 9233386.

Zhao

Xie

Wang

, et al. Pulmonary rehabilitation for pneumoconiosis: protocol for a systematic review and meta-analysis. BMJ Open 2019; 9: e025891.

Mandrioli

Schlünssen

Adám

, et al. WHO/ILO work-related burden of disease and injury: protocol for systematic reviews of occupational exposure to dusts and/or fibres and of the effect of occupational exposure to dusts and/or fibres on pneumoconiosis. Environ Int 2018; 119: 174–185.

Perlman

Maier

. Occupational lung disease. Med Clin North Am 2019; 103: 535–548.

Hall

Blackley

Halldin

, et al. Current review of pneumoconiosis among US coal miners. Curr Environ Health Rep 2019; 6: 137–147.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521: 436–444.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Commun ACM 2017; 60: 84–90.

10.

Wang

Yan

Feng

, et al. Deep learning models of multi-scale lesion perception attention networks for diagnosis and staging of pneumoconiosis: a comparative study with radiologists. J Imaging Inform Med 2024; 37: 3025–3033.

11.

Zhang

Rong

, et al. A deep learning-based model for screening and staging pneumoconiosis. Sci Rep 2021; 11: 2201.

12.

Zhu

Luo

, et al. The development and evaluation of a computerized diagnosis scheme for pneumoconiosis on digital chest radiographs. Biomed Eng Online 2014; 13: 141.

13.

Paglia

Astarita

. Metabolomics and lipidomics using traveling-wave ion mobility mass spectrometry. Nat Protoc 2017; 12: 797–813.

14.

Johnson

Ivanisevic

Siuzdak

. Metabolomics: beyond biomarkers and towards mechanisms. Nat Rev Mol Cell Biol 2016; 17: 451–459.

15.

Lee

Shin

Lee

, et al. Serum levels of TGF-β1 and MCP-1 as biomarkers for progressive coal workers’ pneumoconiosis in retired coal workers: a three-year follow-up study. Ind Health 2014; 52: 129–136.

16.

Lee

Shin

Choi

. Serum levels of IL-8 and ICAM-1 as biomarkers for progressive massive fibrosis in coal workers’ pneumoconiosis. J Korean Med Sci 2015; 30: 140–144.

17.

Zou

Carroll

Liang

, et al. Alterations of serum biomarkers associated with lung ventilation function impairment in coal workers: a cross-sectional study. Environ Health 2011; 10: 83.

18.

Wang

Peng

Ding

, et al. An analysis of targeted Serum lipidomics in patients with pneumoconiosis—China, 2022. China CDC Wkly 2023; 5: 849.

19.

Chen

Shi

Zhang

, et al. Lipidomics profiles and lipid metabolite biomarkers in serum of coal workers’ pneumoconiosis. Toxics 2022a; 10: 496.

20.

Goecks

Jalili

Heiser

, et al. How machine learning will transform biomedicine. Cell 2020; 181: 92–101.

21.

Chen

Shi

Zhang

, et al. Screening of serum biomarkers of coal workers’ pneumoconiosis by metabolomics combined with machine learning strategy. Int J Environ Res Public Health 2022b; 19: 7051.

22.

Dong

Zhu

Kong

, et al. Efficient clinical data analysis for prediction of coal workers’ pneumoconiosis using machine learning algorithms. Clin Respir J 2023; 17: 684–693.

23.

Dupré

Malik

. Inflammation and cancer: what a surgical oncologist should know. Eur J Surg Oncol 2018; 44: 566–570.

24.

Gao

Cai

Fang

, et al. Machine learning based early warning system enables accurate mortality risk prediction for COVID-19. Nat Commun 2020; 11: 5033.

25.

Cai

Huang

Gao

, et al. Artificial intelligence-based models enabling accurate diagnosis of ovarian cancer using laboratory tests in China: a multicentre, retrospective cohort study. Lancet Digit Health 2024; 6: e176–e186.

26.

Luo

Guo

, et al. Predicting congenital heart defects: a comparison of three data mining methods. PLoS One 2017; 12: e0177811.

27.

Mrad

Lahiani

Mefteh-Wali

, et al. Predicting bank inactivity: A comparative analysis of machine learning techniques for imbalanced data. Ann Oper Res 2024. DOI: https://doi.org/10.1007/s10479-024-06018-0.

28.

Rahman

Zhou

, et al. A comprehensive review on machine learning in healthcare industry: classification, restrictions, opportunities and challenges. Sensors 2023; 23: 4178.

29.

Liu

, et al. Improving random forest and rotation forest for highly imbalanced datasets. Intell Data Anal 2015; 19: 1409–1432.

30.

Ghimire

Rogan

Galiano

, et al. An evaluation of bagging, boosting, and random forests for land-cover classification in cape cod, Massachusetts, USA. GISci Remote Sens 2012; 49: 623–643.

31.

Melgani

Bruzzone

. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans Geosci Remote Sens 2004; 42: 1778–1790.

32.

Wang

Tang

Huang

, et al. A comparative study of different machine learning methods for reservoir landslide displacement prediction. Eng Geol 2022; 298: 106544.

33.

Tsoumakas

Katakis

. Multi-Label classification: an overview. Int J Data Warehouse Min 2007; 3: 1–13.

34.

Koyejo

Natarajan

Ravikumar

, et al. Consistent multilabel classification. Adv Neural Inf Process Syst 2015; 28: 3321–3329.

35.

Pereira

Plastino

Zadrozny

, et al. Correlation analysis of performance measures for multi-label classification. Inf Process Manag 2018; 54: 359–369.

36.

Bogatinovski

Todorovski

Džeroski

, et al. Comprehensive comparative study of multi-label classification methods. Expert Syst Appl 2022; 203: 117215.

37.

Beinecke

Heider

. Gaussian Noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making. BioData Min 2021; 14: 49.

38.

Ali

Hossain

Kona

, et al. An ensemble classification approach for cervical cancer prediction using behavioral risk factors. Healthcare Anal 2024; 5: 100324.

39.

Soomro

Mokhtar

Kurnia

, et al. Integrity assessment of corroded oil and gas pipelines using machine learning: a systematic review. Eng Fail Anal 2022; 131: 105810.

40.

Huang

Liu

Zhang

, et al. Surface damage detection for steel wire ropes using deep learning and computer vision techniques. Measurement (Mahwah N J) 2020; 161: 107843.

41.

Sancar

Tabrizi

. Machine learning approach for the detection of vitamin D level: a comparative study. BMC Med Inf Decis Making 2023; 23: 219.

42.

Zadeh

Alsabi

Ramirez-Vick

, et al. Characterizing basal-like triple negative breast cancer using gene expression analysis: a data mining approach. Expert Syst Appl 2020; 148: 113253.

43.

Xia

Wang

Yang

, et al. Performance optimization of support vector machine with oppositional grasshopper optimization for acute appendicitis diagnosis. Comput Biol Med 2022; 143: 105206.

44.

Yildirim

. Novel statistical regularized extreme learning algorithm to address the multicollinearity in machine learning. IEEE Access 2024; 12: 102355–67.

45.

Ogunleye

Wang

. Enhanced XGBoost-based automatic diagnosis system for chronic kidney disease. In: 2018 IEEE 14th International Conference on Control and Automation (ICCA). IEEE 2018; 805-10.

46.

Walkoff

Hobbs

. Chest imaging in the diagnosis of occupational lung diseases. Clin Chest Med 2020; 41: 581–603.

47.

Liu

. Recent advances in feature selection and its applications. Knowl Inf Syst 2017; 53: 551–577.

48.

Senbagamalar

Logeswari

. Genetic clustering algorithm-based feature selection and divergent random forest for multiclass cancer classification using gene expression data. Int J Comput Intell Syst 2024; 17: 23.

49.

Khan

Tarimer

Alwageed

, et al. Effect of feature selection on the accuracy of music popularity classification using machine learning algorithms. Electronics (Basel) 2022; 11: 3518.

50.

Peng

Deng

Huang

, et al. Serum bilirubin levels and disease severity in patients with pneumoconiosis. Can Respir J 2023; 2023: 5642040.

51.

Ellsworth

van Rossum

PSN

Mohan

, et al. Declarations of independence: how embedded multicollinearity errors affect dosimetric and other Complex analyses in radiation oncology. Int J Radiat Oncol Biol Phys 2023; 117: 1054–1062.

52.

Wang

Xia

Chen

, et al. Prediction and optimization model of sustainable concrete properties using machine learning, deep learning and swarm intelligence: a review. J Build Eng 2023; 80: 108065.

53.

Sundus

Hammo

Al-Zoubi

, et al. Solving the multicollinearity problem to improve the stability of machine learning algorithms applied to a fully annotated breast cancer dataset. Inform Med Unlocked 2022; 2022: 101088.

54.

Feng

Diss

Cheng

, et al. Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models. Lab Invest 2021; 102: 236–244.

55.

Tasci

Jagasia

Zhuge

, et al. Gradwise: a novel application of a rank-based weighted hybrid filter and embedded feature selection method for glioma grading with clinical and molecular characteristics. Cancers (Basel) 2023; 15: 4628.

Machine learning prediction of coal workers’ pneumoconiosis classification based on few-shot clinical data

Abstract

Objective

Methods

Results

Conclusions

Keywords

Introduction

Methods

Patients source and clinical data collection

Data processing

Machine learning method

Evaluation metrics

Integrated feature selection method

Results

Results of oversampled balanced data on various ML models

Feature selection based on importance analysis

Feature selection based on correlation analysis

Integrated feature selection method based on a comprehensive assessment of importance and correlation

Discussion

Conclusion

Footnotes

ORCID iD

Ethical considerations

Consent for publication

Author contributions

Funding

Declaration of conflicting interests

Guarantor

References