Sage Journals: Discover world-class research

Abstract

Background:

Preoperative discrimination between benign and malignant ovarian tumors is important. The applicability of published prediction tools may be limited across different health systems. We aim to develop a machine learning model specifically for Macao’s population to predict the borderline or malignancy risk of ovarian masses using routinely available clinical data in Macao’s health system.

Methods:

The study cohorts were derived from 2 major hospitals in Macao, including 496 patients who underwent oophorectomy or cystectomy for ovarian masses at CHCSJ between January 2014 and December 2023, along with a simulated prospective cohort of 95 patients from CHCSJ between January 2024 and November 2024, and an external validation cohort of 61 patients from KWH between January 2020 and September 2024. Patients’ clinical information, ultrasound features, and laboratory test results before initial treatment were collected. LASSO regression was used for feature selection, and classifiers were developed using various machine learning algorithms. The predictions were compared with postoperative pathological diagnoses. The predictive performance was also compared with the RMI-4.

Results:

Age, menopausal status, 5 ultrasound features, and 7 laboratory tests were identified as predictors of borderline and malignant ovarian tumors. An ensemble learning model based on a voting classifier was selected as the final model. Our model outperformed RMI-4 in the internal test set, simulated prospective cohort, and external validation cohort, achieving an area under the curve (AUC) of 0.923–0.951 (vs 0.810–0.868, P < .05). Decision curve analysis demonstrated superior clinical utility, and SHAP analysis confirmed its interpretability.

Conclusions:

We propose a machine learning model targeting Macao’s population for predicting the borderline or malignancy risk of ovarian masses. Our model is accurate, low-cost, easily accessible, and interpretable. On the basis of no workflow changes, machine learning techniques can maximize the predictive potential of routinely available clinical data in a specific health system.

Keywords

Ovarian cancer prediction model artificial intelligence machine learning risk of malignancy index

Introduction

Ovarian tumors arise from various types of cells within the ovary, including epithelial cells, sex cord-stromal cells, and germ cells. According to the World Health Organization (WHO) tumor classification, ovarian tumors are categorized as benign, borderline, or malignant based on their pathological features.¹ Determining the nature of ovarian masses prior to surgery is crucial for developing an appropriate management strategy and surgical plan. Currently, the acceptable diagnostic tests for preoperative assessment of ovarian masses include serum markers such as cancer antigen 125 (CA125), human epididymis protein 4 (HE4), as well as ultrasound and magnetic resonance imaging (MRI). CA125 has high sensitivity for advanced ovarian cancer but poor specificity in premenopausal women, whereas HE4 offers better specificity in this group, and good sensitivity for early-stage cancers.² Ultrasound is usually the first-line tool for evaluating the size, structure (cystic, solid, or complex), and blood flow of ovarian masses, while MRI provides detailed images to characterize complex or indeterminate masses. Nevertheless, due to limitations such as anatomical location, rupture, bleeding, and tumor implantation, preoperative biopsy is not routinely employed. The pathological diagnosis of ovarian masses mainly relies on intraoperative frozen section analysis and postoperative pathological examination.

Since the 1990s, several tools have been proposed for predicting the risk of ovarian malignancy, including Risk of Malignancy Index (RMI),^3,4 Risk of Ovarian Malignancy Algorithm (ROMA),⁵ International Ovarian Tumour Analysis (IOTA) Simple Rules,⁶ Copenhagen Index (CPH-I),⁷ Ovarian-Adnexal Reporting & Data System (O-RADS),⁸ and IOTA ADNEX.⁹ These tools predict the benign or malignant nature of ovarian masses based on patients’ clinical information, serum tumor markers, and/or imaging features. In recent years, with advancements in artificial intelligence (AI), several machine learning-based models with stronger predictive capabilities have been proposed.^10-12 However, the performance of these prediction tools may be influenced by factors such as ethnicity, tumor type and stage, and type of healthcare facility (eg, cancer center vs general hospital). Local validation studies are essential to identify the most appropriate tool and determine the optimal cutoff values for a specific population.¹³

Macao is a special administrative region of the People’s Republic of China. The overwhelming majority (89.4%) of Macao’s population is Chinese, while the remainder consists of non-ethnic Chinese minorities, including Filipinos, Portuguese, and Vietnamese.¹⁴ More than 70% of borderline and malignant ovarian tumors in Macao are diagnosed at an early stage (International Federation of Gynecology and Obstetrics [FIGO] stage I). To our knowledge, serum HE4 testing and specific imaging protocols are not routinely performed for patients with ovarian masses in most hospitals in Macao, thereby limiting the applicability of the aforementioned prediction tools. Additionally, the ethnicity and stage distribution of ovarian tumors in Macao may differ from those of the population on which these tools were originally developed.

To address these limitations, we attempt to develop a machine learning-based model specifically for Macao’s population, using routinely available clinical data to predict the risk of borderline or malignant ovarian masses. We also compare its predictive performance to that of the Risk of Malignancy Index 4 (RMI-4), which is one of the few prediction tools that can be retrospectively evaluated in our study cohorts. Our model aims to identify ovarian masses at high risk of being borderline or malignant, which may require additional consideration in management,^15,16 and to provide an alternative tool for risk stratification before adapting the workflow of a specific health system to accommodate other prediction tools.

Materials and Methods

Statistical Analysis

Study cohorts

The study cohorts were derived from 2 major hospitals in Macao. The primary study cohort consisted of patients who underwent oophorectomy or cystectomy for ovarian masses at Conde de São Januário Hospital (CHCSJ cohort). The inclusion criteria included pathologically confirmed primary benign, borderline, or malignant ovarian tumors. The exclusion criteria were as follows: (1) physical changes such as cystic follicle or cystic corpus luteum less than 3 cm in greatest dimension,¹ (2) coexistence with other primary malignancies, (3) metastatic ovarian cancer, (4) incomplete clinicopathological information, and (5) conditions that could significantly alter laboratory test results, such as end-stage renal disease.

The CHCSJ cohort was divided into a retrospective cohort and a simulated prospective cohort, using January 1, 2024, as the cutoff point. The retrospective cohort consisted of patients treated from January 2014 to December 2023. As the total number of benign ovarian tumors was much higher than that of borderline and malignant ovarian tumors in the real world (approximately 5-7:1 at CHCSJ), for data balance consideration, all borderline and malignant cases during this period were included, whereas only benign cases from January 2019 to December 2023 were enrolled. The simulated prospective cohort included all patients with benign, borderline, and malignant ovarian tumors treated between January 2024 and November 2024.

The external validation cohort was contributed by Kiang Wu Hospital (KWH cohort), using the same inclusion and exclusion criteria, and included patients treated from January 2020 to September 2024.

As borderline ovarian tumors exhibit low malignant potential and their management differs from benign ovarian tumors,^15,16 we classified borderline and malignant ovarian tumors into the same group in this study. An overview of the study cohorts was shown in Figure 1.

Figure 1.

Flowchart of our model development.

Data collection

The following data were extracted from patients’ medical records: age, menopausal status, 7 ultrasound features, complete blood count, 8 biochemistry tests, 5 serum tumor markers, pathological diagnosis, and the FIGO stage. We selected the most recent results of ultrasound and laboratory tests performed before the initial operation or neoadjuvant chemotherapy. These features were routinely available for patients in Macao, and some have been reported to be associated with ovarian cancer.

Seven indexes were calculated using the following formulas: Platelet-to-lymphocyte ratio (PLR) = Platelet count (Plt)/Lymphocyte count (Lym); Neutrophil-to-lymphocyte ratio (NLR) = Neutrophil count (Neu)/Lym; Monocyte-to-lymphocyte ratio (MLR) = Monocyte count (Mon)/Lym; Systemic inflammatory response index (SIRI) = (Neu × Mon) / Lym; Systemic immune-inflammation index (SII) = (Neu × Plt) / Lym; AST-to-ALT ratio (AAR) = Aspartate transaminase (AST)/Alanine transaminase (ALT); AST/ALT-to-platelet ratio index (AARPRI) = (AST/ALT × 150)/Plt.

The RMI-4 was calculated using the formula:⁴ RMI-4 = M × U × S × CA125. “M” represented menopausal status (menopause = 4, premenopause = 1). “U” was the count of presence of 6 ultrasound features (0 feature = 0, 1 feature = 1, more than 1 feature = 4). “S” represented tumor size (< 7 cm = 1, ⩾ 7 cm = 2).

The full list of features collected for analysis was shown in Supplementary Table S1. Serum HE4 and MRI features are not included in this study because they are not routinely performed in our study cohorts. Additionally, ethnicity was collected for demographic description. All data were de-identified by removing personal information including names, dates of birth, medical record numbers, and any data that would allow someone to be identified. We confirm that all methods were performed in accordance with the 1964 Declaration of Helsinki and its later amendments.

Data preprocessing and feature selection

The CHCSJ retrospective cohort was split into a training set (80%) and a test set (20%) with stratification randomly. The training set was used to develop machine learning models, while the test set, simulated prospective cohort, and external validation cohort were used to validate the final model. Features with more than 15% missing values were removed, and the remaining missing values were imputed using the mean imputation method. The data was normalized using min-max scaling method. Data balancing was performed on the training set by applying naive random over-sampling to the borderline/malignant group. Least Absolute Shrinkage and Selection Operator (LASSO) regression was applied to the training set to filter out the most relevant features.

Model development

We created classifiers using 9 classic machine learning algorithms, including logistic regression, K-nearest neighbors, support vector machine (SVC), Nu-support vector machine (Nu-SVC), decision tree, random forest, adaptive boosting (AdaBoost),¹⁷ extreme gradient boosting (XGBoost),¹⁸ and light gradient boosting machine (LGBM).¹⁹ A voting classifier was also created using the relatively top-performing classifiers from various algorithms through the soft voting method. The selected features were used as input variables. The output of classifiers was either “benign” or “borderline/malignant,” ie, binary classification task. The pathological diagnosis was regarded as the ground truth label. The classifiers were first trained using 10-fold cross-validation. The hyperparameters of the classifiers were manually adjusted based on a grid search. Finally, the classifiers would be refitted on the entire training set using the best-found hyperparameters. The Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis-Artificial Intelligence (TRIPOD + AI) reporting guideline was followed during the development and validation of the prediction model (Please see Supplementary Material for the completed checklist).²⁰ The flowchart of the model development process was shown in Figure 1.

Due to security concerns, most hospital workstations were unable to access the Internet. Therefore, we developed a user-friendly desktop application that can be used offline. Clinicians can use our model on computers with restricted Internet access in clinics or wards.

Statistical Analysis

For patients’ characteristics, continuous variables were reported as mean ± standard deviation and compared using the Mann-Whitney U test. Categorical variables were stated as count (%) and compared using the chi-square test or Fisher exact test.

To evaluate predictive performance, the area under receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, macro F1 score, and accuracy were used. The macro F1 score is defined as: $M a c r o F 1 = \frac{1}{C} \sum_{i = 1}^{C} F 1_{i}$ , where C is the total number of classes, and F1_i is the F1 score of the ith class. AUCs were compared using DeLong’s test. Optimal cutoff values for prediction models were determined using the maximal Youden index.²¹ A P-value < .05 was considered statistically significant.

All statistical analysis was performed using R (version 4.3.1; https://www.r-project.org) and Python (version 3.11; https://www.python.org). The R package “glmnet” was used for LASSO. The Python packages “scikit-learn” (version 1.3.2), “xgboost,” and “lightgbm” were used for machine learning algorithms.

Results

Baseline characteristics

A total of 676 patients with ovarian masses were found at CHCSJ. 85 cases were excluded according to our criteria. The main reason for exclusion was missing data essential for the calculation of RMI-4 (n = 76) due to urgent operation (such as torsion or rupture) or incidentally found ovarian masses during other operations (such as Cesarean sections). Other reasons for exclusion included coexistence with other malignancies (n = 6) and markedly altered laboratory test results due to end-stage renal disease or hematopoietic disorder (n = 3). Finally, a total of 591 patients were included in the entire CHCSJ cohort, in which the retrospective cohort consisted of 496 patients (351 benign tumors, 48 borderline tumors, and 97 malignant tumors), and the simulated prospective cohort consisted of 95 patients (83 benign tumors, 7 borderline tumors, and 5 malignant tumors).

The training set and test set, which were split from CHCSJ retrospective cohort, consisted of 396 and 100 patients respectively. The training set included 280 benign tumors, 39 borderline tumors, and 77 malignant tumors, whereas the test set included 71 benign tumors, 9 borderline tumors, and 20 malignant tumors. Baseline characteristics between the training and test sets showed no significant differences in age, ethnicity, or other key clinical features. This ensured that the test set was representative of the studied population, thereby reducing potential bias.

The KWH cohort included 61 patients with ovarian masses, comprising 30 benign tumors, 14 borderline tumors, and 17 malignant tumors. The demographic and clinical characteristics of CHCSJ and KWH cohorts were summarized in Tables 1 to 3. For details of ultrasound features and laboratory test results, please see Supplementary Table S2.

Table 1.

Demographic and clinical characteristics of CHCSJ and KWH cohorts.

	CHCSJ cohort (n = 591)						KWH cohort (n = 61)
	Retrospective cohort (n = 496)			Simulated prospective cohort (n = 95)			External validation
	Benign	BOT / Malignant	P-value	Benign	BOT / Malignant	P-value	Benign	BOT / Malignant	P-value
Number of patients	351	145		83	12		30	31
Ethnicity
Chinese	307 (87%)	127 (88%)	0.775	79 (95%)	12 (100%)	1.00	27 (90%)	30 (97%)	0.354
Other Asian	31 (9%)	11 (7%)		3 (4%)	0 (0%)		3 (10%)	1 (3%)
Caucasian	13 (4%)	7 (5%)		1 (1%)	0 (0%)		0 (0%)	0 (0%)
Age (years)
Mean ± SD	43.3 ± 14.0	51.8 ± 14.6	< 0.001	45.2 ± 15.9	57.3 ± 10.2	0.00535	42.3 ± 13.4	52.8 ± 12.3	0.00556
Range	12-84	13-83		15-89	35-72		23-71	33-82
Menopausal status
Menopause	86 (25%)	79 (54%)	< 0.001	25 (30%)	9 (75%)	0.00711	6 (20%)	14 (45%)	0.0364
Premenopause	265 (75%)	66 (46%)		58 (70%)	3 (25%)		24 (80%)	17 (55%)
Ultrasound features (present)
Multiloculated cysts (US1)	74 (21%)	58 (40%)	< 0.001	17 (20%)	5 (42%)	0.141	7 (23%)	25 (81%)	< 0.001
Solid areas (US2)	125 (36%)	126 (87%)	< 0.001	30 (36%)	12 (100%)	< 0.001	7 (23%)	25 (81%)	< 0.001
Bilateral lesions (US3)	51 (15%)	23 (16%)	0.705	18 (22%)	1 (8%)	0.449	4 (13%)	5 (16%)	1
Ascites (US4)	26 (7%)	59 (41%)	< 0.001	4 (5%)	3 (25%)	0.041	4 (13%)	15 (48%)	0.00312
Intra-abd. metastases (US5)	0 (0%)	13 (9%)	< 0.001	0 (0%)	1 (8%)	0.126	0 (0%)	4 (13%)	0.113
Size ⩾ 7 cm (US6)	144 (41%)	122 (84%)	< 0.001	31 (37%)	10 (83%)	0.00265	20 (67%)	28 (90%)	0.0241
Dermoid cyst (US7)	61 (17%)	2 (1%)	< 0.001	18 (22%)	0 (0%)	0.114	8 (27%)	2 (6%)	0.0426
Laboratory markers (selected)
CA125 (U/mL)	65.6 ± 178.5	640.7 ± 1333.9	< 0.001	51.6 ± 101.0	175.9 ± 363.8	0.114	98.7 ± 281.9	200.2 ± 377.5	0.0887
Urea (mmol/L)	4.2 ± 1.3	4.1 ± 2.0	0.0415	4.7 ± 1.7	4.7 ± 1.7	0.954	4.2 ± 1.1	4.6 ± 1.2	0.283
AST (U/L)	18.6 ± 8.1	20.5 ± 11.2	0.23	18.6 ± 8.8	20.2 ± 7.1	0.253	21.5 ± 0.7	22.6 ± 8.7	0.857
ALT (U/L)	17.2 ± 12.4	14.7 ± 8.9	0.0446	17.1 ± 12.6	15.7 ± 6.7	0.829	15.2 ± 10.4	17.7 ± 12.9	0.343
AAR	1.3 ± 0.5	1.6 ± 1.0	< 0.001	1.3 ± 0.6	1.4 ± 0.4	0.622	1.4 ± 0.0	1.4 ± 0.2	0.857
WBC (×10⁹/L)	7.2 ± 3.4	8.0 ± 3.7	0.0084	7.5 ± 2.5	7.5 ± 4.3	0.211	6.1 ± 2.6	7.3 ± 3.2	0.0322
Lymphocytes (×10⁹/L)	1.9 ± 0.6	1.5 ± 0.6	< 0.001	1.8 ± 0.7	1.7 ± 0.7	0.381	1.5 ± 0.4	1.6 ± 0.5	0.795
Platelets (×10⁹/L)	271.9 ± 70.6	301.8 ± 118.6	0.0191	272.6 ± 69.1	292.1 ± 96.8	0.583	283.7 ± 62.3	290.2 ± 82.6	0.751
PLR	158.3 ± 69.2	225.7 ± 141.6	< 0.001	175.1 ± 94.1	198.4 ± 101.7	0.463	198.5 ± 63.8	212.4 ± 117.1	0.869
FIGO stage
I		105 (72%)			10 (83%)			24 (77%)
II to IV		40 (28%)			2 (17%)			7 (23%)

Abbreviations: AAR, AST-to-ALT ratio; ALT, alanine transaminase; AST, aspartate transaminase; BOT, borderline ovarian tumor; CA125, cancer antigen 125; PLR, platelet-to-lymphocyte ratio; WBC, white blood cell count.

Table 2.

Train-test split of CHCSJ retrospective cohort.

	Training set(n = 396)	Test set(n = 100)	P-value
Number of patients
Benign	280 (71%)	71 (71%)	1.00
Borderline/Malignant	116 (29%)	29 (29%)
Age (years)	45.6 ± 14.5	46.5 ± 15.5	0.617
Menopausal status
Menopause	127 (32%)	38 (38%)	0.261
Premenopause	269 (68%)	62 (62%)
Ultrasound features (present)
Multiloculated cysts	102 (26%)	30 (30%)	0.391
Solid areas	198 (50%)	53 (53%)	0.592
Bilateral lesions	55 (14%)	19 (19%)	0.2
Ascites	65 (16%)	20 (20%)	0.395
Intra-abd. metastases	10 (3%)	3 (3%)	0.731
Size ⩾ 7 cm	215 (54%)	51 (51%)	0.555
Dermoid cyst	46 (12%)	17 (17%)	0.149
Laboratory tests (selected)
CA125 (U/mL)	217.2 ± 680.0	299.4 ± 1092.5	0.116
Urea (mmol/L)	4.2 ± 1.6	4.3 ± 1.3	0.161
AST (U/L)	18.6 ± 8.2	21.2 ± 12.0	0.107
ALT (U/L)	16.2 ± 11.4	17.5 ± 11.9	0.321
AAR	1.3 ± 0.5	1.5 ± 1.2	0.8
WBC (×10⁹/L)	7.4 ± 3.6	7.4 ± 3.1	0.541
Lymphocytes (×10⁹/L)	1.8 ± 0.6	1.8 ± 0.7	0.414
Platelets (×10⁹/L)	283.2 ± 89.7	270.4 ± 82.5	0.363
PLR	177.1 ± 97.2	181.6 ± 114.4	0.989
FIGO stage
I	86/116 (74%)	19/29 (66%)	0.486
II to IV	30/116 (26%)	10/29 (34%)

Table 3.

Pathological diagnoses of CHCSJ and KWH cohorts.

Pathological diagnosis	CHCSJ cohort	KWH cohort
Benign	n = 434	n = 30
Epithelial tumor
Serous tumor	67 (15.4%)	8 (26.7%)
Seromucinous tumor	6 (1.4%)	0 (0%)
Mucinous tumor	47 (10.8%)	8 (26.7%)
Endometriotic tumor	167 (38.5%)	7 (23.3%)
Brenner tumor	1 (0.2%)	0 (0%)
Sex cord-stromal tumor
Fibroma / Thecoma	21 (4.8%)	0 (0%)
Sclerosing stromal tumor	1 (0.2%)	0 (0%)
Germ cell tumor
Mature teratoma / Struma ovarii	102 (23.5%)	7 (23.3%)
Miscellaneous tumor
Follicular cyst	5 (1.2%)	0 (0%)
Corpus luteum cyst	5 (1.2%)	0 (0%)
Simple cyst	8 (1.8%)	0 (0%)
Others	4 (0.9%)	0 (0%)
Borderline / Malignant	n = 157	n = 31
Epithelial tumor (borderline)
Serous borderline tumor	14 (8.9%)	4 (12.9%)
Seromucinous borderline tumor	5 (3.2%)	0 (0%)
Mucinous borderline tumor	30 (19.1%)	10 (32.3%)
Endometrioid borderline tumor	2 (1.3%)	0 (0%)
Borderline Brenner tumor	3 (1.9%)	0 (0%)
Epithelial tumor (malignant)
Serous carcinoma	35 (22.3%)	2 (6.4%)
Mucinous carcinoma	9 (5.7%)	1 (3.2%)
Seromucinous carcinoma	2 (1.3%)	0 (0%)
Endometrioid carcinoma	18 (11.5%)	8 (25.8%)
Clear cell carcinoma	23 (14.6%)	6 (19.4%)
Carcinosarcoma	2 (1.3%)	0 (0%)
Sex cord-stromal tumor
Granulosa cell tumor	6 (3.8%)	0 (0%)
Sertoli-Leydig cell tumor	1 (0.6%)	0 (0%)
Germ cell tumor
Immature teratoma	3 (1.9%)	0 (0%)
Yolk sac tumor	2 (1.3%)	0 (0%)
Dysgerminoma	1 (0.6%)	0 (0%)
Miscellaneous tumor
Extrarenal Wilms tumor	1 (0.6%)	0 (0%)

Features selected by LASSO regression

In the data preprocessing step, 5 features (total bilirubin, CEA, AFP, CA19-9, and CA15-3) were removed due to having more than 15% missing values in the training set. The remaining 33 features were eligible for feature selection step. Finally, 14 features relevant to distinguishing between benign and borderline/malignant ovarian tumors were identified using LASSO regression. These features included age, menopausal status, 5 ultrasound features (multiloculated cysts, solid areas, intra-abdominal metastases, tumor size ⩾ 7 cm, and suggestive of dermoid cyst), and 7 laboratory tests (CA125, Urea, ALT, AAR, WBC, Lym, and PLR).

Performance of classifiers on the training set

We used 80% of the CHCSJ retrospective cohort as the training set to develop and compare different classifiers. The AUCs of the classifiers in 10-fold cross-validation on the training set were shown in Figure 2A. Among the 9 foundation classifiers, the decision tree classifier performed the worst (0.833 ± 0.057). The average AUCs of the other 8 classifiers ranged from 0.913 to 0.958. However, no single classifier could significantly outperform all the others. We selected 5 relatively top-performing classifiers from various algorithms, including logistic regression classifier, K-nearest neighbors classifier, support vector classifier, random forest classifier, and AdaBoost classifier, to create a voting classifier. The AUC of the voting classifier in 10-fold cross-validation was 0.950 ± 0.024. For details of 10-fold cross-validation results, please see Supplementary Table S3.

Figure 2.

Performance of classifiers. (A) AUCs of classifiers in 10-fold cross-validation on training set. (B) ROC curves of classifiers on all unseen data.

Evaluating the predictive performance of classifiers on the test set, simulated prospective cohort, and external validation cohort

Three datasets were used to evaluate our models that were all unseen during the training process. The first dataset was the internal test set split from the CHCSJ retrospective cohort. The second dataset was the simulated prospective cohort from CHCSJ, used to assess the models’ ability to predict “future” cases. The third dataset was the external validation cohort from KWH, which was used to evaluate the generalization ability of the models and their adaptability to Macao’s population.

The ROC curves of the classifiers on all 3 evaluation datasets were shown in Figure 2B. The 3 models with the highest AUC were Nu-support vector machine classifier (0.94), voting classifier (0.94), and logistic regression classifier (0.93). Using the corresponding optimal cutoff values, their accuracies were 0.895, 0.898, and 0.844, respectively, and their macro F1 scores were 0.875, 0.876, and 0.825, respectively. Considering AUC, accuracy, and F1 score in an integrated manner, we believed that the voting classifier performed more robustly in practice. Therefore, an ensemble learning model based on voting classifier was selected as our final model. The optimal cutoff value was 0.5999, which was determined by a Youden index of 0.7657 on the evaluation data. For details of performance of classifiers on evaluation data, please see Supplementary Table S4.

The AUCs of our final model reached 0.934 (95% confidence interval [CI]: 0.879-0.990), 0.951 (95% CI: 0.909-0.992) and 0.923 (95% CI: 0.854-0.991) on the test set, simulated prospective cohort, and external validation cohort, respectively (all P-values < .05 when compared with the corresponding RMI-4) (Figure 3A to C). Using the optimal cutoff for our model and a cutoff of 450 for RMI-4,⁴ our model demonstrated a superior ability in identifying borderline and malignant tumors across all evaluation datasets. Consequently, our model outperformed RMI-4 in all metrics, including sensitivity (0.774-0.917 vs 0.500-0.655), specificity (0.900-0.930 vs 0.867-0.916), positive predictive value (0.611-0.889 vs 0.400-0.800), negative predictive value (0.794-0.987 vs 0.634-0.925), macro F1 score (0.733-0.867 vs 0.444-0.704), and accuracy (0.836-0.920 vs 0.689-0.842) (Table 4).

Figure 3.

Predictive performance of our model and RMI-4. ROC curves for predicting risk of borderline/malignancy on test set, simulated prospective cohort, and external validation cohort are shown in (A) to (C), respectively. The precision-recall (PR) curve and decision curve analysis of our model and RMI-4 on all unseen data are shown in (D) and (E), respectively.

Table 4.

Predictive performance of our model and RMI-4 on evaluation datasets.

	AUC	Sensitivity	Specificity	PPV	NPV	F1 score	Accuracy
Test set
Our model	0.934 *	0.897	0.930	0.839	0.957	0.867	0.920
RMI-4	0.868	0.655	0.916	0.760	0.867	0.704	0.840
Simulated prospective cohort
Our model	0.951 *	0.917	0.916	0.611	0.987	0.733	0.916
RMI-4	0.853	0.500	0.892	0.400	0.925	0.444	0.842
External validation
Our model	0.923 *	0.774	0.900	0.889	0.794	0.828	0.836
RMI-4	0.810	0.516	0.867	0.800	0.634	0.628	0.689

Abbreviations: AUC: area under the curve; NPV: negative predictive value; PPV: positive predictive value.

p-value < 0.05, Please see Figure 3A to C.

In the decision curve analysis (Figure 3E), our model demonstrated a greater net benefit across a wide range of threshold probabilities compared with the treat-all and treat-none strategies, as well as RMI-4.

Discordant cases

Among the 72 borderline and malignant tumors evaluated, 11 cases (15.3%) were incorrectly classified as benign tumors (ie, false negative) by our model. They included mucinous borderline tumor (n = 7), serous borderline tumor (n = 2), immature teratoma (n = 1), and FIGO stage I endometrioid carcinoma (n = 1). Notably, all 19 FIGO stage II to IV malignant tumors could be correctly predicted by our model. In contrast, 31 cases (43.1%) were misclassified by RMI-4, including 21 borderline tumors, 8 FIGO stage I malignant tumors and 2 FIGO stage III malignant tumors. The overall false negative rates of our model and RMI-4 were statistically significant (P < .01). By subgroup, the false negative rates of our model for borderline tumors, FIGO stage I malignant tumors, and FIGO stage II to IV malignant tumors were 30.0%, 8.7%, and 0%, respectively, whereas those of RMI-4 were 70.0%, 34.8%, and 10.5%, respectively.

Among the 184 benign tumors evaluated, the discordant predictions were 16 cases (8.7%) by our model and 19 cases (10.3%) by RMI-4. The false positive rates of our model and RMI-4 were statistically insignificant (P > .05). The discordant cases of our model included endometriotic cyst (n = 5), mucinous cystadenoma (n = 5), fibroma or fibrothecoma (n = 4), serous cystadenoma (n = 1), and mature cystic teratoma (n = 1).

Interpretability

We used SHAP (SHapley Additive exPlanations) to explain the outputs of our machine learning model.²² Figure 4 showed the SHAP values of key features and their impacts on the model’s output. The top 5 features with the highest mean SHAP values included 4 ultrasound features and menopausal status.

Figure 4.

Interpretability of our model. (A) Bar plot showing the mean absolute SHAP values (ranked by their impact on the model output from top to bottom on vertical axis). (B) Beeswarm plot illustrating the distribution of SHAP values for each feature, with color indicating feature value (red for high, blue for low).

Discussion

Preoperative prediction of malignancy risk in ovarian masses is a challenging task, particularly in early-stage disease. In this study, we develop an ensemble machine learning model to predict the borderline or malignancy risk of ovarian masses specific to Macao’s population (Figure 5). Compared to RMI-4, our model exhibits superior ability in identifying borderline and malignant ovarian tumors, ie, reducing false negative cases. Our model demonstrates consistent predictive performance across the internal test set, simulated prospective cohort, and external validation cohort, ensuring its robustness and reproducibility. Stable AUC values across all datasets confirm its reliability in various clinical settings, reduce concerns about overfitting, and increase confidence in its real-world applicability. Additionally, the model achieves an ideal performance in the external validation cohort, and decision curve analysis indicates superior clinical utility. These findings support its generalizability in Macao’s population and its potential utility in clinical decision-making scenarios.

Figure 5.

Sample case from test set. (A) RMI predicted the tumor was low risk for malignancy, whereas our model predicted it was high risk. Pathological examination of the resected ovary confirmed clear cell carcinoma, FIGO stage I. (B) SHAP demonstrated the importance of each key feature to the final prediction.

Predictors of borderline and malignant ovarian tumors

In this study, a total of 14 features were identified as predictors of borderline and malignant ovarian tumors. Among them, 6 features were used by RMI-4. Eight additional features were added, including age, 6 laboratory tests (Urea, ALT, AAR, WBC, Lym, and PLR), and ultrasound feature suggestive of dermoid cyst that was used by O-RADS score and tended to benign lesion.⁸

Inflammatory biomarkers, such as platelet-to-lymphocyte ratio (PLR), have been reported to be associated with the diagnosis, tumor stage, prognosis, and treatment response of various malignancies, including ovarian cancer.²³ The potential mechanism is that ovarian cancer is associated with inflammation and changes of microenvironment, which may lead to the production of various cytokines and growth factors, finally resulting in lymphopenia, thrombocytosis, and increased PLR.²⁴

Liver fibrosis scores have been reported to be associated with an increased risk of gynecological cancers including ovarian cancer. The hypothesized mechanism is that non-alcoholic fatty liver disease is often associated with visceral adiposopathy, whose pathogenetic features have been proposed as tumorigenic triggers.²⁵ In this study, we evaluated 2 liver fibrosis markers AST-to-ALT ratio (AAR) and AST/ALT-to-platelet ratio index (AARPRI). The result showed that AAR was higher in borderline/malignant tumor group in Macao’s population, likely due to decreased ALT levels.

Serum urea has been reported to have an inverse association with renal cell carcinoma in females²⁶ and with overall cancer incidence.²⁷ Although their studies did not include ovarian cancer and the exact mechanisms were not fully understood, we did find serum urea was slightly lower in patients with borderline/malignant ovarian tumors in Macao’s population. This marker was selected by LASSO regression as one of the key features for prediction.

Ensemble learning and interpretability

Traditional machine learning approaches often rely on a single learning algorithm to make predictions based on a given dataset. However, these standalone models may suffer from limitations such as bias, overfitting, or high variance, which can impact their predictive performance. Ensemble learning has emerged as a powerful technique to address these challenges and improve the accuracy and stability of predictive models.²⁸ In this study, we found that the performance of the voting classifier was more robust than that of other classifiers on unseen data. This finding demonstrated that the ensemble learning model may have an advantage in this task.

Despite these advances, the opaque nature of AI decision-making—commonly referred to as “black box AI”—remains a significant barrier to clinical adoption. To address this, we used SHAP to explain the inner workings of our model and enhance interpretability. For instance, in the case shown in Figure 5B, SHAP analysis revealed that solid morphology and mass size were primary contributors to a high-risk prediction, whereas the absence of multilobulated morphology, mild elevation of CA125, and premenopausal status were associated with a low-risk prediction. This level of transparency supports clinical decision-making by clarifying the rationale behind risk stratification and fostering trust in model outputs.

In this study, we found that the mean SHAP values of ultrasound features were generally higher than those of laboratory tests (Figure 4), suggesting that imaging features may play a more prominent role in predicting borderline and malignancy risk in Macao’s population. Based on this, we hypothesized that the laboratory test-based prediction tools may be less accurate for Macao’s population, especially serum HE4 was missing in most patients. However, further validation studies are needed to prove this hypothesis.

Clinical implication

Unlike other recently published machine learning models, which use stronger clinical indicators such as HE4 and O-RADS score, as well as multimodal data fusion, our model uses basic laboratory tests and simple imaging features. In clinical practice, implementing new prediction tools often requires changes to the workflow, such as collecting additional clinical information, adding new laboratory test items, digitizing hardcopy photos, applying specific imaging protocols, providing additional training, and upgrading hardware. The trade-off between accuracy, cost, and accessibility is a critical consideration in the real world, particularly in public health systems. Before adapting the workflow of a specific health system to accommodate stronger prediction tools, using machine learning techniques to maximize the predictive potential of routinely available clinical data may offer a relatively low-cost, simple, and accessible solution to clinical problems.

The limitations of our study include: (1) the total number of cases in our cohorts is relatively small; (2) this is a bi-classification model, and the study cohorts are insufficient for effectively training a multi-classification model (ie, benign versus borderline versus early-stage malignancy versus advanced-stage malignancy); and (3) because some essential markers are missing in most patients in Macao, we are unable to conduct a head-to-head comparison between our model and other published prediction models retrospectively.

In the future, we will continue optimizing our model, conducting further validation and prospective studies, and exploring its clinical applications as a triage tool for determining which ovarian masses can be operated on by general gynecologists and which require gynecologic oncologists, determining the priority of elective surgery, and guiding the necessity of intraoperative frozen section analysis.

Conclusion

In this study, we established a population-specific ensemble machine learning model for predicting the borderline or malignancy risk of ovarian masses. Our model showed superior ability in identifying borderline and malignant ovarian tumors, with AUC values significantly exceeding those of RMI-4. Our model exhibited consistent predictive performance across all evaluation datasets. Decision curve analysis demonstrated superior clinical utility, and SHAP analysis confirmed its interpretability. Our model may serve as a potential triage tool. On the basis of no workflow changes, using machine learning techniques to maximize the predictive potential of routinely available clinical data in a specific health system may offer a relatively low-cost and accessible solution to clinical problems.

Footnotes

Acknowledgements

Not applicable.

Author Contributions

C-FC: conceptualization, methodology, investigation, formal analysis, software, visualization, writing—original draft; L-FS: investigation, validation, writing—review and editing; H-SL: conceptualization, methodology, investigation; H-KC: investigation; I-SC: conceptualization, writing—review and editing; SH: conceptualization, writing—review and editing; W-IF: supervision. K-IC: supervision, writing—review and editing; S-IW: supervision. All authors read and approved the final manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

Data Availability Statement included at the end of the article.

ORCID iDs

Chan-Fong Chio

I-San Chan

Ethical Considerations

This study was approved by the Medical Ethic Committee at Conde de São Januário Hospital, Macao SAR, China (003/MEC/N/2025), and the Research Ethic Committee at Kiang Wu Hospital, Macao SAR, China (KWH 2024-025).

Consent to Participate

Written informed consent was waived by both ethic committees as this is a retrospective study using anonymized data collected during routine clinical care without altering patient management. We confirm that all methods were performed in accordance with the 1964 Declaration of Helsinki and its later amendments.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. The source codes are available at .

References

Cheung

Ellenson

Gilks

, et al. Chapter 1: tumours of the ovary. In: WHO Classification of Tumours Editorial Board, ed. Female Genital Tumours. International Agency for Research on Cancer; 2020:32-167.

Dochez

Caillon

Vaucel

, et al. Biomarkers and algorithms for diagnosis of ovarian cancer: CA125, HE4, RMI and ROMA, a review. J Ovarian Res. 2019;12:28. doi:10.1186/s13048-019-0503-7

Jacobs

Oram

Fairbanks

Turner

Frost

Grudzinskas

JG.

A risk of malignancy index incorporating CA 125, ultrasound and menopausal status for the accurate preoperative diagnosis of ovarian cancer. Br J Obstet Gynaecol. 1990;97:922-929. doi:10.1111/j.1471-0528.1990.tb02448.x

Yamamoto

Yamada

Oguri

Maeda

Fukaya

Comparison of four malignancy risk indices in the preoperative evaluation of patients with pelvic masses. Eur J Obstet Gynecol Reprod Biol. 2009;144:163-167. doi:10.1016/j.ejogrb.2009.02.048

Moore

Jabre-Raughley

Brown

, et al. Comparison of a novel multiple marker assay vs the risk of malignancy index for the prediction of epithelial ovarian cancer in patients with a pelvic mass. Am J Obstet Gynecol. 2010;203:228.e1-228.e6. doi:10.1016/j.ajog.2010.03.043

Timmerman

Ameye

Fischerova

, et al. Simple ultrasound rules to distinguish between benign and malignant adnexal masses before surgery: prospective validation by IOTA group. BMJ. 2010;341:c6839. doi:10.1136/bmj.c6839

Karlsen

Høgdall

Christensen

, et al. A novel diagnostic index combining HE4, CA125 and age may improve triage of women with suspected ovarian cancer—an international multicenter study in women with an ovarian mass. Gynecol Oncol. 2015;138:640-646. doi:10.1016/j.ygyno.2015.06.021

Zhang

Dai

Systematic review and meta-analysis of O-RADS ultrasound and O-RADS MRI for risk assessment of ovarian and adnexal lesions. AJR Am J Roentgenol. 2023;221:21-33. doi:10.2214/AJR.22.28396

Van Calster

Van Hoorde

Valentin

, et al. Evaluating the risk of ovarian cancer before surgery using the ADNEX model to differentiate between benign, borderline, early and advanced stage invasive, and secondary metastatic tumours: prospective multicentre diagnostic study. BMJ. 2014;349:g5920. doi:10.1136/bmj.g5920

10.

Fan

, et al. Using machine learning to predict ovarian cancer. Int J Med Inform. 2020;141:104195. doi:10.1016/j.ijmedinf.2020.104195

11.

Cai

Huang

Gao

, et al. Artificial intelligence-based models enabling accurate diagnosis of ovarian cancer using laboratory tests in China: a multicentre, retrospective cohort study. Lancet Digit Health. 2024;6:e176-e186. doi:10.1016/S2589-7500(23)00245-5

12.

Xiang

Xiao

, et al. Development and validation of an interpretable model integrating multimodal information for improving ovarian cancer diagnosis. Nat Commun. 2024;15:2681. doi:10.1038/s41467-024-46700-2

13.

Winarto

Ongkowidjaja

Kusuma

, et al. Modified risk of ovarian malignancy algorithm and risk of malignancy index in predicting epithelial ovarian cancer in Indonesian population: a single-centered validation study. J South Asian Feder Obs Gynae. 2022;14:283-286. doi:10.5005/jp-journals

14.

Statistics census service of Macao government. Population census 2021. Accessed December 15, 2024. https://www.dsec.gov.mo/en-US/Statistic?id=103

15.

National Comprehensive Cancer Network. Ovarian cancer/fallopian tube cancer/primary peritoneal cancer (version 3.2024). Accessed December 15, 2024. https://www.nccn.org/professionals/physician_gls/pdf/ovarian.pdf

16.

Colombo

Sessa

Bois

, et al. ESMO-ESGO consensus conference recommendations on ovarian cancer: pathology and molecular biology, early and advanced stages, borderline tumours and recurrent disease. Int J Gynecol Cancer. 2019;29:728-760. doi:10.1136/ijgc-2019-000308

17.

Schapire

, ed. Explaining AdaBoost. In: Empirical Inference. Springer; 2013:37-52.

18.

Chen

Guestrin

XGBoost: a scalable tree boosting system. Proc ACM SIGKDD Int. 2016;1:785-794.

19.

Meng

Finley

, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146-3154.

20.

Collins

Moons

KGM

Dhiman

, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378

21.

YOUDEN

WJ.

Index for rating diagnostic tests. Cancer. 1950;3:32-35. doi:10.1002/1097-0142(1950)3

22.

Lundberg

Lee

SI.

A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765-4773. doi:10.5555/3295222.3295230

23.

Topcu

Guzel

Ozer

, et al. Comparison of neutrophil/lymphocyte and platelet/ lymphocyte ratios for predicting malignant potential of suspicious ovarian masses in gynecology practice. Asian Pac J Cancer Prev. 2014;15:6239-6241. doi:10.7314/apjcp.2014.15.15.6239

24.

Raungkaewmanee

Tangjitgamol

Manusirivithaya

, et al. Platelet to lymphocyte ratio as a prognostic factor for epithelial ovarian cancer. J Gynecol Oncol. 2012;23:265-273. doi:10.3802/jgo.2012.23.4.265

25.

Crudele

De Matteis

Graziano

, et al. AST/ALT-to-platelet ratio (AARPRI) predicts gynaecological cancers: a 8-years follow-up study in 653 women. Sci Rep. 2023;13:17793. doi:10.1038/s41598-023-44243-y

26.

Sun

, et al. Causal associations between serum urea and cancer: a mendelian randomization study. Genes. 2021;12:498. doi:10.3390/genes12040498

27.

Wei

, et al. Serum urea concentration and risk of 16 site-specific cancers, overall cancer, and cancer mortality in individuals with metabolic syndrome: a cohort study. BMC Med. 2024;22:536. doi:10.1186/s12916-024-03758-5

28.

Yang

Chen

A Survey on ensemble learning under the era of deep learning. Artif Intell Rev. 2023;56:5545-5589. doi:10.1007/s10462-022

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.04 MB

A Population-Specific Ensemble Machine Learning Model for Predicting Borderline or Malignancy Risk of Ovarian Masses in Macao: A Multicenter Retrospective Study

Abstract

Background:

Methods:

Results:

Conclusions:

Keywords

Introduction

Materials and Methods

Statistical Analysis

Study cohorts

Data collection

Data preprocessing and feature selection

Model development

Statistical Analysis

Results

Baseline characteristics

Features selected by LASSO regression

Performance of classifiers on the training set

Evaluating the predictive performance of classifiers on the test set, simulated prospective cohort, and external validation cohort

Discordant cases

Interpretability

Discussion

Predictors of borderline and malignant ovarian tumors

Ensemble learning and interpretability

Clinical implication

Conclusion

Footnotes

Acknowledgements

Author Contributions

Declaration of Conflicting Interests

Funding

Supplemental Material

ORCID iDs

Ethical Considerations

Consent to Participate

Data Availability Statement

References

Supplementary Material