Sage Journals: Discover world-class research

Abstract

Introduction

Machine learning (ML)–based analysis of cell-free DNA (cfDNA) has emerged as a promising strategy for multi-cancer early detection (MCED). However, reported diagnostic performance varies widely across studies, and many estimates are derived from training or enriched cohorts, limiting their relevance to independent validation and real-world settings.

Methods

We conducted a systematic review and diagnostic accuracy meta-analysis of ML-based cfDNA assays for MCED. Four databases (PubMed, Embase, Web of Science, and the Cochrane Library) were searched from inception to February 2, 2025. Only independent validation or testing datasets were included; all training datasets were excluded. Pooled sensitivity, specificity, diagnostic odds ratio (DOR), and summary receiver operating characteristic (SROC) curves were estimated using a bivariate random-effects model. Subgroup analyses and meta-regression were performed to explore sources of heterogeneity.

Results

Thirteen studies comprising 23 independent datasets and 14,892 participants were included. The pooled sensitivity was 0.78 (95% CI: 0.66-0.87), and the pooled specificity was 0.96 (95% CI: 0.90-0.98). The summary area under the curve (AUC) was 0.94, with a DOR of 76.6. Substantial between-study heterogeneity was observed (I² > 90%), with geographic region, sample size, and cfDNA biomarker type identified as major contributing factors.

Conclusion

ML-based cfDNA assays demonstrate consistently high specificity and moderate-to-high sensitivity across independent validation datasets, supporting their potential role in multi-cancer early detection. However, diagnostic performance is highly context dependent and strongly influenced by study design, population characteristics, and analytical choices. These findings highlight the need for large-scale, prospective, population-based validation before widespread clinical implementation.

Keywords

cell-free DNA machine learning multi-cancer detection early diagnosis liquid biopsy meta-analysis methylation biomarkers non-invasive screening

Introduction

Early cancer detection is a cornerstone of effective cancer control, offering the opportunity to initiate treatment at a potentially curable stage, thereby significantly improving patient prognosis and reducing cancer-related mortality.¹ However, current population-based cancer screening strategies are highly fragmented. Most existing programs are developed for only a handful of malignancies, such as breast, cervical, colorectal, and lung cancers, and rely on distinct modalities, screening intervals, eligibility criteria, and clinical workflows.^2,3 This siloed approach lacks scalability, consistency, and universal applicability across diverse populations and healthcare settings.

Moreover, traditional screening imposes substantial burdens on patient decision-making. Individuals are often required to undergo multiple, uncoordinated tests, each with different risks, benefits, and interpretations, thus creating a complex landscape that can be confusing, time-consuming, and financially challenging.⁴ For asymptomatic individuals, the invasiveness or ambiguity of tests may deter participation. These issues are particularly pronounced in underserved or resource-limited settings, where access to comprehensive, organ-specific screening is limited. Consequently, there is an urgent need for a universal, non-invasive, and patient-friendly early detection approach that is cost-effective, scalable, and aligned with shared decision-making principles.

Liquid biopsy based on cell-free DNA (cfDNA) offers a promising alternative. cfDNA, released into circulation through apoptosis, necrosis, and active secretion, carries tumor-specific genetic and epigenetic information.^5,6 Advances in high-throughput sequencing and fragmentomic analysis have enabled detection of somatic mutations, methylation changes, copy number variations, and fragmentation patterns in cfDNA.^7,8 This opens the door for multi-cancer early detection (MCED) from a single blood sample, streamlining the screening process and improving accessibility. Yet, the complexity and heterogeneity of cfDNA data challenge conventional analytic methods. Machine learning (ML) algorithms, including random forests, support vector machines (SVMs), and deep neural networks—have emerged as powerful tools capable of handling high-dimensional biological data, recognizing subtle patterns, and distinguishing cancer from non-cancer states.^9–11 A conceptual illustration of the typical cfDNA-based ML workflow utilized in these assays is provided in Figure 1. By integrating multiple cfDNA features, ML models can potentially improve sensitivity and specificity while enabling broad cancer coverage and real-world clinical application.

Figure 1.

Workflow for ML-assisted cancer diagnosis using cfDNA. The process is divided into three stages: A. Peripheral blood sample collection: Blood is drawn from a patient, and cell-free DNA (cfDNA) is extracted and subjected to next-generation sequencing (NGS). B. cfDNA feature categories: Raw NGS data is analyzed to extract epigenetic features (methylation) and fragmentomic features (fragment sizes). C. ML-assisted cancer diagnosis workflow: Machine learning algorithms analyze these features to provide a binary prediction of cancer (detected/not detected) and predict the tissue of origin, guiding further diagnostic examinations.

Although cfDNA-based machine learning assays show potential for multi-cancer early detection, reported diagnostic performance has been highly variable across studies. This variability is largely attributable to differences in study populations, cancer types, feature selection strategies, cfDNA biomarker types (such as methylation, fragmentation, or variant detection), and the ML algorithms applied. Several studies have demonstrated high sensitivity and specificity, whereas others yielded only moderate or inconsistent results, often reflecting small sample sizes, retrospective designs, or limited external validation.^12,13 Additionally, the area under the receiver operating characteristic curve, a common measure of model discrimination, also demonstrates wide variability, reflecting the heterogeneity in model development pipelines and cfDNA data quality.¹⁴ These inconsistencies complicate efforts to benchmark performance, translate findings into clinical workflows, and guide regulatory or reimbursement decisions.

While recent systematic reviews, such as the comprehensive Health Technology Assessment by the UK National Institute for Health and Care Research (NIHR) published in 2025,¹⁵ have evaluated the clinical implementation of commercial MCED tests, they explicitly abstained from performing quantitative meta-analyses due to high heterogeneity. Furthermore, existing clinical reviews¹⁶ typically treat the computational component as a “black box,” focusing on the final commercial product rather than quantitatively assessing how different algorithmic approaches influence diagnostic accuracy. Additionally, previous summaries often focus on single-cancer applications or specific biomarker modalities. To date, no meta-analysis has systematically aggregated and evaluated the diagnostic accuracy, heterogeneity, and methodological quality of ML-based cfDNA assays for MCED. Such a synthesis is critical to assess the current evidence base, identify sources of bias and variation, and inform the design of future prospective validation studies.

To address this gap, we systematically reviewed the existing literature and performed a quantitative meta-analysis to evaluate the diagnostic accuracy of machine learning-based cfDNA assays for multi-cancer early detection. Unlike previous reviews, we strictly excluded training datasets and synthesized results solely from independent validation cohorts to provide a realistic estimate of generalizability. We assessed pooled sensitivity, specificity, diagnostic odds ratio (DOR), and area under the summary receiver operating characteristic curve. In addition, we conducted detailed subgroup analyses and meta-regression stratified by ML algorithm type and biomarker modality to explore heterogeneity and assess methodological and biological factors influencing model performance. Our findings provide an evidence-based assessment of the translational potential of cfDNA and ML integration and offer valuable insights for the development of more patient-centered, equitable, and scalable cancer early detection strategies.

Methods

Protocol and Registration

This systematic review and meta-analysis were conducted and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines.¹⁷ The review protocol was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO). The registered protocol prespecified the study objectives, eligibility criteria, literature search strategy, outcome measures, and planned statistical analyses, thereby ensuring methodological transparency and reproducibility. Ethical review and approval were not required for this study because it is a systematic review and meta-analysis based exclusively on previously published literature and publicly available aggregate data, without direct involvement of human participants or collection of primary biological samples.

Search Strategy

A comprehensive literature search was conducted in collaboration with an evidence-based medicine specialist, using a combination of controlled vocabulary terms (MeSH and Emtree) and free-text keywords related to cell-free DNA, machine learning, multi-cancer detection, and early cancer screening. Four electronic databases, PubMed, The Cochrane Library, Embase, and Web of Science, were systematically searched from inception to February 2, 2025. No restrictions were applied regarding geographical region, study design, or article type. Only English-language publications were included. The detailed search strategies and complete search strings for all databases are provided in Supplementary Table S1. To ensure comprehensive coverage, the reference lists of all included articles and relevant reviews were manually screened to identify additional eligible studies that may have been missed in the initial search.

Eligibility Criteria

Inclusion criteria were as follows: (1) studies involving patients with confirmed cancer (at any stage) and/or healthy or non-cancer controls undergoing cfDNA-based testing for multi-cancer early detection (MCED); (2) application of ML algorithms, such as deep learning, random forest, support vector machines, logistic regression, or Bayesian models, to cfDNA data derived from sequencing, methylation profiling, fragmentomics, or other liquid biopsy techniques; (3) inclusion of a non-cancer control group for diagnostic comparison; (4) availability of diagnostic performance metrics or sufficient data to derive values for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN); and (5) articles published in English with adequate methodological detail.

Exclusion criteria included: (1) studies with duplicate data (in which case the version with the most comprehensive dataset was retained); (2) case reports, reviews, editorials, commentaries, or conference abstracts without full data; (3) animal studies, in vitro experiments, or computational-only model development without clinical data; and (4) publications focused on non-cancer cfDNA applications such as treatment monitoring, minimal residual disease (MRD), prenatal testing, or unrelated diseases.

Study Selection

The screening process was conducted using a two-step approach. First, duplicate studies were automatically excluded using EndNote ×9 (Clarivate Analytics), followed by an initial manual screening by a reviewer. Subsequently, the titles, abstracts, and full texts of all remaining records were independently assessed by two reviewers based on the predefined inclusion and exclusion criteria. Discrepancies between reviewers were resolved by discussion, and if necessary, a third independent reviewer was consulted to reach consensus. During the screening process, studies that focused on non-cancer cfDNA applications or reported complications unrelated to multi-cancer early detection were excluded at the title and abstract level. In addition, case reports, case series, editorials, commentaries, and conference abstracts without full data were excluded.

Data Extraction and Quality Assessment

Two reviewers independently extracted relevant data from each eligible study using a standardized and pre-defined data extraction form. The extracted information included: first author's name and publication year, study location, study design, sample size, cancer type and stage, diagnostic reference standard, type of control group, participant age and sex, machine learning algorithm used, type of cfDNA biomarker, and diagnostic outcomes, including the number of TP, FN, FP, and TN. For multi-cancer early detection studies that reported performance metrics per individual cancer type without providing an overall aggregate confusion matrix, we synthesized the data by treating detection as a binary outcome (Cancer Detected vs Not Detected). We calculated the aggregate True Positives (TP) and False Negatives (FN) by summing these values across all included cancer types. Conversely, for the control group, we utilized the overall specificity, False Positives (FP), and True Negatives (TN) reported for the total non-cancer cohort to ensure that control subjects were not counted multiple times. To ensure the reliability of diagnostic accuracy estimates, we prioritized the extraction of performance metrics from independent validation cohorts or held-out test sets. For studies that reported sensitivity and specificity alongside sample sizes but lacked explicit confusion matrix values (TP, FP, FN, TN), we back-calculated these values to reconstruct the 2 × 2 contingency tables. Datasets with missing or incomplete data that could not be reliably reconstructed were excluded from the quantitative meta-analysis. After both reviewers completed the data extraction, results were cross-checked, and any discrepancies were resolved through discussion to reach consensus. The methodological quality and risk of bias of the included studies were evaluated using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool.¹⁸ This tool assesses four domains: patient selection, index test, reference standard, and flow and timing, for both risk of bias and applicability concerns (for the first three domains). The assessments were performed independently by two reviewers, and disagreements were resolved through consensus or with the input of a third reviewer.

Data Synthesis and Analysis

All statistical analyses were conducted using Stata software, version 14.0 SE (StataCorp LLC, College Station, TX). Diagnostic test accuracy meta-analysis was performed using the midas command based on a bivariate mixed-effects regression model. This model was selected as the primary statistical framework because it explicitly accounts for the inherent trade-off and negative correlation between sensitivity and specificity (the threshold effect), thereby preserving the two-dimensional nature of the diagnostic data. The model operates under the assumption that the logit-transformed sensitivity and specificity follow a bivariate normal distribution across the included studies, incorporating both within-study sampling error and between-study heterogeneity. While the Hierarchical Summary Receiver Operating Characteristic (HSROC) model represents a valid alternative structure, the bivariate model was prioritized as it allows for the direct estimation of pooled performance metrics with their respective confidence intervals.

Pooled estimates included sensitivity (Sen), specificity (Spe), positive likelihood ratio (PLR), negative likelihood ratio (NLR), and the diagnostic odds ratio (DOR), each with corresponding 95% confidence intervals (CIs). The DOR ranges from 0 to infinity, with higher values indicating greater discriminatory ability of the diagnostic test.¹⁹ A summary receiver operating characteristic (SROC) curve was constructed, and the area under the curve (AUC) was calculated to evaluate overall diagnostic performance, with values closer to 1 indicating higher accuracy.²⁰ Between-study heterogeneity was assessed using Cochran Q test and the I² statistic; a Q test P < 0.05 or I² > 50% was considered significant. Spearman correlation analysis between the logit of sensitivity and the logit of (1- specificity) was used to assess potential threshold effects; P < 0.05 indicated the presence of a threshold effect.^21,22 Publication bias was evaluated using Deeks funnel plot asymmetry test, with P < 0.05 suggesting possible bias.²³ The clinical utility of cfDNA-based models was further examined using Fagan nomogram analysis, which integrates pretest probability, likelihood ratios, and posttest probability.²⁴ Specifically, we evaluated post-test probabilities using both the prevalence observed in the included studies and a hypothetical low-prevalence setting to simulate a real-world population screening scenario. To explore sources of heterogeneity, meta-regression analyses were conducted by stratifying studies according to study design, type of control group, use of prespecified diagnostic thresholds, geographic region, sample size, algorithm type, number of cancer types included, and cfDNA biomarker type. Finally, sensitivity analyses were performed by excluding each study's training dataset in turn to assess the robustness of the pooled estimates.

Results

Study Selection

A total of 2190 records were identified through database searches, including 859 from Embase, 606 from PubMed, 64 from The Cochrane Library, and 661 from Web of Science. After removing 710 duplicate records, 1480 articles remained for title and abstract screening. Of these, 1445 articles were excluded for not meeting the inclusion criteria based on title and abstract review. The remaining 35 articles were retrieved for full-text assessment, after which 22 studies were excluded for reasons such as insufficient data, irrelevant outcomes, or ineligible study design. No additional eligible studies were identified through manual searching. Ultimately, 13 studies were included in the final meta-analysis.^7,25–36 The study selection process is illustrated in the PRISMA flow diagram (Figure 2).

Figure 2.

PRISMA flow diagram of study selection process. A total of 2190 records were identified through database searching. After duplicate removal and screening, 13 studies were included in the final meta-analysis.

Study Characteristics

A total of 13 studies were included in this meta-analysis, comprising 1 prospective cohort study,³⁰ 2 retrospective cohort studies,^25,26 and 10 case-control studies.^7,27–29^,31–36 Among these, 4 studies reported only one dataset,^7,30,32,35 1 study provided three datasets,²⁸ and the remaining 8 studies contributed two datasets each,^25–27^{,29,31,33,34,36} resulting in a total of 23 datasets for analysis. The included studies were published between 2019 and 2025, involving a combined total of 14,892 participants, including 8434 cancer patients and 6458 non-cancer controls. Geographically, the studies were conducted in China, the United States, India, the Netherlands, Belgium, and Vietnam. In the study by Xu J (2024),³⁶ cancer patients were primarily in stages II–IV, whereas all other studies included patients across stages I–IV. Comprehensive information on population demographics (age, sex), machine learning algorithms, cfDNA biomarker types, and diagnostic performance metrics (TP, FP, TN, FN) is summarized in Table 1. The methodological quality was evaluated using the QUADAS-2 tool, which assesses risk of bias across four domains, Patient Selection, Index Test, Reference Standard, and Flow and Timing, based on specific signaling questions (Figures 3A and 3B). While the overall risk of bias appeared comparable across studies, domain-specific evaluations revealed variations that mirrored the heterogeneity in study characteristics, particularly in subject recruitment and control selection. Specifically, the Reference Standard and Index Test domains consistently demonstrated low risk, reflecting rigorous pathological confirmation and blinded model interpretation. In contrast, the Patient Selection domain exhibited variability (often classified as ‘Unclear’ in Figure 3A), which directly reflects the substantial differences in study populations and the prevalence of retrospective case-control designs utilizing healthy controls. Detailed diagnostic performance of each included study, stratified by dataset (training, validation, or independent testing) and by individual cancer types, is summarized in Supplementary Table S2. Overall, cfDNA-based ML models demonstrated consistently high specificity (typically >90%) and variable accuracy across cancer types, with particularly strong performance for hepatobiliary, pancreatic, and esophageal cancers, whereas breast and prostate cancers showed lower accuracy in large-scale validation datasets.

Figure 3.

QUADAS-2 assessment of risk of bias and applicability concerns for included studies. (A) Summary bar charts showing the proportion of studies rated as having low (green), unclear (yellow), or high (red) risk of bias and applicability concerns across the four QUADAS-2 domains: patient selection, index test, reference standard, and flow and timing. (B) Traffic-light plot illustrating domain-specific judgments for each individual study. Green circles (+) indicate low risk, yellow circles (?) indicate unclear risk, and red circles (–) indicate high risk.

Table 1.

Characteristics of 13 Included Studies in This Meta-Analysis.

Study	Country	Design	Biomarker Types	ML Algorithms	Multi-cancer	Set	Group	Age, Years	N, M/F	TP	FN	FP	TN
Basu 2024	India	RCS	Methylation-based	XGBoost	10 types	Training	Cancer	NR	141, NR	114	27	2	63
						Training	HC	NR	65, NR
						Validation	Cancer	NR	489, NR	384	105	4	257
						Validation	HC	NR	261, NR
Bie 2022	China	RCS	Methylation, fragmentation, and copy number alteration	GLM	7 types	Training	Cancer	60 ± 12	542, 311/231	443	99	3	349
						Training	HC	53 ± 9	352, 179/173
						Independent Validation	Cancer	60 ± 12	238, 139/99	204	34	1	144
						Independent Validation	HC	49 ± 9	145, 68/77
Che 2022	Belgium	CCS	Fragmentomics-based	SVM	3 types	Training	Cancer	39 ± 19	238, 110/128	220	18	4	256
					3 types	Training	HC	69 ± 3	260, 96/164
					5 types	Training	Cancer	62 ± 13	320, 120/200	177	143	5	102
					5 types	Training	HC	49 ± 12	107, 10/97
Cristiano 2019	The Netherlands	CCS	Fragmentomics-based	SGBM	7 types	Training	Cancer	NR	208, NR	166	42	11	204
Cristiano 2019	The Netherlands	CCS	Fragmentomics-based	SGBM	7 types	Training	HC	NR	215, NR
Gao 2023	China	CCS	Methylation-based	SVM	6 types	Training	Cancer	58(51, 64)	399, 252/147	319	80	7	619
						Training	Non-cancer	56(51, 61)	626, 234/392
						Validation	Cancer	61(53, 68)	301, 177/124	231	70	2	121
						Validation	Non-cancer	57(55, 62)	123, 64/59
						Independent validation	Cancer	61(54, 67)	473, 275/198	355	118	23	450
						Independent validation	Non-cancer	58(52, 65)	473, 214/259
Jamshidi 2022	USA	CCS	Methylation-based	XGBoost	>10 types	Training	Cancer	61 ± 12	854, 260/594	328	505	12	548
						Training	Non-cancer	60 ± 12	560, 124/436
						Validation	Cancer	62 ± 12	485, 178/307	158	306	8	354
						Validation	Non-cancer	59 ± 14	362, 127/235
Klein 2021	USA	CCS	Methylation-based	LRA	>20 types	Independent validation	Cancer	62.6 ± 11.8	2823, 1429/1394	1453	1370	6	1248
Klein 2021	USA	CCS	Methylation-based	LRA	>20 types	Independent validation	Non-cancer	56.2 ± 12.6	1254, 390/864
Lei 2024	China	PCS	Methylation-based	LRA	CRC, GC	Independent Validation	Cancer	55.4(25–78)	47, 25/22	39	8	10	44
						Independent Validation	Non-cancer	52.2(31–78)	54, 15/39
						Validation	Cancer	55.3(29–89)	48, 17/31	39	9	11	44
						Validation	Non-cancer	53.5(41–78)	55, 18/36
Ris 2021	USA	CCS	Fragmentomics-based	RFM	7 types	Training	Cancer	65.4 ± 10.9	260, 136/124	150	110	21	394
Ris 2021	USA	CCS	Fragmentomics-based	RFM	7 types	Training	Non-cancer	54.9 ± 15.5	415, 117/298
Shao 2022	USA	CCS	5hmC	MLRM	6 types	Training	Cancer	NR	110, NR	66	44	3	130
						Training	Non-cancer	NR	133, NR
						Validation	Cancer	NR	70, NR	48	22	3	85
						Validation	Non-cancer	NR	88, NR
Shi 2024	China	CCS	Fragmentomics-based	LinearSVC	3 types	Training	Cancer	62.7 ± 10.2	58, 33/25	57	1	4	26
						Training	Non-cancer	NR	31, 19/12
						Independent Validation	Cancer	62.2 ± 10.0	89, 65/24	86	3	15	25
						Independent Validation	Non-cancer	NR	40, 28/12
Thien Nguyen 2025	Vietnam	CCS	Methylation & fragmentomics features	NR	5 types	Training	Cancer	60 (15–85)	135, 68/67	101	34	28	711
Thien Nguyen 2025	Vietnam	CCS	Methylation & fragmentomics features	NR	5 types	Training	Non-cancer	47 (18–90)	739, 315/424
Xu 2024	China	CCS	Fragmentomics-based	RFM	20 types	Training	Cancer	65.3 (33–93)	75, 47/28	66	9	8	62
						Training	HC	56 (24–88)	70, 20/50
						Independent Validation	Cancer	68.7 (38–91)	31, 17/14	28	3	4	26
						Independent Validation	HC	58.5 (25–86)	30, 7/23

Abbreviations used in this table: 5hmC, 5-hydroxymethylcytosine; CCS, Case-control study; PCS, Prospective cohort study; RCS, Retrospective cohort study; HC, Healthy control; SVM, Support vector machine; RFM, Random forest model; LRA, Logistic regression algorithm; MLRM, Multinomial logistic regression model; SGBM, Stochastic gradient boosting model; GLM, Generalized linear model; ML, Machine learning; NR, not reported. CRC, colorectal cancer; GC, gastric cancer; N, number of groups; M, male; F, female; TP, true positive; FN, false negative; FP, false positive; TN, true negative.

Pooled Diagnostic Accuracy of cfDNA-Based Machine Learning Models

A total of 13 studies (23 datasets) were included in the meta-analysis to evaluate the diagnostic performance of ML models applied to cfDNA for early multi-cancer detection. The Spearman correlation analysis yielded a correlation coefficient of −0.53 (P = 0.28), suggesting no significant threshold effect across studies. The pooled sensitivity was 0.779 (95% CI: 0.699-0.843), with substantial heterogeneity (P < 0.01, I² = 98.66%) (Figure 4A). The pooled specificity was 0.962 (95% CI: 0.939-0.977), also showing significant heterogeneity (P < 0.01, I² = 96.49%). The summary receiver operating characteristic (SROC) curve is presented in Figure 4B, with an AUC of 0.955 (95% CI: 0.933-0.970), indicating excellent overall diagnostic accuracy.

Figure 4.

Pooled sensitivity, specificity, and SROC curve of machine learning-based cfDNA assays for multi-cancer early detection. (A) Forest plots showing the individual and pooled sensitivity (left) and specificity (right) across 23 datasets from 13 studies. The pooled sensitivity was 0.779 (95% CI: 0.699-0.843) with significant heterogeneity (I² = 98.66%), and the pooled specificity was 0.962 (95% CI: 0.939-0.977) with I² = 96.49%. (B) Summary receiver operating characteristic (SROC) curve with confidence and prediction contours. The AUC was 0.955 (95% CI: 0.933-0.970), indicating high overall diagnostic accuracy. The red diamond represents the pooled operating point (sensitivity and specificity), and the dotted lines represent the 95% prediction interval.

Diagnostic Likelihood Ratios, Discriminatory Power, and Between-Study Heterogeneity

The pooled PLR was 20.712 (95% CI: 13.236-32.412), and the pooled NLR was 0.229 (95% CI: 0.167-0.315), as shown in Figure 5A. Both indicators demonstrated significant between-study heterogeneity (P < 0.01, I² > 50%), indicating variability in diagnostic performance across included datasets. The DOR, which reflects the overall discriminatory power of the test, was 90.273 (95% CI: 55.886-145.818), with extremely high heterogeneity (P < 0.01, I² = 100%) (Figure 5B). These findings suggest that cfDNA-based ML models offer strong diagnostic discrimination for multi-cancer early detection. However, the substantial heterogeneity across multiple diagnostic metrics highlights the importance of further exploration through subgroup analysis and meta-regression, aimed at identifying sources of variability and improving model generalizability.

Figure 5.

Pooled likelihood ratios and DOR of cfDNA-based machine learning models for multi-cancer early detection. (A) Forest plots showing the pooled PLR and NLR across included datasets. The pooled PLR was 20.712 (95% CI: 13.236-32.412), and the pooled NLR was 0.229 (95% CI: 0.167-0.315), both indicating strong diagnostic performance. (B) Forest plot of the pooled DOR, estimated at 90.273 (95% CI: 55.886-145.818), reflecting the overall discriminatory ability of the test. All metrics exhibited significant heterogeneity, as indicated by high I² values.

Clinical Utility Analysis

The Fagan nomogram plot was used to evaluate the clinical utility of cfDNA-based machine learning models for multi-cancer early detection (Figure 6). Based on the included studies, the pre-test probability of cancer among the study population was 57%. Given the pooled PLR and NLR from the meta-analysis, the post-test probability was calculated to be 96% for individuals with a positive test result, and 23% for those with a negative result. These findings suggest that cfDNA evaluated by ML models provides clinically meaningful diagnostic information within high-risk or enriched cohorts. However, to estimate utility in a realistic screening scenario, we simulated a hypothetical low-prevalence setting of 1% (typical for an average-risk general population).¹ Using the pooled PLR of 20.71, a positive test result would increase the post-test probability of cancer from 1% to approximately 17%. Conversely, using the pooled NLR of 0.23, a negative test result would decrease the probability from 1% to approximately 0.2%. This comparison highlights that while the test significantly elevates the probability of cancer detection (from 1% to 17%), a positive result in a general screening population is not diagnostic on its own and necessitates rigorous follow-up confirmation.

Figure 6.

Fagan nomogram for evaluating the clinical utility of cfDNA-based machine learning models. The pre-test probability of cancer was set at 57%, consistent with the overall prevalence across included studies. The post-test probability was 96% for a positive result and 23% for a negative result, reflecting the strong diagnostic influence of the test.

Assessment of Publication Bias

Deeks’ funnel plot asymmetry test was used to assess the risk of publication bias among the included studies. As shown in Figure 7, the scatter of studies was relatively symmetrical, and the slope of the regression line was not statistically significant (P = 0.35), indicating no evidence of significant publication bias in this meta-analysis.

Figure 7.

Deeks’ funnel plot for publication bias assessment. The funnel plot shows a symmetrical distribution of included studies with a non-significant slope in the linear regression test for funnel plot asymmetry (P = 0.35), suggesting no significant publication bias.

Subgroup Analysis and Meta-Regression

To explore sources of heterogeneity, we conducted subgroup analyses and meta-regression based on geographic region, study design, specificity threshold, sample size, biomarker type, algorithm type, and cancer type coverage (Table 2; Figure 8).

Figure 8.

Forest plots of subgroup analyses assessing sensitivity and specificity across study-level covariates. Subgroups were stratified by (A) geographic region (Western vs Asia), (B) study design (case-control vs cohort), (C) control group type (healthy control vs non-cancer control), (D) pre-specified specificity thresholds, (E) sample size (≥500 vs <500), (F) cfDNA biomarker type (methylation-based vs non-methylation-based), (G) machine learning algorithm type (tree-based vs linear/kernel-based models), and (H) number of cancer types included (>5 vs ≤5). Differences in diagnostic performance were evaluated using meta-regression; P-values indicate the statistical significance of subgroup differences in sensitivity and specificity.

Table 2.

The Analysis Results of Meta-Regression.

Subgroup	No. of Studies	Sensitivity	P Value	Specificity	P Value	Joint Model P Value
Location			<0.01		0.03	<0.01
Western	9	0.62 (0.50, 0.74)		0.98 (0.96, 0.99)
Asian	14	0.85 (0.80, 0.91)		0.95 (0.92, 0.98)
Design			0.03		0.02	0.56
CCS	17	0.76 (0.67, 0.85)		0.96 (0.94, 0.98)
CS	6	0.83 (0.72, 0.94)		0.96 (0.93, 1.00)
Control			<0.01		<0.001	0.09
HC	9	0.83 (0.75, 0.92)		0.97 (0.95, 0.99)
Non-cancer	14	0.74 (0.64, 0.84)		0.96 (0.93, 0.98)
Sample Size			<0.001		0.03	0.01
≥500	9	0.65 (0.53, 0.78)		0.98 (0.97, 0.99)
<500	14	0.84 (0.78, 0.90)		0.94 (0.91, 0.97)
Specific specificity			<0.001		0.07	<0.01
Yes	14	0.72 (0.62, 0.81)		0.98 (0.97, 0.99)
No	9	0.86 (0.78, 0.94)		0.90 (0.85, 0.95)
Methylation-based			<0.01		<0.05	0.11
Yes	13	0.73 (0.63, 0.83)		0.97 (0.96, 0.99)
No	10	0.83 (0.75, 0.92)		0.94 (0.89, 0.98)
ML algorithms			<0.01		<0.01	0.26
Tree-based Models	9	0.72 (0.59, 0.85)		0.96 (0.93, 0.99)
Linear Models or Kernel-based Methods	14	0.81 (0.74, 0.89)		0.96 (0.94, 0.99)
Caner types >5			<0.001		0.23	0.01
Yes	16	0.73 (0.64, 0.81)		0.98 (0.96, 0.99)
No	7	0.87 (0.79, 0.95)		0.91 (0.84, 0.97)

Abbreviations used in this table: CCS, Case-control study; CS, Cohort study; HC, Healthy control.

Geographic Region

Subgroup analysis by study location revealed significant differences. Western studies demonstrated a pooled sensitivity of 0.62 (95% CI: 0.50-0.74) and specificity of 0.98 (95% CI: 0.96-0.99), while studies conducted in Asia yielded higher sensitivity at 0.85 (95% CI: 0.80-0.91) and slightly lower specificity at 0.95 (95% CI: 0.92-0.98). The between-group differences were statistically significant for both sensitivity and specificity (P < 0.05) and remained significant under the bivariate model (P < 0.01), suggesting that regional factors such as population genetics, cancer prevalence, or diagnostic practices may influence test performance.

Study Design

Case-control studies (CCS) showed a sensitivity of 0.76 (95% CI: 0.67-0.85) and specificity of 0.96 (95% CI: 0.94-0.98), whereas cohort studies (CS) demonstrated a slightly higher sensitivity of 0.83 (95% CI: 0.72-0.94) and similar specificity (0.96, 95% CI: 0.93-1.00). Although sensitivity and specificity differences were significant in univariate analysis (P < 0.05), these differences were not significant in the joint bivariate model (P = 0.56), indicating that study design may not be a major contributor to heterogeneity after adjusting for other factors.

Control Type

When grouped by control type, studies using healthy controls achieved higher sensitivity (0.83, 95% CI: 0.75-0.92) and specificity (0.97, 95% CI: 0.95-0.99) than those using non-cancer controls (0.74, 95% CI: 0.64-0.84) and (0.96, 95% CI: 0.93-0.98); however, this difference was not significant in the joint model (P = 0.09). Notably, the lower performance trends in the “non-cancer control” group, which typically included patients with benign diseases or inflammatory conditions, suggest that biological variability contributes to background noise, thereby challenging the specificity of ML models compared to using ideal healthy donors.

Pre-Specified Specificity Thresholds

Studies that pre-specified a high specificity threshold (95-99%) had lower sensitivity at 0.72 (95% CI: 0.62-0.81) but higher specificity at 0.98 (95% CI: 0.97-0.99). In contrast, those without such thresholds reported higher sensitivity of 0.86 (95% CI: 0.78-0.94) and lower specificity of 0.90 (95% CI: 0.85-0.95). The difference in sensitivity was statistically significant (P < 0.001), and the joint model showed a significant difference overall (P < 0.01), while the specificity difference was not significant (P = 0.07). These findings reflect the inherent trade-off in diagnostic testing: prioritizing high specificity often comes at the cost of reduced sensitivity. These results quantify the trade-off in unbiased clinical settings: enforcing the high-specificity thresholds (≥98%) required to minimize false positives in population screening inevitably suppresses the pooled sensitivity.

Sample Size

Studies with ≥500 participants had a sensitivity of 0.65 (95% CI: 0.53-0.78) and specificity of 0.98 (95% CI: 0.97-0.99), while those with fewer than 500 participants had higher sensitivity of 0.84 (95% CI: 0.78-0.90) and lower specificity of 0.94 (95% CI: 0.91-0.97). Both sensitivity and specificity differences were significant (P < 0.05) and remained significant in the bivariate model (P = 0.01). These results suggest that smaller studies may overestimate sensitivity due to overfitting or selection bias, whereas larger studies yield more conservative and stable estimates.

cfDNA Biomarker Type

Studies using methylation-based cfDNA biomarkers reported a sensitivity of 0.73 (95% CI: 0.63-0.83) and specificity of 0.97 (95% CI: 0.96-0.99). In comparison, non-methylation-based studies showed higher sensitivity (0.83, 95% CI: 0.75-0.92) and slightly lower specificity (0.94, 95% CI: 0.89-0.98). The differences were significant in univariate analyses (P < 0.05) but not in the joint model (P = 0.11), indicating that biomarker type partly explain observed variability but is likely confounded by other factors such as sample quality or analytic approach.

ML Algorithm Type

Tree-based models, such as random forests and gradient boosting, had a pooled sensitivity of 0.72 (95% CI: 0.59-0.85) and specificity of 0.96 (95% CI: 0.93-0.99). Linear or kernel-based models (eg, logistic regression, SVM) showed higher sensitivity of 0.81 (95% CI: 0.74-0.89) with comparable specificity (0.96, 95% CI: 0.94-0.99). Differences in both sensitivity and specificity were significant in univariate comparisons (P < 0.01) but not in the bivariate model (P = 0.26), suggesting that algorithm choice alone may not account for performance differences when adjusted for other covariates.

Cancer Type Breadth

Studies that assessed >5 cancer types showed a lower sensitivity of 0.73 (95% CI: 0.64-0.81) but higher specificity of 0.98 (95% CI: 0.96-0.99). In contrast, studies that included ≤5 cancer types reported higher sensitivity of 0.87 (95% CI: 0.79-0.95) and lower specificity of 0.91 (95% CI: 0.84-0.97). The difference in sensitivity was significant (P < 0.001), while that in specificity was not (P = 0.23). The joint model confirmed the overall difference (P < 0.01), indicating that broader cancer coverage may dilute signal strength and reduce detection sensitivity.

These analyses revealed that diagnostic performance varied significantly across subgroups, suggesting that heterogeneity was primarily driven by differences in population characteristics such as geographic region, study design elements including sample size and the number of cancer types assessed, and methodological choices related to the type of cfDNA biomarker used and the machine learning algorithm implemented. Notably, studies involving smaller cohorts, a narrower range of cancer types, or populations from specific regions tended to report higher sensitivity.

Sensitivity Analysis

A sensitivity analysis was performed by excluding all training datasets, leaving 11 independent validation or testing datasets for re-analysis. As shown in Supplementary Figure 1A, the pooled sensitivity was 0.782 (95% CI: 0.662-0.868), with significant heterogeneity (P < 0.01; I² = 98.79%), and the pooled specificity was 0.955 (95% CI: 0.898-0.981), also showing significant heterogeneity (P < 0.01; I² = 98.16%). The pooled PLR and NLR were 17.47 (95% CI: 8.09-37.71) and 0.23 (95% CI: 0.15-0.36), respectively (Supplementary Figure 1B), both demonstrating substantial heterogeneity (P < 0.01; I² > 50%).

The DOR was 76.56 (95% CI: 38.59-151.86), with significant heterogeneity (P < 0.01; I² = 100%) (Supplementary Figure 2A). The summary receiver operating characteristic (SROC) curve yielded an AUC of 0.940 (95% CI: 0.916-0.958) (Supplementary Figure 2B). The Fagan nomogram analysis (Supplementary Figure 3) indicated post-test probabilities of 97% for positive and 29% for negative results. Overall, the diagnostic estimates after exclusion of training sets were consistent with the main pooled results, demonstrating the robustness and stability of the meta-analysis findings.

Discussion

Our study focused on cfDNA rather than other circulating biomarkers such as exosomal RNA, circulating tumor cells, or protein-based markers, due to cfDNA's unique combination of biological accessibility, molecular richness, and growing clinical relevance. Unlike protein markers, which often suffer from low specificity and context dependency,³⁷ cfDNA provides direct genomic and epigenomic signals reflective of tumor biology, including somatic mutations, methylation alterations, and fragmentation patterns.³⁸ Compared to circulating tumor cells (CTCs), cfDNA is more consistently detectable across early and late-stage cancers and can be more feasibly integrated into high-throughput sequencing workflows.³⁹ Moreover, cfDNA assays offer compatibility with machine learning pipelines due to their high-dimensional feature space, which enables nuanced modeling for pan-cancer detection. These properties make cfDNA particularly well-suited for scalable, minimally invasive cancer screening strategies. The notably high specificity aligns with the clinical goal of minimizing false positives in population-wide screening, while the acceptable sensitivity reflects meaningful detection capability at early disease stages.⁴⁰

To contextualize the potential value of cfDNA-based ML models, it is important to compare their performance with existing single-cancer screening modalities.⁴¹ Traditional methods such as mammography for breast cancer and fecal immunochemical testing (FIT) or colonoscopy for colorectal cancer are well-established, evidence-based tools that have significantly reduced cancer mortality when used appropriately in target populations.⁴² However, these methods are cancer-type specific and are often underutilized due to invasiveness, accessibility, or compliance issues.⁴³ In contrast, cfDNA-ML assays offer the possibility of simultaneous, multi-cancer detection from a single blood draw, potentially improving patient convenience and uptake.⁴⁴ While mammography achieves a sensitivity of ∼77%–95% and specificity of ∼94% in screening settings,⁴⁵ cfDNA-based models in our analysis demonstrate comparable or higher specificity (often >90%) and acceptable sensitivity, especially for detecting multiple cancers concurrently. Similarly, FIT for colorectal cancer has a reported sensitivity of ∼74% for early-stage disease and specificity around 95%,⁴⁶ but it requires regular repeated testing and lacks pan-cancer scope. Moreover, cfDNA analysis is less invasive than colonoscopy and may be more acceptable to patients, particularly in low-resource or rural settings where endoscopy services are limited. However, comparisons with established modalities should be interpreted with caution given the overlapping confidence intervals and the differences in study populations (screening vs case-control). The reported performance metrics of cfDNA-based models often exhibit overlapping confidence intervals with those of established single-cancer screening modalities and are frequently derived from case–control or retrospective study designs rather than true screening populations. Moreover, these approaches differ fundamentally in clinical use-case context, including target populations, screening frequency, and intended clinical objectives. As a result, the comparisons presented here are intended to provide contextual benchmarks rather than to imply direct equivalence or clinical interchangeability. Accordingly, cfDNA-based multi-cancer early detection assays may be best viewed as complementary tools, particularly for cancers that currently lack effective screening options, rather than as replacements for established single-cancer screening strategies. Nevertheless, the ability to screen for multiple lethal cancers simultaneously, including those without current screening options like pancreatic or ovarian cancer, represents a paradigm shift in early detection strategy.

Despite these encouraging results, considerable heterogeneity was observed across the included studies (I² > 90%). While such high heterogeneity typically cautions against pooling, we employed a bivariate mixed-effects regression model specifically designed to account for both within-study and between-study variability, allowing for valid pooled estimates. Subgroup analyses and meta-regression helped identify multiple contributing factors. Geographic region emerged as a key variable, with studies from Asia showing higher sensitivity but slightly lower specificity compared to Western studies. This discrepancy likely reflects underlying biological and methodological differences rather than population genetics alone. Biologically, the spectrum of cancer types varies significantly between regions. Asian cohorts frequently included a higher proportion of hepatobiliary, gastric, and esophageal cancers. These tumor types are known as “high-shedders,” releasing significantly higher fractions of ctDNA into the bloodstream compared to breast and prostate cancers, which are predominant in Western datasets and are typically “low-shedders”.^47,48 Methodologically, the selection of control groups plays a critical role. A substantial number of Asian studies utilized strict “healthy” volunteers as controls, creating a distinct biological contrast with cancer patients. In contrast, major Western studies often incorporated participants with benign confounding conditions to mimic real-world screening populations. This inclusion of “clinical controls” increases the difficulty of the classification task, potentially lowering the reported specificity but offering a more realistic estimate of performance in clinical practice. Smaller sample sizes were associated with increased sensitivity estimates, which may reflect model overfitting or selection bias in underpowered analyses. This phenomenon has been previously reported in ML-based diagnostic studies, where insufficient training data can artificially inflate performance metrics.^49,50 Furthermore, studies that pre-specified high specificity thresholds tended to report lower sensitivity, illustrating the classic trade-off inherent in diagnostic testing strategies, particularly when high specificity is prioritized to avoid false positives in population-wide screening programs.⁵¹Our subgroup analysis confirmed that studies enforcing high-specificity thresholds (≥95-99%) reported significantly lower pooled sensitivity (0.72) compared to those without such constraints (0.86). While this reduction might appear to be a performance deficit, it reflects a critical adaptation to the realities of unbiased clinical settings. In a general screening population where cancer prevalence is low (<1%), even a small decrease in specificity (eg, from 99% to 95%) leads to a dramatic increase in the number of false positives, causing unnecessary anxiety and invasive workups. Therefore, the ‘constrained’ sensitivity observed in these high-specificity subgroups represents a more realistic estimate of a test's utility in population-scale screening. It underscores that in real-world applications, maximizing specificity to ensure a high positive predictive value must take precedence over maximizing sensitivity, even if it means missing a proportion of early-stage, low-shedding tumors. Such issues may give a false sense of accuracy and hinder clinical translation. Collectively, these methodological variations highlight the risk that small, unvalidated cohorts may yield overly optimistic estimates. These findings highlight the need for rigorous model development practices, including appropriate sample sizes, robust cross-validation, and external testing, to ensure that ML-based diagnostic tools are both reliable and generalizable in real-world settings.^52,53

Despite the promising diagnostic performance reported across many studies, the risk of model overfitting remains a major concern in machine learning–based cfDNA assays for multi-cancer early detection, particularly in settings characterized by high-dimensional genomic or epigenomic features and relatively limited sample sizes, which are inherently prone to inflated performance estimates.^54,55 Although internal validation strategies such as cross-validation are widely applied, reliance on internal validation alone may overestimate model performance and fail to adequately account for population heterogeneity, technical variability, and differences in pre-analytical workflows encountered in real-world clinical settings.^56,57 Equally important is the lack of independent multicenter external validation in the current literature. A substantial proportion of studies rely on single-center cohorts or reuse publicly available datasets, which limits the evaluation of model robustness across diverse populations, sequencing platforms, and laboratory protocols.^58,59 Without rigorous external validation using geographically and clinically distinct cohorts, the generalizability of these machine learning models remains uncertain, even for leading cfDNA-based multi-cancer early detection platforms.⁸ From a clinical deployment perspective, these limitations represent key barriers to translation. Addressing them will require large-scale, prospective, multicenter studies with standardized cfDNA processing pipelines, transparent model reporting, and independent external validation, as emphasized in established biomarker development and regulatory frameworks.^60,61 Such efforts are essential to bridge the gap between encouraging algorithmic performance and reliable real-world clinical implementation of cfDNA-based multi-cancer early detection assays.

In addition to study-level heterogeneity, differences in the type of cfDNA biomarker and machine learning algorithm used also contributed to variability in diagnostic performance. Methylation-based biomarkers yielded more consistent and robust performance metrics, likely due to their stable epigenetic signals and strong cancer-type specificity, as demonstrated in studies by Xiong et al⁶² and Sharma et al.⁶³ In contrast, fragmentation-based features, while promising, may be more susceptible to noise and variability across sample handling and sequencing platforms.⁶⁴ Similarly, the choice of machine learning algorithm influenced diagnostic performance. Linear models such as logistic regression and support vector machines (SVM) generally achieved higher sensitivity while maintaining comparable specificity relative to tree-based models like random forest or gradient boosting.⁶⁵ Linear models often exhibit better generalizability in high-dimensional, low-sample size settings common in biomedical applications, whereas complex tree-based models may overfit training data if not properly validated.⁶⁶ These observations underscore the importance of methodological harmonization, rigorous cross-validation, and external testing when developing and reporting ML-based diagnostic models.

A major limitation of our study is the substantial heterogeneity observed across the included studies. High heterogeneity is common in diagnostic meta-analyses due to variations in patient spectrum, sample handling, and the threshold effect; however, it suggests that the pooled estimates should be interpreted as an average performance benchmark rather than a precise prediction for any single clinical setting. Despite this high heterogeneity, we deemed quantitative synthesis appropriate for several reasons. First, we utilized a bivariate mixed-effects model, which statistically accounts for between-study variability and preserves the two-dimensional nature of diagnostic data (sensitivity and specificity), offering a more robust estimation than fixed-effect models in heterogeneous settings.⁶⁷ Furthermore, as ML-based cfDNA assays for multi-cancer early detection are rapidly advancing, a systematic synthesis is critical to assess their current overall diagnostic value. By pooling the latest evidence, we aim not only to estimate performance but to identify the sources of variation, such as algorithm type and biomarker modality, thereby revealing methodological deficiencies.

Beyond heterogeneity, other limitations must be considered. A second limitation is the restricted demographic and geographic diversity of the included datasets. Geographically, over 70% of the studies were conducted in China and the United States. This restriction significantly influences generalizability due to regional variations in cancer epidemiology. As observed in the baseline characteristics, Asian cohorts were enriched with high-shedding tumor types (gastric and hepatocellular carcinoma), yielding higher sensitivity estimates compared to Western cohorts dominated by low-shedding types (breast cancer). Consequently, the pooled diagnostic accuracy reported here may not be directly transferable to regions with different prevalent cancer profiles. Demographically, heterogeneity in baseline characteristics further constrains generalizability. As detailed in Table 1, notable age discrepancies were observed in several studies (Thien Nguyen et al, Ris et al), where control groups were significantly younger than cancer patients. Such imbalances may introduce confounding bias, as ML models could potentially exploit age-related cfDNA alterations rather than true tumor-derived signals. Future studies must prioritize the recruitment of diverse, demographically matched global cohorts to ensure ML models are robust across different genetic backgrounds and environmental exposures. Third, the majority of included studies utilized a retrospective case-control design with artificially enriched cohorts (pooled prevalence ∼57%). While this design is valuable for initial discovery, it creates a discrepancy with real-world screening settings where cancer prevalence is typically below 1%. This prevalence gap fundamentally affects the clinical interpretation of diagnostic metrics, particularly the Positive Predictive Value (PPV). In a low-prevalence population, the number of false positives can necessitate substantial confirmatory testing, a challenge empirically demonstrated in prospective trials such as the DETECT-A study.⁴⁴ Furthermore, the reliance on healthy controls can introduce spectrum bias, potentially leading to an overestimation of diagnostic accuracy compared to prospective cohort studies where benign conditions are prevalent. Therefore, the pooled estimates presented here should be viewed as upper-bound performance benchmarks. Fourth, it is notable that our systematic search identified few studies relying solely on somatic mutation profiling that met our inclusion criteria for complex machine learning architectures. This observation reflects a broader trend in the field: unlike genome-wide methylation or fragmentation profiles which provide millions of continuous, high-dimensional features suitable for deep learning, somatic mutations are often sparse and discrete events. Consequently, mutation-based MCED assays typically require combination with protein biomarkers (CancerSEEK⁶⁸) or rely on simpler statistical models rather than the standalone cfDNA-ML frameworks evaluated here. Furthermore, this de facto exclusion aligns with biological constraints: mutation-only assays are susceptible to confounding by Clonal Hematopoiesis of Indeterminate Potential (CHIP)⁶⁹ and lack the tissue-specific signatures required for accurate Tissue of Origin (TOO) localization.⁸ Therefore, our analysis predominantly synthesizes epigenetic and fragmentomic modalities, ensuring a more homogenous comparison of algorithmic performance.

Although cfDNA-based machine learning models demonstrate substantial promise for multi-cancer early detection, the practical reality is that much work remains to be done before these tools can be routinely implemented in clinical care. The major barrier to the clinical translation of cfDNA-based ML models for multi-cancer early detection is the lack of standardized analytical workflows and reporting practices. Across the studies included in this meta-analysis, substantial methodological heterogeneity was observed in terms of cfDNA processing protocols, library preparation methods, sequencing platforms, feature extraction strategies, and the choice of machine learning algorithms. These variations complicate cross-study comparisons, hinder reproducibility, and limit model generalizability.

In parallel, data privacy and ethical considerations remain significant concerns, particularly when deploying cfDNA-ML models in large-scale screening programs. After all, cfDNA represents a patient's unique genetic fingerprint, and the idea of using that information to feed predictive algorithms may understandably raise concerns among patients. To protect patient confidentiality and prevent misuse, strict compliance with legal and ethical standards, such as GDPR in Europe and HIPAA in the U.S., will be essential.^70,71

Another critical challenge lies in the interpretability and clinical acceptance of AI models. Although many ML algorithms demonstrate high diagnostic accuracy, their “black box” nature often makes it difficult for clinicians to understand or explain the rationale behind predictions. This opacity may erode clinician trust, which is essential for integration into routine clinical workflows.^72,73 Furthermore, despite growing interest in AI-assisted diagnostics, many healthcare professionals remain skeptical of these tools, particularly when they are perceived to undermine human expertise or increase cognitive load.^74,75 Training clinicians to interpret model outputs, as well as demonstrating clear clinical utility, will be key to fostering adoption. Importantly, successful deployment will also depend on seamless workflow integration; if AI tools are not easily embedded into existing systems or are viewed as time-consuming, clinicians may resist using them regardless of their performance.^76,77 Addressing these barriers will require not only technical advancements, such as the development of more interpretable models using attention mechanisms or SHAP (SHapley Additive exPlanations),^78,79 but also sustained engagement with clinicians, ethicists, and patients to ensure that implementation strategies are trustworthy, transparent, and aligned with clinical practice.

Beyond technical and methodological considerations, practical implementation challenges must also be addressed to ensure that cfDNA-based ML screening can be translated into routine clinical practice. One critical barrier is the issue of health insurance coverage, which could significantly impact the scalability and accessibility of such tools. The cost of cfDNA testing, particularly when combined with machine learning algorithms, may be prohibitive without appropriate reimbursement mechanisms.^80,81 Policymakers must weigh the cost-effectiveness of these tools against traditional screening methods to ensure equitable access across diverse socioeconomic groups.⁸² Moreover, the potential burden of frequent testing in asymptomatic individuals requires careful assessment to avoid unnecessary healthcare expenditure or inefficient resource allocation.^83,84 Addressing these barriers will require coordinated efforts among researchers, clinicians, ethicists, and policymakers to ensure that cfDNA-based diagnostics are not only scientifically sound but also financially and logistically feasible for real-world implementation.

Conclusion

This systematic review and meta-analysis demonstrates that machine learning–based cfDNA assays achieve high overall diagnostic accuracy for multi-cancer early detection, with robust sensitivity and specificity across independent validation cohorts. Although diagnostic performance varies according to geographic region, sample size, and biomarker type, the available evidence supports the scientific credibility and translational promise of cfDNA–machine learning integration for noninvasive cancer detection. To ensure generalizability and responsible clinical translation, future studies should prioritize large-scale, prospective, multicenter validation using harmonized clinical and analytical protocols.

Supplemental Material

sj-docx-1-tct-10.1177_15330338261425328 - Supplemental material for Value of Machine Learning Models for Cell-Free DNA-Based Multi-Cancer Early Detection: A Systematic Review and Meta-Analysis

Supplemental material, sj-docx-1-tct-10.1177_15330338261425328 for Value of Machine Learning Models for Cell-Free DNA-Based Multi-Cancer Early Detection: A Systematic Review and Meta-Analysis by Qiong Li, MS, Hongde Liu, PhD, and Jinke Wang, PhD in Technology in Cancer Research & Treatment

Footnotes

Abbreviations

Acknowledgements

The authors thank Mr Juncheng Yang (Lanzhou University) for his helpful discussions and methodological suggestions during the early planning stage of this systematic review and meta-analysis.

ORCID iDs

Qiong Li

Jinke Wang

Author Contributions

Jinke Wang and Hongde Liu contributed to the conceptualization and design of the study. Jinke Wang and Qiong Li were responsible for data acquisition, database management, and statistical analysis. Qiong Li drafted the initial version of the manuscript. Jinke Wang contributed to writing and critically revising specific sections of the manuscript. All authors reviewed the manuscript for important intellectual content and approved the final version.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Natural Science Foundation of China (NSFC, Grant No. 62371126).

Declaration of Conflicting Interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Data Availability Statement

Template data collection forms were used to extract data from included studies. All datasets used for analyses are publicly available from previously published sources.

PROSPERO Registration

The review protocol was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO) under the registration number CRD42025645908.

Supplemental Material

Supplemental material for this article is available online.

References

Siegel

Giaquinto

Jemal

. Cancer statistics, 2024. CA Cancer J Clin. 2024;74(1):12-49. doi:10.3322/caac.21820

Etzioni

Urban

Ramsey

, et al. The case for early detection. Nat Rev Cancer. 2003;3(4):243-252. doi:10.1038/nrc1041

Pashayan

Pharoah

PDP

. The challenge of early detection in cancer. Science. 2020;368(6491):589-590. doi:10.1126/science.aaz2078

Lillie

Partin

Rice

, et al. VA Evidence-based Synthesis Program Reports. In: The Effects of Shared Decision Making on Cancer Screening – A Systematic Review. Department of Veterans Affairs (US); 2014.

Wan

JCM

Massie

Garcia-Corbacho

, et al. Liquid biopsies come of age: towards implementation of circulating tumour DNA. Nat Rev Cancer. 2017;17(4):223-238. doi:10.1038/nrc.2017.7

Jahr

Hentze

Englisch

, et al. DNA Fragments in the blood plasma of cancer patients: quantitations and evidence for their origin from apoptotic and necrotic cells. Cancer Res. 2001;61(4):1659-1665.

Cristiano

Leal

Phallen

, et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature. 2019;570(7761):385-389. doi:10.1038/s41586-019-1272-6

Liu

Oxnard

Klein

Swanton

Seiden

. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann Oncol. 2020;31(6):745-759. doi:10.1016/j.annonc.2020.02.011

Esteva

Robicquet

Ramsundar

, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24-29. doi:10.1038/s41591-018-0316-z

10.

Kermany

Goldbaum

Cai

, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122-1131.e9. doi:10.1016/j.cell.2018.02.010

11.

Shen

Singhania

Fehringer

, et al. Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature. 2018;563(7732):579-583. doi:10.1038/s41586-018-0703-0

12.

Mathios

Johansen

Cristiano

, et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun. 2021;12(1):5060. doi:10.1038/s41467-021-24994-w

13.

Wang

Liu

. Unlocking early cancer detection: leveraging machine learning in cell-free DNA analysis for precision oncology. AMIA Annu Symp Proc. 2024;2024:684-692.

14.

Zhang

Lan

. Cell-free DNA-associated multi-feature applications in cancer diagnosis and treatment. Clinical and Translational Discovery. 2024;4(2):e280. doi:10.1002/ctd2.280

15.

Wade

Nevitt

Liu

, et al. Multi-cancer early detection tests for general population screening: a systematic literature review. Health Technol Assess. 2025;29(2):1-105. doi:10.3310/dlmt1294

16.

Hackshaw

Cohen

Reichert

Kansal

Chung

Ofman

. Estimating the population health impact of a multi-cancer early detection genomic blood test to complement existing screening in the US and UK. Br J Cancer. 2021;125(10):1432-1442. doi:10.1038/s41416-021-01498-4

17.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj. 2021;372:n71. doi:10.1136/bmj.n71

18.

Whiting

Rutjes

Westwood

, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529-536. doi:10.7326/0003-4819-155-8-201110180-00009

19.

Glas

Lijmer

Prins

Bonsel

Bossuyt

. The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol. 2003;56(11):1129-1135. doi:10.1016/s0895-4356(03)00177-x

20.

Moses

Shapiro

Littenberg

. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Stat Med. 1993;12(14):1293-1316. doi:10.1002/sim.4780121403

21.

Higgins

Thompson

Deeks

Altman

. Measuring inconsistency in meta-analyses. Br Med J. 2003;327(7414):557-560. doi:10.1136/bmj.327.7414.557

22.

Devillé

Buntinx

Bouter

, et al. Conducting systematic reviews of diagnostic studies: didactic guidelines. BMC Med Res Methodol. 2002;2:9. doi:10.1186/1471-2288-2-9

23.

Deeks

Macaskill

Irwig

. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J Clin Epidemiol. 2005;58(9):882-893. doi:10.1016/j.jclinepi.2005.01.016

24.

Fagan

. Letter: Nomogram for Bayes’s theorem. N Engl J Med. 1975;293(5):257. doi:10.1056/nejm197507312930513

25.

Basu

Hiremath

Rathod

, et al. CancerSpot: a multi-cancer early detection test developed and validated on a retrospective cohort. medRxiv. 2024:2024.12.03.24318395. doi:10.1101/2024.12.03.24318395.

26.

Bie

Wang

, et al. Multimodal analysis of cell-free DNA whole-methylome sequencing for cancer detection and localization. Nat Commun. 2023;14(1):6042. doi:10.1038/s41467-023-41774-w

27.

Che

Jatsenko

Lenaerts

, et al. Pan-Cancer detection and typing by mining patterns in large genome-wide cell-free DNA sequencing datasets. Clin Chem. 2022;68(9):1164-1176. doi:10.1093/clinchem/hvac095

28.

Gao

Lin

, et al. Unintrusive multi-cancer detection by circulating cell-free DNA methylation sequencing (THUNDER): development and independent validation studies. Ann Oncol. 2023;34(5):486-495. doi:10.1016/j.annonc.2023.02.010

29.

Jamshidi

Liu

Klein

, et al. Evaluation of cell-free DNA approaches for multi-cancer early detection. Cancer Cell. 2022;40(12):1537-1549.e12. doi:10.1016/j.ccell.2022.10.022

30.

Klein

Richards

Cohn

, et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Ann Oncol. 2021;32(9):1167-1177. doi:10.1016/j.annonc.2021.05.806

31.

Lei

Zhou

Wen

, et al. Cell-free DNA methylation profiles enable early detection of colorectal and gastric cancer. Am J Cancer Res. 2024;14(2):744-761. doi:10.62347/tptq3682

32.

Ris

Hellan

Douissard

, et al. Blood-based multi-cancer detection using a novel variant calling assay (DEEPGEN(TM)): early clinical results. Cancers (Basel). 2021;13(16):4104. doi:10.3390/cancers13164104

33.

Shao

Bernicker

. Cell-free DNA 5-hydroxymethylcytosine as a marker for common cancer detection. Clinical and Translational Discovery. 2022;2(4):e136. doi:10.1002/ctd2.136

34.

Shi

Guo

Duan

, et al. Detection and characterization of pancreatic and biliary tract cancers using cell-free DNA fragmentomics. J Exp Clin Cancer Res. 2024;43(1):145. doi:10.1186/s13046-024-03067-y

35.

Thien Nguyen

Hanh Nguyen

, et al. Evaluation of a multimodal ctDNA-based assay for detection of aggressive cancers lacking standard screening tests. Future Oncol. 2025;21(1):105-115. doi:10.1080/14796694.2024.2413266

36.

Chen

Fan

Qiu

Feng

. Plasma cell-free DNA as a sensitive biomarker for multi-cancer detection and immunotherapy outcomes prediction. J Cancer Res Clin Oncol. 2024;150(1):7. doi:10.1007/s00432-023-05521-4

37.

Yang

Wang

Qiu

Ren

You

. Early screening and diagnosis strategies of pancreatic cancer: a comprehensive review. Cancer Commun (Lond). 2021;41(12):1257-1274. doi:10.1002/cac2.12204

38.

Song

Yan

, et al. Limitations and opportunities of technologies for the analysis of cell-free DNA in cancer diagnostics. Nat Biomed Eng. 2022;6(3):232-245. doi:10.1038/s41551-021-00837-3

39.

Gao

Zeng

Wang

, et al. Circulating cell-free DNA for cancer early detection. Innovation (Camb). 2022;3(4):100259. doi:10.1016/j.xinn.2022.100259

40.

Rubinstein

Patriotis

Dickherber

, et al. Cancer screening with multicancer detection tests: a translational science review. CA Cancer J Clin. 2024;74(4):368-382. doi:10.3322/caac.21833

41.

Medina

Dracopoli

Bach

, et al. Cell-free DNA approaches for cancer early detection and interception. J Immunother Cancer. 2023;11(9):e006013. doi:10.1136/jitc-2022-006013

42.

Sabatino

Thompson

Croswell

, et al. Use of Cancer Screening Tests, United States, 2023. Prev Chronic Dis. 2025;22:E52. doi:10.5888/pcd22.250139

43.

Lin

Piper

Perdue

, et al. U.S. Preventive Services Task Force Evidence Syntheses, formerly Systematic Evidence Reviews. In: Screening for Colorectal Cancer: A Systematic Review for the US Preventive Services Task Force. Agency for Healthcare Research and Quality (US); June 2016.

44.

Lennon

Buchanan

Kinde

, et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science. 2020;369(6499):eabb9601. doi:10.1126/science.abb9601

45.

Oeffinger

Fontham

Etzioni

, et al. Breast cancer screening for women at average risk: 2015 guideline update from the American cancer society. Jama. 2015;314(15):1599-1614. doi:10.1001/jama.2015.12783

46.

Imperiale

Ransohoff

Itzkowitz

, et al. Multitarget stool DNA testing for colorectal-cancer screening. N Engl J Med. 2014;370(14):1287-1297. doi:10.1056/NEJMoa1311194

47.

Tan

Mok

Rebbeck

. Cancer genomics: diversity and disparity across ethnicity and geography. J Clin Oncol. 2016;34(1):91-101. doi:10.1200/jco.2015.62.0096

48.

Bettegowda

Sausen

Leary

, et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci Transl Med. 2014;6(224):224ra24-224ra24. doi:10.1126/scitranslmed.3007094

49.

Yusuf

Atal

, et al. Reporting quality of studies using machine learning models for medical diagnosis: a systematic review. BMJ Open. 2020;10(3):e034568. doi:10.1136/bmjopen-2019-034568

50.

Vasey

Ursprung

Beddoe

, et al. Association of clinician diagnostic performance with machine learning-based decision support systems: a systematic review. JAMA Netw Open. 2021;4(3):e211276. doi:10.1001/jamanetworkopen.2021.1276

51.

van Breugel

Liu

Oglic

van der Schaar

. Synthetic data in biomedicine via generative artificial intelligence. Nature Reviews Bioengineering. 2024;2(12):991-1004. doi:10.1038/s44222-024-00245-7

52.

Guerra

Litton

Viswanath

Fendrick

. Multicancer early detection tests at a crossroads: commercial availability ahead of definitive evidence. Am Soc Clin Oncol Educ Book. 2025;45(3):e473834. doi:10.1200/edbk-25-473834

53.

Tenchov

Sapra

Sasso

, et al. Biomarkers for early cancer detection: a landscape view of recent advancements, spotlighting pancreatic and liver cancers. ACS Pharmacol Transl Sci. 2024;7(3):586-613. doi:10.1021/acsptsci.3c00346

54.

Hastie

Tibshirani

Friedman

. The elements of statistical learning. Springer series in statistics New-York; 2009.

55.

Vabalas

Gowen

Poliakoff

Casson

. Machine learning algorithm validation with a limited sample size. PloS one. 2019;14(11):e0224365.

56.

Chen

. Overview of clinical prediction models. Ann Transl Med. 2020;8(4):71.

57.

Collins

Reitsma

Altman

Moons

. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. J Br Surg. 2015;102(3):148-158.

58.

Wynants

Van Calster

Collins

, et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. Br Med J. 2020;369:m1328. doi:10.1136/bmj.m1328

59.

Christodoulou

Collins

Steyerberg

Verbakel

Van Calster

. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12-22.

60.

Shen

Xia

Chang

, et al. A large-scale, multi-centre validation study of an AI-empowered blood-based test for multi-cancer early detection. NPJ Precision Oncology. 2025;9(1):321. doi:10.1038/s41698-025-01105-2

61.

Collins

Moons

. Reporting of artificial intelligence prediction models. Lancet. 2019;393(10181):1577-1579.

62.

Xiong

Han

, et al. TOTEM: a multi-cancer detection and localization approach using circulating tumor DNA methylation markers. BMC Cancer. 2024;24(1):840. doi:10.1186/s12885-024-12626-7

63.

Sharma

Verma

Kumar

. Computational challenges in detection of cancer using cell-free DNA methylation. Comput Struct Biotechnol J. 2022;20:26-39. doi:10.1016/j.csbj.2021.12.001

64.

Wang

Mennea

Chan

YKE

, et al. A standardized framework for robust fragmentomic feature extraction from cell-free DNA sequencing data. Genome Biol. 2025;26(1):141. doi:10.1186/s13059-025-03607-5

65.

Hassan

Yasmin

, et al. A comparative assessment of machine learning algorithms with the least absolute shrinkage and selection operator for breast cancer detection and prediction. Decision Analytics Journal. 2023;7:100245. doi:10.1016/j.dajour.2023.100245

66.

Kourou

Exarchos

Karamouzis

Fotiadis

. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8-17. doi:10.1016/j.csbj.2014.11.005

67.

Reitsma

Glas

Rutjes

Scholten

Bossuyt

Zwinderman

. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005;58(10):982-990. doi:10.1016/j.jclinepi.2005.02.022

68.

Cohen

Wang

, et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science. 2018;359(6378):926-930. doi:10.1126/science.aar3247

69.

Razavi

Brown

, et al. High-intensity sequencing reveals the sources of plasma circulating cell-free DNA variants. Nat Med. 2019;25(12):1928-1937. doi:10.1038/s41591-019-0652-7

70.

Mittelstadt

. Principles alone cannot guarantee ethical AI. Nature Machine Intelligence. 2019;1(11):501-507. doi:10.1038/s42256-019-0114-4

71.

Bakare

Adeniyi

Akpuokwe

Eneh

. Data privacy laws and compliance: a comparative review of the EU GDPR and USA regulations. 2024.

72.

Hassan

Nguyen

Finserås

Adde

Strümke

Støen

. Unlocking the black box: enhancing human-AI collaboration in high-stakes healthcare scenarios through explainable AI. Technol Forecast Soc Change. 2025;219:124265. doi:10.1016/j.techfore.2025.124265

73.

Nouis

Uren

Jariwala

. Evaluating accountability, transparency, and bias in AI-assisted healthcare decision- making: a qualitative study of healthcare professionals’ perspectives in the UK. BMC Med Ethics. 2025;26(1):89. doi:10.1186/s12910-025-01243-z

74.

Arvai

Katonai

Mesko

. Health care Professionals’ concerns about medical AI and psychological barriers and strategies for successful implementation. Scoping review. J Med Internet Res. 2025;27:e66986. doi:10.2196/66986

75.

Shamszare

Choudhury

. Clinicians' perceptions of artificial intelligence: focus on workload, risk, trust, clinical decision making, and clinical integration. Healthcare (Basel). 2023;11(16):2308. doi:10.3390/healthcare11162308

76.

Shortliffe

Sepúlveda

. Clinical decision support in the era of artificial intelligence. JAMA. 2018;320(21):2199-2200. doi:10.1001/jama.2018.17163

77.

Wang

Chen

Zhang

, et al. Artificial intelligence in clinical decision support systems for oncology. Int J Med Sci. 2023;20(1):79-86. doi:10.7150/ijms.77205

78.

Huang

Jameel

Long

Papanastasiou

. From explainable to interpretable deep learning for natural language processing in healthcare: how far from reality? Comput Struct Biotechnol J. 2024;24:362-373. doi:10.1016/j.csbj.2024.05.004

79.

Muhammad

Bendechache

. Unveiling the black box: a systematic review of explainable artificial intelligence in medical image analysis. Comput Struct Biotechnol J. 2024;24:542-560. doi:10.1016/j.csbj.2024.08.005

80.

Tsui

WHA

Ding

Jiang

YMD

. Artificial intelligence and machine learning in cell-free-DNA-based diagnostics. Genome Res. 2025;35(1):1-19. doi:10.1101/gr.278413.123

81.

Walker

Jackson

LaGrave

Ashwood

Schmidt

. A cost-effectiveness analysis of cell free DNA as a replacement for serum screening for down syndrome. Prenat Diagn. 2015;35(5):440-446. doi:10.1002/pd.4511

82.

Steijger

Chatterjee

Groot

Pavlova

. Challenges and limitations in distributional cost-effectiveness analysis: a systematic literature review. Int J Environ Res Public Health. 2022;20(1):505. doi:10.3390/ijerph20010505

83.

Jansen

SNG

Kamphorst

Mulder

, et al. Ethics of early detection of disease risk factors: a scoping review. BMC Med Ethics. 2024;25(1):25. doi:10.1186/s12910-024-01012-4

84.

Kruk

Gage

Arsenault

, et al. High-quality health systems in the sustainable development goals era: time for a revolution. Lancet Glob Health. 2018;6(11):e1196-e1252. doi:10.1016/S2214-109X(18)30386-3

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.76 MB