Abstract
Introduction
Machine learning (ML)–based analysis of cell-free DNA (cfDNA) has emerged as a promising strategy for multi-cancer early detection (MCED). However, reported diagnostic performance varies widely across studies, and many estimates are derived from training or enriched cohorts, limiting their relevance to independent validation and real-world settings.
Methods
We conducted a systematic review and diagnostic accuracy meta-analysis of ML-based cfDNA assays for MCED. Four databases (PubMed, Embase, Web of Science, and the Cochrane Library) were searched from inception to February 2, 2025. Only independent validation or testing datasets were included; all training datasets were excluded. Pooled sensitivity, specificity, diagnostic odds ratio (DOR), and summary receiver operating characteristic (SROC) curves were estimated using a bivariate random-effects model. Subgroup analyses and meta-regression were performed to explore sources of heterogeneity.
Results
Thirteen studies comprising 23 independent datasets and 14,892 participants were included. The pooled sensitivity was 0.78 (95% CI: 0.66-0.87), and the pooled specificity was 0.96 (95% CI: 0.90-0.98). The summary area under the curve (AUC) was 0.94, with a DOR of 76.6. Substantial between-study heterogeneity was observed (
Conclusion
ML-based cfDNA assays demonstrate consistently high specificity and moderate-to-high sensitivity across independent validation datasets, supporting their potential role in multi-cancer early detection. However, diagnostic performance is highly context dependent and strongly influenced by study design, population characteristics, and analytical choices. These findings highlight the need for large-scale, prospective, population-based validation before widespread clinical implementation.
Keywords
Introduction
Early cancer detection is a cornerstone of effective cancer control, offering the opportunity to initiate treatment at a potentially curable stage, thereby significantly improving patient prognosis and reducing cancer-related mortality. 1 However, current population-based cancer screening strategies are highly fragmented. Most existing programs are developed for only a handful of malignancies, such as breast, cervical, colorectal, and lung cancers, and rely on distinct modalities, screening intervals, eligibility criteria, and clinical workflows.2,3 This siloed approach lacks scalability, consistency, and universal applicability across diverse populations and healthcare settings.
Moreover, traditional screening imposes substantial burdens on patient decision-making. Individuals are often required to undergo multiple, uncoordinated tests, each with different risks, benefits, and interpretations, thus creating a complex landscape that can be confusing, time-consuming, and financially challenging. 4 For asymptomatic individuals, the invasiveness or ambiguity of tests may deter participation. These issues are particularly pronounced in underserved or resource-limited settings, where access to comprehensive, organ-specific screening is limited. Consequently, there is an urgent need for a universal, non-invasive, and patient-friendly early detection approach that is cost-effective, scalable, and aligned with shared decision-making principles.
Liquid biopsy based on cell-free DNA (cfDNA) offers a promising alternative. cfDNA, released into circulation through apoptosis, necrosis, and active secretion, carries tumor-specific genetic and epigenetic information.5,6 Advances in high-throughput sequencing and fragmentomic analysis have enabled detection of somatic mutations, methylation changes, copy number variations, and fragmentation patterns in cfDNA.7,8 This opens the door for multi-cancer early detection (MCED) from a single blood sample, streamlining the screening process and improving accessibility. Yet, the complexity and heterogeneity of cfDNA data challenge conventional analytic methods. Machine learning (ML) algorithms, including random forests, support vector machines (SVMs), and deep neural networks—have emerged as powerful tools capable of handling high-dimensional biological data, recognizing subtle patterns, and distinguishing cancer from non-cancer states.9–11 A conceptual illustration of the typical cfDNA-based ML workflow utilized in these assays is provided in Figure 1. By integrating multiple cfDNA features, ML models can potentially improve sensitivity and specificity while enabling broad cancer coverage and real-world clinical application.

Workflow for ML-assisted cancer diagnosis using cfDNA. The process is divided into three stages: A. Peripheral blood sample collection: Blood is drawn from a patient, and cell-free DNA (cfDNA) is extracted and subjected to next-generation sequencing (NGS). B. cfDNA feature categories: Raw NGS data is analyzed to extract epigenetic features (methylation) and fragmentomic features (fragment sizes). C. ML-assisted cancer diagnosis workflow: Machine learning algorithms analyze these features to provide a binary prediction of cancer (detected/not detected) and predict the tissue of origin, guiding further diagnostic examinations.
Although cfDNA-based machine learning assays show potential for multi-cancer early detection, reported diagnostic performance has been highly variable across studies. This variability is largely attributable to differences in study populations, cancer types, feature selection strategies, cfDNA biomarker types (such as methylation, fragmentation, or variant detection), and the ML algorithms applied. Several studies have demonstrated high sensitivity and specificity, whereas others yielded only moderate or inconsistent results, often reflecting small sample sizes, retrospective designs, or limited external validation.12,13 Additionally, the area under the receiver operating characteristic curve, a common measure of model discrimination, also demonstrates wide variability, reflecting the heterogeneity in model development pipelines and cfDNA data quality. 14 These inconsistencies complicate efforts to benchmark performance, translate findings into clinical workflows, and guide regulatory or reimbursement decisions.
While recent systematic reviews, such as the comprehensive Health Technology Assessment by the UK National Institute for Health and Care Research (NIHR) published in 2025, 15 have evaluated the clinical implementation of commercial MCED tests, they explicitly abstained from performing quantitative meta-analyses due to high heterogeneity. Furthermore, existing clinical reviews 16 typically treat the computational component as a “black box,” focusing on the final commercial product rather than quantitatively assessing how different algorithmic approaches influence diagnostic accuracy. Additionally, previous summaries often focus on single-cancer applications or specific biomarker modalities. To date, no meta-analysis has systematically aggregated and evaluated the diagnostic accuracy, heterogeneity, and methodological quality of ML-based cfDNA assays for MCED. Such a synthesis is critical to assess the current evidence base, identify sources of bias and variation, and inform the design of future prospective validation studies.
To address this gap, we systematically reviewed the existing literature and performed a quantitative meta-analysis to evaluate the diagnostic accuracy of machine learning-based cfDNA assays for multi-cancer early detection. Unlike previous reviews, we strictly excluded training datasets and synthesized results solely from independent validation cohorts to provide a realistic estimate of generalizability. We assessed pooled sensitivity, specificity, diagnostic odds ratio (DOR), and area under the summary receiver operating characteristic curve. In addition, we conducted detailed subgroup analyses and meta-regression stratified by ML algorithm type and biomarker modality to explore heterogeneity and assess methodological and biological factors influencing model performance. Our findings provide an evidence-based assessment of the translational potential of cfDNA and ML integration and offer valuable insights for the development of more patient-centered, equitable, and scalable cancer early detection strategies.
Methods
Protocol and Registration
This systematic review and meta-analysis were conducted and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. 17 The review protocol was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO). The registered protocol prespecified the study objectives, eligibility criteria, literature search strategy, outcome measures, and planned statistical analyses, thereby ensuring methodological transparency and reproducibility. Ethical review and approval were not required for this study because it is a systematic review and meta-analysis based exclusively on previously published literature and publicly available aggregate data, without direct involvement of human participants or collection of primary biological samples.
Search Strategy
A comprehensive literature search was conducted in collaboration with an evidence-based medicine specialist, using a combination of controlled vocabulary terms (MeSH and Emtree) and free-text keywords related to cell-free DNA, machine learning, multi-cancer detection, and early cancer screening. Four electronic databases, PubMed, The Cochrane Library, Embase, and Web of Science, were systematically searched from inception to February 2, 2025. No restrictions were applied regarding geographical region, study design, or article type. Only English-language publications were included. The detailed search strategies and complete search strings for all databases are provided in Supplementary Table S1. To ensure comprehensive coverage, the reference lists of all included articles and relevant reviews were manually screened to identify additional eligible studies that may have been missed in the initial search.
Eligibility Criteria
Inclusion criteria were as follows: (1) studies involving patients with confirmed cancer (at any stage) and/or healthy or non-cancer controls undergoing cfDNA-based testing for multi-cancer early detection (MCED); (2) application of ML algorithms, such as deep learning, random forest, support vector machines, logistic regression, or Bayesian models, to cfDNA data derived from sequencing, methylation profiling, fragmentomics, or other liquid biopsy techniques; (3) inclusion of a non-cancer control group for diagnostic comparison; (4) availability of diagnostic performance metrics or sufficient data to derive values for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN); and (5) articles published in English with adequate methodological detail.
Exclusion criteria included: (1) studies with duplicate data (in which case the version with the most comprehensive dataset was retained); (2) case reports, reviews, editorials, commentaries, or conference abstracts without full data; (3) animal studies, in vitro experiments, or computational-only model development without clinical data; and (4) publications focused on non-cancer cfDNA applications such as treatment monitoring, minimal residual disease (MRD), prenatal testing, or unrelated diseases.
Study Selection
The screening process was conducted using a two-step approach. First, duplicate studies were automatically excluded using EndNote ×9 (Clarivate Analytics), followed by an initial manual screening by a reviewer. Subsequently, the titles, abstracts, and full texts of all remaining records were independently assessed by two reviewers based on the predefined inclusion and exclusion criteria. Discrepancies between reviewers were resolved by discussion, and if necessary, a third independent reviewer was consulted to reach consensus. During the screening process, studies that focused on non-cancer cfDNA applications or reported complications unrelated to multi-cancer early detection were excluded at the title and abstract level. In addition, case reports, case series, editorials, commentaries, and conference abstracts without full data were excluded.
Data Extraction and Quality Assessment
Two reviewers independently extracted relevant data from each eligible study using a standardized and pre-defined data extraction form. The extracted information included: first author's name and publication year, study location, study design, sample size, cancer type and stage, diagnostic reference standard, type of control group, participant age and sex, machine learning algorithm used, type of cfDNA biomarker, and diagnostic outcomes, including the number of TP, FN, FP, and TN. For multi-cancer early detection studies that reported performance metrics per individual cancer type without providing an overall aggregate confusion matrix, we synthesized the data by treating detection as a binary outcome (Cancer Detected vs Not Detected). We calculated the aggregate True Positives (TP) and False Negatives (FN) by summing these values across all included cancer types. Conversely, for the control group, we utilized the overall specificity, False Positives (FP), and True Negatives (TN) reported for the total non-cancer cohort to ensure that control subjects were not counted multiple times. To ensure the reliability of diagnostic accuracy estimates, we prioritized the extraction of performance metrics from independent validation cohorts or held-out test sets. For studies that reported sensitivity and specificity alongside sample sizes but lacked explicit confusion matrix values (TP, FP, FN, TN), we back-calculated these values to reconstruct the 2 × 2 contingency tables. Datasets with missing or incomplete data that could not be reliably reconstructed were excluded from the quantitative meta-analysis. After both reviewers completed the data extraction, results were cross-checked, and any discrepancies were resolved through discussion to reach consensus. The methodological quality and risk of bias of the included studies were evaluated using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. 18 This tool assesses four domains: patient selection, index test, reference standard, and flow and timing, for both risk of bias and applicability concerns (for the first three domains). The assessments were performed independently by two reviewers, and disagreements were resolved through consensus or with the input of a third reviewer.
Data Synthesis and Analysis
All statistical analyses were conducted using Stata software, version 14.0 SE (StataCorp LLC, College Station, TX). Diagnostic test accuracy meta-analysis was performed using the midas command based on a bivariate mixed-effects regression model. This model was selected as the primary statistical framework because it explicitly accounts for the inherent trade-off and negative correlation between sensitivity and specificity (the threshold effect), thereby preserving the two-dimensional nature of the diagnostic data. The model operates under the assumption that the logit-transformed sensitivity and specificity follow a bivariate normal distribution across the included studies, incorporating both within-study sampling error and between-study heterogeneity. While the Hierarchical Summary Receiver Operating Characteristic (HSROC) model represents a valid alternative structure, the bivariate model was prioritized as it allows for the direct estimation of pooled performance metrics with their respective confidence intervals.
Pooled estimates included sensitivity (Sen), specificity (Spe), positive likelihood ratio (PLR), negative likelihood ratio (NLR), and the diagnostic odds ratio (DOR), each with corresponding 95% confidence intervals (CIs). The DOR ranges from 0 to infinity, with higher values indicating greater discriminatory ability of the diagnostic test.
19
A summary receiver operating characteristic (SROC) curve was constructed, and the area under the curve (AUC) was calculated to evaluate overall diagnostic performance, with values closer to 1 indicating higher accuracy.
20
Between-study heterogeneity was assessed using Cochran Q test and the
Results
Study Selection
A total of 2190 records were identified through database searches, including 859 from Embase, 606 from PubMed, 64 from The Cochrane Library, and 661 from Web of Science. After removing 710 duplicate records, 1480 articles remained for title and abstract screening. Of these, 1445 articles were excluded for not meeting the inclusion criteria based on title and abstract review. The remaining 35 articles were retrieved for full-text assessment, after which 22 studies were excluded for reasons such as insufficient data, irrelevant outcomes, or ineligible study design. No additional eligible studies were identified through manual searching. Ultimately, 13 studies were included in the final meta-analysis.7,25–36 The study selection process is illustrated in the PRISMA flow diagram (Figure 2).

PRISMA flow diagram of study selection process. A total of 2190 records were identified through database searching. After duplicate removal and screening, 13 studies were included in the final meta-analysis.
Study Characteristics
A total of 13 studies were included in this meta-analysis, comprising 1 prospective cohort study, 30 2 retrospective cohort studies,25,26 and 10 case-control studies.7,27–29,31–36 Among these, 4 studies reported only one dataset,7,30,32,35 1 study provided three datasets, 28 and the remaining 8 studies contributed two datasets each,25–27,29,31,33,34,36 resulting in a total of 23 datasets for analysis. The included studies were published between 2019 and 2025, involving a combined total of 14,892 participants, including 8434 cancer patients and 6458 non-cancer controls. Geographically, the studies were conducted in China, the United States, India, the Netherlands, Belgium, and Vietnam. In the study by Xu J (2024), 36 cancer patients were primarily in stages II–IV, whereas all other studies included patients across stages I–IV. Comprehensive information on population demographics (age, sex), machine learning algorithms, cfDNA biomarker types, and diagnostic performance metrics (TP, FP, TN, FN) is summarized in Table 1. The methodological quality was evaluated using the QUADAS-2 tool, which assesses risk of bias across four domains, Patient Selection, Index Test, Reference Standard, and Flow and Timing, based on specific signaling questions (Figures 3A and 3B). While the overall risk of bias appeared comparable across studies, domain-specific evaluations revealed variations that mirrored the heterogeneity in study characteristics, particularly in subject recruitment and control selection. Specifically, the Reference Standard and Index Test domains consistently demonstrated low risk, reflecting rigorous pathological confirmation and blinded model interpretation. In contrast, the Patient Selection domain exhibited variability (often classified as ‘Unclear’ in Figure 3A), which directly reflects the substantial differences in study populations and the prevalence of retrospective case-control designs utilizing healthy controls. Detailed diagnostic performance of each included study, stratified by dataset (training, validation, or independent testing) and by individual cancer types, is summarized in Supplementary Table S2. Overall, cfDNA-based ML models demonstrated consistently high specificity (typically >90%) and variable accuracy across cancer types, with particularly strong performance for hepatobiliary, pancreatic, and esophageal cancers, whereas breast and prostate cancers showed lower accuracy in large-scale validation datasets.

QUADAS-2 assessment of risk of bias and applicability concerns for included studies. (A) Summary bar charts showing the proportion of studies rated as having low (green), unclear (yellow), or high (red) risk of bias and applicability concerns across the four QUADAS-2 domains: patient selection, index test, reference standard, and flow and timing. (B) Traffic-light plot illustrating domain-specific judgments for each individual study. Green circles (+) indicate low risk, yellow circles (?) indicate unclear risk, and red circles (–) indicate high risk.
Characteristics of 13 Included Studies in This Meta-Analysis.
Abbreviations used in this table: 5hmC, 5-hydroxymethylcytosine; CCS, Case-control study; PCS, Prospective cohort study; RCS, Retrospective cohort study; HC, Healthy control; SVM, Support vector machine; RFM, Random forest model; LRA, Logistic regression algorithm; MLRM, Multinomial logistic regression model; SGBM, Stochastic gradient boosting model; GLM, Generalized linear model; ML, Machine learning; NR, not reported. CRC, colorectal cancer; GC, gastric cancer; N, number of groups; M, male; F, female; TP, true positive; FN, false negative; FP, false positive; TN, true negative.
Pooled Diagnostic Accuracy of cfDNA-Based Machine Learning Models
A total of 13 studies (23 datasets) were included in the meta-analysis to evaluate the diagnostic performance of ML models applied to cfDNA for early multi-cancer detection. The Spearman correlation analysis yielded a correlation coefficient of −0.53 (

Pooled sensitivity, specificity, and SROC curve of machine learning-based cfDNA assays for multi-cancer early detection. (A) Forest plots showing the individual and pooled sensitivity (left) and specificity (right) across 23 datasets from 13 studies. The pooled sensitivity was 0.779 (95% CI: 0.699-0.843) with significant heterogeneity (
Diagnostic Likelihood Ratios, Discriminatory Power, and Between-Study Heterogeneity
The pooled PLR was 20.712 (95% CI: 13.236-32.412), and the pooled NLR was 0.229 (95% CI: 0.167-0.315), as shown in Figure 5A. Both indicators demonstrated significant between-study heterogeneity (

Pooled likelihood ratios and DOR of cfDNA-based machine learning models for multi-cancer early detection. (A) Forest plots showing the pooled PLR and NLR across included datasets. The pooled PLR was 20.712 (95% CI: 13.236-32.412), and the pooled NLR was 0.229 (95% CI: 0.167-0.315), both indicating strong diagnostic performance. (B) Forest plot of the pooled DOR, estimated at 90.273 (95% CI: 55.886-145.818), reflecting the overall discriminatory ability of the test. All metrics exhibited significant heterogeneity, as indicated by high
Clinical Utility Analysis
The Fagan nomogram plot was used to evaluate the clinical utility of cfDNA-based machine learning models for multi-cancer early detection (Figure 6). Based on the included studies, the pre-test probability of cancer among the study population was 57%. Given the pooled PLR and NLR from the meta-analysis, the post-test probability was calculated to be 96% for individuals with a positive test result, and 23% for those with a negative result. These findings suggest that cfDNA evaluated by ML models provides clinically meaningful diagnostic information within high-risk or enriched cohorts. However, to estimate utility in a realistic screening scenario, we simulated a hypothetical low-prevalence setting of 1% (typical for an average-risk general population). 1 Using the pooled PLR of 20.71, a positive test result would increase the post-test probability of cancer from 1% to approximately 17%. Conversely, using the pooled NLR of 0.23, a negative test result would decrease the probability from 1% to approximately 0.2%. This comparison highlights that while the test significantly elevates the probability of cancer detection (from 1% to 17%), a positive result in a general screening population is not diagnostic on its own and necessitates rigorous follow-up confirmation.

Fagan nomogram for evaluating the clinical utility of cfDNA-based machine learning models. The pre-test probability of cancer was set at 57%, consistent with the overall prevalence across included studies. The post-test probability was 96% for a positive result and 23% for a negative result, reflecting the strong diagnostic influence of the test.
Assessment of Publication Bias
Deeks’ funnel plot asymmetry test was used to assess the risk of publication bias among the included studies. As shown in Figure 7, the scatter of studies was relatively symmetrical, and the slope of the regression line was not statistically significant (

Deeks’ funnel plot for publication bias assessment. The funnel plot shows a symmetrical distribution of included studies with a non-significant slope in the linear regression test for funnel plot asymmetry (P = 0.35), suggesting no significant publication bias.
Subgroup Analysis and Meta-Regression
To explore sources of heterogeneity, we conducted subgroup analyses and meta-regression based on geographic region, study design, specificity threshold, sample size, biomarker type, algorithm type, and cancer type coverage (Table 2; Figure 8).

Forest plots of subgroup analyses assessing sensitivity and specificity across study-level covariates. Subgroups were stratified by (A) geographic region (Western vs Asia), (B) study design (case-control vs cohort), (C) control group type (healthy control vs non-cancer control), (D) pre-specified specificity thresholds, (E) sample size (≥500 vs <500), (F) cfDNA biomarker type (methylation-based vs non-methylation-based), (G) machine learning algorithm type (tree-based vs linear/kernel-based models), and (H) number of cancer types included (>5 vs ≤5). Differences in diagnostic performance were evaluated using meta-regression; P-values indicate the statistical significance of subgroup differences in sensitivity and specificity.
The Analysis Results of Meta-Regression.
Abbreviations used in this table: CCS, Case-control study; CS, Cohort study; HC, Healthy control.
Geographic Region
Subgroup analysis by study location revealed significant differences. Western studies demonstrated a pooled sensitivity of 0.62 (95% CI: 0.50-0.74) and specificity of 0.98 (95% CI: 0.96-0.99), while studies conducted in Asia yielded higher sensitivity at 0.85 (95% CI: 0.80-0.91) and slightly lower specificity at 0.95 (95% CI: 0.92-0.98). The between-group differences were statistically significant for both sensitivity and specificity (
Study Design
Case-control studies (CCS) showed a sensitivity of 0.76 (95% CI: 0.67-0.85) and specificity of 0.96 (95% CI: 0.94-0.98), whereas cohort studies (CS) demonstrated a slightly higher sensitivity of 0.83 (95% CI: 0.72-0.94) and similar specificity (0.96, 95% CI: 0.93-1.00). Although sensitivity and specificity differences were significant in univariate analysis (
Control Type
When grouped by control type, studies using healthy controls achieved higher sensitivity (0.83, 95% CI: 0.75-0.92) and specificity (0.97, 95% CI: 0.95-0.99) than those using non-cancer controls (0.74, 95% CI: 0.64-0.84) and (0.96, 95% CI: 0.93-0.98); however, this difference was not significant in the joint model (
Pre-Specified Specificity Thresholds
Studies that pre-specified a high specificity threshold (95-99%) had lower sensitivity at 0.72 (95% CI: 0.62-0.81) but higher specificity at 0.98 (95% CI: 0.97-0.99). In contrast, those without such thresholds reported higher sensitivity of 0.86 (95% CI: 0.78-0.94) and lower specificity of 0.90 (95% CI: 0.85-0.95). The difference in sensitivity was statistically significant (
Sample Size
Studies with ≥500 participants had a sensitivity of 0.65 (95% CI: 0.53-0.78) and specificity of 0.98 (95% CI: 0.97-0.99), while those with fewer than 500 participants had higher sensitivity of 0.84 (95% CI: 0.78-0.90) and lower specificity of 0.94 (95% CI: 0.91-0.97). Both sensitivity and specificity differences were significant (
cfDNA Biomarker Type
Studies using methylation-based cfDNA biomarkers reported a sensitivity of 0.73 (95% CI: 0.63-0.83) and specificity of 0.97 (95% CI: 0.96-0.99). In comparison, non-methylation-based studies showed higher sensitivity (0.83, 95% CI: 0.75-0.92) and slightly lower specificity (0.94, 95% CI: 0.89-0.98). The differences were significant in univariate analyses (
ML Algorithm Type
Tree-based models, such as random forests and gradient boosting, had a pooled sensitivity of 0.72 (95% CI: 0.59-0.85) and specificity of 0.96 (95% CI: 0.93-0.99). Linear or kernel-based models (eg, logistic regression, SVM) showed higher sensitivity of 0.81 (95% CI: 0.74-0.89) with comparable specificity (0.96, 95% CI: 0.94-0.99). Differences in both sensitivity and specificity were significant in univariate comparisons (
Cancer Type Breadth
Studies that assessed >5 cancer types showed a lower sensitivity of 0.73 (95% CI: 0.64-0.81) but higher specificity of 0.98 (95% CI: 0.96-0.99). In contrast, studies that included ≤5 cancer types reported higher sensitivity of 0.87 (95% CI: 0.79-0.95) and lower specificity of 0.91 (95% CI: 0.84-0.97). The difference in sensitivity was significant (
These analyses revealed that diagnostic performance varied significantly across subgroups, suggesting that heterogeneity was primarily driven by differences in population characteristics such as geographic region, study design elements including sample size and the number of cancer types assessed, and methodological choices related to the type of cfDNA biomarker used and the machine learning algorithm implemented. Notably, studies involving smaller cohorts, a narrower range of cancer types, or populations from specific regions tended to report higher sensitivity.
Sensitivity Analysis
A sensitivity analysis was performed by excluding all training datasets, leaving 11 independent validation or testing datasets for re-analysis. As shown in Supplementary Figure 1A, the pooled sensitivity was 0.782 (95% CI: 0.662-0.868), with significant heterogeneity (
The DOR was 76.56 (95% CI: 38.59-151.86), with significant heterogeneity (
Discussion
Our study focused on cfDNA rather than other circulating biomarkers such as exosomal RNA, circulating tumor cells, or protein-based markers, due to cfDNA's unique combination of biological accessibility, molecular richness, and growing clinical relevance. Unlike protein markers, which often suffer from low specificity and context dependency, 37 cfDNA provides direct genomic and epigenomic signals reflective of tumor biology, including somatic mutations, methylation alterations, and fragmentation patterns. 38 Compared to circulating tumor cells (CTCs), cfDNA is more consistently detectable across early and late-stage cancers and can be more feasibly integrated into high-throughput sequencing workflows. 39 Moreover, cfDNA assays offer compatibility with machine learning pipelines due to their high-dimensional feature space, which enables nuanced modeling for pan-cancer detection. These properties make cfDNA particularly well-suited for scalable, minimally invasive cancer screening strategies. The notably high specificity aligns with the clinical goal of minimizing false positives in population-wide screening, while the acceptable sensitivity reflects meaningful detection capability at early disease stages. 40
To contextualize the potential value of cfDNA-based ML models, it is important to compare their performance with existing single-cancer screening modalities. 41 Traditional methods such as mammography for breast cancer and fecal immunochemical testing (FIT) or colonoscopy for colorectal cancer are well-established, evidence-based tools that have significantly reduced cancer mortality when used appropriately in target populations. 42 However, these methods are cancer-type specific and are often underutilized due to invasiveness, accessibility, or compliance issues. 43 In contrast, cfDNA-ML assays offer the possibility of simultaneous, multi-cancer detection from a single blood draw, potentially improving patient convenience and uptake. 44 While mammography achieves a sensitivity of ∼77%–95% and specificity of ∼94% in screening settings, 45 cfDNA-based models in our analysis demonstrate comparable or higher specificity (often >90%) and acceptable sensitivity, especially for detecting multiple cancers concurrently. Similarly, FIT for colorectal cancer has a reported sensitivity of ∼74% for early-stage disease and specificity around 95%, 46 but it requires regular repeated testing and lacks pan-cancer scope. Moreover, cfDNA analysis is less invasive than colonoscopy and may be more acceptable to patients, particularly in low-resource or rural settings where endoscopy services are limited. However, comparisons with established modalities should be interpreted with caution given the overlapping confidence intervals and the differences in study populations (screening vs case-control). The reported performance metrics of cfDNA-based models often exhibit overlapping confidence intervals with those of established single-cancer screening modalities and are frequently derived from case–control or retrospective study designs rather than true screening populations. Moreover, these approaches differ fundamentally in clinical use-case context, including target populations, screening frequency, and intended clinical objectives. As a result, the comparisons presented here are intended to provide contextual benchmarks rather than to imply direct equivalence or clinical interchangeability. Accordingly, cfDNA-based multi-cancer early detection assays may be best viewed as complementary tools, particularly for cancers that currently lack effective screening options, rather than as replacements for established single-cancer screening strategies. Nevertheless, the ability to screen for multiple lethal cancers simultaneously, including those without current screening options like pancreatic or ovarian cancer, represents a paradigm shift in early detection strategy.
Despite these encouraging results, considerable heterogeneity was observed across the included studies (
Despite the promising diagnostic performance reported across many studies, the risk of model overfitting remains a major concern in machine learning–based cfDNA assays for multi-cancer early detection, particularly in settings characterized by high-dimensional genomic or epigenomic features and relatively limited sample sizes, which are inherently prone to inflated performance estimates.54,55 Although internal validation strategies such as cross-validation are widely applied, reliance on internal validation alone may overestimate model performance and fail to adequately account for population heterogeneity, technical variability, and differences in pre-analytical workflows encountered in real-world clinical settings.56,57 Equally important is the lack of independent multicenter external validation in the current literature. A substantial proportion of studies rely on single-center cohorts or reuse publicly available datasets, which limits the evaluation of model robustness across diverse populations, sequencing platforms, and laboratory protocols.58,59 Without rigorous external validation using geographically and clinically distinct cohorts, the generalizability of these machine learning models remains uncertain, even for leading cfDNA-based multi-cancer early detection platforms. 8 From a clinical deployment perspective, these limitations represent key barriers to translation. Addressing them will require large-scale, prospective, multicenter studies with standardized cfDNA processing pipelines, transparent model reporting, and independent external validation, as emphasized in established biomarker development and regulatory frameworks.60,61 Such efforts are essential to bridge the gap between encouraging algorithmic performance and reliable real-world clinical implementation of cfDNA-based multi-cancer early detection assays.
In addition to study-level heterogeneity, differences in the type of cfDNA biomarker and machine learning algorithm used also contributed to variability in diagnostic performance. Methylation-based biomarkers yielded more consistent and robust performance metrics, likely due to their stable epigenetic signals and strong cancer-type specificity, as demonstrated in studies by Xiong et al 62 and Sharma et al. 63 In contrast, fragmentation-based features, while promising, may be more susceptible to noise and variability across sample handling and sequencing platforms. 64 Similarly, the choice of machine learning algorithm influenced diagnostic performance. Linear models such as logistic regression and support vector machines (SVM) generally achieved higher sensitivity while maintaining comparable specificity relative to tree-based models like random forest or gradient boosting. 65 Linear models often exhibit better generalizability in high-dimensional, low-sample size settings common in biomedical applications, whereas complex tree-based models may overfit training data if not properly validated. 66 These observations underscore the importance of methodological harmonization, rigorous cross-validation, and external testing when developing and reporting ML-based diagnostic models.
A major limitation of our study is the substantial heterogeneity observed across the included studies. High heterogeneity is common in diagnostic meta-analyses due to variations in patient spectrum, sample handling, and the threshold effect; however, it suggests that the pooled estimates should be interpreted as an average performance benchmark rather than a precise prediction for any single clinical setting. Despite this high heterogeneity, we deemed quantitative synthesis appropriate for several reasons. First, we utilized a bivariate mixed-effects model, which statistically accounts for between-study variability and preserves the two-dimensional nature of diagnostic data (sensitivity and specificity), offering a more robust estimation than fixed-effect models in heterogeneous settings. 67 Furthermore, as ML-based cfDNA assays for multi-cancer early detection are rapidly advancing, a systematic synthesis is critical to assess their current overall diagnostic value. By pooling the latest evidence, we aim not only to estimate performance but to identify the sources of variation, such as algorithm type and biomarker modality, thereby revealing methodological deficiencies.
Beyond heterogeneity, other limitations must be considered. A second limitation is the restricted demographic and geographic diversity of the included datasets. Geographically, over 70% of the studies were conducted in China and the United States. This restriction significantly influences generalizability due to regional variations in cancer epidemiology. As observed in the baseline characteristics, Asian cohorts were enriched with high-shedding tumor types (gastric and hepatocellular carcinoma), yielding higher sensitivity estimates compared to Western cohorts dominated by low-shedding types (breast cancer). Consequently, the pooled diagnostic accuracy reported here may not be directly transferable to regions with different prevalent cancer profiles. Demographically, heterogeneity in baseline characteristics further constrains generalizability. As detailed in Table 1, notable age discrepancies were observed in several studies (Thien Nguyen et al, Ris et al), where control groups were significantly younger than cancer patients. Such imbalances may introduce confounding bias, as ML models could potentially exploit age-related cfDNA alterations rather than true tumor-derived signals. Future studies must prioritize the recruitment of diverse, demographically matched global cohorts to ensure ML models are robust across different genetic backgrounds and environmental exposures. Third, the majority of included studies utilized a retrospective case-control design with artificially enriched cohorts (pooled prevalence ∼57%). While this design is valuable for initial discovery, it creates a discrepancy with real-world screening settings where cancer prevalence is typically below 1%. This prevalence gap fundamentally affects the clinical interpretation of diagnostic metrics, particularly the Positive Predictive Value (PPV). In a low-prevalence population, the number of false positives can necessitate substantial confirmatory testing, a challenge empirically demonstrated in prospective trials such as the DETECT-A study. 44 Furthermore, the reliance on healthy controls can introduce spectrum bias, potentially leading to an overestimation of diagnostic accuracy compared to prospective cohort studies where benign conditions are prevalent. Therefore, the pooled estimates presented here should be viewed as upper-bound performance benchmarks. Fourth, it is notable that our systematic search identified few studies relying solely on somatic mutation profiling that met our inclusion criteria for complex machine learning architectures. This observation reflects a broader trend in the field: unlike genome-wide methylation or fragmentation profiles which provide millions of continuous, high-dimensional features suitable for deep learning, somatic mutations are often sparse and discrete events. Consequently, mutation-based MCED assays typically require combination with protein biomarkers (CancerSEEK 68 ) or rely on simpler statistical models rather than the standalone cfDNA-ML frameworks evaluated here. Furthermore, this de facto exclusion aligns with biological constraints: mutation-only assays are susceptible to confounding by Clonal Hematopoiesis of Indeterminate Potential (CHIP) 69 and lack the tissue-specific signatures required for accurate Tissue of Origin (TOO) localization. 8 Therefore, our analysis predominantly synthesizes epigenetic and fragmentomic modalities, ensuring a more homogenous comparison of algorithmic performance.
Although cfDNA-based machine learning models demonstrate substantial promise for multi-cancer early detection, the practical reality is that much work remains to be done before these tools can be routinely implemented in clinical care. The major barrier to the clinical translation of cfDNA-based ML models for multi-cancer early detection is the lack of standardized analytical workflows and reporting practices. Across the studies included in this meta-analysis, substantial methodological heterogeneity was observed in terms of cfDNA processing protocols, library preparation methods, sequencing platforms, feature extraction strategies, and the choice of machine learning algorithms. These variations complicate cross-study comparisons, hinder reproducibility, and limit model generalizability.
In parallel, data privacy and ethical considerations remain significant concerns, particularly when deploying cfDNA-ML models in large-scale screening programs. After all, cfDNA represents a patient's unique genetic fingerprint, and the idea of using that information to feed predictive algorithms may understandably raise concerns among patients. To protect patient confidentiality and prevent misuse, strict compliance with legal and ethical standards, such as GDPR in Europe and HIPAA in the U.S., will be essential.70,71
Another critical challenge lies in the interpretability and clinical acceptance of AI models. Although many ML algorithms demonstrate high diagnostic accuracy, their “black box” nature often makes it difficult for clinicians to understand or explain the rationale behind predictions. This opacity may erode clinician trust, which is essential for integration into routine clinical workflows.72,73 Furthermore, despite growing interest in AI-assisted diagnostics, many healthcare professionals remain skeptical of these tools, particularly when they are perceived to undermine human expertise or increase cognitive load.74,75 Training clinicians to interpret model outputs, as well as demonstrating clear clinical utility, will be key to fostering adoption. Importantly, successful deployment will also depend on seamless workflow integration; if AI tools are not easily embedded into existing systems or are viewed as time-consuming, clinicians may resist using them regardless of their performance.76,77 Addressing these barriers will require not only technical advancements, such as the development of more interpretable models using attention mechanisms or SHAP (SHapley Additive exPlanations),78,79 but also sustained engagement with clinicians, ethicists, and patients to ensure that implementation strategies are trustworthy, transparent, and aligned with clinical practice.
Beyond technical and methodological considerations, practical implementation challenges must also be addressed to ensure that cfDNA-based ML screening can be translated into routine clinical practice. One critical barrier is the issue of health insurance coverage, which could significantly impact the scalability and accessibility of such tools. The cost of cfDNA testing, particularly when combined with machine learning algorithms, may be prohibitive without appropriate reimbursement mechanisms.80,81 Policymakers must weigh the cost-effectiveness of these tools against traditional screening methods to ensure equitable access across diverse socioeconomic groups. 82 Moreover, the potential burden of frequent testing in asymptomatic individuals requires careful assessment to avoid unnecessary healthcare expenditure or inefficient resource allocation.83,84 Addressing these barriers will require coordinated efforts among researchers, clinicians, ethicists, and policymakers to ensure that cfDNA-based diagnostics are not only scientifically sound but also financially and logistically feasible for real-world implementation.
Conclusion
This systematic review and meta-analysis demonstrates that machine learning–based cfDNA assays achieve high overall diagnostic accuracy for multi-cancer early detection, with robust sensitivity and specificity across independent validation cohorts. Although diagnostic performance varies according to geographic region, sample size, and biomarker type, the available evidence supports the scientific credibility and translational promise of cfDNA–machine learning integration for noninvasive cancer detection. To ensure generalizability and responsible clinical translation, future studies should prioritize large-scale, prospective, multicenter validation using harmonized clinical and analytical protocols.
Supplemental Material
sj-docx-1-tct-10.1177_15330338261425328 - Supplemental material for Value of Machine Learning Models for Cell-Free DNA-Based Multi-Cancer Early Detection: A Systematic Review and Meta-Analysis
Supplemental material, sj-docx-1-tct-10.1177_15330338261425328 for Value of Machine Learning Models for Cell-Free DNA-Based Multi-Cancer Early Detection: A Systematic Review and Meta-Analysis by Qiong Li, MS, Hongde Liu, PhD, and Jinke Wang, PhD in Technology in Cancer Research & Treatment
Footnotes
Abbreviations
Acknowledgements
The authors thank Mr Juncheng Yang (Lanzhou University) for his helpful discussions and methodological suggestions during the early planning stage of this systematic review and meta-analysis.
Author Contributions
Jinke Wang and Hongde Liu contributed to the conceptualization and design of the study. Jinke Wang and Qiong Li were responsible for data acquisition, database management, and statistical analysis. Qiong Li drafted the initial version of the manuscript. Jinke Wang contributed to writing and critically revising specific sections of the manuscript. All authors reviewed the manuscript for important intellectual content and approved the final version.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Natural Science Foundation of China (NSFC, Grant No. 62371126).
Declaration of Conflicting Interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Data Availability Statement
Template data collection forms were used to extract data from included studies. All datasets used for analyses are publicly available from previously published sources.
PROSPERO Registration
The review protocol was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO) under the registration number CRD42025645908.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
