Abstract
Objective
To explore how the definition of the target condition and post hoc exclusion of participants can limit the usefulness of diagnostic accuracy studies.
Methods
We used data from a systematic review, conducted for a NICE diagnostic assessment of risk scores to inform secondary care decisions about specialist referral for women with suspected ovarian cancer, to explore how the definition of the target condition and post hoc exclusion of participants can limit the usefulness of diagnostic accuracy studies to inform clinical practice.
Results
Fourteen of the studies evaluated the ROMA score, nine used Abbott ARCHITECT tumour marker assays, five used Roche Elecsys. The summary sensitivity estimate (Abbott ARCHITECT) was highest, 95.1% (95% CI: 92.4 to 97.1%), where analyses excluded participants with borderline tumours or malignancies other than epithelial ovarian cancer and lowest, 75.0% (95% CI: 60.4 to 86.4%), where all participants were included. Results were similar for Roche Elecsys tumour marker assays. Although the number of patients involved was small, data from studies that reported diagnostic accuracy for both the whole study population and with post hoc exclusion of those with borderline or non-epithelial malignancies suggested that patients with borderline or malignancies other than epithelial ovarian cancer accounts for between 50 and 85% of false-negative ROMA scores.
Conclusions
Our results illustrate the potential consequences of inappropriate population selection in diagnostic studies; women with non-epithelial ovarian cancers or non-ovarian primaries, and those borderline tumours may be disproportionately represented among those with false negative, ‘low risk’ ROMA scores. These observations highlight the importance of giving careful consideration to how the target condition has been defined when assessing whether the diagnostic accuracy estimates reported in clinical studies will translate into clinical utility in real-world settings.
Keywords
Introduction
The failure to design diagnostic test studies which utilize the spectrum of participants representative of the population for the intended use in clinical practice has been extensively discussed in the diagnostic literature for many years.1–4 Using healthy volunteers as controls and/or including more individuals with advanced disease and fewer individuals with early stage disease are generally held to be associated with overestimation of test accuracy;1,3 however, the appropriateness, or otherwise, of spectrum of participants in a study can be less obvious. A recent review article, describing the sources of bias in diagnostic accuracy studies describes how the patient spectrum is altered at each referral, illustrating the importance of selecting study participants who correspond with the stage in the clinical pathway (referral stage) at which the test being evaluated would be used in practice. Reporting guidelines for test accuracy studies also emphasize the importance of defining the intended use and clinical role of the test being evaluated. 5 Despite this, a recent study of 112 systematic reviews of diagnostic accuracy studies found that 46% failed to provide a clear description of the intended role of the test in the clinical pathway and overemphasis on positive conclusions (not supported by the data) was present in 72% of abstracts and 69% of full texts. 6 Interpretation of the results of test accuracy studies for clinical practice remains inadequate; we present an example of current relevance to clinical biochemistry.
Early diagnosis of ovarian cancer is particularly problematic, with a high proportion of women (58%) still diagnosed at an advanced stage (stage III or IV) and 21% having metastases at diagnosis. 10 Ovarian cancer survival is strongly related to stage at diagnosis; 2012 data showed that the one-year and five-year survival rates for women diagnosed at stage I were 97% and 90% versus 53% and 4% for women diagnosed at stage IV. 9 Improving early diagnosis is therefore a priority and, when evaluating testing strategies, it is essential to consider possible variation in performance for the detection of different stages of ovarian cancer. Similarly, while the majority of studies about ovarian cancer diagnosis concern epithelial carcinomas, there is some evidence to indicate that the diagnostic performance of tumour markers and risk scores may vary between tumours of different tissue types; 11 possible effects of tumour tissue type on estimates of test performance should, therefore, also be considered.
Current guidance, NICE clinical guideline (CG122) Ovarian cancer: recognition and initial management, 12 recommends the calculation of a risk malignancy index I (RMI I) score (based on serum CA125, ultrasound and menopausal status) followed by referral to a specialist gynaecological oncology multidisciplinary team (SMDT) for people with an RMI score ≥250. We have recently completed a systematic review to assess the clinical effectiveness of using alternative risk scores (ROMA, IOTA simple ultrasound rules, the IOTA ADNEX model, Overa (MIA2G) and RMI 1 at thresholds other than 250) to guide referral decisions for women with adnexal mass and suspected ovarian cancer in secondary care. The review was undertaken as part of a diagnostic appraisal to inform the development of new NICE diagnostics guidance (DG31). 13
The ROMA score uses serum HE4 and serum CA125 concentrations, along with menopausal status, to generate an individualized estimate of the risk that a person has ovarian cancer. 14 The objective of this article was to explore how the definition of the target condition and the post hoc exclusion of participants can limit the usefulness of diagnostic accuracy studies. We used studies of the ROMA score, taken from our systematic review, to provide an example of current relevance to Clinical Biochemistry practitioners and researchers. The full results of our systematic review are published elsewhere. 15
Methods
Systematic review methods followed the principles outlined in the Centre for Reviews and Dissemination guidance for undertaking reviews in healthcare, 16 and the NICE Diagnostic Assessment Programme manual. 17 This article focuses on studies assessing the accuracy of the ROMA score and explores how the spectrum of study participants may affect the usefulness of results for clinical practice.
Data sources
We searched 22 resources up to November 2016, including MEDLINE, EMBASE, clinical trials registers and conference proceedings (Radiological Society of North America, American Society of Clinical Oncology Annual Conference, Society of Gynecologic Oncology, The National Cancer Research Institute, European Society of Radiology). Furthermore, we contacted experts in the field, with the aim of identifying any unpublished studies. Search strategies were based on the specified risk scores and the target condition (ovarian cancer), and did not include any study design terms or filters 18 ; example search strategies are provided online (web appendix 1). No restrictions on language or publication status were applied to any searches.
Inclusion criteria
Diagnostic cohort studies, which assessed the accuracy of risk scores for identifying those women with suspected ovarian cancer who require referral from secondary care to specialist oncology services, were eligible for inclusion. Studies were required to use histological confirmation as the reference standard.
We included secondary care studies in women of any age with suspected ovarian cancer, who had not previously been treated for ovarian cancer and were not currently receiving chemotherapy; studies were included if the setting was unclear, but the population was described as people with suspected ovarian cancer. For studies of the ROMA score, only studies using tumour marker (CA125 and HE4) assays commercially available in the UK (Abbott ARCHITECT (Abbott Diagnostics, Abbott Park, Illinois, USA), Roche Elecsys (Roche Diagnostics, Rotkreuz, Switzerland) and Fujirebio Lumipulse G (Fujirebio Diagnostics, Göteburg, Sweden)) were included. Only studies of the ROMA score are included in this article.
Included studies were required to report sufficient data to determine the numbers of true positive, false positive, false negative and true negative test results; the primary outcomes were sensitivity and specificity and the data needed to calculate these parameters.
Studies were screened for relevance independently by two reviewers and full text articles of studies considered potentially relevant were assessed for inclusion by one reviewer and checked by a second. Disagreements, at either stage of study selection, were resolved through discussion and consensus, or by consultation with a third reviewer.
Data extraction
One reviewer extracted data using a prepiloted data extraction form and extractions were checked by a second reviewer; any disagreements were resolved through discussion and consensus, or by consultation with a third reviewer.
Quality assessment
The methodological quality of included test accuracy studies was assessed using QUADAS-2, 19 which uses four domains to assess risk of bias and three domains to assess the applicability of the study to the review question. Quality assessment was undertaken by one reviewer and checked by a second reviewer and any disagreements were resolved by consensus or discussion with a third reviewer.
Analysis
Sensitivity and specificity were calculated for each set of 2 × 2 data. All meta-analyses estimated separate pooled estimates of sensitivity and specificity, using random-effects logistic regression. 20 The bivariate/hierarchical summary receiver operating characteristic model21–23 could not be applied because the data-sets for each tumour marker assay manufacturer and target condition were too small and/or homogeneous. Heterogeneity was assessed visually using summary receiver operating characteristic plots or receiver operating characteristic (ROC) space plots. Analyses were performed in MetaDisc. 24
Results
Overview of included studies
The searches identified 2456 references; 38 studies, reported in 48 publications, were included in the full systematic review. Figure 1 shows the flow of studies through the review process. Fourteen studies reported data on the accuracy of the ROMA score and were included in this article,25–38 of which nine used Abbott ARCHITECT tumour marker assays25–33 and five used Roche Elecsys tumour marker assays.34–38 None of the included studies used the Fujirebio Lumipulse G system.

Flow of studies through the review process.
All studies included women with adnexal/ovarian mass. Four of the 14 studies reported analyses which excluded some participants based on their final histopathological diagnosis25,28,31,35 and two studies included final histopathological diagnosis in their participant selection criteria.33,36 Because final histopathological diagnosis is information which could not be known at the point in the clinical pathway where the ROMA score would be used, we consider that both of these approaches are, in effect, post hoc exclusions; no study provided a justification for these exclusions. All four of the studies that excluded participants from their analyses, based on final histopathological diagnosis appeared to use the terms ‘ovarian cancer’ and ‘epithelial ovarian cancer’ interchangeably; study objectives were framed in terms of differentiating between benign and malignant ovarian masses, whereas accuracy results were reported for epithelial ovarian cancer. Two of these studies also excluded patients with borderline tumours from their analyses.31,35 In contrast, both of the studies that used histopathological diagnosis as a participant selection criterion clearly reported an objective of evaluating the ROMA score for the detection of epithelial ovarian cancer. A further four studies did not provide a clear definition of the target condition.26,27,29,30 Details of studies evaluating the ROMA score, their associated references and main target condition are provided in Table 1.
Details of studies evaluating the ROMA score.
QUADAS-2 results for ROMA score.
Methodological quality of studies assessing the ROMA score
None of the studies were rated as ‘low’ risk of bias for all domains of the QUADAS-2 tool and only two were rated as having ‘low’ concerns regarding all applicability domains.29,32 A summary of the QUADAS-2 assessments for each study is provided in Table 2.
The main potential sources of bias concerned flow and timing. Six (43%) studies were rated as ‘high’ risk of bias on the flow and timing domain, because some participants were excluded from the analyses after their histopathological diagnoses had been established.25,28,31,33,35,36 This approach has been taken, by some researchers, in order to allow calculation of risk score performance data for specific target conditions, which are subsets of ovarian cancer (e.g. epithelial ovarian cancer, excluding borderline tumours), but was classified as inappropriate exclusion because final histopathological diagnosis is information which could not be known at the point in the clinical pathway where the ROMA score would be used.
These six studies were also classified as having ‘high’ concern regarding applicability, with respect to how the target condition was defined by the reference standard. This is because, in ‘real world’ clinical practice, it is likely that the appropriate target condition, for women presenting with adnexal mass who are being considered for specialist referral, would be considered to be ‘all malignant tumours’. Web appendix 2 lists final histological diagnoses (where reported) of study participants.
Accuracy of the ROMA score using Abbott ARCHITECT tumour marker assays
Nine studies used Abbott ARCHITECT tumour marker assays.25–33 Only one study included all participants in the analysis, regardless of their final histopathological diagnosis (target condition: all malignant tumours including borderline). 32 Two studies excluded women with histopathological diagnoses other than epithelial ovarian cancer, but included women with borderline tumours.28,33 Two further studies excluded participants with non-epithelial ovarian cancer, participants with non-ovarian cancers and participants with borderline tumours;25,31 the distribution of positive and negative ROMA score results, in these patients, was not reported. The remaining four studies did not report a clear definition of the target condition; the results for these studies are not reported.26,27,29,30
The sensitivity estimate for the ROMA score was highest, 95.1% (95% CI: 92.4 to 97.1%), where analyses excluded participants with borderline tumours and those with malignancies other than epithelial ovarian cancer and lowest, 75.0% (95% CI: 60.4 to 86.4%), where all participants were included in the analysis (see Table 3). Conversely, the specificity estimate for the ROMA score was highest, 87.9% (95% CI: 81.9 to 92.4%), in the study which included all participants 32 and lowest, 62.5% (95% CI: 59.7 to 65.3%), where analyses excluded participants with borderline tumours and those with malignancies other than epithelial ovarian cancer (see Table 3).
Accuracy of the ROMA by tumour marker manufacturer and target condition.
TP: true positive; FP: false positive; FN: false negative; TN: true negative.
One study reported test performance estimates calculated both with and without the inclusion of participants with borderline tumours. 33 Although the number of participants involved was small, these data indicated that around half of all false-negative risk scores were accounted for by patients with borderline tumours, 3/6 (50%). 33 Approximately 13% (17/128) of the participants in this study had borderline tumours, while 39% (50/128) had malignant tumours, i.e. a higher proportion of patients with borderline tumours had a negative ROMA score, 17.6% (3/17), than was the case for patients with malignant tumours, 3/50 (6%). 33
Two studies, using different thresholds, assessed the variation in the performance of the ROMA score with different stages of epithelial ovarian cancer (see Table 3).25,28 In both studies, the sensitivity estimate was highest, 92.1% (95% CI: 78.6 to 98.3%) and 100% (95% CI: 89.7 to 100%), where the target condition was stage III/IV epithelial ovarian cancer and patients with stage I/II and borderline disease were excluded from the analysis and decreased, 82.6% (95% CI: 61.2 to 95.0%) and 75.0% (95% CI: 42.8 to 95.4%) where the where the target condition was stage I/II epithelial ovarian cancer, and patients with stage III/IV and borderline disease were excluded from the analysis.25,28 When the target condition was borderline epithelial tumours and all patients with more advanced stage disease were excluded from the analysis, the sensitivity estimate was significantly lower, 56.3% (95% CI: 29.9 to 80.2%). 25
Accuracy of the ROMA score using Roche tumour marker assays
Five studies used Roche Elecsys tumour marker assays.34–38 Two studies included all patients, regardless of final histopathological diagnosis (target condition all malignant tumours). The summary estimates of sensitivity and specificity derived from these studies were 79.1% (95% CI: 74.2 to 83.5%) and 79.1% (95% CI: 76.3 to 81.6%), respectively.34,38 One of these studies also reported test accuracy when study participants with borderline tumours were excluded from the analysis. 34 The exclusion of participants with borderline tumours resulted in increased sensitivity, 95.5% (95% CI: 84.5 to 99.4%), and unchanged specificity, 79.3% (95% CI: 73.4 to 85.2%). 34 Data from this study indicated that patients with borderline tumours and those with non-ovarian primaries accounted for a high proportion, 12/14 (86%), of the false-negative risk scores observed. 34
A further study included all participants, but classified those found to have borderline ovarian tumours as disease negative. 37 The sensitivity estimate from this study appeared slightly higher than that from the studies where borderline tumours were classified as positive, 83.8% (95% CI: 73.4 to 91.3%), and the specificity estimate appeared slightly lower, 68.8% (95% CI: 61.6 to 75.4%), but neither difference was statistically significant (see Table 3). 37 The same study also reported test performance data, where eight (3%) patients with non-epithelial ovarian cancer and non-ovarian primaries were excluded from the analysis; this exclusion did not significantly change the results (see Table 3). Although the numbers involved were small, it should be noted that patients with malignancies other than epithelial ovarian cancer accounted for four (50%) of the false-negative results. 37 This study also assessed the variation in the performance of the ROMA score with different stages of epithelial ovarian cancer (see Table 3). 37 The sensitivity estimate was highest, for both the ROMA score 97.2% (95% CI: 95.5 to 99.9%) and the RMI 1 88.9% (95% CI: 73.9 to 96.9%), where the target condition was stage II to IV epithelial ovarian cancer and patients with stage I disease were excluded from the analysis. 37 A second study observed a similar pattern for stage III/IV epithelial ovarian cancer and patients, with stage I/II disease and borderline tumours excluded, compared with stage I/II epithelial ovarian cancer and patients, with stage III/IV disease and borderline tumours excluded 36 (see Table 3).
Discussion
All studies described in this article were diagnostic cohort studies, taken from our systematic review of ovarian cancer risk scores, which reported data on the diagnostic accuracy of the ROMA score using either Roche Elecsys or Abbott ARCHITECT tumour marker assays. Using either manufacturer’s tumour marker assays, sensitivity estimates for the ROMA were highest, where analyses excluded participants with borderline tumours and those with malignancies other than epithelial ovarian cancer and lowest, where all participants were included in the analysis, regardless of their final histopathological diagnosis. The analysis which included all participants, regardless of their final histopathological diagnosis, is more likely to reflect the performance of the score in a clinical setting since the population in which the risk score is to be used will be defined by presenting characteristics and is likely to include women with a variety of histopathological ovarian tumour types as well as some whose primary cancer is subsequently found to be non-ovarian.
Our results also indicate that the ROMA score is better at identifying women with high-grade ovarian tumours (stage III/IV), than low-grade tumours (stage I/II) or borderline tumours and is better at identifying epithelial ovarian tumours than other histopathological tissue types. This is a potential limitation in the clinical setting, where there is a heterogeneous mix of tumour tissue types and stages. It is also an indication of a fundamental limitation of the diagnostic test accuracy concept as applied to cancer, which assumes a single tissue type and tumour stage that can be established by histopathological examination of an excised sample. The understanding of how cancers evolve is currently undergoing a period of intense research which show that cancer most likely do not evolve on a linear pathway from low grade to high grade but are a heterogeneous mix of many subclones which may evolve separately. 43 Depending on how the tumour is sampled for histology, intratumour heterogeneity impacts upon our ability to find suitable cancer biomarkers for clinical use. Diagnostic studies of the future will have to give careful consideration to the evolution of cancer and understand that a person diagnosed with cancer may in fact have several tumours each with a different genetic origin, and therefore each will have different diagnostic or prognostic consequences. Re-evaluation of the use of single tissue samples per patient, and the assumption of tumour homogeneity is required to update diagnostic research in the ongoing evolution of personalized medicine.
Previous systematic reviews of the ROMA score have focused on predicting ovarian cancer (no definition reported) or epithelial ovarian cancer, have combined data from studies using different manufacturers’ tumour marker assays and thresholds and have not clearly described how study participants with borderline tumours and those with non-ovarian primaries were classified.11,44,45 The resultant summary estimates of test performance have tended to be higher than those described in this article (sensitivity 85% to 87%, specificity 82% to 86%), and the authors’ conclusions about the potential clinical utility of the ROMA score may perhaps be over optimistic.
The definition of the target condition is a crucial consideration when assessing whether the diagnostic accuracy estimates reported in clinical studies will translate into clinical utility in real-world settings. In the current example, to define the target condition as ‘epithelial ovarian cancer’ implies that how women with other malignancies are classified by the ROMA score is not relevant. Clearly, such women form part of the spectrum of those presenting with an adnexal mass (those in whom the ROMA score is intended to be used). Furthermore, post hoc exclusion of study participants based on their final histopathological diagnosis requires information that could not be known at the point of presentation. Studies should therefore include all participants in their analyses. Consideration of the data from those studies that reported accuracy estimates for both the whole study population (target condition all malignant tumours including borderline) and for selected populations (participants found to have borderline tumours and/or those with non-epithelial ovarian cancers or non-ovarian primaries excluded) indicates that patients with borderline tumours and those with non-epithelial ovarian cancers or non-ovarian primaries may be disproportionately represented among those with false-negative ROMA scores; it should be emphasized that these observations are derived from very small numbers of patients and should be viewed as hypothesis generating. The downstream consequences (treatment and prognosis) of a false negative, low risk, classification are likely to differ between patients with different histological cancer types and between those with borderline tumours and those with higher stage malignancies, although all histological cancer types will require referral to an SMDT. A more complete exploration of the types of patients who are likely to be misclassified as low risk, as well as an investigation of the downstream clinical consequences for these patients is needed. The potential to detect non-epithelial ovarian cancers by combining other tests (e.g. alpha fetoprotein (AFP) and beta human chorionic gonadotropin (beta-hCG), as recommended in CG122, 12 for women under 40 with suspected ovarian cancer) with the ROMA score is unclear and may warrant exploration in future studies.
There remains a further question, regarding the real-world clinical applicability of studies evaluating the ROMA score. All participants in the identified studies underwent surgery (i.e. histological confirmation of disease status was available). In practice, risk scores may be used, in secondary care, to triage patients to surgery or surveillance/conservative management, as well as to guide decisions about where surgery should be undertaken (referral to a specialist gynaecological oncology unit). This potential mismatch between the study populations and real-world clinical practice is reflected in the relatively high estimate for the prevalence of malignancy (25.1%) derived from the ROMA studies included in our systematic review. It should be noted that a lower prevalence of malignancy may also affect risk score performance in practice. It could be argued that a more realistic estimate of the performance of the ROMA score would be obtained by including both patients undergoing surgery and those who are managed conservatively, and applying a mixed reference standard of histological confirmation or follow-up for a specified minimum period.
Conclusions
Despite the optimistic conclusions presented by some research studies, our results indicate that the ROMA score is unlikely to provide adequate sensitivity to be of use in guiding decisions about referral from secondary care to an SMDT. There are limited data to indicate that patients with borderline tumours and those with non-epithelial ovarian cancers or non-ovarian primaries may be disproportionately represented among those with false-negative ROMA scores. Future studies should include populations that reflect the referral point at which the ROMA score is intend to be used. These observations highlight the importance of giving careful consideration to how the target condition has been defined, and whether particular groups of patients have been inappropriately excluded from the analyses, when assessing whether the diagnostic accuracy estimates reported in clinical studies will translate into clinical utility in real-world settings.
Supplemental Material
Supplemental material for Clinically inappropriate post hoc exclusion of study participants from test accuracy calculations: the ROMA score, an example from a recent NICE diagnostic assessment
Supplemental material for Clinically inappropriate post hoc exclusion of study participants from test accuracy calculations: the ROMA score, an example from a recent NICE diagnostic assessment by Shona Lang, Nigel Armstrong, Sohan Deshpande, Bram Ramaekers, Sabine Grimm, Shelley de Kock, Jos Kleijnen and Marie Westwood in Annals of Clinical Biochemistry
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This report presents independent research funded by the National Institute for Health Research (NIHR). The views and opinions expressed by authors in this publication are those of the authors and do not necessarily reflect those of the NHS, the NIHR, NETSCC, the HTA programme or the Department of Health.
Ethical approval
Not applicable.
Guarantor
MW.
Contributorship
MW and SL planned and drafted this article. All authors contributed to planning and interpretation of the systematic review on which this article is based and all authors provided input to the article. SdK devised and performed the literature searches and provided information support to the project. All parties were involved in drafting and/or commenting on the report.
Supplementary material
Additional supplementary information may be found with the online version of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
