Abstract
Objective
To examine the breast cancer detection rate by single reading of an experienced radiologist supported by an artificial intelligence (AI) system, and compare with two-dimensional full-field digital mammography (2D-FFDM) double reading.
Materials and methods
Images (3D-tomosynthesis) of 161 biopsy-proven cancers were re-read by the AI algorithm and compared to the results of first human reader, second human reader and consensus following double reading in screening. Detection was assessed in subgroups by tumour type, breast density and grade, and at two operating points, referred to as a lower and a higher sensitivity threshold.
Results
The AI algorithm method gave similar results to double-reading 2D-FFDM, and the detection rate was significantly higher compared to single-reading 2D-FFDM. At the lower sensitivity threshold, the algorithm was significantly more sensitive than reader A (97.5% vs. 89.4%, p = 0.02), non-significantly more sensitive than reader B (97.5% vs. 94.4%, p = 0.2) and non-significantly less sensitive than the consensus from double reading (97.5% vs. 99.4%, p = 0.2). At the higher sensitivity threshold, the algorithm was significantly more sensitive than reader A (99.4% vs. 89.4%, p < 0.001) and reader B (99.4% vs. 94.4%, p = 0.02) and identical to the consensus sensitivity (99.7% in both cases, p = 1.0). There were no significant differences in the detection capability of the AI system by tumour type, grading and density.
Conclusion
In this proof of principle study, we show that sensitivity using single reading with a suitable AI algorithm is non-inferior to that of standard of care using 2D mammography with double reading, when tomosynthesis is the primary screening examination.
Keywords
Introduction
Breast cancer is the most common cancer in women and the first cause of death among women in high income countries. 1 Organised screening programmes for the early detection of breast cancer have been implemented in many countries. 2 At present, most of these are based on screening by two-dimensional full-field digital mammography (2D-FFDM) but the use of tomosynthesis (3D-mammography plus 2D-synthesised images) is often suggested due to its higher detection rate and a potentially lower recall rate. 3
Many studies have shown that the use of tomosynthesis in population-based screening settings4,5 improves the detection rate for breast cancer, although it is worth noting that this finding is not universal. 6 In the German biennial screening programme, tomosynthesis cannot be used as the first-line screening, but since 2015 tomosynthesis plus synthetically reconstructed 2D images have been widely used in screening assessment as recommended in the European guidelines on breast cancer screening. 7 These guidelines also issued a neutral decision on breast cancer screening with digital breast tomosynthesis (DBT) compared to 2D-FFDM, suggesting that either DBT or 2D-FFDM could be the modality of choice. 8
With the increasing lack of radiologists in the European Union and the increased workload, if DBT screening became the norm, new strategies for reading these screening examinations might become necessary. At the moment, double reading is recommended in the European Guidelines. 9 Any changes would be required to maintain the high quality standard in terms of detection capability that is presently also based on double reading the 2D-FFDM. If single-reading tomosynthesis with a validated and suitable artificial intelligence (AI) algorithm equals or outperforms the present standard of double-reading 2D-FFDM, human resources could be saved and costs reduced.
Materials and methods
In our screening unit we retrospectively identified 161 biopsy-proven cancers of women aged 50–69, which had been assessed during 2015 to 2017, by the additional use of tomosynthesis plus synthetically reconstructed 2D images. The tomosynthesis devices we used were two Senographe Essentials with Senoclaire upgrade (GE Healthcare). Patients consented to the use of their data for scientific evaluation during the screening examinations, and ethics approval for this study was granted by the ethics committee of the chamber of Westfalia at the University of Münster under 2020–026-f-S.
The original screening examination was 2D-FFDM in two views. In the German screening programme, if one or both readers of a mammography find a suspicious lesion, the case is discussed in a conference where the final decision for recall is made by consensus of all readers of the screening unit. The recalled women are assessed by the radiologist responsible for that unit, using additional mammographic views, ultrasound, clinical examination and, if necessary, biopsy. Since 2015, tomosynthesis plus 2D-synthesised images have been used in assessment in addition to or instead of additional mammographic views. In our unit, we have been using mainly two-view tomosynthesis in assessment of soft tissue findings such as architectural distortions, masses and asymmetric densities. Calcifications have been mainly assessed by additional magnification views. Exclusion criteria were tumors not assessed by tomosynthesis in two views or incomplete documentation.
For this study, after identification of biopsy-proven cancers from 2015 to 2017 (n = 161) that were assessed by additional two-view tomosynthesis plus synthetically reconstructed 2D images, the 3D images were re-read by an AI algorithm (ProfoundAI 3D by iCAD Inc.). The AI algorithm was considered to have identified the cancer if it marked the relevant image as positive and correctly located the cancer on the image in the judgement of a senior radiologist. We calculated percentages of detection by tumour histopathology (type), grading and breast density. Furthermore we compared the results of AI to first and second readings, and to consensus from double reading by the McNemar test. Results were compared with AI at two operating points: a medium sensitivity point (referred to below as lower sensitivity threshold) and a high sensitivity point (higher sensitivity threshold).
A total of five readers contributed to this study. All readers had a minimum reading experience of 10 years within the screening program, reading a minimum amount of 5000 exams annually. For density reporting we used the previous American College of Radiology Breast Imaging Reporting & Data System categories 1–4 10 rather than the more recent a–d, 11 because in the German screening software these were still used during the period we evaluated.
Results
Among the assessed 161 cancers we found: 129 no special type (NST) (80.12%), 17 lobular cancers (10.56%), 5 tubular cancers (3.11%) 4 ductal carcinoma in situ (DCIS) (2.48%), 3 mucinous cancers (1.86%) 1 basal cancer (probably NST) (0.62%), 1 medullary cancer (0.62%), 1 malignant cystosarcoma (0.62%).
At the lower sensitivity threshold the system detected 157 cancers out of 161 (97.5%, 95% CI: 95.0–100.0) and at the higher, the detection rate was 160 out of 161 (99.4%, 95% CI: 98.2–100.0).
Table 1 shows the detection results for the AI algorithm by tumour type, grade and density category. The detection rate of NST was 99.2% at the higher sensitivity threshold and 97.7% at the lower. The detection rate of all other types of histopathology was 100% at the higher sensitivity threshold, and except for lobular carcinoma and mucinous carcinoma was also 100% at the lower. For grade, there was a slightly lower detection rate for grade 3 than for grades 1 and 2 tumours. For density, detection was lower at low levels of density using the lower sensitivity threshold.
AI algorithm detection results, by tumour type, grade and density category.
Finally, we performed statistical analysis for comparison of human reading results of 2D images vs. human reading of 3D images supported by the AI algorithm. The results are shown in Table 2 for the lower sensitivity threshold. The algorithm was significantly (p = 0.02) more sensitive than reader A, at 97.5% (95% CI: 95.0–100.0) vs. 89.4% (95% CI: 84.6–94.2), was non-significantly (p = 0.2) more sensitive than reader B, at 97.5% (95% CI: 95.0–100.0) vs. 94.4% (95% CI: 90.8–98.0), and was non-significantly lower (p = 0.2) than the consensus from double reading at 97.5% (95% CI: 95.0–100.0) vs. 99.4% (95% CI: 98.2–100.0).
Comparison of AI algorithm result with human reader result, lower sensitivity threshold.
Table 3 shows the corresponding results for the higher sensitivity threshold. The AI algorithm was significantly (p < 0.001) more sensitive than reader A, at 99.4% (95% CI: 98.2–100.0) vs. 89.4% (95% CI: 84.6–94.2), significantly (p = 0.02) more sensitive than reader B, at 99.4% (95% CI: 98.2–100.0) vs. 94.4% (95% CI: 90.8–98.0), and identical to the consensus sensitivity 99.4% (95% CI: 98.2–100.0) in both cases (p = 1.0).
Comparison of AI algorithm result with human reader result, higher sensitivity threshold.
Discussion
At present, most of the population-based organised screening programmes in Europe are based on 2D-FFDM and a double reading policy. The European Commission Initiative on Breast Cancer (ECIBC) suggested the use of either 2D-FFDM or DBT as the primary exam in screening. 8 One of the drawbacks for the implementation of DBT is the increasing number of images to be read. 12 The additional reading time will be even larger if the double reading policy persists. While there is evidence that AI can reduce reading time in the context of DBT screening, 13 there remains a need for prospective research within real-time screening programmes.
This study demonstrates that DBT read by an AI algorithm is not inferior in terms of cancer detection to double-reading 2D-FFDM as it is done today. The high capacity for detection was seen in all tumour types and grades, and in dense breast tissue; however, the numbers are too small to be confident of the results in specific subgroups. Although our results suggest that the algorithm is effective in dense breasts, it should be noted that we only had eight cases in the highest density category.
There have been positive results for large-scale retrospective reading of mammograms with AI systems. A simulation on a representative large data set from the United States and UK (2D-FFDM) with the AI system used as a second reader demonstrated that, compared to the standard double-reading process used in the UK, the AI system maintained a non-inferior performance and reduced the workload which would have been incurred by the second reader by 88%. Furthermore, a reduction in false positives and false negatives could be demonstrated. 14
Of course there are some limitations. Our study is of modest size. Although studies on larger populations showed similar distributions of invasive tumour attributes, 15 we had very few DCIS cases in our series. This retrospective study addresses only the sensitivity of the algorithm. We do not report any false-positive rates for the AI algorithm, and these will be essential as part of a further evaluation. We also do not know how many subsequent interval cancers could have been identified by the algorithm. On the other hand, there have been trials of tomosynthesis in screening, which have shown that the rate of false positive findings will not increase.4,5 Trials have also shown that detection of breast cancers is significantly higher with tomosynthesis.4,5 Finally, we cannot state that every available algorithm will give the same performance as we know that algorithms have to be adapted to the imaging device, and quality measures have to be developed.
To build on the initial positive results of this study, further studies (particularly prospective) are required, with numbers large enough for subgroup analyses. There should also be the development of test sets for algorithms as well as human readers.
Footnotes
Acknowledgements
The histopathology results were provided by Prof. Dr med. Horst Bürger, Institut für Pathologie, Husenerstraße, Paderborn, Germany. Technical support was given by Jonathan Go from iCAD Inc., Nashua, NH, USA. iCAD Inc. granted us the AI algorithm for study purposes.
Declaration of conflicting interests
AG has done several presentations on the use of ProFound AI in diagnostics at several international meetings (RSNA, ECR, Eusobi). Travel expenses have been covered and honorarium has been paid.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
