Abstract
Interval cancers are a commonly seen problem in organized breast cancer screening programs and their rate is measured for quality assurance. Artificial intelligence algorithms have been proposed to improve mammography sensitivity, in which case it is likely that the interval cancer rate would decrease and the quality of the screening system could be improved. Interval cancers from negative screening in 2011 and 2012 of one regional unit of the national German breast cancer screening program were classified by a group of radiologists, categorizing the screening digital mammography with diagnostic images as true interval, minimal signs, false negative and occult cancer. Screening mammograms were processed using a detection algorithm based on deep learning. Of the 29 cancer cases available, artificial intelligence identified eight out of nine of those classified as minimal signs, all six false negatives and none of the true interval and occult cancers. Sensitivity for lesions judged to be already present in screening mammogram was 93% (95% confidence interval 68–100) and sensitivity for any interval cancer was 48% (95% confidence interval 29–67). Using an artificial intelligence algorithm as an additional reading tool has the potential to reduce interval cancers. How and if this theoretical advantage can be reached without a negative effect on recall rate is a challenge for future research.
Introduction
Interval cancers are a commonly seen problem in organized breast cancer screening programs. 1 The rate of interval cancers is used as a quality measure for the successful work of a screening program, 2 knowing that quite a high proportion of the cancers detected in the interval between two screening rounds after a negative prior screening is due to missing or misinterpreting lesions in the supposedly negative screening round.3–5
Since the development of artificial intelligence (AI), it has been proposed as an aid in reading mammography, 6 in particular to substitute the need for a second reader, a practice recommended by most guidelines in order to reduce false negatives.7,8
The aim of this study was to find out whether adding AI into the reading process as a supportive tool could be helpful in decreasing the interval cancer rate in population-based organized screening programs.
Material and methods
The German organized population-based screening program invites 50- to 69-year-old women for a full-field digital mammography (FFDM) biannually. All exams are double read independently by two readers with an annual reading requirement for each reader of minimum 5000 mammograms. If either one or both readers find a suspicious lesion, decision on recall is reached by consensus of all readers within the screening unit. Assessment is only done by the physician responsible for the screening unit.
For quality assurance of the screening program, the regional cancer registries are asked to identify cancers occurring after a negative screening test through record linkage with screening archives. The list of the registered interval cancer cases is sent to each screening unit for reassessment of the mammographic findings. Interval cancers are divided by screening radiologists into four categories, according to comparison of the screening mammogram with available diagnostic images: true interval cancers, minimal sign cancers, missed cancers (false negative) and occult cancers.
In this study, all of these negative-read screening mammograms were processed through an algorithm, ProFound 2 D (granted by iCAD Inc., Nashua, NH, USA), which uses advanced image processing, feature computation and pattern recognition technology to analyse the images for potential areas of concern. These areas of concern are displayed for the physician by overlying detections at the appropriate locations of the mammography images within the mammography reading software. The deep learning algorithm has been developed and validated at iCAD using 8000 2 D-FFDM studies obtained from more than 50 sites and multiple devices. The software has CE (European Conformity) approval to assist radiologists while reading two-dimensional (2D) mammograms.
Lesions (areas of concern) are marked on the 2D mammogram and assigned a ‘lesion score’ reflecting the estimated probability of the lesion to be a cancer based on deep-learning analysis of other lesions with similar findings in a large reference database. For this study, lesions with a probability (lesion score) equal to or higher than 30% have been considered as positive. We defined the threshold of 30% in a preliminary investigation of cancer cases in our institute (288 biopsy-proven cancers, not including the cases used for this study) before using the algorithm, which showed that 96.5% of all cancers had a score higher than 30%. Cases were classified as ‘identified’ if the algorithm correctly identified a lesion in the breast area where the interval cancer later had been diagnosed, or ‘not identified’ if the algorithm did not identify any lesion or identified a lesion in the wrong breast area.
Interval cancers detected within 24 months (2011–2014) of a negative screening performed in 2011–2012, after FFDM introduction in one regional unit (Paderborn) of the national German screening program, were included. During this period, 37,367 50–69-year-old women were screened mostly with FFDM, recall was 3.1% and detection 4.4/1000 (168 cancers). Average reader sensitivity was reported by the officially used software (Masc-KV-IT) as 85%, specificity 94.9%. The cancer registry reported 51 interval cancers in the 24 months after the 2011–2012 mammograms in women who screened negative. Complete records, including histopathology and diagnostic images of the interval cancers, were obtained from all internal archives and requested from hospitals or other private institutes where women had been diagnosed with the interval cancer. Exclusion criteria were: incompleteness of the documentation, no access to the diagnostic images of the interval cancers, absence of a screening FFDM either because not available or because screened with film mammogram.
Ethics approval for this study was granted by the medical chamber of Westfalia-Lippe under 2020–025-f-S. The devices used for imaging were one Senographe DS, one Senographe Essential (GE Healthcare) and one Siemens Inspiration. Sensitivity estimates are presented with relative 95% confidence interval (95% CI) computed according to binomial exact distribution.
Results
Of the 51 interval cancers reported, 22 were excluded. One case was excluded because the woman had only a unilateral mammography in the screening due to cancer surgery and interval cancer was detected on the side that had not been screened (local recurrence), for six interval cancers the prior mammogram was on film, and for the others, it was impossible to retrieve diagnostic images or full clinical records. Therefore, 29 interval cancer cases were available for the study. Histopathology distribution was: 24 no special type, 4 lobular cancers and 1 Paget’s disease. Grading was: G1, G2 and G3 in 1, 17 and 10 cases respectively; for one case, grade was unknown. Of the 12 cases classified as true interval cancers by the radiologists, the algorithm identified in none of them a suspicious lesion compatible with the site where the cancer later was diagnosed, but in three cases identified a lesion in another site. Similarly, neither of the two occult cancers were identified. Eight out of nine cases classified as minimal signs, and all six classified as false negatives were correctly identified (Table 1).
Number of interval cancers according to classification by radiologist and AI algorithm.
aFor three cases the AI identified a lesion in a breast area other than where the interval cancer occurred.
Considering the subgroup of 15 false negative and minimal sign lesions, sensitivity was 93% (95% CI 68–100). The algorithm could identify 48% (95% CI 29–67) of all interval cancers.
Discussion
Our findings show that interval cancers that had shown either minimal signs or were false negatives could have been detected using ProFound 2 D as a supporting tool for the reading radiologists. In our screening program, theoretically, this AI algorithm could have anticipated diagnosis of about 48% of those cases later diagnosed as interval cancers. This improvement though could be achieved only if all those positive (above threshold level) for the AI algorithm were to be recalled for assessment and if the assessment was 100% sensitive. The first condition is not reasonable for a screening program, where maintaining the recall rate under a threshold is a critical point to keep undesirable effects of screening at an acceptable level and to maintain sustainability of the assessment workload. The use of AI though as a supportive tool for experienced readers might be a reasonable approach to decrease interval cancer rate while maintaining a high specificity. A study, on an unselected screening population, comparing the actual reading procedures with AI-aided reading could give an idea of how its introduction would impact simultaneously on interval cancer and on recall rate. Nevertheless, a study adopting the same algorithm showed that AI could obtain similar specificity to that obtained by human readers. 9 This is consistent with the relatively small number of lesions identified by the AI algorithm that did not develop into cancers in our sample (three among the true interval cancers).
A limitation of this study is the impossibility to assess the impact of AI use on recall rate and specificity. The small absolute number of interval cancers also limits this study, which can certainly only give an indication of the direction interval cancer rate might take if AI was added to the reading process. Due to data protection regulations, it was impossible to obtain the clinical records of all women identified with an interval cancer. Also, there is no national cancer registry in Germany and we must rely on regional ones. The cancer registry is only allowed to report the interval cancers for each unit to that specific unit, so the only way to increase numbers in the studies would be to add more years of screening. To date (December 2020), the interval cancer cases since screening in 2013 and 2014 have still not been made available to the screening unit. Further studies on unselected screening populations and different algorithms are needed to assess the accuracy of AI algorithms, and randomized controlled trials are needed to confirm the impact of introducing AI algorithms in an organized quality-assured screening program.
Here, we give the proof of principle that using an AI algorithm as an additional reading tool for an experienced radiologist or even to replace a second reader, while maintaining a human consensus read, has the potential to greatly reduce the interval cancer rate through pointing out lesions to the reader for further evaluation. How and if this theoretical advantage could be reached without a negative effect on recall rate is a challenge for future research.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
