Abstract
Introduction
Breast cancer is the most common cancer type and the second cause of cancer-based mortality in women according to the 2020 global cancer statistics. 1 Screening for breast cancer with mammography has shown a reduction in breast cancer mortality by many randomized trials and incidence-based mortality studies.2–5 Therefore, many developed countries have implemented large-scale mammography screening programs in the last 3 decades. But, despite these successful screening programs and improved treatment options, breast cancer is still one of the major causes of cancer-related death in women around the world. The efficiency of mammography remains controversial.6,7 The main reported disadvantages of mammography are high rates of false positives and false negatives. 8 Studies have shown that up to 30% to 40% of cancers can be missed during mammography screening and only 10% of women are recalled for further diagnostic workup and are diagnosed with breast cancer.9,10 This fact can be explained by many reasons as follows; dense breast tissue, false positioning, human interpretation error. On the other hand, important consequences of high recall rate and false positivity in daily practice are increased patient anxiety, excessive follow-up, and invasive diagnostic procedures. Due to these disadvantages, there has been a need for methods and techniques that will increase the sensitivity and specificity, and correct reading rates of mammography evaluation. Other radiological methods including ultrasound, digital breast tomosynthesis, and magnetic resonance imaging have been introduced for screening, but mammography is still the frontline and most common modality used around the globe.
Double reading by 2 radiologists independently has been implemented in screening programs to improve cancer detection rates. Successful rates were achieved in cancer detection, there has been a decrease in recall rates and positive predictive values for cancer detection.11,12 Computer-aided detection (CAD) software was designed in 1998 to improve and assist mammography readings. 13 The literature on CAD's efficacy is still under discussion despite being used for 2 decades. Early studies had shown improved cancer detection.14,15 However, large-scale studies have shown a high false-positive rate, low specificity, and failure to improve radiologists’ performance due to increased additional review and number of marked areas.16,17
Artificial intelligence (AI) is a rapidly growing branch of computer science that has created great excitement this century with breakthroughs in the development of many applications and its potential to change paradigms in breast imaging. 18 CAD is based on human decisions like density or shape and results as negative or positive, however, AI algorithms can find the new characteristics which enables classification of lesions which are unknown and undetectable by human eyes. Many studies have shown that AI can increase the sensitivity for detection of breast cancer and decrease false-positive evaluations, so indicates a great potential in improving radiologists’ contributions to patients’ care.19–23 Machine learning (ML) and deep learning (DL) models have been used in personal breast cancer risk assessment, predicting pathologic upgrade of high-risk lesions, detecting the negative screening mammography, estimating the presence of invasive component accompanying ductal carcinoma in situ, early prediction of response after neoadjuvant chemotherapy or prediction of lymph node metastases in primary breast cancer.24–34
AI is defined as a large spectrum description for many different sections and training models, including artificial neural networks, ML, and DL. ML is based on a training model which learns to identify the characteristics and associated variables that are described and observed in the input data. 35 DL is a learning model based on multiple layers of deep neural networks (NN), that is similar to human neural tissue in the brain. 36 DL is more important for radiology, especially for breast imaging due to its ability to learn the characteristics which are essential to categorize the mammograms as positive or negative and has a potential to find new correlations which were not evident for human interpretation. Last year Kim et al developed and validated an AI algorithm by using large-scale data and showed better diagnostic efficiency than radiologists in breast cancer detection. 37
In this study, we aimed to evaluate the performance of an AI algorithm in a simulated screening setting and its effectiveness in detecting missed and interval cancers.
Materials and Methods
Population
Digital mammograms were collected from Bahcesehir Mammographic Screening Program (BMSP) which is the first organized, population-based, 10-year (2009-2019) mammography screening program in Turkey. During the 10-year period, biennially, women between ages 40 and 69 in the region were invited to the screening. The mammograms taken biennially were recorded in the archive system. The study was approved by the ethics committee of Acibadem M.A.A University school of medicine institutional review board number with 2020-22/23 (location: Istanbul, date: 15.10.2020). Each eligible woman signed a written informed consent form when they were enrolled to the BMSP. The reporting of this study conforms to STROBE guidelines. 38 Patient and tumor characteristics are listed in Table 1.
Patient and Tumor Characteristics.
Abbreviations: BIRADS, Breast Imaging Reporting and Data System.
Mammograms
During the 10-year screening period, a total of 22 621 screening examinations were performed. All cancers detected in the screening program during this period were included into the study without an exclusion criterion. In total, 211 mammograms were extracted from the archive of the screening program in this retrospective study. One hundred ten of these were diagnosed as breast cancer (74 screen-detected, 27 interval, 9 missed), 101 were negative mammograms. The negative mammograms were chosen from the mammograms of women who did not have any breast related diagnosis in the 2 years following the initial mammogram and who matched the ages and densities of the cancer patients. Power analysis was performed using the Open Epi program. It was found that the sample size was sufficient for this study. Definitions of diagnosed breast cancers were as follows: (1) interval cancer was described as finding a primary breast cancer following a negative mammographic evaluation within 2 years; (2) missed cancer was described as the detection of breast cancer after a false-negative mammogram but detected in the first 30 days with another imaging modality or had clinical findings; (3) screen-detected cancer was defined as detection of cancer with a routine screening mammogram. All cancer cases that were diagnosed during the BMSP were included into our retrospective study. Negative mammograms were used as a control group by AI evaluation.
Image Analysis
Digital mammography images were obtained with a full-field digital mammographic device (Selenia, Hologic) from the screening center. Two projections, mediolateral oblique and craniocaudal were obtained for each woman. Two breast radiologists with more than 5 years of experience read the mammograms in the screening center independently. In case of inconsistency between the readers, a third radiologist with more than 20 years of experience interpreted the findings for the final decision. Radiological findings were evaluated under the guidance of 4th edition of Breast Imaging-Reporting and Data System of the American College of Radiology (BIRADS). 39 The BMSP had already started before the last updated 5th version of BIRADS.
Artificial Intelligence System
We used a recently developed diagnostic support software (Lunit INSIGHT MMG, Seoul, South Korea) on a free website (https://insight.lunit.io/mmg/login). 37 The AI algorithm of this software uses deep convolutional neural networks (CNNs) and highlights areas in the mammograms where the suspicion of malignancy is above a certain threshold. 40 The system calculates an abnormality score which reflects the likelihood of malignancy of the detected lesion. The score between 1% and 100% likelihood of malignancy flagged by AI is recorded. In this study, we did not use the images, but instead used the underlying prediction score of the algorithm. In the case of multiple findings with different values, the highest score is considered final. The images used in this study have never been used to train, validate, or test a previously developed AI algorithm.
Statistical Analysis
The breast cancer detection rates of radiologists in the screening program compared with AI system in a simulation scenario. Receiver operating characteristic (ROC) analysis was done and a threshold for cancer detection was calculated with Youden's index. All mammograms were relabeled based on the threshold. Three different mammography assessment methods were compared in this study: (1) 2 radiologists’ assessment at screening center, (2) AI assessment based on the established risk score threshold, (3) a hypothetical radiologist and AI team-up in which AI is defined as the third reader. R systems (R Core Team, 2020) and pROC package (Robin X. et al, 2011) were used for statistical analysis.
Results
In total 211 mammograms, 74 screen-detected cancers (67.3%), 27 interval cancers (24.5%), 9 missed cancers (8.1%), and 101 negative control group mammograms were evaluated by the AI. Area under curve (AUC) was 0.853 (95% CI = 0.801-0.905) and the cut-off value for risk score was 34.5% with a sensitivity of 72.8% (80/110) and a specificity of 88.3% (89/101) for AI cancer detection in ROC analysis (Figure 1).

Receiver operating characteristic (ROC) analysis and the threshold calculated with Youden's index.
Risk score distributions for each cancer subgroups were as follows, 83.8% of screen-detected cancers showed a risk score higher than 34.5% while 16.2% of them were below, 44.4% of interval cancers had a risk score higher than 34.5%, while 55.5% of them were below and lastly 66.6% of missed cancers had a risk score higher than 34.5% while 33.3% of them were below (Figure 2).

Risk score distributions for each cancer subgroup.
Overall cancer detection rates were 67.3% (74/110) for radiologists, 72.7% (80/110) for AI, and 83.6% (92/110) for radiologists and AI team-up (Figure 3). AI detected 72.7% (80/110) of all cancers on its own, of which 62 were screen-detected, 12 were interval cancers and 6 were missed cancers. Hypothetical AI and radiologist team-up detected 83.6% (92/110) of all cancers, of which 74 were screen-detected, 12 were interval cancers, 6 were missed cancers. AI evaluated 16.2% of the true positive mammograms as a negative mammogram (Figure 4). On the other hand, AI detected an additional 44.4% (12/27) of interval 66.7% (6/9) of missed cancers that were not previously detected by radiologists (Figure 5; Table 2).

Cancer detection rates for each group.

CC and MLO mammograms show a lesion presented with architectural distortion in the retroglandular space of the upper quadrant of the right breast which was assessed as true positive by a radiologist. However, the AI system calculated the risk score as 17% and assessed as a negative mammogram.

CC and MLO mammograms evaluated as a negative mammogram by radiologist; however, the AI system detected the missed cancer with a risk score of 45%.
Mammographic and Clinicopathologic Features of AI-Detected Additional Cancers.
Abbreviations: BIRADS, Breast Imaging Reporting and Data System; AI, artificial intelligence.
Discussion
In this study, we evaluated the performance of an AI algorithm in a simulated screening setting and its effectiveness in detecting missed and interval cancers. The cancer detection rate of AI was higher than radiologists, however, it was found to be lower than the hypothetical radiologist and AI team-up. AI was able to detect an additional 44.4% of interval and 66.7% of missed cancers which were not previously detected by radiologists.
In the present study, the AI system that we used in our study is based on CNNs which is the most used NN type in radiologic studies. We have shown that the AI algorithm is a successful diagnostic tool for breast cancer detection with 0.853 of AUC which is in line with the current literature.22,29 In 2019, Rodriquez-Ruiz et al published a retrospective, multi-reader, and multi-case study and investigated the performance of radiologists with and without a supporting AI. 21 Their study included 240 mammograms (consisting of cancers, false-positive cases, and normal mammograms) read by 14 radiologists and resulted in statistically significant higher AUC values with AI support than unassisted reading (0.89 and 0.87, respectively). An improvement was made with less-experienced radiologists but not with experienced radiologists which makes one question the real effect in clinical practice. Therefore, the same group published a subsequent study with the same AI algorithm and compared the performances of AI with 101 radiologists and showed higher AUC values with AI (0.84 vs 0.81). This result was obtained not only with less-experienced radiologists, but also 61.4% of all radiologists. 22 Pacile et al published a multi-reader study to evaluate the effectiveness of AI in breast cancer detection with a similar design to a previous study with 240 mammograms (including true-positive, false-negative, true-negative, false-positive cases) read by 14 radiologists with and without AI support and found AUC values of 0.769 and 0.797, respectively. 41 Average sensitivity was also found to be increased with AI assistance for breast cancer detection in the same study. Our study included 211 mammograms consisting of true positive and false negative cases together with normal mammograms from a population based screening program. The difference between AUC values of different studies can be explained by different designs and AI algorithms. An optimal data should contain all types of mammographic evaluations in order to be as similar as possible to real-life or routine screening. Kim et al developed and validated an AI algorithm using 170,230 mammograms derived from 5 centers (South Korea, the USA, UK). 37 Then, they designed a multicenter, reader study with 320 mammograms (cancers, benign lesions, and normal mammograms) read by 14 radiologists and found a significant improvement in breast cancer detection rates. Overall AUC values for AI only, AI and radiologists, and radiologists only were 0.959, 0.881, and 0.810, respectively. Unlike the other studies, AI-only performance was better than AI-assisted radiologist's performance. In a detailed analysis, they showed that AI had a better performance especially in detecting early stages of cancer (T1 and node-negative cancers) and also the cases presenting with asymmetry or architectural distortions. Additionally, AI was not affected by breast density as much as radiologists, according to the same study. These results showed that AI may positively contribute to the prognosis of patients by decreasing the rate of interval breast cancers. However, these studies did not focus on the interval or missed cancers and were designed in a prospective analysis where the readers were expected to detect a high number of positive mammograms than the real-life situation in which the cancer detection rate is less than 8 in 1000. This may cause a biased artificial environment where the reader stays more cautious. Although our study is retrospective the reader performance was real time in a real screening program.
Interval cancer rate for biannual mammography screening is between 0.8 and 2.1 per 1000 screening and these cancers tend to be biologically more aggressive tumors. 42 Thus, reducing the interval cancer rate should result in a better outcome of a screening program. This study showed a potential decrease in interval cancers by 44.4% in a screening program. A study by Lang et al showed the effect of AI in detecting at 19.3% of the interval cancers in mammography screening which is less than half of the interval cancer detection rate in our study. However, Lang et al included the interval cancers with the highest AI score of 10 in order not to increase the recall rate. 43 On the other hand, in our study, we have included middle and high scores with a threshold at 34.5 and achieved a high specificity at 88.3%. Interval cancers can be stratified as true negative and false negative depending on the presence of an evident finding on the initial mammogram. False negative interval cancers were reported between 25% and 40% in majority of the studies. 44 In other words, almost one-third of the interval cancers have visible findings in the initial mammograms and evitable. However, detection of subtle changes is challenging and difficult to improve without increasing the recall rates. Although additional information such as prior mammograms, clinical findings, or breast cancer risk can improve the outcomes, it may not be possible to evaluate this additional information in screening programs with high volume of mammograms.45–47 AI, as a second reader could be beneficial in triage of the suspicious mammograms for a third referee reader.
Watanabe et al published a retrospective study to evaluate the effect of an AI-based CAD software in detecting missed cancers on mammograms. 23 They showed that only 51% of missed cancers could be detected without the assistance of AI while this number jumped to 62% with the assistance of AI. In our study, AI detected 66.6% of the missed cancers which is in line with their study. Both studies show that more than half of the missed cancers can be detected with an AI support. Human errors are the second main cause after overlapping breast tissue for nondetection of cancers at mammography. 48 Both errors could be decreased with the implementation of AI in screening reading. Our study showed that AI can both increase the detection rate of missed and interval cancers. However, AI detected 16.2% less screen detected cancers than radiologists and the hypothetical radiologist AI combinations showed the highest performance in detecting all the screen detected, missed, and interval cancers. This study shows that adding AI in the reading workflow will improve the outcome of screening. Shortage of human resources particularly in countries with limited resources is one of the main drawbacks of screening. 49 Implementing AI as a second reader in screening programs will not only help overcoming human resource shortages but will also ameliorate the outcomes.
Our study has several limitations. First, it is a retrospective study, and the performance of radiologists and AI was not correlated in a prospective setting. However, the cases were selected from a population-based screening program and all were evaluated by 2 experienced radiologists. Second, we did not evaluate the histopathological and radiological features of detected cancers that would provide a detailed information about the benefit of the AI system. Third, the AI algorithm evaluates solely the uploaded images but does not consider any other information such as clinical history, family history, or symptoms. Fourth, the Youden's index is computed on an “experimental” ROC curve including many discontinuities. ROC analysis requires appropriate curve fitting before adding any other consideration. The 34.5% threshold value calculated may be affected by “local” effects of the experimental ROC curve basically associated with the small sample size.
Conclusion
In conclusion, AI may potentially enhance the capacity of breast cancer screening programs by increasing cancer detection rates and decreasing false negative evaluations such as missed and interval cancers and may be implemented in the screening reading workflow.
Footnotes
Abbreviations
Acknowledgments
The first three authors have contributed to the manuscript equally. This study is presented in the European Congress of Radiology 2021.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Ethics Statement
The study was approved by institutional review board number with 2020-22/23.
Informed consent
Written informed consent was obtained from the patient(s) for their anonymized information to be published in this article.
