Sage Journals: Discover world-class research

Abstract

Objective

To examine the breast cancer detection rate by single reading of an experienced radiologist supported by an artificial intelligence (AI) system, and compare with two-dimensional full-field digital mammography (2D-FFDM) double reading.

Materials and methods

Images (3D-tomosynthesis) of 161 biopsy-proven cancers were re-read by the AI algorithm and compared to the results of first human reader, second human reader and consensus following double reading in screening. Detection was assessed in subgroups by tumour type, breast density and grade, and at two operating points, referred to as a lower and a higher sensitivity threshold.

Results

The AI algorithm method gave similar results to double-reading 2D-FFDM, and the detection rate was significantly higher compared to single-reading 2D-FFDM. At the lower sensitivity threshold, the algorithm was significantly more sensitive than reader A (97.5% vs. 89.4%, p = 0.02), non-significantly more sensitive than reader B (97.5% vs. 94.4%, p = 0.2) and non-significantly less sensitive than the consensus from double reading (97.5% vs. 99.4%, p = 0.2). At the higher sensitivity threshold, the algorithm was significantly more sensitive than reader A (99.4% vs. 89.4%, p < 0.001) and reader B (99.4% vs. 94.4%, p = 0.02) and identical to the consensus sensitivity (99.7% in both cases, p = 1.0). There were no significant differences in the detection capability of the AI system by tumour type, grading and density.

Conclusion

In this proof of principle study, we show that sensitivity using single reading with a suitable AI algorithm is non-inferior to that of standard of care using 2D mammography with double reading, when tomosynthesis is the primary screening examination.

Keywords

Digital breast tomosynthesis artificial intelligence double reading breast cancer screening tomosynthesis mammography

Introduction

Breast cancer is the most common cancer in women and the first cause of death among women in high income countries.¹ Organised screening programmes for the early detection of breast cancer have been implemented in many countries.² At present, most of these are based on screening by two-dimensional full-field digital mammography (2D-FFDM) but the use of tomosynthesis (3D-mammography plus 2D-synthesised images) is often suggested due to its higher detection rate and a potentially lower recall rate.³

Many studies have shown that the use of tomosynthesis in population-based screening settings^4,5 improves the detection rate for breast cancer, although it is worth noting that this finding is not universal.⁶ In the German biennial screening programme, tomosynthesis cannot be used as the first-line screening, but since 2015 tomosynthesis plus synthetically reconstructed 2D images have been widely used in screening assessment as recommended in the European guidelines on breast cancer screening.⁷ These guidelines also issued a neutral decision on breast cancer screening with digital breast tomosynthesis (DBT) compared to 2D-FFDM, suggesting that either DBT or 2D-FFDM could be the modality of choice.⁸

With the increasing lack of radiologists in the European Union and the increased workload, if DBT screening became the norm, new strategies for reading these screening examinations might become necessary. At the moment, double reading is recommended in the European Guidelines.⁹ Any changes would be required to maintain the high quality standard in terms of detection capability that is presently also based on double reading the 2D-FFDM. If single-reading tomosynthesis with a validated and suitable artificial intelligence (AI) algorithm equals or outperforms the present standard of double-reading 2D-FFDM, human resources could be saved and costs reduced.

Materials and methods

In our screening unit we retrospectively identified 161 biopsy-proven cancers of women aged 50–69, which had been assessed during 2015 to 2017, by the additional use of tomosynthesis plus synthetically reconstructed 2D images. The tomosynthesis devices we used were two Senographe Essentials with Senoclaire upgrade (GE Healthcare). Patients consented to the use of their data for scientific evaluation during the screening examinations, and ethics approval for this study was granted by the ethics committee of the chamber of Westfalia at the University of Münster under 2020–026-f-S.

The original screening examination was 2D-FFDM in two views. In the German screening programme, if one or both readers of a mammography find a suspicious lesion, the case is discussed in a conference where the final decision for recall is made by consensus of all readers of the screening unit. The recalled women are assessed by the radiologist responsible for that unit, using additional mammographic views, ultrasound, clinical examination and, if necessary, biopsy. Since 2015, tomosynthesis plus 2D-synthesised images have been used in assessment in addition to or instead of additional mammographic views. In our unit, we have been using mainly two-view tomosynthesis in assessment of soft tissue findings such as architectural distortions, masses and asymmetric densities. Calcifications have been mainly assessed by additional magnification views. Exclusion criteria were tumors not assessed by tomosynthesis in two views or incomplete documentation.

For this study, after identification of biopsy-proven cancers from 2015 to 2017 (n = 161) that were assessed by additional two-view tomosynthesis plus synthetically reconstructed 2D images, the 3D images were re-read by an AI algorithm (ProfoundAI 3D by iCAD Inc.). The AI algorithm was considered to have identified the cancer if it marked the relevant image as positive and correctly located the cancer on the image in the judgement of a senior radiologist. We calculated percentages of detection by tumour histopathology (type), grading and breast density. Furthermore we compared the results of AI to first and second readings, and to consensus from double reading by the McNemar test. Results were compared with AI at two operating points: a medium sensitivity point (referred to below as lower sensitivity threshold) and a high sensitivity point (higher sensitivity threshold).

A total of five readers contributed to this study. All readers had a minimum reading experience of 10 years within the screening program, reading a minimum amount of 5000 exams annually. For density reporting we used the previous American College of Radiology Breast Imaging Reporting & Data System categories 1–4¹⁰ rather than the more recent a–d,¹¹ because in the German screening software these were still used during the period we evaluated.

Results

Among the assessed 161 cancers we found: 129 no special type (NST) (80.12%), 17 lobular cancers (10.56%), 5 tubular cancers (3.11%) 4 ductal carcinoma in situ (DCIS) (2.48%), 3 mucinous cancers (1.86%) 1 basal cancer (probably NST) (0.62%), 1 medullary cancer (0.62%), 1 malignant cystosarcoma (0.62%).

At the lower sensitivity threshold the system detected 157 cancers out of 161 (97.5%, 95% CI: 95.0–100.0) and at the higher, the detection rate was 160 out of 161 (99.4%, 95% CI: 98.2–100.0).

Table 1 shows the detection results for the AI algorithm by tumour type, grade and density category. The detection rate of NST was 99.2% at the higher sensitivity threshold and 97.7% at the lower. The detection rate of all other types of histopathology was 100% at the higher sensitivity threshold, and except for lobular carcinoma and mucinous carcinoma was also 100% at the lower. For grade, there was a slightly lower detection rate for grade 3 than for grades 1 and 2 tumours. For density, detection was lower at low levels of density using the lower sensitivity threshold.

Table 1.

AI algorithm detection results, by tumour type, grade and density category.

Factor	Category	No. of cancers	Detected at
Factor	Category	No. of cancers	Higher sensitivity threshold (%)	Lower sensitivity threshold (%)
Tumour type	NST	129	128 (99.2)	126 (97.7)
	Lobular	17	17 (100)	16 (94.1)
	Other invasive	11	11 (100)	10 (90.9)
	DCIS	4	4 (100)	4 (100)
Grade	1	42	42 (100)	41 (97.6)
	2	80	80 (100)	79 (98.8)
	3	35	34 (97.1)	33 (94.3)
	Not recorded	4	4 (100)	4 (100)
Density category	ACR 1	8	8 (100)	7 (87.5)
	ACR 2	68	68 (100)	67 (98.5)
	ACR 3	77	76 (98.7)	75 (97.4)
	ACR 4	8	8 (100)	8 (100)

Finally, we performed statistical analysis for comparison of human reading results of 2D images vs. human reading of 3D images supported by the AI algorithm. The results are shown in Table 2 for the lower sensitivity threshold. The algorithm was significantly (p = 0.02) more sensitive than reader A, at 97.5% (95% CI: 95.0–100.0) vs. 89.4% (95% CI: 84.6–94.2), was non-significantly (p = 0.2) more sensitive than reader B, at 97.5% (95% CI: 95.0–100.0) vs. 94.4% (95% CI: 90.8–98.0), and was non-significantly lower (p = 0.2) than the consensus from double reading at 97.5% (95% CI: 95.0–100.0) vs. 99.4% (95% CI: 98.2–100.0).

Table 2.

Comparison of AI algorithm result with human reader result, lower sensitivity threshold.

Human reader	Human reading result	AI result		Total
Human reader	Human reading result	Positive	Negative	Total
Reader A	Positive	140	4	144
	Negative	17	0	17
	Total	157	4	161
Reader B	Positive	149	3	152
	Negative	8	1	9
	Total	157	4	161
Consensus double reading	Positive	156	4	160
	Negative	1	0	1
	Total	157	4	161

Table 3 shows the corresponding results for the higher sensitivity threshold. The AI algorithm was significantly (p < 0.001) more sensitive than reader A, at 99.4% (95% CI: 98.2–100.0) vs. 89.4% (95% CI: 84.6–94.2), significantly (p = 0.02) more sensitive than reader B, at 99.4% (95% CI: 98.2–100.0) vs. 94.4% (95% CI: 90.8–98.0), and identical to the consensus sensitivity 99.4% (95% CI: 98.2–100.0) in both cases (p = 1.0).

Table 3.

Comparison of AI algorithm result with human reader result, higher sensitivity threshold.

Human reader	Human reading result	AI result		Total
Human reader	Human reading result	Positive	Negative	Total
Reader A	Positive	143	1	144
	Negative	17	0	17
	Total	160	1	161
Reader B	Positive	152	0	152
	Negative	8	1	9
	Total	160	1	161
Consensus double reading	Positive	159	1	160
	Negative	1	0	1
	Total	160	1	161

Discussion

At present, most of the population-based organised screening programmes in Europe are based on 2D-FFDM and a double reading policy. The European Commission Initiative on Breast Cancer (ECIBC) suggested the use of either 2D-FFDM or DBT as the primary exam in screening.⁸ One of the drawbacks for the implementation of DBT is the increasing number of images to be read.¹² The additional reading time will be even larger if the double reading policy persists. While there is evidence that AI can reduce reading time in the context of DBT screening,¹³ there remains a need for prospective research within real-time screening programmes.

This study demonstrates that DBT read by an AI algorithm is not inferior in terms of cancer detection to double-reading 2D-FFDM as it is done today. The high capacity for detection was seen in all tumour types and grades, and in dense breast tissue; however, the numbers are too small to be confident of the results in specific subgroups. Although our results suggest that the algorithm is effective in dense breasts, it should be noted that we only had eight cases in the highest density category.

There have been positive results for large-scale retrospective reading of mammograms with AI systems. A simulation on a representative large data set from the United States and UK (2D-FFDM) with the AI system used as a second reader demonstrated that, compared to the standard double-reading process used in the UK, the AI system maintained a non-inferior performance and reduced the workload which would have been incurred by the second reader by 88%. Furthermore, a reduction in false positives and false negatives could be demonstrated.¹⁴

Of course there are some limitations. Our study is of modest size. Although studies on larger populations showed similar distributions of invasive tumour attributes,¹⁵ we had very few DCIS cases in our series. This retrospective study addresses only the sensitivity of the algorithm. We do not report any false-positive rates for the AI algorithm, and these will be essential as part of a further evaluation. We also do not know how many subsequent interval cancers could have been identified by the algorithm. On the other hand, there have been trials of tomosynthesis in screening, which have shown that the rate of false positive findings will not increase.^4,5 Trials have also shown that detection of breast cancers is significantly higher with tomosynthesis.^4,5 Finally, we cannot state that every available algorithm will give the same performance as we know that algorithms have to be adapted to the imaging device, and quality measures have to be developed.

To build on the initial positive results of this study, further studies (particularly prospective) are required, with numbers large enough for subgroup analyses. There should also be the development of test sets for algorithms as well as human readers.

Footnotes

Acknowledgements

The histopathology results were provided by Prof. Dr med. Horst Bürger, Institut für Pathologie, Husenerstraße, Paderborn, Germany. Technical support was given by Jonathan Go from iCAD Inc., Nashua, NH, USA. iCAD Inc. granted us the AI algorithm for study purposes.

Declaration of conflicting interests

AG has done several presentations on the use of ProFound AI in diagnostics at several international meetings (RSNA, ECR, Eusobi). Travel expenses have been covered and honorarium has been paid.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Axel Graewingholt

Stephen Duffy

References

Ferlay

Héry

Autier

, et al. Global burden of breast cancer. In: Li C (eds) Breast cancer epidemiology. New York: Springer, 2009, pp. 1–19.

International Agency for Research on Cancer. Breast cancer screening. In: IARC handbooks of cancer prevention. Lyon: IARC, 2016.

Bae

MS.

Sustainable benefits of digital breast tomosynthesis screening. Radiology 2020. DOI: 10.1148/radiol.2020203933.

Bernardi

Macaskill

Pellegrini

, et al. Breast cancer screening with tomosynthesis (3D mammography) with acquired or synthetic 2D mammography compared with 2D mammography alone (STORM-2): a population-based prospective study. Lancet Oncol 2016; 17: 1105–1113.

Pattacini

Nitrosi

Giorgi Rossi

, for the RETomo Working Groupet al. Digital mammography versus digital mammography plus tomosynthesis for breast cancer screening: the Reggio Emilia tomosynthesis randomized trial. Radiology 2018; 288: 375–385.

Hofvind

Holen

ÅS

Aase

, et al. Two-view digital breast tomosynthesis versus digital mammography in a population-based breast cancer screening programme (to-Be): a randomised, controlled trial. Lancet Oncol 2019; 20.

https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/diagnosis/DBT (accessed 11 November 2020).

https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/screening-tests#recs-group-1 (accessed 11 November 2020).

https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (accessed 20 November 2020).

10.

D'Orsi CJ, Bassett LW, Berg WA, et al. ACR BI-RADS Atlas: breast imaging reporting and data system. 4th ed. Reston, VA: American College of Radiology, 2003.

11.

D'Orsi CJ, Sickles EA, Mendelson EA, et al. ACR BI-RADS Atlas: breast imaging reporting and data system. 5th ed. Reston, VA: American College of Radiology, 2013.

12.

Houssami

Lockie

Clemson

, et al. Pilot trial of digital breast tomosynthesis (3D mammography) for population-based screening in BreastScreen Victoria. Med J Aust 2019; 211: 357–362.

13.

Conant

Toledano

Periaswamy

, et al. Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis. Radiol Artif Intell 2019; 1: e180096.

14.

McKinney

Sieniek

Godbole

, et al. International evaluation of an AI system for breast cancer screening. Nature 2020; 577: 89–94.

15.

Nagtegaal

Allgood

Duffy

, et al. Prognosis and pathology of screen-detected carcinomas: how different are they? Cancer 2011; 117: 1360–1368.

Retrospective comparison between single reading plus an artificial intelligence algorithm and two-view digital tomosynthesis with double reading in breast screening

Abstract

Objective

Materials and methods

Results

Conclusion

Keywords

Introduction

Materials and methods

Results

Discussion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iDs

References