Abstract
Background
The impact of radiologists’ characteristics has become a major focus of recent research. However, the markers of diagnostic efficacy and confidence in dense and non-dense breasts are poorly understood.
Purpose
This study aims to assess the relationship between radiologists’ characteristics and diagnostic performance across dense and non-dense breasts.
Materials and methods
Radiologists specialising in breast imaging (n = 128) who had 0.5–40 (13±10.6) years of experience reading mammograms were recruited. Participants independently interpreted a test set containing 60 digital mammograms (40 normal and 20 abnormal) with similarly distributed breast densities. Diagnostic performance measures were analysed via Jamovi software (version 1.6.22).
Results
In dense breasts, breast-imaging fellowship completion significantly improved specificity (p = 0.004), location sensitivity (p = 0.01) and the area under the curve (AUC) of the receiver operating characteristic (p = 0.03). Only participation in BreastScreen reading significantly improved all performance metrics: specificity (p = 0.04), sensitivity (p = 0.005), location sensitivity (p < 0.001) and AUC (p < 0.001). Reading > 100 mammograms weekly significantly improved sensitivity (p = 0.03), location sensitivity (p = 0.001), and AUC (p = 0.03).In non-dense breasts, breast fellowship completion significantly improved sensitivity (p = 0.02), location sensitivity (p = 0.04) and AUC (p = 0.002). Participation in BreastScreen reading and reading > 100 mammograms weekly significantly improved only sensitivity (p = 0.002 and p = 0.003, respectively) and location sensitivity (p < 0.001 and p < 0.001, respectively).
Conclusion
Participating in screening programs, breast fellowships and reading > 100 mammograms weekly are important indicators of the diagnostic performance of radiologists across dense and non-dense breasts. In dense breasts, optimal performance resulted from participation in a breast screening program.
Introduction
Globally, breast cancer is the second most common form of cancer. 1 It is also the second leading cause of cancer-related death among Australian women. 2 Currently, mammography is the only modality considered effective for breast cancer screening as it has been proven to reduce breast cancer mortality, particularly among women aged 50 to 69. 3
Many factors impact the outcome of screening; these include intrinsic limitations of technology, the experience of those interpreting the mammogram, lesion characteristics, and breast density. 4 Researchers often refer to dense breasts as ‘heterogeneously dense’ and ‘extremely dense’ and non-dense breasts as ‘fatty’ and ‘scattered areas of fibroglandular density’. In the United States, it has been estimated that 43% of the screened population has dense breasts. 5 Chinese and Korean women have a higher prevalence of dense breasts than US women, with 49.2% and 54.4%, respectively.6,7 Data on the Australian population is not available, perhaps due to the lack of breast density notification policy. However, Breast Cancer Network Australia is advocating for breast density policy changes. 8
Increased breast density increases the risk of masking and of cancer. 9 It has also been shown that dense breast composition is directly linked to risks associated with breast cancer, 10 emphasising the need to optimise early detection in women with dense breasts. There is evidence of wide variation in diagnostic efficacy between radiologists or breast image readers. 11 This inter-reader variability requires that intrinsic human factors be considered when designing strategies to improve breast cancer detection. Thus, the impact of radiologists’ characteristics has become a major focus of recent research.12-14
Several studies have examined the association between reader characteristics such as years of reading mammograms, the number of mammograms read per year, completion of a fellowship in breast imaging and participation in diagnostic workups.11,13-17 These studies demonstrated wide variation in the relationships between observers’ characteristics and performance in mammography interpretation. However, most published studies assessed the influence of readers’ characteristics with little or no consideration for the impact of breast composition. Thus, the markers of diagnostic efficacy and confidence in dense and non-dense breasts are poorly understood, and further in-depth investigation is needed. Therefore, this study aims to assess the relationship between radiologists’ characteristics and diagnostic performance across dense and non-dense breasts.
Method and materials
Image test sets
Two digital mammography (DM) test sets were developed from a screening population database. Each DM test set contained 60 cases (40 normal and 20 abnormal). The normal cases were confirmed to be normal by at least two radiologists and by a follow-up negative mammogram obtained 2–4 years later. The types and characteristics of the lesions were also established by these radiologists. The abnormal cases contained at least one biopsy-proven cancer lesion. Density classification was determined by a consensus of two consultant breast radiologists with more than 20 years of experience in reading screening mammograms. The cases exhibited a range of breast densities classified as non-dense (≤ 50% glandular tissue) and dense breasts (> 50% glandular tissue). Australia uses the Royal Australian and New Zealand College of Radiologists (RANZCR) synoptic scale, which is similar to the fourth edition BI-RADS Atlas. To make the test set relevant to countries using the RANZCR synoptic scale and BI-RADS fourth edition, breast density of the cases included was classified according to the fourth edition BI-RADS Atlas. 18 The distribution of breast densities was as follows: DM test set 1 included 40% non-dense and 60% dense cases, while DM test set 2 included 45% non-dense and 55% dense cases.
Participants
Radiologists’ demographic information at the time of completing the DM test set.
Reading environment
The images were read via the Breast Reader Assessment Strategy (BREAST) platform either at a conference or at different Australian clinical sites using primary displays between 2015 and 2019. Ambient lighting in reading rooms at conferences was set at 15–20 lux to conform with the RANZCR and BreastScreen Australia Accreditation Standards19,20 as well as with ambient lighting conditions in many clinical settings. Calibrated Barco 5MP medical-grade monochrome liquid crystal display monitors with a resolution of 2049 × 2560 pixels were used.
Study design
Participants completed an electronic survey of their demographic information and work experience - including position, speciality, completion of a fellowship in breast radiology, number of years reading mammography and the number of cases read per week. Subsequently, each reader independently interpreted the images in the test sets and assigned a confidence rating to each decision. If the reader considered the image to be normal, he/she moved to the next case and the case was automatically rated as 1, meaning no cancer was present. If a lesion was detected, the reader marked the lesion’s location and assigned a confidence rating score from 2 to 5, which is compatible with the Tabar/RANZCR classification used in BreastScreen Australia where 2 = benign, 3 = indeterminate/equivocal, 4 = suspicious and 5 = highly suspicious.
A rating of 3, 4 or 5 signified malignancy, with higher ratings denoting higher confidence. If a rating of 3 or above was given, the reader was asked to describe the type of breast lesion detected (discrete mass, architectural distortion, spiculated mass, nonspecific density, stellate and calcification) by checking the appropriate box in a pop-up menu. These marks and ratings were then used to assess reader performance.
Statistical analysis
The radiologists’ performances were calculated in terms of specificity, sensitivity, location sensitivity and AUC in dense and non-dense breasts. Location sensitivity was determined by the distance of the mouse click from the breast lesion centre. If distance was not recorded, this indicated that the radiologist marked outside the correct region or did not give any markings. Diagnostic confidence (radiologists’ level of confidence that the detected lesion was malignant) and lesion classification (their ability to correctly classify the lesion into type) in dense and non-dense images were calculated.
The diagnostic performance metrics were compared using an independent-samples t-test or a Mann–Whitney U test, depending on the distribution of the data. A chi-squared test (χ2) was conducted to assess the association between radiologists’ characteristics and both diagnostic confidence and lesion classification across dense and non-dense breasts. One-way ANOVA and Kruskal–Wallis tests were applied to compare three independent groups depending on data distribution. p-value ≤ 0.05 was considered statistically significant. These analyses were conducted via Jamovi software (version 1.6.22).
Results
Radiologists’ performances in dense breasts
Comparison of radiologists’ performance characteristics in dense breasts.
(*) signifies median values, including 1st and 3rd quartiles, where significant values resulted from the Mann–Whitney U test. Bold values indicate statistical significance at the p-value ≤ 0.05 level.
Radiologists’ performances in non-dense breasts
Comparison of radiologists’ performance characteristics for non-dense breasts.
(*) signifies median values, including 1st and 3rd quartiles, where significant values resulted from the Mann–Whitney U test. Bold values indicate statistical significance at the p-value ≤ 0.05 level.
Association between radiologists’ characteristics and diagnostic confidence when reporting breast cancer across breast densities
Association between radiologists’ characteristics and diagnostic confidence when reporting breast cancer.
Score 3 (indeterminate/equivocal); score 4 (suspicious); score 5 (highly suspicious). Bold values indicate statistical significance at the p-value ≤ 0.05 level.
Association between radiologists’ characteristics and their ability to classify lesions into types across breast densities
Association between radiologists’ characteristics and lesion classification.
Lesion types: Stellate, discrete mass, spiculated mass, nonspecific density, calcification and architectural distortion. Bold values indicate statistical significance at the p-value ≤ 0.05 level.
Discussion
This observational study was conducted to assess radiologists’ performances and markers of good performance across different breast compositions. The evidence suggests a difference in radiologists’ performance characteristics across dense and non-dense breasts, indicating that participation in screening programmes, completing breast fellowship training and reading >100 mammograms weekly are important diagnostic performance indicators, but not the number of years reading mammography.
The literature shows wide variation in the search, perception and decision-making abilities of radiologists that are concomitant with differences in performance in the interpretation of mammographic images, suggesting that human limitations significantly impact the efficacy of screening mammography. Differences in reader ability and interaction with radiological images cannot be completely mitigated, but should be exploited to improve diagnostic efficacy.
We thought that in a simulated clinical environment, readers may exhibit a trade-off between sensitivity and specificity in dealing with suspicious cases. 21 However, BreastScreen readers and those with breast fellowship training were not influenced by this trade-off, as demonstrated by the significantly higher specificity compared to other characteristics, particularly in dense breasts. Whilst the impact of fellowship training and volume read on performance is consistent with most published literature,11,17,22,23 it is unclear whether the significantly higher performance observed in BreastScreen readers is due to feedback from the service. It is possible that these three factors are interconnected and work together to improve performance.
The lack of association between years of reading mammograms and performance is reasonable and consistent with data from both data linkage and observer performance studies.12,13,17,24 Our findings suggest that years reading mammograms may not necessarily capture experience because factors such as mentorship, participation in diagnostic workup, feedback and interactions with images of different disease presentations may influence how radiologists build image interpretation skills.
Diagnostic confidence is also an important factor of radiologists’ performance, as it is associated with greater accuracy in detecting breast cancer and influences decision-making regarding recall for further assessment or biopsy.22,25 However, diagnostic confidence is complex, involving visual perceptions and clinical judgements that depend on other factors, such as image quality and the interpreters’ capabilities. 4 Interestingly, BreastScreen Australia and high-volume readers demonstrated significantly lower diagnostic confidence, using a score of 3 when reporting breast cancer across dense and non-dense breasts. This is consistent with the results of a previous observational study that did not consider mammographic density. 22 This could be due to a cohort of more risk-averse readers who do not want to overemphasise the significance of the lesion or who assume that mammograms alone are unreliable and require additional information from supplemental imaging and biopsy for confirmation. In the Tabar/RANZCR scoring scheme, a score of 3 indicates that the lesion requires further investigation, usually through percutaneous needle biopsy. 26 Diagnostic confidence based on the assessment categories is very subjective; the decision to recall or not is the key parameter, as any score of 3 or above will result in a recall.
These findings have a few implications for policy and practice. First, radiologists’ characteristics associated with performance in dense and non-dense breasts can be used to optimise pairing strategies in countries such as Australia, where independent double reading of mammograms is practiced. This strategy may increase the chance that if a lesion is missed by one radiologist due to breast composition, it will be detected by another. Second, these findings can be used to identify radiologists who may benefit from tailored training in identifying breast cancer in different density breasts and inform educational interventions to improve their performance. Third, radiological lesion classification into types has been associated with specific histological findings, 27 and variation in the classification of lesions could affect further assessment and patient management. Therefore, familiarity with lesion features may eliminate some intrinsic errors associated with radiologists’ diagnostic performance and decisions, 4 and test-set data may be useful for training to improve radiologists’ ability to detect and classify malignant features on mammograms. These findings suggest the need for observational studies exploring the impact of knowledge of mammographic lesion features on breast cancer detection and confidence levels across dense and non-dense breasts.
This study is not without limitations. First, the number of dense cases was comparable to that of non-dense cases, and such weighting may not reflect real screening populations. However, the number of cases needed to be similarly distributed to avoid selection bias. This is supported by a previous study that found readers show significantly higher sensitivity rates in dense breasts when fewer dense breast cases are included in the dataset. 28 Second, these findings may not completely reflect the performance of this cohort of radiologists in actual screening practice because interval cancers were not considered. However, a previous study 29 comparing the performance of the same cohort of Australian radiologists in both clinical and test settings showed no difference in performance, suggesting that test set data can reasonably predict performance in a clinical setting. Clinical audits to assess radiologists’ screening performance require several years of follow-up to establish true interval cancers and negative mammograms, and the results of these audits are provided to the clinical practice rather than the individual radiologist. Therefore, test set data may provide opportunities to establish reader characteristics associated with performance across breasts of different compositions and feedback for individual radiologists. Third, it is possible that not all testing sites conformed to the RANZCR and BreastScreen Australia Accreditation Standards ambient lighting standards, although such differences should have a negligible impact on the findings. 30 To our knowledge, no study has closely examined the characteristics of BreastScreen Australia readers associated with improved performance in different breast compositions. Therefore, our study provides baseline data to optimise diagnostic efficacy and confidence in dense breasts.
In conclusion, participating in a screening program reading, fellowship in breast imaging and weekly volumes read of greater than 100 mammogram cases are the most important indicators of diagnostic performance across dense and non-dense breasts. In dense breasts, optimal performance was demonstrated by screening program readers. These findings have practical implications for helping breast screening programs achieve better outcomes in dense and non-dense breasts.
Footnotes
Acknowledgement
The authors would like to thank the Breast Reader Assessment Strategy (BREAST) for supporting the collection of the data used in this paper, and the Australian Department of Health, and Cancer Institute New South Wales for funding the BREAST.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
