Abstract
Objectives
To analyse how reader performance varied by time during the day in a population-based breast cancer screening programme.
Methods
A total of 2,937,312 readings from 148 radiologists and 1,468,656 women were included in this study from Norway. Number and percentages of mammographic readings, positive scores, true and false positive readings, true and false negative readings, sensitivity and specificity were presented for categories of time of day and for each day of the week. Multilevel mixed effect logistic regression models with restricted cubic splines were fitted to the data, and used to predict the odds ratio of the different performance measures.
Results
The following distribution was found for the performance measures during the study period: true positive: 12,463 (0.4%); false positive: 128,419 (4.4%); true negative: 2,794,636 (95.1%); and false negative: 1794 (0.06%). The percentage of positive readings (true positive and false positive) was highest before lunch and in the early afternoon (4.9%): false positive was highest in both periods (4.5%) and true positive was highest in the early afternoon (0.5%). The percentage of true negative was highest in the evening (95.6%), and of false negative was highest at lunchtime (0.07%). This corresponds to a gradually decreasing predicted sensitivity throughout the day. The opposite was observed for specificity.
Conclusions
Screen-reading early versus late during the day resulted in higher sensitivity, although at the cost of specificity. Despite small differences in the performance measures during the day, the results may be important in the discussion of optimal management of screening programmes.
Introduction
Mammographic screening requires high volume reading of mammograms. To ensure the quality of the reading, the European Guidelines for Breast Cancer Screening and Diagnosis recommend double reading and an annual reading volume of at least 5000 mammograms per reader. 1 The new evidence-based guidelines from the European Commission Initiative on Breast Cancer give a conditional recommendation for double versus single reading and an annual reader volume between 3500 and 11,000 mammograms. 2 However, the evidence behind these recommendations is sparse. Aspects related to reading procedures, like batch versus incidental reading, number of mammograms in the batches, and length of the reading might be of influence for the quality of the screening offered to the women. Likewise, the time of day and weekday for the reading, as well as the screening volume and number of breast readers available at the centre, might be of influence for the quality of the reader performance.
Circadian rhythms are well-known characteristics of human performance, both physical and cognitive. Studies have shown that cognitive performance varies 20% to 40% during the course of the day, which seems to align with the circadian rhythm of body temperature.3,4 Several studies have suggested a reduction in efficiency from morning to afternoon.5,6
Screen-reading of mammograms is a cognitive task that requires perceptual efficiency, and one might expect variations with time of day. Studies considering time of day with regard to the reading of mammograms are limited and present varying results.7–10 In a reader study of an enriched test set of 120 mammograms, no significant circadian variations in performance were shown, and no significant differences were observed in sensitivity and specificity at different times during the day in a study of 69 breast radiologists reading 50 mammograms.7,8 A study from the UK, published in 2011, showed variation in recall rates during the day for four readers, but did not provide any information about rates of breast cancer. 9 The most recent study, published in 2017, reported lowest recall rate for cases read between 17:00 and 01:00 and lowest cancer detection rates for cases read between 01:00 and 09:00. 10
Considering the variation in what time of day radiologists read mammograms in BreastScreen Norway and the limited number of studies published, we wanted to contribute to filling this knowledge gap. We took advantage of the data collected as part of the screening programme and investigated the reader performance by time of day. Reader performance measures were defined as true and false positive, true and false negative, sensitivity and specificity. Our hypothesis was that time of day influenced the radiologists’ performance in BreastScreen Norway.
Materials and methods
Our study was approved by the data protection officer for research at Oslo University Hospital (2019/14600). BreastScreen Norway offers women aged 50–69 two-view mammographic screening every other year. 11 Information from the screening and interpretation procedures are stored in a national database. Breast radiologists at 16 multidisciplinary breast centres are responsible for reading mammograms, as well as for assessment of clinically referred patients. Digital mammography was used at all centres from 2012.
The programme provides independent double reading by two breast radiologists, i.e. the first radiologist’s interpretation score is not available to the second radiologist, and no marks should be annotated on the images. Each breast is assigned a score of 1–5 by each radiologist to indicate mammographic findings (1, negative for malignancy; 2, probably benign; 3, intermediate suspicion of malignancy; 4, probably malignant; 5, high suspicion of malignancy). If either radiologist assigns a score of 2 or higher (Positive score), a consensus meeting determines whether to recall the woman for further assessment. If the two radiologists do not agree, a third radiologist acts as an arbitrator. Despite a positive score (PS) by both radiologists, the woman could be dismissed at consensus/arbitration and referred back to screening. If a score of 3 or higher was given, the radiologists who gave this score need to agree in dismissing the woman from a recall. All radiologists are recommended to undergo training to start and continue screen-reading. 12 In this study, the radiologists’ experiences in screen-reading varied from first-year faculty to those with more than 20 years of experience within the screening programme.
Study sample
We received a pseudonymized data file with information about 3,004,129 readings performed during the study period, from 1 January 2012 to 31 December 2018. Due to the small number of readings at night, readings between 23:00 and 07:00 were excluded (n = 22,407). Subsequently, we excluded readings where the women were recalled due to symptoms reported at the pre-screening interview and those due to technical reasons (n = 10,974), where there was no independent double reading (n = 3868), and readings by radiologists with less than 500 readings during the study period (n = 3355). As a result of the exclusions, some screening examinations ended up with only one reader (n = 26,213). These examinations were excluded. The final study sample included 2,937,312 readings performed by 148 radiologists; 1,468,656 screening examinations with independent double reading of mammograms from 610,104 women (Figure 1).

Study population, exclusions and final study sample.
Data
When the radiologists complete the screen-reading, the time stamp is registered in the screening programme’s database. For descriptive purposes, we used this information to classify the time of day into the following categories; Morning: 07:00–08:59 (n = 351,813 readings); Before lunch: 09:00–10:59 (n = 737,098 readings); Lunch: 11:00–12:59 (n = 543,548 readings); Early afternoon: 13:00–14:59 (n = 689,994 readings); Late afternoon: 15:00–16:59 (n = 384,693 readings); Evening: 17:00–22:59 (n = 230,166 readings). The distribution of readings across different times of day varied by different weekdays.
We defined screen-detected breast cancer as either ductal carcinoma in situ or invasive breast cancer, after a positive screening examination resulting in a recall for further assessment. The individual reader’s score and the outcome of further assessment were used to classify the reading for each radiologist into true positive (TP), false positive (FP), true negative (TN) and false negative (FN).
For this study, we defined TP as a score of 2 or higher given by the individual radiologist on a screen-detected breast cancer, FP as a score of 2 or higher on a negative screening examination, TN as a score of 1 on a negative screening examination and FN as a score of 1 on a mammogram with a screen-detected breast cancer. The reading could be classified as FN only if the other reader gave a score of 2 or higher and a screen-detected cancer was diagnosed. For this reason, the FN rate is underestimated, and the TN rate is overestimated. Since not all interval breast cancers are truly missed cancers, we have chosen not to include these in our measures. However, the number and rates of FP do not correspond to the real number of recalls with negative outcome, due to a number of these examinations being dismissed at consensus. The radiologists’ sensitivity was calculated as
Statistical analysis
Descriptive statistics of the 148 radiologists’ performance were presented as frequencies and percentages, or median and interquartile ranges (IQR). Descriptive statistics of TP, FP, FN and TN were presented with frequencies and percentages by categories of time of day and weekdays. When modelling the association, we used a continuous measure of time of day. To adjust for the repeated measures introduced by the numerous readings performed by the individual radiologists, we fitted a multilevel mixed effects logistic regression with restricted cubic splines with five knots to the data. This allowed for randomly drawn intercepts for each radiologist and non-linear associations between the performance measures and time of day. Additionally, this allowed us to adjust for weekdays by adding the categorical variable as a fixed effect in the regression model. A post-estimation of odds ratio (OR) with corresponding 95% confidence interval (CI) for the time points during the day with 07:00 as reference value was also calculated from the adjusted model with the restricted cubic splines. Predicted probabilities of TP, FP, FN and TN from the restricted cubic spline models were presented graphically. With this model, we jointly estimated sensitivity and specificity by expressing the two measures as a set of conditional probabilities in which a correctly classified mammogram given a breast cancer diagnosis was defined as sensitivity. Correspondingly, specificity was classified as the conditional probability of a correctly classified mammogram given a negative mammogram. All analyses were performed using Stata MP 16.0 (Stata Corp., TX, USA), and the post-estimation was analysed with the user-written xblc and adjustrcspline commands. 13
Results
A total of 148 radiologists were involved as readers during the study period, with a median of 10 years (IQR: 3–15) experience in screen-reading, a median of 25,632 (IQR: 8341–72,758) readings, and an annual median of 4081 (IQR: 2144–5926) readings.
The highest number of screen-readings was performed before lunch time (25.1%) and the lowest during the evening (7.8%; Table 1, Figure S1). True positive was highest in the early afternoon (0.5%). False positive was highest before lunch time and in early afternoon (4.5%) and lowest in the evening (4.0%). True negative was highest in the evening (95.6%) and lowest before lunch time (95.0%). False negative was lowest in the morning (0.05%) and highest during lunch time (0.07%). The descriptive association of TP and PS showed a linear trend (Figure 2).
Number and percentage of mammographic readings, positive scores, true and false positive readings, true and false negative readings, sensitivity and specificity for the readings performed in BreastScreen Norway 2012–2018, by time of day and weekday.

True positive (per 1000 readings) and positive score (%), by categories of time of day, for 2,937,312 readings in BreastScreen Norway 2012–2018.
The highest number of screen-readings was performed on Tuesdays (23.5% of the readings) followed by Mondays (20.7%), while the lowest number was read on Saturdays (1.9%; Table 1). The highest percentage of TP was on Fridays (0.5%). Sundays had the lowest percentage of PS (4.6%) and FN (0.05%). Saturdays had the highest percentage of FN (0.08%) and Tuesdays had the lowest percentage of FP (4.2%).
Post-estimations of the regression model predicted a lower odds ratio of TP and FP rate compared to the reference time at 07:00 for each following hour (Table 2). In the morning and early afternoon, the association was not significantly lower for TP. From 16:00 and onward, the association was significantly different from the reference time of 07:00. The odds of FP was significantly lower at each hour after 07:00 compared to the odds at 07:00, and the odds of TN was significantly higher at each hour after 07:00 compared to the odds at 07:00. No significant differences were observed for the odds of FN. The association between time of day and the predicted probability of different performance measures showed a decreasing rate of TP, an increasing rate of TN and a stable rate of FN (Figure 3). The sensitivity was highest early in the day, while specificity gradually increased during the day (Figure 4).
Predicted odds ratio (OR) and 95% confidence interval (CI) for true and false positives, and true and false negatives, for 2,937,312 readings in BreastScreen Norway 2012–2018. a
aOR and 95% CI are predicted from a mixed effect logistic regression model adjusted with time of day and weekday as fixed effects and radiologists as random effect.

Predicted probability of true and false positive readings, and true and false negative readings, with 95% confidence intervals, for 2,937,312 readings in BreastScreen Norway 2012–2018.

Predicted probability of sensitivity and specificity, with 95% confidence intervals, for 2,937,312 readings in BreastScreen Norway 2012–2018.
Discussion
Our study identified the highest probability of true positive (TP) screen-readings in the morning and the lowest in the evening. For this reason, and due to the low number of false negatives (FN), the sensitivity of the screen-reading was highest in the morning and declined during the day. The opposite was observed for specificity. No substantial differences in screen-reading performance were observed by weekday.
In our study, we found a decreasing trend of PS during the day. One study from the UK showed individual variations of PS, and another study found both lower recall rates and cancer detection rates for cases read between 17:00 and 01:00 compared to 09:00–17:00. 9 , 10 Despite these variations in performance throughout the day, there was a trend in decreasing PS and TP which might be due to fatigue, from busy working schedules. However, experienced radiologists can possibly mitigate this fatigue by adopting coping strategies. 14 Colonoscopies are also tasks that require perceptual efficiency for detecting subtle signs of a disease with relatively low incidence, and are comparable to mammographic screening. Our results are in line with those from studies on colonoscopies, and the impact of time of day on the outcome of the procedure.15–17 In one of these studies, the polyp detection rate was 27% higher for early morning colonoscopies compared to those starting later than 08:30. 16 In other words, there is reason to believe that screen-reading might be preferable early compared to later in the day, but numerous factors must be considered when planning cost-effective organized screening.
Some studies have reported fatigue during the course of the week to result in a declining detection rate. 18 However, our finding of no substantial difference in reading performance by weekday is supported by other studies.19,20 We found a seemingly lower crude sensitivity on Saturdays, but consider this finding to be due to random variations at certain small study sites and the small number of FN.
In our modelling of time of day, we adjusted for the individual effect of the radiologist to present generalizable findings. Still, it is important to acknowledge that some radiologists prefer working in the morning and others in the late afternoon. Different personality types, known as chronotypes, either have their optimal cognitive performance in the morning or in the evening. 21 We assume different chronotypes to be present in this study. The effects of the different types may cancel each other out, and may have led to an underestimation of the true effect of time of day on the reading performance in our study.
The readings were observational data from daily clinical practice, including radiologists and screening examinations from a population-based programme. As such, a strength is the large number of screen-readings included. This large dataset made us able to detect even small differences in reading performance. However, it is important to note that our findings represent those of individual reader’s score, and not the final score for the examination, or the performance of the programme. A limitation in our study is the lack of review data for cancers. For examinations with at least one PS, the final decision about whether to recall the woman was made in a consensus meeting. A decreasing number of PS in our data might indicate a decrease in number of recalls, although this cannot be known for certain. In our study, a FN score required a diagnosis of screen-detected cancer after a recall triggered by a PS by one of the readers. In other words, the FN rate is underestimated, and the TN rate is overestimated. We did not include information related to the organization of the screen-reading, such as batch reading versus incidental reading, and how this may be distributed during a work day in this study, which represents a limitation of the study. Even though the sensitivity and specificity of the radiologists are closely related to the sensitivity and specificity of the programme as a whole, we cannot infer from our data that changing the time of day each radiologist reads mammograms would change the sensitivity and specificity of the programme. A mammogram read early in the day by the first reader may be read in the evening by the second reader and vice versa. Nevertheless, clinical practices where radiologists only read screening images late in the day, due to high workload, may consider changing practice.
In conclusion, our findings suggest that screen-reading early versus late during the day results in higher sensitivity for the readers, although at the cost of specificity. Despite small differences in the performance measures during the day, this may be an important aspect to consider in the discussion of optimal management of large-scale screening programmes.
Supplemental Material
sj-pdf-1-msc-10.1177_0969141320953206 - Supplemental material for Time of day and mammographic reader performance in a population-based breast cancer screening programme
Supplemental material, sj-pdf-1-msc-10.1177_0969141320953206 for Time of day and mammographic reader performance in a population-based breast cancer screening programme by Heinrich A Backmann, Marthe Larsen, Anders S Danielsen and Solveig Hofvind in Journal of Medical Screening
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
