Abstract

Xu and Prorok 1 point out that our test for independence 2 between the outcomes of subsequent screens is not equivalent to testing independence of Xi and X2. It is correct that we tested the hypothesis of independence between subsequent screens by testing an implication of this independence. When it is not possible or straightforward directly to test a hypothesis, it is a common procedure in statistics to test an implication of the hypothesis instead. We compared the cumulative false positive risk expected under the assumption of independency with the observed cumulative false positive risk and tested whether these two probabilities were equal. The acceptance by a statistical test of an implication of the hypothesis is of course not the same as the acceptance of the hypothesis itself, but it is a way to strengthen/weaken one's belief in the hypothesis. If we did not have independence, it seems very odd that the expected cumulative false positive risks in two independent mammography screening programmes resemble so remarkably the observed cumulative false positive risks. Our belief in the hypothesis of independence was further strengthened by the fact that it seems reasonable in view of the radiologists’ practice of comparing new with old mammograms.
Xu and Prorok provide an example with very few observations to illustrate that our method does not prove independence. The fact that they have so few observations gives a very broad 95% confidence interval, so the hypothesis of p = p* will be accepted for a very large span of p-values, including p-values from data without independence between subsequent screens. We do not question the theoretical correctness of this example. But the example is far away from the reality when evaluating mammography screening programmes. This evaluation normally includes observations of subsequent screens from thousands of women, and thus yields narrow confidence intervals and strengthens the test of p = p*.
For the HIP Mammography Screening Programme, Xu et al 1 calculated the cumulative false positive risk to be 24.5%. Using the same data and our method, we calculated the cumulative false positive risk to be 23.6%. Perhaps the radiologists’ in the HIP Programme did not consequently compare new mammograms with old mammograms whereby independence between outcomes from subsequent screens will be lost. This could explain the small difference in our estimates, but under all circumstances the small difference is probably insignificant to the targeted women to whom this estimate is provided. We therefore find that for the real life large data sets with narrow 95% confidence intervals, our method provides a valid, pragmatic test of independence.
