Abstract
It has been observed that statistical tests are infre quently applied in analysing differences in the performance of different retrieval methods. We believe this is explained on the one hand by the complexity of the subject, and on the other hand by the desire to avoid misleading conclusions. Because practical retrieval methods cannot be explained by simple models, parametric statistical tests are generally not suitable. Some non-parametric tests require a symmetry in the null hypothesis that seems inappropriate to the required task. A second class of non-parametric tests comprise the bootstrap methods. Here, the null hypothesis seems appro priate to practical testing, but the bootstrap assumption (that the sample may adequately represent the whole population) may be in question. If the bootstrap assumption is false, one may be led to erroneous conclusions (type I or type II errors). Here, by use of a mathematical model [11] which approxi mates the behaviour of practical retrieval systems, we show that bootstrap methods perform well in performance com parisons based on actual test sets used in practice. Type I error is appropriately predictable and the power loss of the tests, when compared with the theoretically most power ful test in the most realistic setting, may not exceed ten per centage points. We conclude that the bootstrap methods provide a practical approach to statistical testing in the field of retrieval performance analysis.
Get full access to this article
View all access options for this article.
