Abstract

Research articles frequently report on several significance tests. When multiple hypothesis tests report on a single issue, the P values may not be an accurate guide to significance of a given result [1]. Whenever an investigator conducts a statistical significance test, they could make either a Type I or a Type II error (see box). The risk of making such errors is part of the hypothesis testing process, but it is generally agreed that making a Type I error is more serious than making a Type II error [2]. Normal practice dictates that the chance of making a Type I error is set before beginning the research. The chance of making a Type I error is set as α= 0.05, corresponding to the P value where the null hypothesis will either be accepted or rejected. However, the chance of making a Type I error of 0.05 is for one test. But, if the number of tests increases, so does the chance of making a Type I error.
Multiple tests are common in research [3]. For example, researchers wishing to examine treatment effects on several dependent variables. Similarly, studies sometimes report sub-group analyses after examining the main effects of a study. Both practices increase the number of hypothesis tests and the chance of making a Type I error. If the aim is to reduce or maintain the chance of making a Type I error at 0.05, researchers need to employ techniques whereby they can adjust the P should they need to conduct multiple tests [4].
Some methodologists argue that making corrections is necessary [5, 6], while others regard the adjustment as unnecessary because research allows the comparisons across separate experiments [7]. Multiple tests within a given study are unlikely to be independent, and without adjusting the P values, the chance of declaring a significant relationship between an independent and a dependent variable is greater than the 0.05 level [1]. Also, pure chance dictates that when a P value is set to 0.05, the probability of getting a significant result is one in twenty (0.05), even if a significant result does not exist [6].
How to go about it
The most often used correction is the Bonferroni correction. It is simple to apply, but is sometimes considered too conservative. It lowers the significance threshold from 0.05 to 0.05/k, where k is the number of statistical tests run [8]. Kim et al. [9] reported on functional instability of the ankle joint. To do so, they reported the results of six significance tests (Table 1). Without controlling for multiple tests, there were four significant differences reported. When using a Bonferroni correction the P values is adjusted by dividing the significance value by the number of tests conducted (0.05/6 = 0.0083). After the correction, the number of significant tests is reduced to three. Truthfully, the correction only influenced the third P value of 0.047, it was never going to alter the original non-significant results. Please note that in spite of the low P values, each of the tests is significant at P < 0.05 [3].
In spite of its simplicity, the Bonferroni correction is criticised for being too conservative [6, 11]. Other options are available, and some are only a little harder to calculate [10, 11]. The Holm [11] and Hochberg [10] procedures are also more powerful than Bonferonni, and this is attributed to the fact that both are sequential [12, 13].
The two methods are similar in operation, with Holm being described as a ‘step down’ technique and Hochberg a ‘step up’ technique [8]. The Holm calculations are shown in Table 2, with the P values arranged from smallest to largest. If the smallest value is greater than 0.05/k (0.05/6 = 0.0083) stop, nothing is significant. If it is less than 0.05/k, it is significant. The process continues with the second smallest p value being compared with 0.05/(k-1) (0.05/5 = 0.001). The procedure continues until a non-significant value is found [8].
The Hochberg procedure works in the opposite direction with the P value arranged from largest to smallest (Table 3) [8]. If the smallest value is lower than 0.05, all tests are significant, and the process can stop. Otherwise it continues and the second value is compared against 0.05/2 (0.025), then it and all subsequent P values are significant. If it not, the process continues with the third P values compared against 0.05/3 (0.0167). If it significant, so are each of the remaining P values. Using the Kim et al., [9] data, the fourth test was significant against a critical value of 0.0125 (0.05/4). The remaining significance tests are also significant.
All of the methods described will keep the probability of making a Type I error at P < 0.05. For each of the three corrections, Kim’s [9] data shows three significant test results. This is not always the case, Holm [11] and Hochberg [10] usually produce similar results [8], and usually more significant results than the Bonferroni.
Authors and readers are encouraged to apply corrections for multiple hypothesis testing. Inflated P values are a problem [14], but controlling for them can present a clearer picture of study effects when multiple tests are presented [8]. The methods described in this paper are simple with all calculations performed on a spreadsheet. Present your findings as clearly as possible, and examine thoroughly the results of others.
