Abstract

In their 2013 article in Human and Experimental Toxicology, Geier et al. 1 presented reanalyses of data from the “Casa Pia Study of the Health Effects of Dental Amalgam in Children,” 2,3 in particular, the results described in Woods et al. 4 regarding the association of exposure to dental amalgam with glutathione-S-transferase (GST)-α and GST-π. In a similar manner, in their 2011 article in “Biometals,” 5 Geier et al. also presented reanalyses of data from the Casa Pia study regarding the association of exposure to dental amalgam with urinary porphyrins, as presented in Woods et al. 6 In both the articles, Geier et al. imply that their analyses, which they claim show significant associations of urinary porphyrins and GSTs with dental amalgam exposure, contradict our published findings on porphyrins and GSTs (which indicate no associations with amalgam exposure) because they used a more sensitive statistical analysis. This letter is in response to those claims. There are two major points we would like to make regarding this issue.
The first issue is the major difference and discrepancy in statistical approach. The Casa Pia study was a randomized clinical trial, and porphyrins and GSTs were secondary measures in that clinical trial. Even with secondary measures, the appropriate analytic approach in testing whether the exposure of one of the treatment groups to dental amalgam caused an increase in porphyrins or GSTs compared to the nonexposed (dental composite) group was a comparison of the two randomized treatment groups. By taking advantage of the randomized design and using the treatment assignment that was designated by the randomization, our analyses allowed inferences to be made about the potential cause–effect relationships between amalgam exposure and outcomes. In contrast, analyses that examine associations between outcomes and observed urinary mercury levels or weighted amalgam exposure scores are prone to biases due to confounding. In order to make our analyses as robust and precise as possible we adjusted for covariates that either explained some of the error variance (and therefore increased precision) or that adjusted for group differences that might have occurred by chance despite the random group assignment. Such a procedure in hypothesis testing using a prespecified model and hypothesis offers the most objective evaluation of whether the observed data provide sufficient evidence to conclude that there is an association, while at the same time protecting the overall probability of reaching false positive conclusions. This is the approach advocated in the clinical trial literature. The approach used by Geier et al. is an exploratory method that uses the data to suggest how to best configure the model and the hypothesis so that “statistical significance” is more likely to be declared. This approach is basically the same as in a clinical trial in which intervention and control are compared and declared with no overall difference. Advocates for the intervention, not satisfied with the result, sometimes then delve into the data to try to find subgroups of patients in which a “statistically significant” difference could be declared. This is a well-known pitfall in clinical trials, which Friedman et al., in their book Fundamentals of Clinical Trials 7 (p. 372), call
… posthoc analyses, sometimes referred to as “data-dredging” or “fishing”. Such analysis is determined by the data themselves. Because many comparisons are theoretically possible, tests of significance become difficult to interpret and should be challenged. Such analyses should serve primarily to generate hypotheses for evaluation in other trials.
As Friedman et al. document, there are numerous examples of such post hoc findings not being confirmed in subsequent trials. Also related to the methods used by Geier et al. is the problem of multiplicity in which the problem of reaching a false positive conclusion is increased substantially if many tests of significance are performed. In the description of their analytic approach, Geier et al. 5 describe trying several different approaches and models to get the one that provided statistical significance. It is not clear exactly how many analyses and tests were performed, but they describe analyses in which they tried five different weightings of the size of the amalgam restorations to come up with exposure scores, and two different time lag approaches (whether current exposure was related to the current response or to the subsequent response). That combination suggests at least 10 analyses were done. If one were to engage in such multiplicity of testing and yet still want to protect against excessive false positive findings, it is common to employ something like a Bonferroni adjustment to the tests (Friedman et al., 7 p. 377). Applying the Bonferroni adjustment in this kind of situation would involve using a significance level for each test that is obtained by dividing the overall significance level desired (p < 0.05) by the number of tests performed (10), so that a level of p < 0.005 would be needed to declare any individual test to be significant. None of the p values given in the two Geier articles would be declared significant using that criterion, which demonstrates how tenuous their findings should be considered.
The second point we wish to make is that the porphyrins and GSTs were included in the trial because they were thought to be renal biomarkers for exposure to mercury, and we wanted to see if they would be sensitive to the low-level mercury exposure from amalgams. Slight elevations of such biomarkers are themselves not likely to be health events but may merely indicate renal recognition and response to the presence of mercury. Although we did not observe a significant increase in these biomarkers, we went in expecting to. Had we seen increases (and therefore, even if you were to accept the Geier et al. findings of increases), slight elevations in biomarkers for mercury should not be confused with a health outcome unless there is an indication of serious and permanent kidney damage. As is explained in our design paper,
2
we initially considered using the GSTs to define a primary renal health outcome for the study, since kidneys that have failed were known to excrete very high GST levels. But doing so would have required identifying a level of either or both GSTs that indicated when serious and/or permanent kidney damage had occurred, so we could compare incidences of exceeding this threshold between the two groups. We were not able to identify such GST threshold levels at that point of time, so we included GSTs as continuous secondary biomarkers. Information has been published that indicates GST-α increases 5- to 10-fold in the urine in the cases of drug or heavy metal-induced nephrotoxicity and more than 20-fold during acute tubular necrosis.
8
In Table 4 of their 2013 article, Geier et al. gave estimated GST-α levels based on their analysis that are 5–12% higher over time in the amalgam-exposed group compared with the nonamalgam-exposed group. That very slight increase stands in stark contrast to the 5- to 20-fold increases seen in true health outcomes. As became clear in discussions within our scientific group, if we had included the GSTs as primary response variables and if we defined significant elevation of one or both GSTs (even if very small) as a trial end point, we could have ended up declaring an increase in GST as a significant harmful health outcome when all we were doing was verifying an increase in mercury level (which was certain to occur due to amalgam exposure) using a renal biomarker for mercury exposure but with no true health effect. For these reasons, we suggest a great deal of caution be used in interpreting the statistical significance claimed in the 2011 and 2013 Geier et al. articles. Even if their results are assumed to be valid, their estimated slight increases in porphyrins and GSTs, which are renal biomarkers for mercury exposure as well as other exposures, should not be interpreted as indicative of a health event rather than merely a confirmation of exposure. If one wished to design a longitudinal study specifically to detect a clinically important nephrotoxic effect of dental amalgam, one could monitor (probably on an annual basis) well-known and accepted clinical measures of kidney function, such as serum creatinine, urinary albumin, and urinary N-acetyl-β-
