Abstract
Study design, statistical analysis, interpretation of results, and conclusions should be a part of all research papers. Statistics are integral to each of these components and are therefore necessary to evaluate during manuscript peer review. Research published in Toxicological Pathology is often focused on animal studies that may seek to compare defined treatment groups in randomized controlled experiments or focus on the reliability of measurements and diagnostic accuracy of observed lesions from preexisting studies. Reviewers should distinguish scientific research goals that aim to test sufficient effect size differences (i.e., minimizing false positive rates) from common toxicologic goals of detecting a harmful effect (i.e., minimizing false negative rates). This journal comprises a wide range of study designs that require different kinds of statistical assessments. Therefore, statistical methods should be described in enough detail so that the experiment can be repeated by other research groups. The misuse of statistics will impede reproducibility.
Statistics are an integral component of scientific research. Research methods, including statistical methods, often are not described in sufficient detail that another laboratory group can repeat the experiment. The misuse of statistics is another concern. In recent years, there has been an increased awareness of the need for better description of statistics to improve transparency and reproducibility in scientific publications (Wade 2000; Williams 2015). A number of journals have established statistical guidelines to address this issue (Mullane et al. 2015; Hickey et al. 2015; Hills 2017; Hofner, Schmid, and Edler 2016; Hutton et al. 2017; Khan 2017; Lang and Altman 2015; Levesque 2015; Stephens and Grant 2015). In addition, the American Statistical Association (2018) has prepared the “Ethical Guidelines for Statistical Practice” to promote transparent assumptions, reproducible results, and valid interpretations for anyone who uses statistics in their profession. Analyses based on custom code or data processing that are not adequately described in a manuscript will also impede reproducibility. Reviewers should critically assess the statistical descriptions and procedures or alert the editor if they do not feel qualified to make appropriate assessments.
P value calculations are not necessary or appropriate for every study. For example, animal pathology may involve staining and interpretation of preselected slides. Randomized controlled experiments are concerned with population-level effects, while a case report might focus on the in-depth examination of just one individual. Obviously, population-level statistics, involving p values, means, and errors, are not relevant in the latter case. But even a case report does not excuse the researcher from ignoring all statistical considerations. Was the individual chosen randomly or selected from a population based on its unique characteristics? How was the sample collected? Is there any information about the population from which the individual was chosen? How has potential bias in examination been addressed and described? Does the interpretation of the results cover just one particular sample or could it apply to a broader population? Are the conclusions justified by the data? As we will describe more fully below, these considerations cannot be separated from statistics.
Experimental Design
Reviewers should ensure that authors clearly state the objective of their research and all hypotheses or research questions that they aim to address in their work. Well-designed experiments will allow suitable statistical assessment and interpretation, while most poorly designed experiments cannot be rescued by statistical computations (Festing and Altman 2002).
Experimental design describes the treatments that will be compared, the number of experimental units to which the treatments will be applied, and the rules that allocate units to the treatments. The units in an experiment are often individual animals, but they may also be litters in a reproductive study or cages of animals when studying infection or behavior. The experimental design should be conveyed clearly in the manuscript so that the reader can readily identify any confounding between treatments and blocking effects. For instance, when comparing two treatments, if all units exposed to treatment “A” were fed a different diet than all units exposed to treatment “B,” then treatment would be confounded with diet. That is, statistical analysis of the data would likely not be able to separate the effects of treatment from the effects of diet. Replication is another important component of experimental design, as replicated values are used to estimate the variation associated with the treatment. If the same unit is measured multiple times, the resulting data should not be treated as independent observations. Special statistical methods are required to handle repeated measurements and may require additional statistical consultation.
Statistical bias will occur when measurements or statistics systematically tend to be greater or less than the true value in a study design. For example, oversampling diseased animals can produce hidden correlations between animals when these samples end up in the same treatment groups either by chance or design. Random assignment of units to groups minimizes selection bias and forms the theoretical basis of statistical testing (Fisher 1971). Randomization reduces bias by equalizing unmeasured factors across groups. Studies that do not include a control group can also result in bias. Control groups (or conditions, such as pretreatment) are an important way to account for the effects of unmeasured factors on the outcome (Moore and McCabe 2001). Every study should plan to have a control group or condition concurrent with the experimental treatment groups.
The probability that a statistical test detects a specific difference, when it exists, between the control group and the treated group is referred to as statistical power. Sample size is an important factor in determining power. In planning each experiment, the sample size should be large enough to have a high probability to detect biologically meaningful differences between groups yet not so large as to declare very small differences as statistically significant and, thus, waste animals, time, and resources. When designing a study, it is a common practice to include enough units to achieve 80% power (i.e., a true, predefined, difference will be found 80% of the time). It is also advisable to discuss these power calculations in the Methods section.
Data Collection and Processing
Pathologists normally use standard criteria for their diagnoses. Therefore, the study pathologist typically has access to a variety of information needed to make proper assessments and does not rely on blinded evaluations. This material includes in-life clinical observations, clinical pathology data, experimental design criteria, husbandry conditions, the characteristics of the test agent, and other pertinent material (Crissman et al. 2004). Knowledge of treatment group status may be used to separate treatment-related effects from unimportant changes found in controls. That is, the range of effects in normal specimens can be helpful to establish a baseline threshold to aid the diagnosis of lesions (Neef et al. 2012). There appears to be general agreement among toxicological pathologists that blinded evaluations during the initial evaluation of tissues may diminish the impact of pathological assessments by increasing the time for evaluation and increasing the difficulty of separating treatment-related changes from normal variation (Crissman et al. 2004; Iatropoulos 1984; Newberne and de la Iglesia 1985). Statistical theory and methods rely on blinded evaluations, so the absence of blinding has historically been a point of contention between toxicological pathologists and statisticians.
Despite the challenges of blinded studies, it is not uncommon for pathologists to receive requests for blinded review of one or more lesions. In these cases, selected samples can be reevaluated in a blinded fashion in order to increase confidence in the final diagnoses (Rousseaux and Gad 2013). To avoid systematic differences in judgment during blinded evaluations, each pathologist should evaluate all the slides. Due to the potential for conscious and unconscious bias, the authors should describe in detail how slides are read.
Quantitative data, such as tissue concentrations, may need data processing prior to statistical analyses. The limit of blank, limit of detection (LoD), and limit of quantitation (LoQ) are different values that describe the valid measurement range of analytic procedures (Armbruster and Pry 2008). The LoQ is particularly important for immunoassays that qualitatively determine the presence or absence of analytes (e.g., drugs and troponin) or hormones (e.g., estrogen). Concentrations that are too small to be reliably measured are censored and are often replaced with a constant value (e.g., 0, LoD/2, and LoD/
Useful Statistical Tests.
Note: ANOVA = analysis of variance.
Statistical Analyses
Null hypotheses and statistical tests should be determined in the planning stages of each experiment. Every statistical test has a probability of producing a significant result just by chance. It is an unsound practice (aka “data dredging,” “data snooping,” or “p hacking”) to conduct many different statistical tests until a desirable result is found and then report the p values associated with only the desired outcomes. This happens all too often and leads to falsely elevated significance (falsely small p values) when the statistical analysis does not properly account for the large number of statistical tests that were actually conducted.
The first step in data analysis is to characterize the sample or groups using descriptive statistics. This includes noting whether outliers were present and whether they were removed from further analysis. Another function this serves is to determine whether quantitative data are normally distributed so that appropriate statistical tests may be applied.
Some studies involve descriptive statistics or prediction instead of statistical testing. These two applications are not only acceptable but preferable to statistical testing in certain situations. The data can be displayed in graphs, figures, and tables. Graphs are useful when values are more informative, and figures may be more helpful to visualize trends. When reporting the mean, the authors should also include the precision of the mean estimate (standard deviation or standard error, which should be clearly defined in figure legends). For instance, error bars in tables and figures based on biological replicates are necessary for making biological conclusions (Altman et al. 1983; anonymous 2013; Rousseaux and Gad 2013). Error bars based on technical replicates may have utility in describing the technical variability of a method, but they cannot be used to make biological conclusions. The number of significant figures used in reporting statistics should be carefully considered, especially for sample sizes less than 100, in order to avoid the false impression that precision is greater than it actually is (Altman et al. 1983). Percentages should generally not be quoted to more than one decimal place, but t-statistics and correlation coefficients can be quoted to two decimal places.
The shape of a distribution and type of data (Figure 1) are key factors in choosing an appropriate statistical test or presentation of the data. The central tendency (mean, median, or mode) is an important feature of a data distribution and is determined by the shape of a distribution and type of data. Numeric data are either discrete or continuous, while categorical data are ordinal or nominal. The mean or median usually is used for statistical testing and presentation of numeric data. The median or mode is more appropriate for presentation of ordinal data and the mode must be used for nominal data.

Distributions of different data types. The appropriate measure of central tendency for illustrative distributions of (A) numeric and (B) categorical data is indicated in the figure for mean (solid vertical line), median (dashed vertical line), and mode. Mean refers to the average value, median is the middle value, and mode is the value that appears the most often in a distribution. Dispersion can be measured according to standard deviation (or standard error), interquartile range (IQR), and range. Standard deviation is the average deviation of scores from the mean and is useful when describing the variability of measurements. On the other hand, the standard error is the standard deviation of a sampling distribution of the mean and is useful when describing the uncertainty around the mean. The IQR is the difference between upper and lower quartiles. Range refers to the difference between highest and lowest observed values. The central tendency of continuous data can be represented as the mean, the median, or the mode; ordinal data should be described by the median or the mode; and nominal data should only be described by the mode. With symmetric data, mean (± standard deviation or standard error) is usually preferable. However, if the data are skewed or contain influential outliers, then the median and IQR are more suitable. IQR is more informative than range for ordinal data. The dispersion measure for nominal data is the modal percentage, that is, the percentage of the sample that belongs to the modal category.
Parametric assumptions often stipulate normally distributed data, equal variances among groups, and independent samples. The Shapiro–Wilk test for normality and the Levene test for equality of variances may be used to evaluate the data for parametric assumptions. Often, a simple transformation of the data, such as taking the logarithm of each of the values, may improve normality and equality of variances. When these assumptions are reasonable, t-tests, analysis of variance (ANOVA), linear regression, and correlation may be appropriate methods to apply, depending on the hypotheses (Table 1). A very common approach to compare multiple groups is to use one-way ANOVA to determine whether groups differ, followed by multiple comparisons tests to understand specific differences. However, when the normality and equal variances assumptions are not reasonable, these methods may result in erroneous conclusions. If the normality assumption is violated, or when the distribution of the data is unknown, nonparametric tests may provide a better approach. Nonparametric methods include Mann–Whitney tests, Kruskal–Wallis ANOVA, Spearman’s rank order correlation, and resampling techniques such as permutation tests. Resampling generates a sampling distribution from the original data. An estimated p value can be calculated by comparing the test statistic calculated from the observed data to the distribution of test statistics calculated from the permuted data sets. Violations of the equal variances assumption should be carefully considered to decide the best testing strategy to use for the situation at hand (Hothorn 2014).
Significance testing evaluates the evidence in the sample against a predetermined null hypothesis (H 0) of no difference, no effect, or no association. The p value is the probability of obtaining a test statistic as extreme as, or more extreme than, the test statistic computed for the experiment assuming that H 0 is true. The p values provide a measure of evidence against a null hypothesis (H 0) rather than evidence for accepting the alternative hypothesis (H 1) that there is a difference or effect or association. H 1 is one-sided if the hypothesis claims that the direction of change in a parameter is either larger or smaller than H 0 (difference in only one direction is specified). On the other hand, H 1 is two-sided if the hypothesis only claims that a parameter is not equal to H 0 (difference can occur in either direction). The evidence against H 0 is stronger for smaller p values. Reviewers should keep in mind that while p values provide a useful tool for assessing study outcomes, they are only one source of evidence and should not be used to the exclusion of all other information. In addition, researchers and reviewers should be aware that correlation between units can produce incorrect p values that exhibit higher significance than would have resulted from a sample that accurately represented the population unless statistical methods that correctly incorporate the correlation are used.
The p values should be reported and their calculation should be carefully described. Consider again an experiment with two treatment groups. In this example, H 0 can take two basic forms. The most common statistical testing framework utilized in the two-group case would be a t-test in which H 0 states that the means of both groups are the same and H 1 states that the group means are different. More complicated statistical models are required when multiple factors are involved. For instance, batch effects, litter sizes, and body weight are typical covariates that may be considered in statistical modeling of toxicological studies. When multiple measurements are taken on the same unit over time, repeated measurements analyses are needed. Also, without adequate adjustment, statistical hypothesis testing in high-dimensional experiments (e.g., gene expression microarray studies) gives rise to unacceptably high false positive rates. This large-scale multiple hypothesis testing problem should be corrected with a suitable false discovery rate or family-wise error rate control procedure. It would be wise for a reviewer to seek solid statistical consultation if these methods are unfamiliar.
The probability of rejecting a true null hypothesis (α) is referred to as the significance level (or Type I error or false positive rate). While the value of α chosen for a given study is arbitrary, it is usually specified (before the experiment is conducted!) to be .05. All statistical procedures, including software packages, tests, sample sizes, and preselected significance levels should be adequately described. Citations should be provided for less common statistical tests so that other researchers can reproduce the results. The p values should be reported numerically (e.g., p = .03), and very small p values should use the reporting convention “p < .001.” Authors should not use fillers such as “not significant” or “n.s.” in the place of numbers when reporting results of statistical tests. Table 1 provides a brief description of statistical tests that may be encountered in the journal and when they are appropriate.
As a final note, there are two major schools of thought in statistics: frequentist and Bayesian (Bland and Altman 1998). The discussion presented in this section has focused on the “frequentist” framework, which is by far the most common approach used in toxicological pathology. Frequentist strategies assume that a population parameter is a fixed and unvarying quantity. Significance testing then focuses on hypotheses involving this population value based on the probability of an event occurring in an experiment. By contrast, Bayesian strategies assume that a probability distribution describes a population parameter. In this framework, a probability distribution is selected to express a prior belief about the population and combined with a likelihood (the observed data) to produce the resulting posterior distribution. For example, historical control data could potentially serve as a prior distribution for some toxicological bioassays. Nevertheless, the selection of a prior distribution will influence the results and needs to be carefully considered. In the absence of prior knowledge, a uniform prior is often assumed. Hothorn (2014) presents a useful discussion of frequentist and Bayesian approaches and software used in toxicological applications (Hothorn 2014).
Interpretation
Statistical analyses should appropriately address the stated research questions and hypotheses. There is value in providing an extensive review of the literature, providing speculations about possible outcomes not seen in the current study, and arguing for the relevance of future work. However, statistics presented in the manuscript should provide evidence for the interpretations made by the authors. Along these lines, it is important for reviewers to distinguish between association and causation. Identifying an association (or correlation) between two variables is extremely important to scientific research. However, it is important for reviewers to remember that an association between two variables could be affected by one or more additional, and possibly unmeasured, variables unless the effects of other potentially influential variables have been taken into account in the analysis. Analysis of covariance can be used to test for differences in the continuous dependent variable due to a categorical independent variable while controlling for the effects of another continuous variable. In addition, the technique of multiple regression can be used to separate the effects of one variable from another variable in order to increase confidence in interpretations. Nevertheless, adjusting for the effect of other known variables cannot change the nature of an association into causation. Causation means that changes in one variable directly cause changes in another variable. Hill (1965) lists the eight criteria needed to make causal inferences in a series of properly designed studies.
While p values provide valuable assessments, they should not be relied on too heavily in toxicologic pathology. The data are statistically significant at the α level if the p value is as small or smaller than α. Nevertheless, statistical significance is not identical to scientific or biological importance (Altman et al. 1983). A large effect, even if real, may yield a nonsignificant result if the sample size is too small. On the other hand, a very large sample size can easily lead to a very small p value for a small effect that is not biologically relevant. Due to the risk of false positives (Type I errors), reviewers should constantly keep in mind that a statistically significant result does not necessarily indicate a biologically meaningful effect even when the statistical testing was done appropriately. A p value of .05 will be obtained by chance once in 20 experiments of equal size, on average, and an appropriately powered study (typically 80% power) will not be able to detect a real effect in 20% of identical experiments. Numerical p values can help the reader interpret the study results much better than simply stating that the result was significant or not. The p values of .04 and .06 should lead to similar interpretations rather than the radically different ones of “significant” and “not significant,” respectively (Altman et al. 1983). A similar line of reasoning argues for presenting estimates of uncertainty along with parameter estimates. For example, means should be accompanied by standard deviations, standard errors of the mean, or confidence intervals to aid in the interpretation of results.
Conclusions
The statistical elements of a paper should be transparent and reproducible. The design and statistical analyses should be appropriate for the research questions and hypotheses that were posed. The points discussed here are by no means a complete review of statistical issues that can be encountered when reviewing a research study. Nevertheless, reviewers of papers in Toxicologic Pathology should ensure that statistical considerations in all phases of a manuscript are adequately addressed. If in doubt about the statistical procedures adopted by the authors, reviewers should report their reservations to the editor/associate editor and request to have a statistician review any aspect of the paper (whether design, statistical analysis, or interpretation).
Supplemental Material
Supplemental Material, DS1_TPX_10.1177_0192623318785097 - Statistical Guidance for Reviewers of Toxicologic Pathology
Supplemental Material, DS1_TPX_10.1177_0192623318785097 for Statistical Guidance for Reviewers of Toxicologic Pathology by Keith R. Shockley, and Grace E. Kissling in Toxicologic Pathology
Footnotes
Acknowledgment
We would like to thank Dr. Michelle Cora (Cellular and Molecular Pathology Branch, NIEHS), Dr. Gregg Dinse (Social & Scientific Systems), and Dr. David Malarkey (Cellular and Molecular Pathology Branch, NIEHS) for reviewing the manuscript and providing helpful suggestions.
Author Contributions
All authors (KS, GK) contributed to conception or design; data acquisition, analysis, or interpretation; drafting the manuscript; and critically revising the manuscript. All authors gave final approval, and agreed to be accountable for all aspects of work in ensuring that questions relating to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Declaration of Conflicting Interests
The author(s) declared no potential real or perceived conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported (in part) by the Intramural Research Program of the National Institutes of Health, National Institute of Environmental Health Sciences (NIEHS).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
