Abstract
Flaws in experimental statistics are a major contributor to the poor reproducibility of animal experiments. Informed decisions about whether conclusions are justified requires clear reporting of experimental data and the statistical methods used to analyse them. When data are misinterpreted, manipulated or concealed to generate publications, it creates an illusion that chance observations are robust data which confirm the hypotheses presented. Attempts to reproduce and advance such observations can propagate large areas of irreproducible science. This hinders scientific progress, erodes public support for research, damages reputations and wastes resources. This review analyses and explains recommendations to improve use and reporting of statistics in animal experiments.
Introduction
Immensely laborious calculations on inferior data may increase the yield from 95 to 100 per cent. A gain of 5 per cent, of perhaps a small total. A competent overhauling of the process of collection, or of the experimental design, may often increase the yield ten or twelve-fold for the same cost in time and labour. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.
Ronald Fisher – Presidential Address to the First Indian Statistical Congress, 1938
Reproducibility means being able to duplicate results using different personnel, materials and methods. It overlaps with repeatability, which means duplicating results using the same personnel, materials and methods. Low statistical power or poor analysis was ranked third behind selective reporting and pressure to publish as factors which contribute to irreproducible science. 1 The Australian National Health and Medical Research Council became sufficiently concerned about reproducibility in animal studies to release a document entitled ‘Best practice methodology in the use of animals for scientific purposes’. 2 It identified experimental statistics as one of the four most common categories of flaws in animal studies. Statistical problems include failure to determine statistical power,3,4 biologically relevant effect size, 5 appropriate statistical significance level, appropriate sample size, 6 failure to apply correct statistical tests for analysis of data and lack of appropriate training in statistics to design meaningful experiments.7,8
Industry has observed similar problems with reproducibility. Early-stage venture capitalists assume more than half of academic research will not be reproducible. 9 Studies by pharmaceutical companies indicate more than 75% irreproducibility. 10 Less than 1% of published cancer biomarkers enter clinical practice. Reasons include poor methods, inappropriate statistics and incomplete and selective reporting.11–13 The Reproducibility Project: Cancer Biology was established to provide evidence about the reproducibility of preclinical research in cancer biology by repeating selected experiments from high-impact papers. A recent study used seven criteria to assess the reproducibility of 158 effects in a selection of 23 papers reporting the results of preclinical research in cancer biology. 14 Original positive results were half as likely to be repeated successfully as original null results (40% vs. 80%). For original positive effects that were reported as numerical values, the median effect size for repeated experiments was 85% smaller than the median of the original effect sizes (0.43 vs. 2.96). Animal experiments with positive effects were less reproducible than non-animal experiments with positive effects, on every criterion. For example, 12% of replication effects for animal experiments were in the same direction as the original and statistically significant study compared with 54% for non-animal experiments.
The international biotech company Amgen found only six out of 53 (11%) landmark preclinical cancer studies published in high impact factor journals over a 10-year period were reproducible. 15 Landmark was defined as something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics. Impact factor is a measure of the frequency with which the ‘average article’ in a journal has been cited in a particular year or period. 16 Of these 53 publications, 21 (∼40%) were in a journal with an impact factor of 20 or greater, while the remaining 32 (∼60%) were in a journal with an impact factor between 5 and 19. Alarmingly, non-reproducible articles had higher mean numbers of citations than the reproducible articles, so they were more widely followed. This may be because the science in the non-reproducible articles was compromised to make it appear more impressive.17,18 Inappropriate use of statistics was identified as one of the six common factors in the irreproducible studies. 19 Using an estimate that 50% of preclinical life sciences research is not reproducible, one study found this wastes $28 billion per annum in the USA alone. 20
Reporting of methods and statistics is measurably improved by using checklists.21–23 Various checklists are available for planning and reporting animal studies. Most checklists advocate that publications should provide information on the methods used to reduce bias such as randomisation and blinding. Other recommendations include specifying the total numbers of animals included in the statistical analyses with power and sample size calculations, 24 describing each statistical test used plus the unit of analysis and providing an explanation for why any data were excluded, 25 and considering pilot studies, statistical power and significance levels. 26 The ARRIVE guidelines v2.0 provide the most explicit statistical advice. 27 Item 7 (statistical methods) states: Provide details of the statistical methods used for each analysis, including software used; and (b) describe any methods used to assess whether the data met the assumptions of the statistical approach, and what was done if the assumptions were not met. Item 10 (results) states: For each experiment conducted, including independent replications, report: (a) summary/descriptive statistics for each experimental group, with a measure of variability where applicable (e.g. mean and SD, or median and range); and (b) If applicable, the effect size with a confidence interval. Further explanation and elaboration were published for these guidelines. 28
The current review provides additional clarification on how to improve use and reporting of statistics in animal experiments, explains how to best achieve some recommendations such as reporting summary/descriptive statistics for each experimental group 27 and discourages other recommendations such as performing a power analysis after completion of the experiment to check whether the power of the experiment was sufficient to draw any conclusions. 24
Statistical significance is not a definitive threshold
The way statistics are used is often determined by the goal of the research. Academic research is searching for something ‘interesting’, which is a statistically significant result. The greater the statistical significance, the more publishable it is. By contrast, industry is seeking something ‘feasible’, which is a marketable result. Results must be robust to withstand clinical trials and commercialisation. 29
The standard measure of statistical significance is the p-value. This is the probability of obtaining a difference at least as large as the one observed purely by chance, assuming there is no difference. The lower the p-value, the greater the statistical significance.
The current rigid threshold for statistical significance (p <0.05) was largely created by non-statisticians. 30 When the statistician Ronald Fisher introduced the p-value in the 1920s, he intended it as an informal assessment of whether results were worthy of further investigation and as one part of a fluid, largely non-numerical, intuitive process. In contrast, the rigorous, rules-based system of Jerzy Neyman and Egon Pearson specifically omitted p-values. It included statistical power, false positives and false negatives and was intended for repeated experiments. Modern statistical hypothesis testing is a hybrid of these competing systems. Although it aimed to produce an objective, evidence-based, decision-making system for working scientists, it was developed without a thorough understanding of either approach. Once a p-value <0.05 was defined as ‘statistically significant’, it began to transform from an indicator into a target.
Publications will often describe p-values >0.05 with inventive terms such as ‘approaching significance’, but they never refer to values slightly <0.05 as approaching insignificance. 31 Other creative descriptions of p-values >0.05 include ‘barely fails to attain statistical significance at conventional levels’, ‘not absolutely significant but very probably so’, ‘suggestive of a significant trend’ and ‘weakened significance’.32,33
Consequently, many statisticians argue against using p-value thresholds for hypothesis testing.34–39 p-values should be part of the analysis of results, with the understanding that a low p-value may have many causes. It may be that the alternative hypothesis proposed is true, or it may mean one of many other alternative hypotheses is true. Alternatively, a low p-value may be due to a large sample size, or the experiment may have been conducted on an unrepresentative sample. Violation of statistical assumptions which may produce low p-values include introducing bias through inadequate randomisation and blinding, inaccurate measurements, incorrect replication and flaws in statistical methods. 40
Other causes of low p-values are more insidious. 41 Multiplicity means multiple comparisons and includes testing many hypotheses in one experiment and testing one hypothesis many times or in multiple ways in one or more studies. It includes p-hacking, which refers to applying multiple statistical analyses and sub-analyses until uncovering a statistically significant result without clearly reporting how it was obtained. 42 It is also known as data dredging, snooping, fishing, significance chasing and double dipping. Coupled with incomplete or selective reporting, it virtually guarantees impressive but irreproducible results for publication.
Effect size emphasises biological significance
Effect size is the direction and magnitude of quantitative findings. It indicates the biological significance of findings and is independent of sample size. This is in contrast to the p-value, where a statistically significant difference will inevitably be detected with a large enough sample size unless there is no difference at all. 43
Effect size can be expressed as an absolute or a calculated difference. The absolute effect size is the raw difference between group means. This is useful for variables with intrinsic meaning such as body weight, blood pressure, milk or wool production. Calculated indices are useful when the measurements have no intrinsic meaning, such as numbers on a subjective scale or when studies used different scales so that no direct comparison is possible. These can be divided into the d family and the r family. 5 The d family of effect sizes measures differences between groups (e.g. calculated in units of standard deviation or ratios such as odds ratio or risk ratio). The r family of effect sizes measures the strength of association between variables (e.g. correlation coefficients or the proportion of variance attributed to an effect versus the total variance). 44
The observed value of test statistics such as F- and t-values are also effect sizes. F is the variance due to the treatment divided by variance due to error, whereas t is the mean difference divided by its standard error. 45
One of the main advantages of using effect size is that different effect size estimates from repeated studies can be combined to give an overall best estimate of the effect size. 46
Confidence intervals reduce misunderstanding by highlighting uncertainty
Calculating confidence intervals (CI) for effect sizes places greater emphasis on the interpretation of the biological or clinical significance of results rather than just their statistical significance.5,34
CIs show how rough or uncertain an estimate is. 47 A CI can be thought of as the set of true but unknown differences that are statistically compatible with the observed difference. 48 As the CI narrows, the estimate becomes more precise. CIs are usually calculated as 90%, 95% or 99% but can be any percentage of interest between 0% and 100%. In the case of a 95% CI, one would expect that 95 out of 100 CIs obtained from equal-sized samples drawn from the same population will contain the population parameter. Hence, the width of a CI calculated for the same estimate automatically increases from 90% to 95% to 99%.
CIs are an example of inferential error bars, which give information about what conclusions are justified. The length of inferential error bars is proportional to how much uncertainty there is in the data. Another example of inferential error bars is standard error of the mean (SEM). SEM is an estimate of the standard deviation of mean values obtained from multiple samples of the same size from the same population. Simple CI bars for metric variables are SEM bars adjusted for sample size (n) to allow comparison between samples of uneven size. The ratio of CI to standard error is the appropriate quantile of the t distribution for that n, and changes with n. As the sample size increases, the size of SEM and CI bars decreases, but the standard deviation of the sample remains constant.49,50
Presenting CI graphically highlights uncertainty and discourages rigid interpretations based on p-value thresholds. Note that error bars and summary statistics (e.g. mean and standard deviation) should only be shown for independently repeated experiments and never for the technical replicates used within a single experiment to conduct internal quality checks. 51 This is because inferences can only be made about the population from which independent samples were drawn unless correlation is accounted for (e.g. mixed models). Since technical replicates are not independent, they only indicate the fidelity with which replicates were created and cannot provide evidence of the reproducibility of the main results. 52 Misinterpreting data from experiments which were not independently repeated and disguising inappropriate data selection with terms such as ‘typical or representative result’ are major causes of poor reproducibility. 19
Display data in raw form
In contrast to inferential error bars, descriptive error bars show the spread of data (e.g. range and standard deviation), and the type of error bar on any graph must be specified. Descriptive error bars are commonly used in dynamite plots, which are bar charts displaying mean and standard deviation. Dynamite plots are so named because they look like a dynamite plunger. They are never appropriate because they conceal data such as the number of data points and their distribution. 53 For example, a dynamite plot with two data points may look the same as one with 50 data points provided the mean and standard deviation are the same. Similarly, a single outlier in one data set may be sufficient to make the dynamite plot look the same as another data set with two distinctive clusters of data points. In each case, the comparison is misleading.
Consequently, small samples should be presented as dot plots showing all individual values with a point or line representing the mean. 54 A box plot or violin plot should be used if there are too many data points to plot individually without congestion. Box and violin plots show the median value, interquartile range (which captures the middle half of the data between the first and third quartile) and the upper and lower adjacent values. (These are dependent on the software used. One definition is the values in the data that are furthest away from the median on either side of the box but are still within a distance of 1.5 times the interquartile range from the nearest end of the box.) Outliers or outside points are beyond the adjacent values. In addition, a violin plot shows the entire distribution of the data. 55
Clearly distinguish between experimental and observational units
The experimental unit is the entity that is randomly and independently assigned to the treatment conditions (e.g. person, animal, litter, cage, holding pen, fish tank, culture dish or a well in a microtiter plate). 56 It is the smallest unit which can receive different treatments and should be the unit of statistical analysis. Experimental interventions must be applied to each experimental unit independently to average out treatment errors across the experimental groups. Treatments should not spill over or affect adjacent experimental units, and experimental units should not influence each other, especially on the outcomes of interest. Independence of experimental units is assumed in p-value calculations.
The observational unit is the entity on which measurements are made. It is the smallest unit that will be measured or observed in an experiment. It may also be called the measurement or sampling unit.
Conclusions only apply to the population which experimental units were drawn from. 51 The population is the set of all individuals or experimental units of interest. Hence, misidentification of units means the intended hypotheses will not be tested and results cannot be extrapolated to the population of interest.
Sample size (lowercase n) is the number of experimental units in a treatment group. 57 Capital N is the total number of experimental units that are used in the whole experiment. 45 Hence, the sample size is only the number of animals per treatment group if treatments are randomly and independently applied to individual animals. Otherwise, the experimental unit, and hence the unit of statistical analysis, should be the entity which was randomly and independently assigned to the treatment conditions such as the litter, cage, holding pen, fish tank, pot, pasture or plot.
If individual animals are counted towards sample size when they are not the experimental unit, this is an example of pseudoreplication. Pseudoreplication is the use of inferential statistics to test for treatment effects using data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent. 58 This confounds treatment effects with random error (e.g. biological variation and measurement error), artificially inflates sample sizes and therefore reduces p-values. It violates the principle of independence, resulting in an invalid analysis unless correlation is accounted for (e.g. mixed models) and irreproducible data. 54
Clearly distinguish between observational studies, exploratory experimental studies and confirmatory experimental studies
The pyramid of research design is a hierarchy which represents escalating study rigor.59,60 Observational (non-randomised) studies are on the lower levels of the pyramid. They aim to identify associations between predictors and outcomes. Examples include cohort, cross-sectional and case-control studies, which are used to investigate incidence, prevalence and aetiology. 61 Observational studies are typically less expensive and quicker than randomised studies and reduce welfare risks to study participants because no interventions are carried out by investigators (e.g. subjects are exposed to risk factors naturally). Sample size depends on the number of predictors and the assumed underlying probability distribution of the outcome. Subjects are often selected on convenience or availability. This increases the likelihood of selection bias and confounding.62,63 Confounding arises when subjects in different groups are fundamentally different with regard to the outcome of interest. This may bias estimates of association. The design and analysis of observational studies should control for confounding by matching subjects between groups on the basis of confounders, arranging and analysing subjects within groups on the basis of levels of confounders (stratification) and using multivariate techniques or generalised linear models of regression to adjust for multiple confounding factors. 64
Experimental (randomised) studies aim to establish a causal effect between treatment factors and outcomes. Subjects are randomly allocated to treatment and control groups. Investigators collecting and analysing data should be blinded to which groups are treatment and which are control. This is most important when there is any subjective element in assessing the results. Random allocation and blinding reduce bias and strengthen the evidence obtained. Experimental studies may be exploratory or confirmatory. 65
Exploratory experimental studies are designed to make new discoveries, develop methods and generate new hypotheses. 66 They tend to be small and flexible, so the sequence and specific design of experiments may be inexact. For example, sample size may be estimated based on feasibility and constraints rather than power analyses. 67 Where hypotheses are imprecise or non-existent at commencement, analysis should be limited to descriptive statistics only. The emphasis should be on high sensitivity to detect all strategies that might be useful (i.e. minimise false negatives). Consequently, many different strategies are often tested in parallel, which may necessitate small sample sizes. Under such circumstances, some experiments will produce large effects and statistically significant results due to random variation alone. 14
A danger of exploratory studies is the temptation to misrepresent hypotheses which evolve over the course of sequential experiments as a priori hypotheses and selectively support them with spurious results for publication. This increases the risk of false-positive findings entering the scientific literature.
Confirmatory experimental studies use clear, a priori hypotheses, rigid, pre-specified designs, sample sizes calculated on power and well-established methods, often over prolonged durations, to provide strong, new evidence. An example is a randomised clinical trial to determine efficacy. The emphasis should be on high specificity to eliminate false positives which progressed through exploratory studies. Hence, sample sizes should be sufficiently large to minimise the effects of random variation. Analysis of confirmatory experimental studies should include both descriptive and inferential statistics (e.g. analysis of variance (ANOVA) families) to infer the findings of the sample into the underlying population.
All experimental studies should specify whether they are exploratory or confirmatory, including all hypotheses tested and when they were formulated.
Use powerful experimental designs
A well-designed experiment avoids bias (systematic error) and is sufficiently powerful to be able to detect effects likely to be of biological importance. Power is the probability of finding a statistically significant difference at a given p-value with the specified number of subjects in each group. It indicates the signal (treatment) to noise (unexplained variability) ratio of the experiment. 45 In underpowered experiments, an effect or signal has to be larger than the true effect size to overcome the noise and detect treatment effects. 68 Common sources of unexplained variability in animal experiments include subclinical infections (which are the most common type in laboratory animals), genetic drift (which cannot be stopped) and variations in microbiome (with effects as diverse as drug responses, immune and nervous system function and development of disease models).69,70
One way to increase power is to increase sample size. Any signal will eventually become statistically significant with a large enough sample size. 71 Increasing sample size increases cost and logistical complexity and has ethical implications. Upscaling an experiment may impact consistency. For example, if a larger experiment requires more investigators over separate days, ‘investigator’ and ‘day’ become technical factors which could influence results and may need to be included as variables in the analysis. Similarly, if the increased duration accentuates the influence of circadian rhythms, time may need to be included as a covariate or blocking factor. These are all changes to the original design of the experiment that was used for the power calculation. 72
Simple options to increase power with fixed sample size include testing a specific hypothesis instead of a general one, using the minimum number of factor levels for continuous variables without reducing them to categorical variables based on a threshold such as the median (dichotomising or binning) and crossing factors rather than nesting them. 73
Factor arrangement refers to how factors relate to each other. When levels of one factor always co-occur with the same levels of another factor, they are completely confounded. An example is running all control animals on one day and all treated animals on another day. Hence, ‘treatment’ and ‘day’ are completely confounded.
A nested arrangement occurs when levels of one factor are grouped under the levels of another factor. If whole litters of mice are randomised to treatment groups, the factor ‘litter’ would be nested under the factor ‘treatment’. If there are large differences between litters, a nested arrangement could result in false positives. Furthermore, if litters are randomised to treatment groups rather than individual animals, litter should be the experimental unit, and hence average values per litter should be used in statistical analysis. This reduces sample size and reduces power. Hence, nested designs should be avoided but are common because they are convenient (e.g. housing littermates in the same cage).
A crossed or factorial arrangement occurs when all levels of one factor co-occur with all levels of another factor. If mouse pups from separate litters are spread across treatment groups (randomised to groups within litters), the treatment and litter factors would be crossed. In these examples, the difference in power between nested and crossed designs diminishes as variation between litters becomes smaller and/or variation within litters becomes greater. 74
A third approach to increase power is to reduce noise via improved experimental design.75,76
Completely randomised design
In a completely randomised design, experimental units are randomly assigned to treatments and controls. It is suitable when the experimental units are independent, interchangeable and homogeneous and there are sufficient experimental units available for the required power. This is the simplest experimental design to calculate power for, conduct the experiment and analyse the results. This design does not control for either differences or relatedness in experimental units, nor variation occurring during the course of the experiment. 77
In factorial experiments, more than one type of independent variable or factor is varied at a time to determine which factors are influential. 78 Factors can be any variable which the investigator can control. This includes animal-related characteristics (e.g. sex, strain, age, diet and health status), environmental variables (e.g. cage and group size, bedding material and environmental complexity) and protocol-specific variables (e.g. reagents, dosages, operators, routes of administration, timing of observations and methods of measurement).
The simplest factorial is a 2 × 2 design. For example, male and female animals in each of two treatment groups (treatment and control). Hence, there are four groups of animals. All of the animals contribute to estimates of the effect of treatment and sex. If the response of the two sexes to treatment is different, it is called an interaction between sex and treatment. This multiple use of data is a key benefit of using factorial designs to maximise information on limited numbers of animals. 79 If the same experiment had four levels of treatment, it would be a 4 × 2 factorial experiment. When one group of animals is included for every combination of each factor, it is termed a full factorial design. However, as the number of factors increases, the total number of combinations becomes very large. In these cases, a fractional factorial design may be used with a subset of these combinations to maximise information on the main effects and interactions.
Factorial experimental designs are more effective and efficient than varying one factor at a time, hence saving time, money and animals. The main extra cost is the increase in the complexity of the experiment, which could lead to mistakes, and the increased complexity of the statistical analysis (e.g. switching from one-way ANOVA to two-way or multi-way ANOVA). 80
Randomised block design
Factorial experimental designs can be used to show which controllable factors are important, but they cannot be easily used to investigate the many uncontrollable factors that can affect the results. 81 The effect of uncontrollable factors, or nuisance variables, can be explored with a randomised block design (RBD), even if they are unknown. Nuisance variables add heterogeneity to experimental units and probably effect the results but are of no interest to the investigator. 77 Examples include light, vibration, temperature, humidity, cage location, noise, time of day and changes in skill of investigators. Controlling for nuisance variables can reduce animal numbers by two to ten times while retaining similar power and significance. 82
A RBD splits the experiment into multiple mini experiments where a complete replicate of the basic experiment is conducted within each block. Any restriction in randomisation creates a block. Experimental units are randomly assigned to treatments within blocks which ensures a balance of treatments across the variability within the blocks. Depending on the context, blocks can be considered as fixed or random effects. 83 Blocks are recombined in the final statistical analysis, and nuisance variables are removed as a block effect. 84
Treatment groups must be randomly intermingled within blocks and assessed blind in random order to avoid bias. Blinding will occur if subjects are only identified by their identification number once the treatments are given. Randomisation to treatment group (RTTG) is when treatment groups are pre-ordered within blocks and not randomly intermingled. RTTG is an invalid design because any environmental effect that differs between treatment groups may be mistaken for the effects of the treatment, leading to bias and irreproducible results. Hence, randomisation should be applied to allocation of experimental units, as well as the order of treatments and assessments. 77
A correctly RBD controls for both variation among experimental subjects and variation caused during the course of the experiment by the research environment and in the assessment of results. This design reduces variability and confounding within treatments and improves the estimate of treatment effects, thereby increasing statistical power.
For a complete RBD, each block contains every treatment group. An incomplete RBD may be used when the blocks are not large enough to accommodate all treatments.
Factorial designs can be used within blocks, and blocks can account for natural groupings of experimental units. For example, separate blocks may contain siblings or litters of animals, animals from different suppliers, animals of different mean age or weight, animals housed in different cages or in different locations, animals fed a different batch of diet or animals used at a different time of the year. If these natural groups are not accounted for, results may be confounded by conditions which existed before the experiment commenced.
Split-plot experimental design
An example of a factorial design in a blocked experiment is a split-plot experiment. 85 These have two levels of experimental units, where the blocks themselves serve as experimental units for a subset of the factors. The blocks are referred to as whole plots, while the experimental units within blocks are called split plots, split units or subplots. For example, if control mice are co-housed with treatment mice in single-strain cages, strain (cage) is the whole plot, and treatment (mouse) is the split plot. Cage is the biggest factor in variability of rodent experiments because everything important to rodents clusters on cage (e.g. temperature, humidity, smell, light and vibration). By placing control and treatment animals of the same strain in the same cage, the cage becomes a block for these nuisance variables. Another example is groups of caged mice fed different diets per cage, while individual mice are injected with vitamins. In this case, diet (cage) is the whole plot and vitamins (mouse) are the split plot. Both whole plots and split plots need to be independently randomised.
In a split-plot experiment, factors which are more difficult to administer such as strain, sex or age should be assigned to whole plots, while factors which are easier to administer such as applied treatments should be assigned to split plots. Split plots have more experimental units and hence larger sample size and greater statistical power.
The order of application of treatment and control should be randomised and applied consistently to allow for effects such as stress and waiting.
One important thing to consider is that experimental units should not influence each other, especially on the outcomes of interest. 56 Co-housed animals may influence each other on many relevant variables, from behaviour to microbiomes (and hence anything influenced by an animal’s microbiome). This effect needs to be balanced against the benefits of blocking nuisance variables when co-housing control and treatment animals.
Blocking is an effective means of exploiting the benefits of both standardisation and heterogenisation.86–88 Within blocks of subjects, the experimental conditions can be standardised as rigorously as possible (e.g. use of same genotype, same age and same experimental context), so that any differences in response to the experimental treatments will most likely be attributable to the treatment. However, the blocks themselves can be heterogeneous and vary in one or several aspects. Large block effects are common, highlighting the importance of using concurrent controls and randomisation. 84 Differences between blocks indicate the external validity of results, or how reproducible results are under different conditions. In a RBD with replication over time and/or in different laboratories or by different staff, the extent to which the results are repeatable under different conditions will give a good indication of the robustness of the experiment. High external validity shows results are generalisable because they are not overwhelmed by noise and context-specific biological variation.89,90
As with factorial designs, the cost is that blocking adds complexity to the design and analysis of the experiment. Blocking is also less tolerant of missing observations than a completely randomised design and should not be done with small experiments due to loss of power (e.g. fewer than 16 experimental units).
Split plot and some RBDs are commonly analysed with mixed models. Mixed models contain both fixed and random effects which can explicitly account for correlation between repeated measurements on subjects (longitudinal data) and describe how the response of subjects changes over time. 91 Fixed effects are factors assumed to have the same effect across subjects (e.g. effect of an intervention, but there are circumstances where this assumption may not apply such as genetic differences between subjects) while random effects are factors likely to vary substantially between subjects (e.g. baseline physiological functions). This gives flexibility to determine the effects of multiple factors. Common statistical methods such as linear regression models and repeated-measures ANOVA are less suited to repeated-measures data due to invalid assumptions such as independence of measurements and all effects being fixed. Another challenge which longitudinal data presents for repeated-measures ANOVA is when subjects have different numbers of available measurements (unbalanced data pattern) due to missing values. If a subject drops out of a longitudinal trial due to treatment-related adverse effects, the missing data are solely dependent on the observed outcomes and classified as missing at random (MAR). Conversely, if the missing data have nothing to do with the treatment effect and its outcome (e.g. the subject relocates), they are independent of both observed and unobserved measurements and classified as missing completely at random (MCAR). 92 Mixed models assume missing data are MAR and can use all available observations and subjects in the analysis. Repeated-measures ANOVA relies on the less likely and more problematic assumption that missing data are MCAR.
Do not recalculate power to interpret non-significant results
Power is a pretrial calculation using a given effect size and should play no role once data are collected. 48 A statistically non-significant result guarantees the power was inadequate for detecting that effect size. Nevertheless, retrospective power calculations are often inappropriately used to interpret the validity of statistically non-significant results and are recommended by some guidelines. 24 For example, observed (or post hoc) power is the statistical power of the test calculated after the experiment using the observed effect size and p-value. Detectable effect size uses the observed variability to compute the minimum effect size which could be observed by the power calculated before the trial.93,94
Retrospective power calculations interpret higher power in an experiment with a non-significant result as stronger evidence for a true negative. However, power calculated retrospectively is uniquely related to the p-value obtained in that experiment. Observed power increases as the p-value decreases. Yet, a lower p-value gives stronger support for a false negative. This creates a contradiction, with higher retrospective power indicating a true negative and the lower calculated p-value indicating a false negative. This is called the power approach paradox. 95
A non-statistically significant result should not be referred to as negative because it does not mean the population effect is definitely zero. It just means an effect was not found, so a population effect of zero cannot be ruled out. A more meaningful way to interpret non-significant results is to provide confidence intervals, observed effect sizes, exact p-values and test statistics. 45
Conclusions
The statistician George Box wrote, ‘In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world’. 96
A statistical model is a mathematical simplification of data variability which contains many assumptions about how data were collected and analysed, and how the results were selected.
40
Misinterpretation and misrepresentation of data through flawed experimental statistics are a major cause of poor reproducibility.41,97–100 The following simple steps which summarise the preceding discussion would improve the reliability of published results:
Investigators should preferentially use powerful experimental designs such as RBDs and factorials to save time, money and animals. Procedures used for randomisation and blinding should be formalised and reported with results. All experimental studies should specify whether they are exploratory or confirmatory, including all hypotheses tested and when they were formulated. Data should be provided in raw form such as dot, box or violin plots. Dynamite plots are never appropriate because they conceal data. Experimental units should be the unit of statistical analysis, and they should be explicitly identified for each study and treatment group. Misidentification of the experimental unit leads to pseudoreplication, overestimation of the sample size and false-positive results. Conclusions should not be based only on whether a p-value passes a specific threshold. Unless p <0.001, exact p-values should be provided to three decimal places. If p >0.1, two decimal places is sufficient. Non-statistically significant results should not be referred to as negative, and statistical power calculated pre-trial should not be recalculated post-trial to interpret non-significant results. All results should be stated with effect size, confidence intervals and test statistics. Effect size helps readers understand the magnitude of differences found, whereas statistical significance examines whether the findings are likely to be due to chance.
43
Most importantly, investigators should consult a statistician at both the design and interpretation stages of experiments. No experiment should ever be started without a clear idea of how the resulting data will be analysed. 57 Improving the use and reporting of experimental statistics will better inform decisions made on the basis of published results, reduce wastage caused by irreproducible science and improve scientific progress.
Footnotes
Acknowledgements
The author thanks Corinne Alberthsen, Russ Bradford and Alison Small for helpful discussions while writing this manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
