Recommendations to improve use and reporting of statistics in animal experiments

Abstract

German

Spanish

French

Flaws in experimental statistics are a major contributor to the poor reproducibility of animal experiments. Informed decisions about whether conclusions are justified requires clear reporting of experimental data and the statistical methods used to analyse them. When data are misinterpreted, manipulated or concealed to generate publications, it creates an illusion that chance observations are robust data which confirm the hypotheses presented. Attempts to reproduce and advance such observations can propagate large areas of irreproducible science. This hinders scientific progress, erodes public support for research, damages reputations and wastes resources. This review analyses and explains recommendations to improve use and reporting of statistics in animal experiments.

Keywords

Block confidence intervals effect size experimental unit factorial power p-value significance

Introduction

Immensely laborious calculations on inferior data may increase the yield from 95 to 100 per cent. A gain of 5 per cent, of perhaps a small total. A competent overhauling of the process of collection, or of the experimental design, may often increase the yield ten or twelve-fold for the same cost in time and labour. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.

Ronald Fisher – Presidential Address to the First Indian Statistical Congress, 1938

Reproducibility means being able to duplicate results using different personnel, materials and methods. It overlaps with repeatability, which means duplicating results using the same personnel, materials and methods. Low statistical power or poor analysis was ranked third behind selective reporting and pressure to publish as factors which contribute to irreproducible science.¹ The Australian National Health and Medical Research Council became sufficiently concerned about reproducibility in animal studies to release a document entitled ‘Best practice methodology in the use of animals for scientific purposes’.² It identified experimental statistics as one of the four most common categories of flaws in animal studies. Statistical problems include failure to determine statistical power,^3,4 biologically relevant effect size,⁵ appropriate statistical significance level, appropriate sample size,⁶ failure to apply correct statistical tests for analysis of data and lack of appropriate training in statistics to design meaningful experiments.^7,8

Industry has observed similar problems with reproducibility. Early-stage venture capitalists assume more than half of academic research will not be reproducible.⁹ Studies by pharmaceutical companies indicate more than 75% irreproducibility.¹⁰ Less than 1% of published cancer biomarkers enter clinical practice. Reasons include poor methods, inappropriate statistics and incomplete and selective reporting.^11–13 The Reproducibility Project: Cancer Biology was established to provide evidence about the reproducibility of preclinical research in cancer biology by repeating selected experiments from high-impact papers. A recent study used seven criteria to assess the reproducibility of 158 effects in a selection of 23 papers reporting the results of preclinical research in cancer biology.¹⁴ Original positive results were half as likely to be repeated successfully as original null results (40% vs. 80%). For original positive effects that were reported as numerical values, the median effect size for repeated experiments was 85% smaller than the median of the original effect sizes (0.43 vs. 2.96). Animal experiments with positive effects were less reproducible than non-animal experiments with positive effects, on every criterion. For example, 12% of replication effects for animal experiments were in the same direction as the original and statistically significant study compared with 54% for non-animal experiments.

The international biotech company Amgen found only six out of 53 (11%) landmark preclinical cancer studies published in high impact factor journals over a 10-year period were reproducible.¹⁵ Landmark was defined as something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics. Impact factor is a measure of the frequency with which the ‘average article’ in a journal has been cited in a particular year or period.¹⁶ Of these 53 publications, 21 (∼40%) were in a journal with an impact factor of 20 or greater, while the remaining 32 (∼60%) were in a journal with an impact factor between 5 and 19. Alarmingly, non-reproducible articles had higher mean numbers of citations than the reproducible articles, so they were more widely followed. This may be because the science in the non-reproducible articles was compromised to make it appear more impressive.^17,18 Inappropriate use of statistics was identified as one of the six common factors in the irreproducible studies.¹⁹ Using an estimate that 50% of preclinical life sciences research is not reproducible, one study found this wastes $28 billion per annum in the USA alone.²⁰

Reporting of methods and statistics is measurably improved by using checklists.^21–23 Various checklists are available for planning and reporting animal studies. Most checklists advocate that publications should provide information on the methods used to reduce bias such as randomisation and blinding. Other recommendations include specifying the total numbers of animals included in the statistical analyses with power and sample size calculations,²⁴ describing each statistical test used plus the unit of analysis and providing an explanation for why any data were excluded,²⁵ and considering pilot studies, statistical power and significance levels.²⁶ The ARRIVE guidelines v2.0 provide the most explicit statistical advice.²⁷ Item 7 (statistical methods) states: Provide details of the statistical methods used for each analysis, including software used; and (b) describe any methods used to assess whether the data met the assumptions of the statistical approach, and what was done if the assumptions were not met. Item 10 (results) states: For each experiment conducted, including independent replications, report: (a) summary/descriptive statistics for each experimental group, with a measure of variability where applicable (e.g. mean and SD, or median and range); and (b) If applicable, the effect size with a confidence interval. Further explanation and elaboration were published for these guidelines.²⁸

The current review provides additional clarification on how to improve use and reporting of statistics in animal experiments, explains how to best achieve some recommendations such as reporting summary/descriptive statistics for each experimental group²⁷ and discourages other recommendations such as performing a power analysis after completion of the experiment to check whether the power of the experiment was sufficient to draw any conclusions.²⁴

Statistical significance is not a definitive threshold

The way statistics are used is often determined by the goal of the research. Academic research is searching for something ‘interesting’, which is a statistically significant result. The greater the statistical significance, the more publishable it is. By contrast, industry is seeking something ‘feasible’, which is a marketable result. Results must be robust to withstand clinical trials and commercialisation.²⁹

The standard measure of statistical significance is the p-value. This is the probability of obtaining a difference at least as large as the one observed purely by chance, assuming there is no difference. The lower the p-value, the greater the statistical significance.

The current rigid threshold for statistical significance (p <0.05) was largely created by non-statisticians.³⁰ When the statistician Ronald Fisher introduced the p-value in the 1920s, he intended it as an informal assessment of whether results were worthy of further investigation and as one part of a fluid, largely non-numerical, intuitive process. In contrast, the rigorous, rules-based system of Jerzy Neyman and Egon Pearson specifically omitted p-values. It included statistical power, false positives and false negatives and was intended for repeated experiments. Modern statistical hypothesis testing is a hybrid of these competing systems. Although it aimed to produce an objective, evidence-based, decision-making system for working scientists, it was developed without a thorough understanding of either approach. Once a p-value <0.05 was defined as ‘statistically significant’, it began to transform from an indicator into a target.

Publications will often describe p-values >0.05 with inventive terms such as ‘approaching significance’, but they never refer to values slightly <0.05 as approaching insignificance.³¹ Other creative descriptions of p-values >0.05 include ‘barely fails to attain statistical significance at conventional levels’, ‘not absolutely significant but very probably so’, ‘suggestive of a significant trend’ and ‘weakened significance’.^32,33

Consequently, many statisticians argue against using p-value thresholds for hypothesis testing.^34–39 p-values should be part of the analysis of results, with the understanding that a low p-value may have many causes. It may be that the alternative hypothesis proposed is true, or it may mean one of many other alternative hypotheses is true. Alternatively, a low p-value may be due to a large sample size, or the experiment may have been conducted on an unrepresentative sample. Violation of statistical assumptions which may produce low p-values include introducing bias through inadequate randomisation and blinding, inaccurate measurements, incorrect replication and flaws in statistical methods.⁴⁰

Other causes of low p-values are more insidious.⁴¹ Multiplicity means multiple comparisons and includes testing many hypotheses in one experiment and testing one hypothesis many times or in multiple ways in one or more studies. It includes p-hacking, which refers to applying multiple statistical analyses and sub-analyses until uncovering a statistically significant result without clearly reporting how it was obtained.⁴² It is also known as data dredging, snooping, fishing, significance chasing and double dipping. Coupled with incomplete or selective reporting, it virtually guarantees impressive but irreproducible results for publication.

Effect size emphasises biological significance

Effect size is the direction and magnitude of quantitative findings. It indicates the biological significance of findings and is independent of sample size. This is in contrast to the p-value, where a statistically significant difference will inevitably be detected with a large enough sample size unless there is no difference at all.⁴³

Effect size can be expressed as an absolute or a calculated difference. The absolute effect size is the raw difference between group means. This is useful for variables with intrinsic meaning such as body weight, blood pressure, milk or wool production. Calculated indices are useful when the measurements have no intrinsic meaning, such as numbers on a subjective scale or when studies used different scales so that no direct comparison is possible. These can be divided into the d family and the r family.⁵ The d family of effect sizes measures differences between groups (e.g. calculated in units of standard deviation or ratios such as odds ratio or risk ratio). The r family of effect sizes measures the strength of association between variables (e.g. correlation coefficients or the proportion of variance attributed to an effect versus the total variance).⁴⁴

The observed value of test statistics such as F- and t-values are also effect sizes. F is the variance due to the treatment divided by variance due to error, whereas t is the mean difference divided by its standard error.⁴⁵

One of the main advantages of using effect size is that different effect size estimates from repeated studies can be combined to give an overall best estimate of the effect size.⁴⁶

Confidence intervals reduce misunderstanding by highlighting uncertainty

Calculating confidence intervals (CI) for effect sizes places greater emphasis on the interpretation of the biological or clinical significance of results rather than just their statistical significance.^5,34

CIs show how rough or uncertain an estimate is.⁴⁷ A CI can be thought of as the set of true but unknown differences that are statistically compatible with the observed difference.⁴⁸ As the CI narrows, the estimate becomes more precise. CIs are usually calculated as 90%, 95% or 99% but can be any percentage of interest between 0% and 100%. In the case of a 95% CI, one would expect that 95 out of 100 CIs obtained from equal-sized samples drawn from the same population will contain the population parameter. Hence, the width of a CI calculated for the same estimate automatically increases from 90% to 95% to 99%.

CIs are an example of inferential error bars, which give information about what conclusions are justified. The length of inferential error bars is proportional to how much uncertainty there is in the data. Another example of inferential error bars is standard error of the mean (SEM). SEM is an estimate of the standard deviation of mean values obtained from multiple samples of the same size from the same population. Simple CI bars for metric variables are SEM bars adjusted for sample size (n) to allow comparison between samples of uneven size. The ratio of CI to standard error is the appropriate quantile of the t distribution for that n, and changes with n. As the sample size increases, the size of SEM and CI bars decreases, but the standard deviation of the sample remains constant.^49,50

Presenting CI graphically highlights uncertainty and discourages rigid interpretations based on p-value thresholds. Note that error bars and summary statistics (e.g. mean and standard deviation) should only be shown for independently repeated experiments and never for the technical replicates used within a single experiment to conduct internal quality checks.⁵¹ This is because inferences can only be made about the population from which independent samples were drawn unless correlation is accounted for (e.g. mixed models). Since technical replicates are not independent, they only indicate the fidelity with which replicates were created and cannot provide evidence of the reproducibility of the main results.⁵² Misinterpreting data from experiments which were not independently repeated and disguising inappropriate data selection with terms such as ‘typical or representative result’ are major causes of poor reproducibility.¹⁹

Display data in raw form

In contrast to inferential error bars, descriptive error bars show the spread of data (e.g. range and standard deviation), and the type of error bar on any graph must be specified. Descriptive error bars are commonly used in dynamite plots, which are bar charts displaying mean and standard deviation. Dynamite plots are so named because they look like a dynamite plunger. They are never appropriate because they conceal data such as the number of data points and their distribution.⁵³ For example, a dynamite plot with two data points may look the same as one with 50 data points provided the mean and standard deviation are the same. Similarly, a single outlier in one data set may be sufficient to make the dynamite plot look the same as another data set with two distinctive clusters of data points. In each case, the comparison is misleading.

Consequently, small samples should be presented as dot plots showing all individual values with a point or line representing the mean.⁵⁴ A box plot or violin plot should be used if there are too many data points to plot individually without congestion. Box and violin plots show the median value, interquartile range (which captures the middle half of the data between the first and third quartile) and the upper and lower adjacent values. (These are dependent on the software used. One definition is the values in the data that are furthest away from the median on either side of the box but are still within a distance of 1.5 times the interquartile range from the nearest end of the box.) Outliers or outside points are beyond the adjacent values. In addition, a violin plot shows the entire distribution of the data.⁵⁵

Clearly distinguish between experimental and observational units

The experimental unit is the entity that is randomly and independently assigned to the treatment conditions (e.g. person, animal, litter, cage, holding pen, fish tank, culture dish or a well in a microtiter plate).⁵⁶ It is the smallest unit which can receive different treatments and should be the unit of statistical analysis. Experimental interventions must be applied to each experimental unit independently to average out treatment errors across the experimental groups. Treatments should not spill over or affect adjacent experimental units, and experimental units should not influence each other, especially on the outcomes of interest. Independence of experimental units is assumed in p-value calculations.

The observational unit is the entity on which measurements are made. It is the smallest unit that will be measured or observed in an experiment. It may also be called the measurement or sampling unit.

Conclusions only apply to the population which experimental units were drawn from.⁵¹ The population is the set of all individuals or experimental units of interest. Hence, misidentification of units means the intended hypotheses will not be tested and results cannot be extrapolated to the population of interest.

Sample size (lowercase n) is the number of experimental units in a treatment group.⁵⁷ Capital N is the total number of experimental units that are used in the whole experiment.⁴⁵ Hence, the sample size is only the number of animals per treatment group if treatments are randomly and independently applied to individual animals. Otherwise, the experimental unit, and hence the unit of statistical analysis, should be the entity which was randomly and independently assigned to the treatment conditions such as the litter, cage, holding pen, fish tank, pot, pasture or plot.

If individual animals are counted towards sample size when they are not the experimental unit, this is an example of pseudoreplication. Pseudoreplication is the use of inferential statistics to test for treatment effects using data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent.⁵⁸ This confounds treatment effects with random error (e.g. biological variation and measurement error), artificially inflates sample sizes and therefore reduces p-values. It violates the principle of independence, resulting in an invalid analysis unless correlation is accounted for (e.g. mixed models) and irreproducible data.⁵⁴

Clearly distinguish between observational studies, exploratory experimental studies and confirmatory experimental studies

The pyramid of research design is a hierarchy which represents escalating study rigor.^59,60 Observational (non-randomised) studies are on the lower levels of the pyramid. They aim to identify associations between predictors and outcomes. Examples include cohort, cross-sectional and case-control studies, which are used to investigate incidence, prevalence and aetiology.⁶¹ Observational studies are typically less expensive and quicker than randomised studies and reduce welfare risks to study participants because no interventions are carried out by investigators (e.g. subjects are exposed to risk factors naturally). Sample size depends on the number of predictors and the assumed underlying probability distribution of the outcome. Subjects are often selected on convenience or availability. This increases the likelihood of selection bias and confounding.^62,63 Confounding arises when subjects in different groups are fundamentally different with regard to the outcome of interest. This may bias estimates of association. The design and analysis of observational studies should control for confounding by matching subjects between groups on the basis of confounders, arranging and analysing subjects within groups on the basis of levels of confounders (stratification) and using multivariate techniques or generalised linear models of regression to adjust for multiple confounding factors.⁶⁴

Experimental (randomised) studies aim to establish a causal effect between treatment factors and outcomes. Subjects are randomly allocated to treatment and control groups. Investigators collecting and analysing data should be blinded to which groups are treatment and which are control. This is most important when there is any subjective element in assessing the results. Random allocation and blinding reduce bias and strengthen the evidence obtained. Experimental studies may be exploratory or confirmatory.⁶⁵

Exploratory experimental studies are designed to make new discoveries, develop methods and generate new hypotheses.⁶⁶ They tend to be small and flexible, so the sequence and specific design of experiments may be inexact. For example, sample size may be estimated based on feasibility and constraints rather than power analyses.⁶⁷ Where hypotheses are imprecise or non-existent at commencement, analysis should be limited to descriptive statistics only. The emphasis should be on high sensitivity to detect all strategies that might be useful (i.e. minimise false negatives). Consequently, many different strategies are often tested in parallel, which may necessitate small sample sizes. Under such circumstances, some experiments will produce large effects and statistically significant results due to random variation alone.¹⁴

A danger of exploratory studies is the temptation to misrepresent hypotheses which evolve over the course of sequential experiments as a priori hypotheses and selectively support them with spurious results for publication. This increases the risk of false-positive findings entering the scientific literature.

Confirmatory experimental studies use clear, a priori hypotheses, rigid, pre-specified designs, sample sizes calculated on power and well-established methods, often over prolonged durations, to provide strong, new evidence. An example is a randomised clinical trial to determine efficacy. The emphasis should be on high specificity to eliminate false positives which progressed through exploratory studies. Hence, sample sizes should be sufficiently large to minimise the effects of random variation. Analysis of confirmatory experimental studies should include both descriptive and inferential statistics (e.g. analysis of variance (ANOVA) families) to infer the findings of the sample into the underlying population.

All experimental studies should specify whether they are exploratory or confirmatory, including all hypotheses tested and when they were formulated.

Use powerful experimental designs

A well-designed experiment avoids bias (systematic error) and is sufficiently powerful to be able to detect effects likely to be of biological importance. Power is the probability of finding a statistically significant difference at a given p-value with the specified number of subjects in each group. It indicates the signal (treatment) to noise (unexplained variability) ratio of the experiment.⁴⁵ In underpowered experiments, an effect or signal has to be larger than the true effect size to overcome the noise and detect treatment effects.⁶⁸ Common sources of unexplained variability in animal experiments include subclinical infections (which are the most common type in laboratory animals), genetic drift (which cannot be stopped) and variations in microbiome (with effects as diverse as drug responses, immune and nervous system function and development of disease models).^69,70

One way to increase power is to increase sample size. Any signal will eventually become statistically significant with a large enough sample size.⁷¹ Increasing sample size increases cost and logistical complexity and has ethical implications. Upscaling an experiment may impact consistency. For example, if a larger experiment requires more investigators over separate days, ‘investigator’ and ‘day’ become technical factors which could influence results and may need to be included as variables in the analysis. Similarly, if the increased duration accentuates the influence of circadian rhythms, time may need to be included as a covariate or blocking factor. These are all changes to the original design of the experiment that was used for the power calculation.⁷²

Simple options to increase power with fixed sample size include testing a specific hypothesis instead of a general one, using the minimum number of factor levels for continuous variables without reducing them to categorical variables based on a threshold such as the median (dichotomising or binning) and crossing factors rather than nesting them.⁷³

Factor arrangement refers to how factors relate to each other. When levels of one factor always co-occur with the same levels of another factor, they are completely confounded. An example is running all control animals on one day and all treated animals on another day. Hence, ‘treatment’ and ‘day’ are completely confounded.

A nested arrangement occurs when levels of one factor are grouped under the levels of another factor. If whole litters of mice are randomised to treatment groups, the factor ‘litter’ would be nested under the factor ‘treatment’. If there are large differences between litters, a nested arrangement could result in false positives. Furthermore, if litters are randomised to treatment groups rather than individual animals, litter should be the experimental unit, and hence average values per litter should be used in statistical analysis. This reduces sample size and reduces power. Hence, nested designs should be avoided but are common because they are convenient (e.g. housing littermates in the same cage).

A crossed or factorial arrangement occurs when all levels of one factor co-occur with all levels of another factor. If mouse pups from separate litters are spread across treatment groups (randomised to groups within litters), the treatment and litter factors would be crossed. In these examples, the difference in power between nested and crossed designs diminishes as variation between litters becomes smaller and/or variation within litters becomes greater.⁷⁴

A third approach to increase power is to reduce noise via improved experimental design.^75,76

Completely randomised design

In a completely randomised design, experimental units are randomly assigned to treatments and controls. It is suitable when the experimental units are independent, interchangeable and homogeneous and there are sufficient experimental units available for the required power. This is the simplest experimental design to calculate power for, conduct the experiment and analyse the results. This design does not control for either differences or relatedness in experimental units, nor variation occurring during the course of the experiment.⁷⁷

In factorial experiments, more than one type of independent variable or factor is varied at a time to determine which factors are influential.⁷⁸ Factors can be any variable which the investigator can control. This includes animal-related characteristics (e.g. sex, strain, age, diet and health status), environmental variables (e.g. cage and group size, bedding material and environmental complexity) and protocol-specific variables (e.g. reagents, dosages, operators, routes of administration, timing of observations and methods of measurement).

The simplest factorial is a 2 × 2 design. For example, male and female animals in each of two treatment groups (treatment and control). Hence, there are four groups of animals. All of the animals contribute to estimates of the effect of treatment and sex. If the response of the two sexes to treatment is different, it is called an interaction between sex and treatment. This multiple use of data is a key benefit of using factorial designs to maximise information on limited numbers of animals.⁷⁹ If the same experiment had four levels of treatment, it would be a 4 × 2 factorial experiment. When one group of animals is included for every combination of each factor, it is termed a full factorial design. However, as the number of factors increases, the total number of combinations becomes very large. In these cases, a fractional factorial design may be used with a subset of these combinations to maximise information on the main effects and interactions.

Factorial experimental designs are more effective and efficient than varying one factor at a time, hence saving time, money and animals. The main extra cost is the increase in the complexity of the experiment, which could lead to mistakes, and the increased complexity of the statistical analysis (e.g. switching from one-way ANOVA to two-way or multi-way ANOVA).⁸⁰

Randomised block design

Factorial experimental designs can be used to show which controllable factors are important, but they cannot be easily used to investigate the many uncontrollable factors that can affect the results.⁸¹ The effect of uncontrollable factors, or nuisance variables, can be explored with a randomised block design (RBD), even if they are unknown. Nuisance variables add heterogeneity to experimental units and probably effect the results but are of no interest to the investigator.⁷⁷ Examples include light, vibration, temperature, humidity, cage location, noise, time of day and changes in skill of investigators. Controlling for nuisance variables can reduce animal numbers by two to ten times while retaining similar power and significance.⁸²

A RBD splits the experiment into multiple mini experiments where a complete replicate of the basic experiment is conducted within each block. Any restriction in randomisation creates a block. Experimental units are randomly assigned to treatments within blocks which ensures a balance of treatments across the variability within the blocks. Depending on the context, blocks can be considered as fixed or random effects.⁸³ Blocks are recombined in the final statistical analysis, and nuisance variables are removed as a block effect.⁸⁴

Treatment groups must be randomly intermingled within blocks and assessed blind in random order to avoid bias. Blinding will occur if subjects are only identified by their identification number once the treatments are given. Randomisation to treatment group (RTTG) is when treatment groups are pre-ordered within blocks and not randomly intermingled. RTTG is an invalid design because any environmental effect that differs between treatment groups may be mistaken for the effects of the treatment, leading to bias and irreproducible results. Hence, randomisation should be applied to allocation of experimental units, as well as the order of treatments and assessments.⁷⁷

A correctly RBD controls for both variation among experimental subjects and variation caused during the course of the experiment by the research environment and in the assessment of results. This design reduces variability and confounding within treatments and improves the estimate of treatment effects, thereby increasing statistical power.

For a complete RBD, each block contains every treatment group. An incomplete RBD may be used when the blocks are not large enough to accommodate all treatments.

Factorial designs can be used within blocks, and blocks can account for natural groupings of experimental units. For example, separate blocks may contain siblings or litters of animals, animals from different suppliers, animals of different mean age or weight, animals housed in different cages or in different locations, animals fed a different batch of diet or animals used at a different time of the year. If these natural groups are not accounted for, results may be confounded by conditions which existed before the experiment commenced.

Split-plot experimental design

An example of a factorial design in a blocked experiment is a split-plot experiment.⁸⁵ These have two levels of experimental units, where the blocks themselves serve as experimental units for a subset of the factors. The blocks are referred to as whole plots, while the experimental units within blocks are called split plots, split units or subplots. For example, if control mice are co-housed with treatment mice in single-strain cages, strain (cage) is the whole plot, and treatment (mouse) is the split plot. Cage is the biggest factor in variability of rodent experiments because everything important to rodents clusters on cage (e.g. temperature, humidity, smell, light and vibration). By placing control and treatment animals of the same strain in the same cage, the cage becomes a block for these nuisance variables. Another example is groups of caged mice fed different diets per cage, while individual mice are injected with vitamins. In this case, diet (cage) is the whole plot and vitamins (mouse) are the split plot. Both whole plots and split plots need to be independently randomised.

In a split-plot experiment, factors which are more difficult to administer such as strain, sex or age should be assigned to whole plots, while factors which are easier to administer such as applied treatments should be assigned to split plots. Split plots have more experimental units and hence larger sample size and greater statistical power.

The order of application of treatment and control should be randomised and applied consistently to allow for effects such as stress and waiting.

One important thing to consider is that experimental units should not influence each other, especially on the outcomes of interest.⁵⁶ Co-housed animals may influence each other on many relevant variables, from behaviour to microbiomes (and hence anything influenced by an animal’s microbiome). This effect needs to be balanced against the benefits of blocking nuisance variables when co-housing control and treatment animals.

Blocking is an effective means of exploiting the benefits of both standardisation and heterogenisation.^86–88 Within blocks of subjects, the experimental conditions can be standardised as rigorously as possible (e.g. use of same genotype, same age and same experimental context), so that any differences in response to the experimental treatments will most likely be attributable to the treatment. However, the blocks themselves can be heterogeneous and vary in one or several aspects. Large block effects are common, highlighting the importance of using concurrent controls and randomisation.⁸⁴ Differences between blocks indicate the external validity of results, or how reproducible results are under different conditions. In a RBD with replication over time and/or in different laboratories or by different staff, the extent to which the results are repeatable under different conditions will give a good indication of the robustness of the experiment. High external validity shows results are generalisable because they are not overwhelmed by noise and context-specific biological variation.^89,90

As with factorial designs, the cost is that blocking adds complexity to the design and analysis of the experiment. Blocking is also less tolerant of missing observations than a completely randomised design and should not be done with small experiments due to loss of power (e.g. fewer than 16 experimental units).

Split plot and some RBDs are commonly analysed with mixed models. Mixed models contain both fixed and random effects which can explicitly account for correlation between repeated measurements on subjects (longitudinal data) and describe how the response of subjects changes over time.⁹¹ Fixed effects are factors assumed to have the same effect across subjects (e.g. effect of an intervention, but there are circumstances where this assumption may not apply such as genetic differences between subjects) while random effects are factors likely to vary substantially between subjects (e.g. baseline physiological functions). This gives flexibility to determine the effects of multiple factors. Common statistical methods such as linear regression models and repeated-measures ANOVA are less suited to repeated-measures data due to invalid assumptions such as independence of measurements and all effects being fixed. Another challenge which longitudinal data presents for repeated-measures ANOVA is when subjects have different numbers of available measurements (unbalanced data pattern) due to missing values. If a subject drops out of a longitudinal trial due to treatment-related adverse effects, the missing data are solely dependent on the observed outcomes and classified as missing at random (MAR). Conversely, if the missing data have nothing to do with the treatment effect and its outcome (e.g. the subject relocates), they are independent of both observed and unobserved measurements and classified as missing completely at random (MCAR).⁹² Mixed models assume missing data are MAR and can use all available observations and subjects in the analysis. Repeated-measures ANOVA relies on the less likely and more problematic assumption that missing data are MCAR.

Do not recalculate power to interpret non-significant results

Power is a pretrial calculation using a given effect size and should play no role once data are collected.⁴⁸ A statistically non-significant result guarantees the power was inadequate for detecting that effect size. Nevertheless, retrospective power calculations are often inappropriately used to interpret the validity of statistically non-significant results and are recommended by some guidelines.²⁴ For example, observed (or post hoc) power is the statistical power of the test calculated after the experiment using the observed effect size and p-value. Detectable effect size uses the observed variability to compute the minimum effect size which could be observed by the power calculated before the trial.^93,94

Retrospective power calculations interpret higher power in an experiment with a non-significant result as stronger evidence for a true negative. However, power calculated retrospectively is uniquely related to the p-value obtained in that experiment. Observed power increases as the p-value decreases. Yet, a lower p-value gives stronger support for a false negative. This creates a contradiction, with higher retrospective power indicating a true negative and the lower calculated p-value indicating a false negative. This is called the power approach paradox.⁹⁵

A non-statistically significant result should not be referred to as negative because it does not mean the population effect is definitely zero. It just means an effect was not found, so a population effect of zero cannot be ruled out. A more meaningful way to interpret non-significant results is to provide confidence intervals, observed effect sizes, exact p-values and test statistics.⁴⁵

Conclusions

The statistician George Box wrote, ‘In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world’.⁹⁶

A statistical model is a mathematical simplification of data variability which contains many assumptions about how data were collected and analysed, and how the results were selected.⁴⁰ Misinterpretation and misrepresentation of data through flawed experimental statistics are a major cause of poor reproducibility.^41,97–100 The following simple steps which summarise the preceding discussion would improve the reliability of published results:

Investigators should preferentially use powerful experimental designs such as RBDs and factorials to save time, money and animals. Procedures used for randomisation and blinding should be formalised and reported with results.

All experimental studies should specify whether they are exploratory or confirmatory, including all hypotheses tested and when they were formulated.

Data should be provided in raw form such as dot, box or violin plots. Dynamite plots are never appropriate because they conceal data.

Experimental units should be the unit of statistical analysis, and they should be explicitly identified for each study and treatment group. Misidentification of the experimental unit leads to pseudoreplication, overestimation of the sample size and false-positive results.

Conclusions should not be based only on whether a p-value passes a specific threshold. Unless p <0.001, exact p-values should be provided to three decimal places. If p >0.1, two decimal places is sufficient.

Non-statistically significant results should not be referred to as negative, and statistical power calculated pre-trial should not be recalculated post-trial to interpret non-significant results.

All results should be stated with effect size, confidence intervals and test statistics. Effect size helps readers understand the magnitude of differences found, whereas statistical significance examines whether the findings are likely to be due to chance.⁴³

Most importantly, investigators should consult a statistician at both the design and interpretation stages of experiments. No experiment should ever be started without a clear idea of how the resulting data will be analysed.⁵⁷ Improving the use and reporting of experimental statistics will better inform decisions made on the basis of published results, reduce wastage caused by irreproducible science and improve scientific progress.

Footnotes

Acknowledgements

The author thanks Corinne Alberthsen, Russ Bradford and Alison Small for helpful discussions while writing this manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Baker

1,500 scientists lift the lid on reproducibility. Nature 2016; 533: 452–454.

National Health and Medical Research Council. Best practice methodology in the use of animals for scientific purposes, https://www.nhmrc.gov.au/about-us/publications/best-practice-methodology-use-animals-scientific-purposes (2017, accessed 30 June 2022).

Cressey

Poorly designed animal experiments in the spotlight. Nature 2015. https://doi.org/10.1038/nature.2015.18559

Kilkenny

Parsons

Kadyszewski

, et al. Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PLoS One 2009; 4: e7824.

Hawkins

Gallacher

Gammell

Statistical power, effect size and animal welfare: recommendations for good practice. Anim Welf 2013; 22: 339–344.

Furuya

Wijesundara

Neeman

, et al. Use and misuse of statistical significance in survival analyses. mBio 2014; 5: e00904–00914.

Jasińska-Stroschein

Orszulak-Michalak

Reporting experimental studies on animals – the problems with translating of outcomes to clinical benefits. Methodological and statistical considerations: the example of pulmonary hypertension. Eur J Pharmacol 2021; 897: 173952.

Hooijmans

Ritskes-Hoitinga

Progress in using systematic reviews of animal studies to improve translational research. PLoS Med 2013; 10: e1001482.

Osherovich

Hedging against academic risk. SciBX 2011; 4: 416.

10.

Begley

Ioannidis

JP.

Reproducibility in science: improving the standard for basic and preclinical research. Circ Res 2015; 116: 116–126.

11.

Bunnage

ME.

Getting pharmaceutical R&D back on target. Nat Chem Biol 2011; 7: 335–339.

12.

Ioannidis

JPA

Bossuyt

PMM.

Waste, leaks, and failures in the biomarker pipeline. Clin Chem 2017; 63: 963–972.

13.

Kern

SE.

Why your new cancer biomarker may never work: recurrent patterns and remarkable diversity in biomarker failures. Cancer Res 2012; 72: 6097–6101.

14.

Errington

Mathur

Soderberg

, et al. Investigating the replicability of preclinical cancer biology. eLife 2021; 10: e71601.

15.

Begley

Ellis

LM.

Raise standards for preclinical cancer research. Nature 2012; 483: 531.

16.

Garfield

The clarivate analytics impact factor, https://clarivate.com/webofsciencegroup/essays/impact-factor/ (2022, accessed 21 June 2022).

17.

Campos-Varela

Ruano-Ravina

Misconduct as the main cause for retraction. A descriptive study of retracted publications and their authors. Gac Sanit 2019; 33: 356–360.

18.

Fang

Casadevall

Retracted science and the retraction index. Infect Immun 2011; 79: 3855–3859.

19.

Begley

CG.

Six red flags for suspect work. Nature 2013; 497: 433–434.

20.

Freedman

Cockburn

Simcoe

TS.

The economics of reproducibility in preclinical research. PLoS Biol 2015; 13: e1002165.

21.

NPQIP. Did a change in Nature journals’ editorial policy for life sciences research improve reporting? BMJ Open Sci 2019; 3: e000035.

22.

Han

Olonisakin

Pribis

, et al. A checklist is associated with increased quality of reporting preclinical biomedical research: a systematic review. PLoS One 2017; 12: e0183591.

23.

Plint

Moher

Morrison

, et al. Does the CONSORT checklist improve the quality of reports of randomised controlled trials? A systematic review. Med J Aust 2006; 185: 263–267.

24.

Hooijmans

Leenaars

Ritskes-Hoitinga

A gold standard publication checklist to improve the quality of animal studies, to fully integrate the Three Rs, and to make systematic reviews more feasible. Altern Lab Anim 2010; 38: 167–182.

25.

Osborne

Avey

Anestidou

, et al. Improving animal research reporting standards: HARRP, the first step of a unified approach by ICLAS to improve animal research reporting standards worldwide. EMBO Rep 2018; 19: e46069.

26.

Smith

Clutton

Lilley

, et al. PREPARE: guidelines for planning animal research and testing. Lab Anim 2018; 52: 135–141.

27.

Percie du Sert

Hurst

Ahluwalia

, et al. The ARRIVE guidelines 2.0: updated guidelines for reporting animal research. PLoS Biol 2020; 18: e3000410.

28.

Percie du Sert

Ahluwalia

Alam

, et al. Reporting animal research: explanation and elaboration for the ARRIVE guidelines 2.0. PLoS Biol 2020; 18: e3000411.

29.

Prinz

Schlange

Asadullah

Believe it or not: how much can we rely on published data on potential drug targets?

Nat Rev Drug Discov 2011; 10: 712.

30.

Nuzzo

Scientific method: statistical errors. Nature 2014; 506: 150–152.

31.

Harvey

LA.

Nearly significant if only…. Spinal Cord 2018; 56: 1017.

32.

Hankins

Still not significant, https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/ (2013, accessed 1 June 2022).

33.

Kjaerulff

p-Values – abused not abandoned, https://chemaust.raci.org.au/article/may-2018/p-values-%E2%80%93-abused-not-abandoned.html (2018, accessed 1 June 2022).

34.

Fidler

Loftus

GR.

Why figures with error bars should replace p values: some conceptual arguments and empirical demonstrations. Z Psychol 2009; 217: 27–37.

35.

Gardner

Altman

DG.

Confidence intervals rather than p values: estimation rather than hypothesis testing.

Br Med J (Clin Res Ed) 1986; 292: 746–750.

36.

Carver

The case against statistical significance testing. Harv Educ Rev 1978; 48: 378–399.

37.

Amrhein

Greenland

McShane

Scientists rise up against statistical significance. Nature 2019; 567: 305–307.

38.

McShane

Gal

Gelman

, et al. Abandon statistical significance. Am Stat 2019; 73: 235–245.

39.

Wasserstein

Schirm

Lazar

NA.

Moving to a world beyond ‘p < 0.05’. Am Stat 2019; 73: 1–19.

40.

Greenland

Senn

Rothman

, et al. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 2016; 31: 337–350.

41.

Goodman

Fanelli

Ioannidis

JPA.

What does research reproducibility mean?

Sci Transl Med 2016; 8: 341ps312–341ps312.

42.

Simonsohn

Nelson

Simmons

JP.

p-Curve: a key to the file-drawer.

J Exp Psychol Gen 2014; 143: 534–547.

43.

Sullivan

Feinn

Using effect size – or why the p value is not enough. J Grad Med Educ 2012; 4: 279–282.

44.

McGough

Faraone

SV.

Estimating the size of treatment effects: moving beyond p values. Psychiatry (Edgmont) 2009; 6: 21–29.

45.

Gaskill

Garner

JP.

Power to the people: power, negative results and sample size. J Am Assoc Lab Anim 2020; 59: 9–16.

46.

Coe

It’s the effect size, stupid: what effect size is and why it is important. Annual Conference of the British Educational Research Association, University of Exeter, UK, 2002.

47.

Tan

SB.

The correct interpretation of confidence intervals. Proc Singapore Healthc 2010; 19: 276–278.

48.

Goodman

Berlin

JA.

The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994; 121: 200–206.

49.

Cumming

Fidler

Vaux

DL.

Error bars in experimental biology. J Cell Biol 2007; 177: 7–11.

50.

Krzywinski

Altman

Error bars. Nat Methods 2013; 10: 921–922.

51.

Vaux

Fidler

Cumming

Replicates and repeats – what is the difference and is it significant? A brief discussion of statistics and experimental design. EMBO Rep 2012; 13: 291–296.

52.

Vaux

DL.

Research methods: know when your numbers are significant. Nature 2012; 492: 180–181.

53.

Drummond

Vowler

SL.

Show the data, don’t conceal them. Br J Pharmacol 2011; 163: 208–210.

54.

Gosselin

RD.

Guidelines on statistics for researchers using laboratory animals: the essentials. Lab Anim 2019; 53: 28–42.

55.

Schriger

Cooper

RJ.

Achieving graphical excellence: suggestions and methods for creating high-quality visual displays of experimental data. Ann Emerg Med 2001; 37: 75–87.

56.

Lazic

Clarke-Williams

Munafò

MR.

What exactly is ‘N’ in cell culture and animal experiments?

PLoS Biol 2018; 16: e2005282.

57.

Festing

Altman

DG.

Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J 2002; 43: 244–258.

58.

Hurlbert

SH.

Pseudoreplication and the design of ecological field experiments. Ecol Monogr 1984; 54: 187–211.

59.

Concato

Observational versus experimental studies: what’s the evidence for a hierarchy?

NeuroRx 2004; 1: 341–347.

60.

Lawson

Design and analysis of experiments with R. Boca Raton, FL: Chapman & Hall/CRC Press, 2014, p.620.

61.

Mann

CJ.

Observational research methods. Research design II: cohort, cross sectional, and case-control studies. Emerg Med J 2003; 20: 54–60.

62.

Copas

HG.

Inference for non-random samples. J R Stat Soc Series B Stat Methodol 1997; 59: 55–95.

63.

Van Belle

Statistical rules of thumb. Hoboken, New Jersey: John Wiley, 2008, p.300.

64.

Morshed

Tornetta

Bhandari

Analysis of observational studies: a guide to understanding statistical methods. J Bone Joint Surg 2009; 91: 50–60.

65.

Schwab

Held

Different worlds: confirmatory versus exploratory research. Significance 2020; 17: 8–9.

66.

Kimmelman

Mogil

Dirnagl

Distinguishing between exploratory and confirmatory preclinical research will improve translation. PLoS Biol 2014; 12: e1001863.

67.

Reynolds

PS.

When power calculations won’t do: Fermi approximation of animal numbers. Lab animal 2019; 48: 249–253.

68.

Button

Ioannidis

JPA

Mokrysz

, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 2013; 14: 365–376.

69.

Festing

MF.

Refinement and reduction through the control of variation. Altern Lab Anim 2004; 32: 259–263.

70.

Gärtner

A third component causing random variability beside environment and genotype. A reason for the limited success of a 30 year long effort to standardize laboratory animals?

Lab Anim 1990; 24: 71–77.

71.

Vetter

Mascha

EJ.

Bias, Confounding, and Interaction: lions and tigers, and bears, oh my!

Anesth Analg 2017; 125: 1042–1048.

72.

Lazic

Experimental design for laboratory biologists: maximising information and improving reproducibility. Cambridge, Cambridgeshire: Cambridge University Press, 2016, p.422.

73.

Lazic

SE.

Four simple ways to increase power without increasing the sample size. Lab Anim 2018; 52: 621–629.

74.

Lazic

Essioux

Improving basic and translational science by accounting for litter-to-litter variation in animal models. BMC Neurosci 2013; 14: 37.

75.

Festing

MFW.

Inbred strains should replace outbred stocks in toxicology, safety testing, and drug development. Toxicol Pathol 2010; 38: 681–690.

76.

Festing

MFW

Altman

DG.

Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J 2002; 43: 244–258.

77.

Festing

MFW.

The ‘completely randomised’ and the ‘randomised block’ are the only experimental designs suitable for widespread use in pre-clinical research. Sci Rep 2020; 10: 17577.

78.

Shaw

Festing

Peers

, et al. Use of factorial designs to optimize animal experiments and reduce animal use. ILAR J 2002; 43: 223–232.

79.

Shaw

Reduction in laboratory animal use by factorial design. Altern Lab Anim 2004; 32: 49–51.

80.

Festing

MFW.

Design and statistical methods in studies using animal models of development. ILAR J 2006; 47: 5–14.

81.

Kaps

Lamberson

Biostatistics for animal science. Wallingford, Oxfordshire: CABI, 2017, p.562.

82.

Krzywinski

Altman

Analysis of variance and blocking. Nat Methods 2014; 11: 699–700.

83.

Dixon

Should blocks be fixed or random? 28th Annual Conference on Applied Statistics in Agriculture, 2016.

84.

Festing

MF.

Randomized block experimental designs can increase the power and reproducibility of laboratory animal experiments. ILAR J 2014; 55: 472–476.

85.

Jones

Nachtsheim

CJ.

Split-plot designs: what, why, and how. J Qual Technol 2009; 41: 340–361.

86.

Bailoo

Reichlin

Würbel

Refinement of experimental design and conduct in laboratory animal research. ILAR J 2014; 55: 383–391.

87.

Richter

Garner

Auer

, et al. Systematic variation improves reproducibility of animal experiments. Nat Methods 2010; 7: 167–168.

88.

Richter

Garner

Würbel

Environmental standardization: cure or cause of poor reproducibility in animal experiments?

Nat Meth 2009; 6: 257–261.

89.

Voelkl

Altman

Forsman

, et al. Reproducibility of animal research in light of biological variation. Nat Rev Neurosci 2020; 21: 384–393.

90.

Voelkl

Vogt

Sena

, et al. Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biol 2018; 16: e2003693.

91.

Detry

Analyzing repeated measurements using mixed models. JAMA 2016; 315: 407–408.

92.

Mazumdar

Memtsoudis

SG.

Beyond repeated-measures analysis of variance: advanced statistical methods for the analysis of longitudinal data in anesthesia research. Reg Anesth Pain Med 2012; 37: 99–105.

93.

Levine

Ensom

MH.

Post hoc power analysis: an idea whose time has passed?

Pharmacotherapy 2001; 21: 405–409.

94.

O’Keefe

DJ.

Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: sorting out appropriate uses of statistical power analyses. Commun Methods Meas 2007; 1: 291–299.

95.

Hoenig

Heisey

DM.

The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 2001; 55: 19–24.

96.

Box

GEP.

Science and statistics. J Am Stat Assoc 1976; 71: 791–799.

97.

Ioannidis

JPA.

Why most published research findings are false. PLoS Med 2005; 2: e124.

98.

Halsey

Curran-Everett

Vowler

, et al. The fickle p value generates irreproducible results. Nat Methods 2015; 12: 179–185.

99.

Ware

Munafò

MR.

Significance chasing in research practice: causes, consequences and possible solutions. Addict 2015; 110: 4–8.

100.

Szucs

Ioannidis

JPA.

When null hypothesis significance testing is unsuitable for research: a reassessment. Front Hum Neurosci 2017; 11: 390.