Abstract
Animal research often involves experiments in which the effect of several factors on a particular outcome is of scientific interest. Many researchers approach such experiments by varying just one factor at a time. As a consequence, they design and analyze the experiments based on a pairwise comparison between two groups. However, this approach uses unreasonably large numbers of animals and leads to severe limitations in terms of the research questions that can be answered. Factorial designs and analyses offer a more efficient way to perform and assess experiments with multiple factors of interest. We will illustrate the basic principles behind these designs, discussing a simple example with only two factors before suggesting how to design and analyze more complex experiments involving larger numbers of factors based on multiway analysis of variance.
Introduction
Mathematical theory and scientific practice has long shown the importance of statistical design for rigorous research. Nevertheless, biomedical research involving animals is fraught with bad design practices. 1 One of these practices involves the so-called “one-factor-at-a-time” (OFAT) approach, in which researchers conduct repeated experiments while only varying one experimental factor at a time. This is highly inefficient as it uses more resources and provides less-accurate effect size estimates than statistical designs in which multiple factors are varied at the same time in a systematic fashion. In addition, testing only one factor at a time precludes the identification of interaction effects between different factors. 2
In our role as animal ethics committee members, we are often confronted with inadequate planning of animal studies that are based on an OFAT design. Even in cases in which researchers plan according to a factorial design, their conceptual approach to the analysis is often constricted by an OFAT mindset. Our aim is to identify this issue, give suggestions on how to overcome it, and thereby provide researchers with design strategies that help to reduce animal numbers and increase the precision of effect size estimates.
A first example involving two factors
Assume that we are interested in the effect of a new drug on the body weight of C57BL/6 mice on a high-fat diet. To make our findings more robust, we want to test the drug on mice of two different age levels. Consequently, we compare the effect of injections of drug D with the effect of injections of a saline solution (control) in a first experimental run with 6-week-old mice before comparing the same two treatments in a second experimental run with 12-week-old mice. Each experimental run lasts for a total of 6 weeks, after which the change in body weight is compared between the two groups. Different animals are used in each run because the mice are killed at the end of each experiment to measure additional outcomes of interest that serve as secondary endpoints. The example is entirely fictitious but loosely inspired by data from real experiments. 3
In an OFAT design, each run considers only one factor at a time.
Here,
The OFAT design only allows us to compare the treatment effect of the drug with its control from the same run, that is
Factorial design
For better inference, it would be beneficial to have all treatment groups simultaneously present in one run. This can be achieved using a factorial design as described in Table 2.
In a factorial design, multiple factors are assessed simultaneously.
Here, τD and τA denote the effect of drug D and age of 12 weeks, respectively, measured as the difference in weight compared with the group of 6-week-old mice receiving the control treatment. The parameter τDA represents deviations from additivity and is therefore a measure of the amount of interaction between the two factors drug and age. If there is no interaction between drug treatment and age, then the effect of the drug treatment is the same for both age groups. Mathematically speaking, the two effects “age” and “drug treatment” would behave additively, that is, the treatment average for mice receiving drug D at the age of 12 weeks would be
In contrast to the OFAT design, the factorial design has only five animals in each group, but due to the better arrangement of the animals, the same number of observations per drug intervention group is achieved. If there is no interaction, then a single run with a factorial design with five animals per group (20 in total) has thus the same statistical power for testing the treatment effect of the drug D as two separate runs of the OFAT design using double the number of animals (see Appendix A.2 for details). Furthermore, the factorial design allows to estimate interaction effects between the drug and the age levels, something which would be impossible to do with an OFAT design. In short, a factorial design allows us to address twice the questions with half the animal numbers.
The right design needs the right analysis
We next discuss the detrimental effects of analyzing the data from a factorial design in an OFAT fashion. In every experiment, it is important to specify the primary hypothesis to be tested, in particular, which comparisons are most important. In the original OFAT design from Table 1, the main interest lay in the treatment effect of the drug, which was assessed at two different time points. In the factorial design from Table 2, this can be rephrased as two separate questions.
Is there a difference between the drug treatment group and the drug control group regardless of age? Is the effect of the drug the same for both age groups (absence of interaction)?
To answer the second question, one must test for an interaction effect. As mentioned, an OFAT analysis, for example using pairwise t-tests, is incapable of doing so without invoking strong and often implausible assumptions. In contrast, the two-way analysis of variance (ANOVA) approach can be used to test whether there is an interaction effect. If this is the case, one would have to conclude that the treatment effect of the drug differs between 6- and 12-week-old mice. Only in that case should one perform separate tests for the two age groups (i.e. a so-called “sub-group analysis” 5 ). However, in this case, too, the factorial analysis holds an advantage, as t-tests based on factorial designs can use all four groups to estimate the standard errors, thereby increasing power when compared to an OFAT analysis (see Appendix A.2).
If there is no interaction, the factorial analysis based on the two-by-two ANOVA F-test for the treatment effect compares the 10 animals of the drug treatment group with the 10 animals of the drug control group, thereby implicitly controlling for the additive effect of the age factor. This is a more powerful approach than performing two separate OFAT t-tests, where only five animals per treatment group would be compared with each other at each time point and which would require a multiple comparison adjustment for the two independent tests, thereby further decreasing the power of the OFAT analysis.
In practice, when using a factorial analysis to test for the treatment effect, one should include the interaction term even if the test for interaction was not significant. 6 This is recommended, because a non-significant test result is no guarantee that an interaction effect is indeed absent (as is the case in our example in Appendix A.3).
More than two factors
The example with two factors was very simple and mainly served to describe the fundamental principles of the factorial design and its analysis. In practice, experiments are often planned with more than two factors. For example, we might be interested in (A) whether the effect depends on the age of mice, (B) whether the drug is applied subcutaneously (s.c.) or by intragastric gavage (i.g.), (C) at which time point after drug intake measurements are taken, (D) the effects of several different diets, etc. When one plans such an experiment as a series of pairwise comparisons, this leads to exaggerated animal numbers as indicated by Table 3. For the sake of simplicity, we considered only one diet in the example.
Number of animals requested for experiments including many factors.
i.g., intragastric gavage; s.c., subcutaneously.
For a statistician, this is an obvious candidate for a factorial design, but biomedical researchers often tend to approach it with an OFAT mindset: experiments for certain factors are performed independently and the sample size is justified based on a two-sample t-test to compare two groups just like in our first example. Instead, data from such an experimental design should be analyzed using a multiway ANOVA. This will generally lead to a substantially reduced number of animals, due to the gain in power following from the same principles which are described in Appendix A.3 for the two-way ANOVA.
To make best use of the multiway ANOVA approach, one must be clear about the primary research question and certain assumptions must be made. The more factors considered at the same time, the higher the order of interactions that might occur. Two-way interactions are fairly easy to interpret (see Figure 1), but higher-order interactions are often more difficult to visualize. Therefore, a pragmatic approach might be to ignore higher-order interactions in the analysis, if they are not of primary scientific interest. This could then lead to the implementation of so-called “incomplete factorial designs”, 7 where certain combinations of factors are deliberately omitted.

Interaction plots display the expected mean for each combination of factors: (a) without interaction; (b) with positive interaction.
While this seemingly reduces the amount of information gained by the experiment, a proper incomplete factorial design can actually help to make optimal use of the animals which are used. There is good theoretical understanding of incomplete factorial designs and specifically in industrial applications there is wide-spread use of them (see Montgomery 8 for an extensive exposition of incomplete factorial designs).
The exact planning of such a study would also have to take into account logistic restrictions such as the number of animals which can be handled at the same time. Consequently, the study might be conducted by performing several replications of experiments with the same incomplete factor combinations. This would have additional advantages in terms of assessing the replicability of research results.9,10
Discussion
In this short report, we have highlighted the main advantages of factorial designs and analyses compared with the OFAT approach. In particular, pairwise comparisons of groups using two-sample t-tests should not be the primary guide for planning and analyzing experiments in preclinical research. ANOVA approaches based on factorial designs have key advantages over experiments in which only one variable is assessed in each run. They have been successfully employed and studied intensively for more than a century in a range of different fields of science (see, e.g., pages 4–53 in Dean et al. 11 ). As such, it is high time that they become the standard rather than the exception in pre-clinical animal trials.
Of course, depending on the specific experimental setting there are other important design and analysis methods. To detect changes from baseline, an analysis of covariance (ANCOVA) is, in general, the preferred method of analysis.12,13 Similarly, randomized block designs allow for more precise effect size estimates when there is large variability between groups of experimental units.14,15 As such, the design and analysis of the example outlined above could be further improved by including the baseline weight as a covariate (leading to a multiway ANCOVA analysis) or by blocking for the experimental run (leading to a randomized block design).
If it is possible to measure the outcome variable without causing harm to the animal, then we might improve the design by collecting data from the same animals at multiple time points, to provide more insights on potential time trends or increase the precision of the estimates. However, discussing the analysis of designs with repeated measures producing longitudinal data is beyond the scope of this article.4,16,17
Sometimes we want to run several experiments concerned with the same research question. When planning such experiments, we need to decide if a correction for multiple testing should be done on the results of all experiments to control the family-wise error rate. 18 We might also consider a sequential design in which experiments are planned in a certain temporal order so that we can use the results of the first experiments to decide whether to conduct the subsequent ones. 19 Furthermore, there is the topic of preplanned replications of experiments, which can be considered as a special case of a block design but comes with its own set of statistical intricacies. 9
Finally, one should emphasize that statistical power may not always be the only way to determine the sample size. Animal trials often have a fairly exploratory character and estimation of quantities (for example, group means) might be more important than the testing of hypotheses. Sample size calculations can then be based on the desired precision of the estimates, for example expressed in the form of confidence intervals.20,21 In this setting, too, the multivariate ANOVA has statistical advantages over an OFAT approach, as it will typically yield narrower confidence intervals. Hence, using factorial designs combined with ANOVA methods for analysis should be regarded as the bare minimal standard in experimental research involving animals.
Footnotes
Data availability
No data were used or produced for this publication. All computations can be reproduced from the information in the text and appendix.
Ethics review
No animal or human trials were conducted for this publication, hence, now ethical board approval was required.
Conflict of Interest
The authors declare that they have no conflicts of interest with regard to the content of this article.
Funding
The authors did not receive financial support for the research, authorship, or publication of this article.
