Sage Journals: Discover world-class research

Abstract

German

Spanish

French

Animal research often involves experiments in which the effect of several factors on a particular outcome is of scientific interest. Many researchers approach such experiments by varying just one factor at a time. As a consequence, they design and analyze the experiments based on a pairwise comparison between two groups. However, this approach uses unreasonably large numbers of animals and leads to severe limitations in terms of the research questions that can be answered. Factorial designs and analyses offer a more efficient way to perform and assess experiments with multiple factors of interest. We will illustrate the basic principles behind these designs, discussing a simple example with only two factors before suggesting how to design and analyze more complex experiments involving larger numbers of factors based on multiway analysis of variance.

Keywords

ethics and welfare Experimental design policy reduction sample size statistics techniques

Introduction

Mathematical theory and scientific practice has long shown the importance of statistical design for rigorous research. Nevertheless, biomedical research involving animals is fraught with bad design practices.¹ One of these practices involves the so-called “one-factor-at-a-time” (OFAT) approach, in which researchers conduct repeated experiments while only varying one experimental factor at a time. This is highly inefficient as it uses more resources and provides less-accurate effect size estimates than statistical designs in which multiple factors are varied at the same time in a systematic fashion. In addition, testing only one factor at a time precludes the identification of interaction effects between different factors.²

In our role as animal ethics committee members, we are often confronted with inadequate planning of animal studies that are based on an OFAT design. Even in cases in which researchers plan according to a factorial design, their conceptual approach to the analysis is often constricted by an OFAT mindset. Our aim is to identify this issue, give suggestions on how to overcome it, and thereby provide researchers with design strategies that help to reduce animal numbers and increase the precision of effect size estimates.

A first example involving two factors

Assume that we are interested in the effect of a new drug on the body weight of C57BL/6 mice on a high-fat diet. To make our findings more robust, we want to test the drug on mice of two different age levels. Consequently, we compare the effect of injections of drug D with the effect of injections of a saline solution (control) in a first experimental run with 6-week-old mice before comparing the same two treatments in a second experimental run with 12-week-old mice. Each experimental run lasts for a total of 6 weeks, after which the change in body weight is compared between the two groups. Different animals are used in each run because the mice are killed at the end of each experiment to measure additional outcomes of interest that serve as secondary endpoints. The example is entirely fictitious but loosely inspired by data from real experiments.³

Table 1 provides a summary of the characteristics of the OFAT design. For each experimental run, 20 animals are used. The body weight of the kth mouse in run

i \in {1, 2}

is modeled for the control group C and the drug group D as follows:

\begin{matrix} Control group: Y_{i k}^{C} = μ_{i}^{C} + ϵ_{i k}^{C} & Drug group: Y_{i k}^{D} = μ_{i}^{D} + ϵ_{i k}^{D} \end{matrix}

Table 1

In an OFAT design, each run considers only one factor at a time.

Run	Drug	Age	Sample size	Treatment average	Random mouse effect
1	Control	6 weeks	10	$μ_{1}^{C}$	$ϵ_{1 k}^{C}$
1	Drug D	6 weeks	10	$μ_{1}^{D} = μ_{1}^{C} + τ_{1}^{D}$	$ϵ_{1 k}^{D}$
2	Control	12 weeks	10	$μ_{2}^{C}$	$ϵ_{2 k}^{C}$
2	Drug D	12 weeks	10	$μ_{2}^{D} = μ_{2}^{C} + τ_{2}^{D}$	$ϵ_{2 k}^{D}$

Here, $μ_{i}^{C}$ and $μ_{i}^{D}$ denote the average body weight of mice receiving the control treatment and the drug D, respectively, in run i. In each group and run, the individual body weight of each mouse is modeled as a random deviation from this average body weight. Hence, treatment averages ( $μ_{i}^{D}, μ_{i}^{C}$ ) are considered fixed whereas the individual mouse effects ( $ϵ_{i k}^{C}, ϵ_{i k}^{D}$ ) are modeled as random variables. For the sake of simplicity, we make the classical model assumption of independently and identically normally distributed error terms ( $ϵ_{i k}^{C}$ and $ϵ_{i k}^{D}$ ) and additionally assume that there are no missing values. It should be noted that the error terms could be distributed in any number of ways, the correct choice depending on the specific application in question, and that missing values can complicate the computations of the error terms (see, e.g., Hinkelmann and Kempthorne,⁴ chapter 9.5).

The OFAT design only allows us to compare the treatment effect of the drug with its control from the same run, that is $μ_{i}^{D} - μ_{i}^{C}$ for $i \in {1, 2}$ . For a comparison of the drug effect across runs one might be tempted to compare $μ_{1}^{D} - μ_{1}^{C}$ with $μ_{2}^{D} - μ_{2}^{C}$ . However, this reasoning rests on the assumption that the drug effect is not influenced by any unknown factor that might change between the runs. This assumption is not necessarily fulfilled and cannot be tested, either. As an additional disadvantage, the OFAT design does not allow to disentangle the effect of age from any other factor that might have changed between the first and the second run.

Factorial design

For better inference, it would be beneficial to have all treatment groups simultaneously present in one run. This can be achieved using a factorial design as described in Table 2.

Table 2

In a factorial design, multiple factors are assessed simultaneously.

Run	Drug	Age	Sample size	Treatment average	Mouse effect
1	Control C	6 weeks	5	$μ^{C, 6}$	$ϵ_{k}^{C, 6}$
1	Drug D	6 weeks	5	$μ^{D, 6} = μ^{C, 6} + τ^{D}$	$ϵ_{k}^{D, 6}$
1	Control C	12 weeks	5	$μ^{C, 12} = μ^{C, 6} + τ^{A}$	$ϵ_{k}^{C, 12}$
1	Drug D	12 weeks	5	$μ^{D, 12} = μ^{C, 6} + τ^{D} + τ^{A} + τ^{D A}$	$ϵ_{k}^{D, 12}$

Here, τ^D and τ^A denote the effect of drug D and age of 12 weeks, respectively, measured as the difference in weight compared with the group of 6-week-old mice receiving the control treatment. The parameter τ^DA represents deviations from additivity and is therefore a measure of the amount of interaction between the two factors drug and age. If there is no interaction between drug treatment and age, then the effect of the drug treatment is the same for both age groups. Mathematically speaking, the two effects “age” and “drug treatment” would behave additively, that is, the treatment average for mice receiving drug D at the age of 12 weeks would be $μ^{C, 6} + τ^{D} + τ^{A}$ (see Appendix A.1 for a detailed explanation of interaction effects).

In contrast to the OFAT design, the factorial design has only five animals in each group, but due to the better arrangement of the animals, the same number of observations per drug intervention group is achieved. If there is no interaction, then a single run with a factorial design with five animals per group (20 in total) has thus the same statistical power for testing the treatment effect of the drug D as two separate runs of the OFAT design using double the number of animals (see Appendix A.2 for details). Furthermore, the factorial design allows to estimate interaction effects between the drug and the age levels, something which would be impossible to do with an OFAT design. In short, a factorial design allows us to address twice the questions with half the animal numbers.

The right design needs the right analysis

We next discuss the detrimental effects of analyzing the data from a factorial design in an OFAT fashion. In every experiment, it is important to specify the primary hypothesis to be tested, in particular, which comparisons are most important. In the original OFAT design from Table 1, the main interest lay in the treatment effect of the drug, which was assessed at two different time points. In the factorial design from Table 2, this can be rephrased as two separate questions.

Is there a difference between the drug treatment group and the drug control group regardless of age?

Is the effect of the drug the same for both age groups (absence of interaction)?

To answer the second question, one must test for an interaction effect. As mentioned, an OFAT analysis, for example using pairwise t-tests, is incapable of doing so without invoking strong and often implausible assumptions. In contrast, the two-way analysis of variance (ANOVA) approach can be used to test whether there is an interaction effect. If this is the case, one would have to conclude that the treatment effect of the drug differs between 6- and 12-week-old mice. Only in that case should one perform separate tests for the two age groups (i.e. a so-called “sub-group analysis”⁵). However, in this case, too, the factorial analysis holds an advantage, as t-tests based on factorial designs can use all four groups to estimate the standard errors, thereby increasing power when compared to an OFAT analysis (see Appendix A.2).

If there is no interaction, the factorial analysis based on the two-by-two ANOVA F-test for the treatment effect compares the 10 animals of the drug treatment group with the 10 animals of the drug control group, thereby implicitly controlling for the additive effect of the age factor. This is a more powerful approach than performing two separate OFAT t-tests, where only five animals per treatment group would be compared with each other at each time point and which would require a multiple comparison adjustment for the two independent tests, thereby further decreasing the power of the OFAT analysis.

In practice, when using a factorial analysis to test for the treatment effect, one should include the interaction term even if the test for interaction was not significant.⁶ This is recommended, because a non-significant test result is no guarantee that an interaction effect is indeed absent (as is the case in our example in Appendix A.3).

More than two factors

The example with two factors was very simple and mainly served to describe the fundamental principles of the factorial design and its analysis. In practice, experiments are often planned with more than two factors. For example, we might be interested in (A) whether the effect depends on the age of mice, (B) whether the drug is applied subcutaneously (s.c.) or by intragastric gavage (i.g.), (C) at which time point after drug intake measurements are taken, (D) the effects of several different diets, etc. When one plans such an experiment as a series of pairwise comparisons, this leads to exaggerated animal numbers as indicated by Table 3. For the sake of simplicity, we considered only one diet in the example.

Table 3

Number of animals requested for experiments including many factors.

Age	Application	Treatment	Day 3	Day 6	Day 10	Day 21	Total
6 weeks	s.c.	Control	10	10	10	10	40
6 weeks	s.c.	Drug 1	10	10	10	10	40
6 weeks	i.g.	Control	10	10	10	10	40
6 weeks	i.g.	Drug 2	10	10	10	10	40
12 weeks	s.c.	Control	10	10	10	10	40
12 weeks	s.c.	Drug 1	10	10	10	10	40
12 weeks	i.g.	Control	10	10	10	10	40
12 weeks	i.g.	Drug 2	10	10	10	10	40
							320

i.g., intragastric gavage; s.c., subcutaneously.

For a statistician, this is an obvious candidate for a factorial design, but biomedical researchers often tend to approach it with an OFAT mindset: experiments for certain factors are performed independently and the sample size is justified based on a two-sample t-test to compare two groups just like in our first example. Instead, data from such an experimental design should be analyzed using a multiway ANOVA. This will generally lead to a substantially reduced number of animals, due to the gain in power following from the same principles which are described in Appendix A.3 for the two-way ANOVA.

To make best use of the multiway ANOVA approach, one must be clear about the primary research question and certain assumptions must be made. The more factors considered at the same time, the higher the order of interactions that might occur. Two-way interactions are fairly easy to interpret (see Figure 1), but higher-order interactions are often more difficult to visualize. Therefore, a pragmatic approach might be to ignore higher-order interactions in the analysis, if they are not of primary scientific interest. This could then lead to the implementation of so-called “incomplete factorial designs”,⁷ where certain combinations of factors are deliberately omitted.

Fig. 1

Interaction plots display the expected mean for each combination of factors: (a) without interaction; (b) with positive interaction.

While this seemingly reduces the amount of information gained by the experiment, a proper incomplete factorial design can actually help to make optimal use of the animals which are used. There is good theoretical understanding of incomplete factorial designs and specifically in industrial applications there is wide-spread use of them (see Montgomery⁸ for an extensive exposition of incomplete factorial designs).

The exact planning of such a study would also have to take into account logistic restrictions such as the number of animals which can be handled at the same time. Consequently, the study might be conducted by performing several replications of experiments with the same incomplete factor combinations. This would have additional advantages in terms of assessing the replicability of research results.^9,10

Discussion

In this short report, we have highlighted the main advantages of factorial designs and analyses compared with the OFAT approach. In particular, pairwise comparisons of groups using two-sample t-tests should not be the primary guide for planning and analyzing experiments in preclinical research. ANOVA approaches based on factorial designs have key advantages over experiments in which only one variable is assessed in each run. They have been successfully employed and studied intensively for more than a century in a range of different fields of science (see, e.g., pages 4–53 in Dean et al.¹¹). As such, it is high time that they become the standard rather than the exception in pre-clinical animal trials.

Of course, depending on the specific experimental setting there are other important design and analysis methods. To detect changes from baseline, an analysis of covariance (ANCOVA) is, in general, the preferred method of analysis.^12,13 Similarly, randomized block designs allow for more precise effect size estimates when there is large variability between groups of experimental units.^14,15 As such, the design and analysis of the example outlined above could be further improved by including the baseline weight as a covariate (leading to a multiway ANCOVA analysis) or by blocking for the experimental run (leading to a randomized block design).

If it is possible to measure the outcome variable without causing harm to the animal, then we might improve the design by collecting data from the same animals at multiple time points, to provide more insights on potential time trends or increase the precision of the estimates. However, discussing the analysis of designs with repeated measures producing longitudinal data is beyond the scope of this article.^4,16,17

Sometimes we want to run several experiments concerned with the same research question. When planning such experiments, we need to decide if a correction for multiple testing should be done on the results of all experiments to control the family-wise error rate.¹⁸ We might also consider a sequential design in which experiments are planned in a certain temporal order so that we can use the results of the first experiments to decide whether to conduct the subsequent ones.¹⁹ Furthermore, there is the topic of preplanned replications of experiments, which can be considered as a special case of a block design but comes with its own set of statistical intricacies.⁹

Finally, one should emphasize that statistical power may not always be the only way to determine the sample size. Animal trials often have a fairly exploratory character and estimation of quantities (for example, group means) might be more important than the testing of hypotheses. Sample size calculations can then be based on the desired precision of the estimates, for example expressed in the form of confidence intervals.^20,21 In this setting, too, the multivariate ANOVA has statistical advantages over an OFAT approach, as it will typically yield narrower confidence intervals. Hence, using factorial designs combined with ANOVA methods for analysis should be regarded as the bare minimal standard in experimental research involving animals.

Footnotes

Data availability

No data were used or produced for this publication. All computations can be reproduced from the information in the text and appendix.

Ethics review

No animal or human trials were conducted for this publication, hence, now ethical board approval was required.

Conflict of Interest

The authors declare that they have no conflicts of interest with regard to the content of this article.

Funding

The authors did not receive financial support for the research, authorship, or publication of this article.

ORCID iDs

Servan Luciano Grüninger

Florian Frommlet

References

Rowe

Recommendations to improve use and reporting of statistics in animal experiments. Lab Anim. 2023; 57(3): 224–235. doi:10.1177/00236772221140669

Czitrom

One-factor-at-a-time versus designed experiments. Am Stat. 1999; 53(2): 126–131. doi:10.1080/00031305.1999.10474445

Coskun

Urva

Roell

Loghin

Moyers

, et al. LY3437943, a novel triple glucagon, GIP, and GLP-1 receptor agonist for glycemic control and weight loss: from discovery to clinical proof of concept. Cell Metab. 2022; 34(9): 1234–1247.e9. doi:10.1016/j.cmet.2022.07.013

Hinkelmann

Kempthorne

, (eds). Design and Analysis of Experiments. 2nd ed. Wiley Series in Probability and Statistics. Hoboken, NJ: Wiley-Interscience; 2008.

Sun

Ioannidis

JPA

Agoritsas

Alba

Guyatt

How to use a subgroup analysis: users’ guide to the medical literature. JAMA. 2014; 311(4): 405. doi:10.1001/jama.2013.285063

Muralidharan

Romero

Wüthrich

Factorial designs, model selection, and (incorrect) inference in randomized experiments. Rev Econ Stat. 2023:1–44. doi:10.1162/rest_a_01317

Byar

Herzberg

Tan

WY.

Incomplete factorial designs for randomized clinical trials. Statist Med. 1993; 12(17): 1629–1641. doi:10.1002/sim.4780121708

Montgomery

DC.

Design and Analysis of Experiments. Hoboken, NJ: John Wiley & Sons; 2017.

Frommlet

Heinze

Experimental replications in animal trials. Lab Anim. 2021; 55(1): 65–75. doi:10.1177/0023677220907617

10.

von Kortzfleisch

Karp

Palme

Kaiser

Sachser

Richter

SH.

Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’. Sci Rep. 2020; 10(1): 16579. doi:10.1038/s41598-020-73503-4

11.

Dean

Morris

Stufken

Bingham

Handbook of Design and Analysis of Experiments, vol. 7. Boca Raton, FL: CRC Press; 2015.

12.

Karp

Segonds-Pichon

Gerdin

AKB

Ramírez-Solis

White

JK.

The fallacy of ratio correction to address confounding factors. Lab Anim. 2012; 46(3): 245–252. doi:10.1258/la.2012.012003

13.

Clifton

The correlation between baseline score and post-intervention score, and its implications for statistical analysis. Rev Econ Stat. 2019; 20(43): 1–6.

14.

Festing

MFW.

Randomized block experimental designs can increase the power and reproducibility of laboratory animal experiments. ILAR J. 2014; 55(3): 472–476. doi:10.1093/ilar/ilu045

15.

Bailey

Design of Comparative Experiments. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge; New York: Cambridge University Press; 2008.

16.

Kristensen

Hansen

Statistical analyses of repeated measures in physiological research: a tutorial. Adv Physiol Educ. 2004; 28(1): 2–14. doi:10.1152/advan.00042.2003

17.

Duricki

Soleman

Moon

LDF.

Analysis of longitudinal data from animals with missing values using SPSS. Nat Protoc. 2016; 11(6): 1112–1129. doi:10.1038/nprot.2016.048

18.

Goeman

Solari

Multiple testing for exploratory research. Statist Sci. 2011; 26(4): 584–597. doi:10.1214/11-STS356

19.

Neumann

Grittner

Piper

Rex

Florez-Vargas

Karystianis

, et al. Increasing efficiency of preclinical research by group sequential designs. PLoS Biol. 2017; 15(3): e2001307. doi:10.1371/journal.pbio.2001307

20.

Greenland

On sample-size and power calculations for studies using confidence intervals. Am J Epidemiol. 1988; 128(1): 231–237. doi:10.1093/oxfordjournals.aje.a114945

21.

Greenland

Senn

Rothman

Carlin

Poole

Goodman

, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016; 31(4): 337–350. doi:10.1007/s10654-016-0149-3

Half the price,twice the gain: How to simultaneously decrease animal numbers and increase precision with good experimental design