Abstract
A phenotyping study records physiologic or morphologic changes in an experimental animal resulting from an intervention. In mice, this intervention is most frequently genetic, but it may be any type of experimental manipulation. Accurate representation of the human condition under study is essential if the model is to yield useful conclusions. In this review, general approaches to the design of phenotyping studies are considered. These approaches take into account major sources of reduced model validity, such as unexpected phenotypic variation in mice, evolutionary divergence between mice and humans, unanticipated sources of variation, and common design errors. As poor design is the most common reason why studies fail to yield enduring results, emphasis is placed on reduction of bias, sampling, controlled study design, and appropriate statistical analysis.
Keywords
John Steinbeck could never have foreseen how often the title of his book
What Constitutes a Phenotyping Study?
In its broadest sense, a phenotyping study records clinical, morphologic, physiologic, or cellular changes in mice resulting from an intervention. Today, this intervention is most frequently genetic but may be any type of experimental manipulation including dietary, pharmacologic, infectious, physical, or surgical. 36 Most commonly, phenotyping is only one aspect of a relatively narrowly focused hypothesis-driven study; however, it may take center stage in enormous in vivo phenotyping projects that lack a specific hypothesis. 80 In this section, approaches to major categories of hypothesis-driven and large-scale phenotyping are discussed.
Hypothesis-Driven Phenotyping
Our capacity to generate genetically defined rodents was revolutionized first by the random introduction of new genes into the mouse genome, 7,57 and then by the ability inactivate specific genes in murine embryonic stem cells. 72 Most commonly, GEM studies entail characterizing the phenotype induced by genetic alteration of a known gene that is often associated with a human disorder. In these studies, the only experimental variable is the introduced genetic alteration, although many unintended variables such as strain and concurrent disease may be present. 6,82 Reports typically include methods of generating the mice, clinical and clinicopathologic data, salient histopathologic lesions, and molecular data illustrating a potential mechanism of disease. 77 In strongly mechanistic studies, clinical or pathologic abnormalities may be minimally reported (or reported in supplementary data) in favor of molecular or biochemical data. 77 In general, four to ten mice per sex, genotype, and age group are used. 6,41,64,82 In some reports, animal numbers may be as small as two to four animals per experiment, 77 whereas in others, variation inherent in the technology (called measurement error) or biological outcome requires more animals. 43 In reality, the sample size depends on how variable the phenotype is and is discussed in detail below.
Large-Scale Phenotyping
Several large-scale, random mutagenesis efforts continue to generate mice in which phenotype must be assessed to infer gene function. 25 Phenotypic assessment of large numbers of mice must integrate the results of predetermined phenotyping screens and unified databases with comparable datasets and analytic methods. 80 The challenges posed by this “hypothesis-independent” approach lie more in large-scale logistics and implementation than in the principles of study design inherent to hypothesis-driven research. In Europe, several research organizations spanning the nations of the European Commission have developed programs and protocols for phenotyping genetically altered mice (http://empress.har.mrc.ac.uk/viewempress). The International Knockout Mouse Consortium, which includes the NIH knockout mouse program (KOMP) program (http://www.komp.org/), will soon attempt to phenotype numerous new lines of GEM using these European or other protocols (http://www.knockoutmouse.org/). Recently, Genentech has completed large-scale phenotyping on over 400 knockout lines for secreted and transmembrane proteins using some of the suggested protocols mentioned above. 70
How Good a Model Is the Mouse, Really?
We hear often that mice are not small, furry humans. Nevertheless, they are the most commonly used animal to model diseases that are often uniquely human. Their use is valid if the parameters of the hypothesis are clearly defined. For example, a mouse model cannot replicate the cognitive and emotional spectrum of Alzheimer’s disease in humans. However, a mouse can be used to explore the biochemical effects of amyloid precursor protein in nervous tissue. 15 Nevertheless, the ultimate contribution of animal research to human health has been questioned. 24,38,58 Many of the causes for this issue are leveled at problems of study design, 5,24,54 which will be addressed below. However, some less familiar sources of reduced model validity arise from physiologic and genetic differences between mice and humans, and these sources are described below. 46
Evolutionary Divergence May Create Unexpected Phenotypes
The use of animal models is based on the evidence that the physiology of divergent organisms is driven by homologous mechanisms. For example, loss of the transcription factor Pax 6 (PAX6) results in similar ocular defects in mice, humans, and drosophila.
31
Ectopic expression of the mouse
Functional differences in human and mouse genes can be difficult or impossible to predict and usually emerge during the course of the experiment.
3
For example, the
Background Genetic Heterogeneity in Humans Exceeds That in Mice
The majority of prevalent human conditions such as diabetes; obesity; and immunologic, cardiovascular, and aging-associated diseases are complex phenotypes that result from interaction of multiple genetic and environmental factors. 10,50,75 To study these conditions in mice, the predominant methodologic bias is toward simplification of a complex system into component subsystems that isolate a single intervention against a background of controlled variables. 51 Although this approach may be used to establish causation of individual variables, conclusions reached may not apply when complex interactions are considered in the whole system. With a view to better modeling of the genomic diversity underlying complex traits, investigators at Oak Ridge National Laboratory have developed a new genetic reference population of mice (the Collaborative Cross) derived from eight inbred strains. 34 However, for most studies involving GEM, it is recommended that the mutation be back-crossed onto a single strain for 10 generations (congenic mice). 61 The phenotype of animals that have been incompletely back-crossed to a single strain may be quite variable, and in some cases, the phenotype may be lost on one background and strengthened on another. 61 A phenotype that is consistent regardless of background strain is likely to prove most useful over time.
Disease-Causing Mutations in Humans Encompass Broad Allelic Spectra
The majority of highly prevalent human diseases result from an allelic spectrum within a single gene, or multiple independent alleles that predispose the individual to disease. 48 This “geneticist’s nightmare” cannot be accurately modeled by the complete elimination of gene function seen in knockout mice. The goal of KOMP and other high-throughput knockout projects is to observe the effect of loss of function of every gene in the mouse genome. Although these effects may be embryonic lethal and often do not reflect the corresponding human condition, this approach provides an essential platform upon which more subtle defects such as ENU point mutagenesis 53 may be assessed.
Gene–Environment Interaction
The interaction between genotype and factors such as diet, lifestyle, and/or dwelling situation result in a more diverse phenotype in humans than can be accurately modeled in mice. Further, in many human studies, measures of quality of life, mental health, and physical function may be as important as a primary disease-specific outcome that cannot be directly measured with animal models. 68 Genetic background may profoundly affect the result of an environmental intervention, as is commonly seen in the varying effect of mouse strain has on an intervention such as diet. 12,23,30 Standardization of test conditions is generally regarded as a good practice, as it minimizes the effects of potentially confounding variables on the primary research question. 76 Recently, Richter et al raise an interesting point regarding behavioral studies, which are notorious for the degree to which the local environment influences outcome. 11,60,76 They suggest that imposing rigid standardization is ultimately impossible to replicate across laboratories, that doing so may yield results specific to a particular laboratory, and that phenotypes that persist across some degree of heterogenization may be more reproducible. 60 Nevertheless, for most purposes, eliminating confounding variables from a study design is recommended.
Gene–Gene Interaction
Interaction between non-allelic genes, or epistasis, is a well-established phenomenon that underlies the variation of phenotype induced by a single genetic alteration in differing mouse strains. 32,44,55,65 More recently, the capacity to assess genome-wide gene expression has fostered our ability to assess functional interactions at the transcriptional level. Using a variety of computational methods, 42 expression data may be aggregated into functional gene networks. Although in its infancy, a systems approach to understanding disease is rapidly evolving in parallel with whole-genome mutagenesis. 66,79
General Principles of Good Study Design
Guidelines for experimental design, analysis, and reporting are available in the literature, 19 –21,37,67,78 and of course through consultation with a statistician. A checklist of design and statistical parameters to consider when performing phenotyping experiments is given in Table 1 and described in more detail below.
Checklist of Design Parameters to Consider When Performing Phenotyping Studies
Hypothesis
A hypothesis is a statement of a statistically testable outcome. It sets the framework for the experimental design and thus forms the backbone of the experiment. The hypothesis is often framed as two sided, with a null and alternate hypothesis (eg, the null hypothesis states that there is no difference between the experimental genotype and wild type for levels of interleukin-6). Alternately, it may be presented as one sided (eg, the experimental genotype has a higher level of interleukin-6 than the wild-type control group).
Description of Variables and Experiments
Next, the ability to replicate a study rests heavily on good reporting of the animal variables used. This information should ideally be presented in the methods, but it may be inadequately described or buried in the results section, figure legends, or tables.
36
For each experiment, animal variables include number of animals used, as well as their age (or weight), sex, background strain, and source, if purchased. A general statement about management practices such as type of housing, light–dark cycle, diet, and microbial status is usually sufficient. However, if the study is likely to be particularly influenced by variability in these parameters (eg, the presence of
Controls
Controls are selected relative to the intervention. Each time an additional variable (eg, diet or age in addition to genotype) is introduced, the number of animals increases. In many GEM studies, the genetic intervention is the only variable, and in this case, equal numbers of male and female age-matched littermates of each genotype (wild-type, homozygous mutant, and heterozygous) are used. In genetic studies, control littermates are typically available. In the event that they are not, controls of a similar background strain (and raised under similar conditions) may be used. However, greater variance within the control group may result in a corresponding increase in sample size to detect a difference between control and experimental groups. Similar numbers of control and experimental animals should be used to account for variation in normal background pathology such as tumor incidence. 28,29,71 An alternative method is to refer to previously published wild-type data (historical controls). This method is very likely to result in bias as diet, environmental conditions including the number of mice per cage, breeding methods, strain subline differences, and other factors can play an important role in the incidence of lesions.
Reducing Bias
Systematic bias may arise in several ways, 47,73 and not uncommonly, methods to reduce systemic bias are not reported in published papers. 36,67 There are many randomization schemes ranging from assignment to treatment arms using a table of random numbers or stratified randomization to provide balance on some characteristic such as sex or age to adaptive randomization methods. Observational bias is introduced when the observer is aware of the genotype or intervention status of the subject. Ideally, the observer (pathologist or investigator) should be unaware of which genotype the animal belongs to, or to which arm a subject is assigned throughout the study. Thus, we suggest that whenever within-genotype random allocation is feasible, it should be performed in a blinded manner, such as by a technician who will not perform the outcome measures.
Sample Size Calculation to Demonstrate Clinically Meaningful Differences
The actual numbers of mice needed depends on variability of the outcome in both control and experimental groups. Sample size calculations typically require several pieces of information, such as the difference between groups to be detected (or frequency of the outcome for each treatment arm), the variation in the outcome for each group (these may be different), how many follow-up time points will be included, and expected survival of each treatment arm. Many of these inputs are from pilot experiments or from the literature. If one has not found significance based on a priori sample size calculations, then the initial inputs can be revisited. Was there more variation than expected? Is the difference observed a clinically meaningful difference? Adding mice to an experiment that is not demonstrating clinically meaningful differences is unlikely to achieve a robust outcome. Finally, the reality of intrastrain variation should not be overlooked. 52 Wild-type mice of the same strain may be used as controls for experiments, 16 and if these mice are not littermates, intrastrain variation resulting from genetic drift may influence outcome.
Factorial Designs
Factorial experimental designs allow combinations of two or more design factors to be evaluated in one experiment, these are referred to as factorial designs. 1 These types of experimental designs may be 10 times as efficient as a series of two-armed (treatment and control) experiments, and they will likely reduce the number of animals used, allow for estimation of factor interactions (eg, genotype and diet), and increase the strength of the scientific findings. 21
Appropriate Analytic Techniques
The appropriate analytic method should be selected to test the hypothesis, and the data should meet the assumptions of that test. Small sample sizes may not meet the assumptions of methods that assume a normal distribution (parametric methods). These are called means models and include the
Control for Correlated Outcomes Within a Subject
When several outcomes are measured at the same time (eg, metabolic compounds) or a single outcome is measured over time (eg, DHEA through a day), then the correlation among the measures within a subject needs to be accounted for, because common analytic methods assume all the observations are independent. When there are correlated measures, the variance will be biased, which typically inflates the type I error. 74 Working with a statistician will ensure proper methods for correlated outcomes.
Intent-to-Treat Analysis
Intent-to-treat analysis denotes analyzing the data according to the randomized groups, such as drug or diet, regardless of whether the treatment was adhered to by each subject. Bias can be induced by dropping those that cannot adhere because of intolerance of the treatment or even death, as well as by removing outliers. Mistakes include replacing mice that become too ill on a treatment or die and reporting only the results of mice who survived the experimental period. Exceptions to intent-to-treat analysis occur when there are protocol or equipment failures, such as accidental administration of incorrect dose of an infectious agent or drug, miscalibration of equipment, outbreak of disease in the colony, or losing animal identification. In situations such as these, the animal should be removed from the analytic dataset.
Interpretation and Presentation of the Results
Descriptive statistics such as sample size, distribution of data, mean (or median), and measures of variability are essential. The unit of measurement for all analyses should be clearly stated, and calculated
External Validity
External validity means that the effect estimates from an initial study have been replicated in a separate cohort. In a 2006 Editorial in
Analyzing Causes of Death and Survival
An important aspect of evaluating aging in GEM studies is survival and cause of death analysis (CODA). 39,81 In order to better characterize the GEM and to understand the biology of the condition induced and comparison to humans, it is of great importance to determine the cause of death (COD) of the GEM and compare the COD in GEM versus wild-type mice. For aging studies, this analysis assumes great importance. Why should new lines of aging mice have shorter or longer life spans than those of wild-type mice? What are the mechanisms for prolonging life? Can it include inhibition or prevention of tumor development, decreased degenerative aging diseases of major organs, or other causes? Few publications take on these important issues. 4,17,40,81 Ladiges et al 40 suggest methods for evaluating mice in aging studies and validating findings in one study, yet no method for evaluating or comparing causes of death are noted. We suggest that for all aging mouse studies, a CODA analysis should be performed that includes methods of evaluating survival, clinical and anatomic pathology workups to assess COD, and inclusion of potential mechanistic indicators of aging (eg, insulin-like growth factor 1 levels). 39,81 Statistical methods should be used for CODA for comparing wild-type versus GEM lines. Only then will meaningful publications on effects of aging by gene modification occur.
Including Relevant Expertise
Expertise in pathology and data interpretation, especially mouse pathology, is imperative for the conduct and publication of studies involving mice and mouse pathology. 8,33 Examples of erroneous pathology 22 or lack of statistical evaluation of the data 62 appear in leading high-impact journals. Since the publication of these initial articles on two new lines of GEM, no subsequent articles have been published on research with these GEM, despite their initial publication in leading journals. The reviewers of manuscripts involving GEM are often experts in molecular biology or genetics, not pathology. This deficiency appears to be a major cause of publication of erroneous pathology diagnoses, especially in molecular biology and genetics journals. Pathologists familiar with mouse pathology should be reviewers of manuscripts that include mouse pathology.
Consequences of Poorly Designed Animal Studies
The study design and statistical analyses determine whether the hypothesis advanced in the introduction was adequately tested. Many GEM publications report studies with deficiencies in experimental design. 5,24,54,67,8 In addition, data interpretation and conclusions can be based on limited evidence. 8,33 Unfortunately, many published studies provide poor documentation of the findings in the study, especially involving pathology. These findings can negate the conclusions and lead to errata, retractions, and at least, poorly reported publications. 8,33 On the other hand, the lack of optimal design may not detract from the conclusions and significance of the studies, if obvious positive results are found. For example, consider a study in which knockout mice develop a high incidence of lymphoma at 6 months of age and in which no wild-type mice were used as controls. Lymphomas in 6-month-old mice are rare in most mouse lines and hopefully in the mice of the background used. More common, however, is a report that shows a few tumors of various types in 10 knockout mice at 12 months of age, with both sexes combined, and 10 wild-type controls have no or few tumors, and no statistical evaluations are used. These types of studies have been published in high-impact and other journals. 8,33 Since these studies are already published, one can only use their publications as references even if the design and interpretation are not necessarily accurate or would require further studies to prove their hypotheses. Publication of a study does not necessarily mean that what it concludes is true or even that the study was well done. 49
Conclusions
Designing an animal experiment so that its conclusions can be trusted is often more complicated than one would expect. Failure to conclusively address a hypothesis may stem from a number of design errors, as well as the inevitable unexpected findings that accompany much research. Adequate study design requires considerable care, and is best done in consultation with various members of the team prior to initiating the project. A proposed study would benefit from statistical or pathologic review, as would completed articles submitted to journals. Finally, well-designed, humane studies on animals that implement the three Rs are consistent with good science.
Footnotes
Acknowledgements
The work for this report was funded in part by grants from the National Institute on Aging R21EY018719 and P30AG21342 at the Yale Claude D. Pepper Older Americans Independence Center.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors received no financial support for the research, authorship, and/or publication of this article.
