Abstract
This is an introductory paper to a series of papers intended to provide the basis for understanding the contribution of endocrine axis disruption or dysfunction to the pathogenesis of morphological findings and to aid in the interpretation of study outcomes. This is the first in this series of guidance papers prepared by the Working Group and outlines general concepts of study design and assay conduct and validation for hormone studies in general.
Introduction and Purpose
In the course of conducting standard toxicological testing, experimental observations are noted when a specific finding may have an etiology that is related or suspected to be related to an underlying hormonal change. In most standard nonclinical toxicology studies for pharmaceuticals, hormone assays/data are not routine endpoints. For hormonal endpoints to be meaningful, appropriate sample sizes, analyses, and data interpretation are required. Experimental bias can be introduced if the research plan or process is not appropriately designed and conducted or is insufficiently powered to detect effects if they were present. However, when available, serum hormone data can greatly enhance the assessment of human risk because an understanding of the underlying mode of action of toxicity can improve accurate risk evaluation, facilitate selection of clinical markers to monitor for potential toxicities, and provide a valuable metric to extrapolate from animals to humans.
Under the direction of the Society of Toxicologic Pathology (STP), the Hormonal Assessment Best Practices Working Group was charged with outlining relevant principles of endocrinology to be considered when investigating test articles that modulate or disrupt the reproductive or thyroid hormone axes. The goal was to provide the basis for understanding the contribution of endocrine axis disruption or dysfunction to the pathogenesis of morphological findings and to aid in the interpretation of study outcomes. The STP placed specific emphasis on examination of the use of hormonal assays in nonclinical toxicology studies. This is the first in this series of guidance papers prepared by the Working Group and outlines general concepts of study design and assay conduct and validation for hormone studies in general. The series of articles (Table 1) that follows this article will present basic biological principles of evaluating adult male and female reproductive hormones and thyroid hormones and consideration for their assessment in the context of nonclinical toxicology studies in adult animals. It is recognized that these hormones play a major role in the developing fetus, neonate, and infant, and although they are critical for risk assessment for women of childbearing potential and pediatric populations, hormonal assessment in these settings will not be within the scope of these articles. These articles will focus on hormones most commonly affected in general toxicity studies, often with associated morphological changes in hormonally responsive organs. Contextual information about histopathologic changes that may accompany hormonal changes is provided to assist hypothesis generation about mode of action for the observed toxicity. Finally, suggestions for optimal designs for evaluation of hormonal endpoints are discussed.
Manuscripts in the series.
In this article, general concepts of the endocrine system are outlined, and how appreciation of these principles can inform the design of either stand-alone studies or standard toxicity studies modified to incorporate hormone assessments is described. In addition, guidance on the interpretation and analyses of hormonal data, consideration for validating assays, and the advantages and disadvantages of current commonly used procedures to measure and interpret hormonal data are provided.
General Concepts
The activity of the anterior pituitary gland is controlled by both stimulatory and inhibitory hormones from the parvocellular neurons in the hypothalamus (Patel 2001). Hormones released from the pituitary gland regulate the secretion of hormones from target organs, which function by feedback mechanisms to control the release of stimulatory hormones from the pituitary and hypothalamus. Therefore, when considering which hormones to analyze, it is important to understand the biological regulation of the hormonal axes and, when practical, to measure all relevant hormones that are functionally related. Evaluation of hormones often provides important clues to a possible mode of action in situations involving compound-mediated toxicity. For example, if increased luteinizing hormone (LH) concentrations are not accompanied by elevated testosterone (T) levels in appropriately designed animal studies, a potential mode of toxicity (impaired steroidogenesis) could be concluded. A reduction in circulating levels of thyroxine (T4) and free T4 in the absence of an increase in thyroid-stimulating hormone (TSH) may suggest increased liver iodothyronine catabolism with simultaneous alteration of the hypothalamic-pituitary set point for negative feedback of free T4 on TSH secretion (Curran and DeGroot 1991). As such, an understanding of fundamental endocrinology is essential to select the correct hormones to measure in different situations and to guide subsequent investigations. The articles that follow this general concepts and considerations article will discuss in greater depth the morphological and functional changes in individual organ systems that would trigger hormone analyses.
As in any scientific investigation, a hypothesis of the mechanism of an observed toxicity should be formulated before a hormonal testing protocol is developed. First, consider and review the molecule and its related class, pharmacology data on the target, and all available information from toxicology studies that may help develop a hypothesis as to the potential mode of action. It should be remembered that toxicity can be due to an off-target activity, and this should be considered when formulating a hypothesis for the mechanisms of toxicity. Therefore, it would be extremely useful to evaluate any available data from receptor/enzyme screening assays that may have been conducted during the compound selection phases. This is especially important in toxicology studies in which doses used are suprapharmacologic and selectivity based on pharmacologic doses/concentrations may not be applicable. Drug disposition data can be another source of useful information. Similar molecules with differing physicochemical properties (e.g., pKa) can accumulate in different tissue compartments (Benjamin et al. 2010). Conducting appropriate physiologically based pharmacokinetic modeling to estimate organ level concentrations of drugs (Gerlowski and Jain 1983) and use of this information to select targets from pan-receptor screening will help to formulate accurate hypotheses.
The hypothesis will guide which hormones should be considered for inclusion in investigative studies. The hormones and experimental plan will help determine whether a stand-alone investigative toxicity study is needed or whether quantitative assessment of serum hormones could be incorporated into a standard toxicity study. In most cases, stand-alone studies are ideal for quantitative hormonal investigations as they allow customized study designs, use of sufficient numbers of animals for adequate statistical power, and incorporation of appropriate measures to avoid confounding technical problems. Despite the fact that stand-alone studies will take additional time and resources, they are generally the best route to gain meaningful data. However, there are at least two scenarios in which including hormone measurements in otherwise standard toxicology studies becomes useful (Table 2). In these situations, appropriate precautions should be taken to obtain meaningful information and understand the limitations of these types of measurements.
Examples in which hormone measurements have been included in repeat-dose toxicity studies.
TSH = thyroid-stimulating hormone; T3 = triiodothyronine; T4 = thyroxine.
In general, incorporation of female reproductive hormones (LH, follicle-stimulating hormone [FSH], prolactin, estradiol [E2], and progesterone [P4]) as endpoints in standard toxicity studies, such as 28-day repeat-dose toxicity studies, should be avoided. The serum concentrations of these hormones fluctuate with reproductive cycle, and the interpretation of data becomes complicated without knowledge of and correlation with the stage of the reproductive cycle. Even with knowledge of the stage of the cycle, conventional study designs using 10 to 15 animals per dose, divided by the four stages of the cycle, will yield, on average, 3 to 4 animals in each stage of the cycle, which is too few to permit conclusions to be made. This was illustrated in a report by Biegel et al. (1998). For evaluating these hormones, specialized studies are recommended and will be a topic in a subsequent article in this series. In addition, caution should be exercised when measuring male reproductive hormones in routine toxicity studies since the pulsatility and variability of these hormones are too great to provide meaningful values from the standard group sizes used in conventional general toxicity studies.
In addition, hormonal measurements in short-term toxicity studies in which changes in hormone concentrations can be sudden and transient present a different set of challenges than measuring serum hormone concentrations in chronic studies. For example, in chronic studies, adaptive changes have probably occurred, resulting in altered hormone levels that typically do not change immediately when the drug is withdrawn. As a result, the timing of sample collection after a chronic exposure may not be as critical as in a short-term study. In short-term toxicity studies, necropsy and blood collection typically occur 24 h after the last dose. Acute hormonal changes would likely be missed in these studies if the drug had a short half-life and had been largely excreted by 24 h after dosing. In addition, short-term studies will usually show a more pronounced, hormonally specific effect, which may help in identifying the mode of action. Longer term exposures, in which normal adaptive changes have occurred, may require a larger number of animals to identify the hormonal effect since the magnitude of the change could decrease over time as the body compensates to return normal homeostasis.
The articles that follow provide details on optimal sampling and study designs for hormonal measurements. For hormones that are secreted in a pulsatile manner, it is advisable to collect samples that span a substantial part of the day to capture several pulses during the maximal blood concentration (Tmax) period. However, blood volumes in rodents can be a limitation for collecting serial samples. This can be overcome, at least partially, by increasing the number of animals per group. In nonrodent toxicity studies, because of the larger size of the animals, it is more practical to obtain serial blood samples. Furthermore, when analyzing serum hormone concentrations in nonrodent species, a good practice is to obtain baseline measurements and use baseline measurements from the same animal to compare for potential test article–related effects. The number of baseline measurements would depend on the hormone being measured as well as procedures to acclimate the animals to the study conditions.
When including hormonal endpoints in a standard toxicity study, the following points should be considered to minimize variability of the hormone data. Some of these considerations are unique to certain hormones, with some hormones more profoundly affected than others.
Pulsatility
Hormone release may be pulsatile (e.g., LH, FSH, and T) or nonpulsatile (e.g., triiodothyronine [T3] and T4), and pulsatile release may vary in pattern over time. In addition, some species may show pulsatility in certain hormones, while other species may not. Pulsatility is defined here as periodic peaks of serum hormone concentrations that can be fitted to a statistical model. The pulsatility of hormone concentration is a function of its secretion kinetics and clearance. Therefore, when measuring pulsatile hormones, sampling at a single time point may produce misleading information. To overcome this problem, prospective power analysis should be conducted to determine the number of animals needed to provide a given probability that a change of a certain magnitude will be detected. Depending on the frequency of the hormone pulses and the blood volume of the animal species being investigated, serial sampling is an alternative solution.
Reproductive Cycle
In female rats and Beagle dogs, with estrous cycles of 4 to 5 days or 6 months, respectively, hormonal concentrations can vary by more than 10-fold depending on the hormone and stage of the estrous cycle. Therefore, measuring serum hormones without knowledge of estrous cycle may not provide useful data or at a minimum may require additional considerations for study design. As mentioned earlier, specifically designed investigations are best suited to measure serum hormones in female animals, especially rodents. Hormone assessments in female animals may be included when the reproductive cycles can be determined.
Age of Animal and Maturation Status
Hormone concentrations change dramatically with the onset of puberty, and it is important to take this into account when conducting studies. This is particularly important in large animals that reach puberty at different ages (e.g., dogs and monkeys), where the animals in a study may be of a similar age but at different stages of sexual maturation. Reproductive senescence is also important to consider. Female Sprague-Dawley rats have evidence of abnormal cyclicity by the end of 13-week toxicity studies, and this will be reflected by alterations in hormonal values. Conducting hormone measurements at the end of a 2-year rat or mouse study also has challenges because of the combined effects of reproductive senescence and normal background pathology, which may involve endocrine neoplasia. Older female Old World primates such as macaques may also undergo natural ovarian senescence, leading to loss of endogenous E2 and P4 (Nichols et al. 2005). For thyroid hormones, one of the primary concerns is the impact of altered hormonal status on fetal and neonatal brain development. As such, age of assessment, recognition of fluctuating normative ranges during pregnancy and in the early neonatal period, issues of limited sample volumes, and detection limits of standard assays must be considered.
Circadian Rhythms
Hormones such as prolactin have a circadian rhythm in pregnant rats (Smith, Freeman, and Neill 1975), while hormones such as T3 and T4 may not have a similarly pronounced diurnal pattern. Ideally, serum for hormone analysis should be sampled at the same time of day between test groups and throughout the experiment, within a narrow time window and with stratification of blood collection across treatment groups. Depending on the number of animals in a toxicity study, this may produce technical challenges that need to be considered and addressed during study design.
Stress and Use of Anesthetics
Consideration of stress is very important when deciding which method of blood collection to use. Some hormones discussed are influenced by stress (for specific information, please refer to the articles in this series). When stress is of concern, care should be taken to select the most appropriate method of euthanasia and/or anesthetics for blood collection for specific hormones of interest. Some anesthetics, in addition to inducing mild stress, can interfere with hormone concentrations in a more direct manner (Nazian 1988; Tohei et al. 1997). If the test article–related hormonal change is profound, the effects of stress or anesthetics may be manageable because the magnitude of the change will exceed the error introduced by the anesthetic. It is good practice to incorporate experimental procedures that would minimize stress such as not transporting animals for at least an hour prior to blood collection, conducting necropsy in a separate room than housing, or using short-duration anesthesia (i.e., less than 1 min) or decapitation without prior anesthesia.
Interpretation of Data
Analyses of hormonal profiles for test article–related effects are not straightforward. Descriptions and characterizations of pulsatile release of hormones can be difficult to analyze objectively, although statistical methods are available to evaluate pulsatile hormone release profiles (Mock, Norton, and Frankel 1978). However, these types of analyses typically require a sampling frequency that may not be practical or even possible in routine nonclinical toxicology studies. Simple area under the curve calculations from a hormone profile containing as few as six time points can be applied to integrate the pulsatile hormone data to assess test article–related effects. An example of this methodology is presented for cynomolgus monkey T concentrations in Figure 1.

Individual hormone profiles could be integrated into manageable data by calculating the area under the curve (AUC). Then, baseline AUC values from individual animals could be compared with the AUC values obtained during treatment to detect test article–related changes. For example, individual monkey testosterone profiles from day 1 (A) and day 7 (B) have been integrated into one graph (C) for analyses.
Frequent questions posed by investigators include whether the observed change in a hormone is of sufficient magnitude and/or duration to lead to an adverse effect on morphology or function. Hormone concentration by itself without any functional or morphological effects on the animal is challenging to interpret and may not be adverse. For example, a slight change in plasma testosterone concentrations without any alteration in androgen-dependent organ weights or testicular pathology may be interpreted as a statistically but not biologically meaningful effect. Alternatively, this pattern may be evident following short-term exposure but accompanied by organ weight changes or histopathology with continued exposure. Therefore when interpreting hormone data, it is important to do so in the context of the magnitude and duration of evoked change and the relationship of serum hormones to end organ function or morphology. Morphological or functional data when possible should be collected for all hormones to understand and appreciate the dynamics between hormone concentrations and end-organ function and activity. These data would also help interpret changes in hormone concentrations as adverse or incidental.
Principles of Assay Validation
Validation of an assay is important to determine its performance criteria. Validation should include measurements of sensitivity, selectivity, and reproducibility. In addition, validation will determine whether the assay works as advertised in the investigator’s laboratory. There are accepted principles (Stockham and Scott 2008) involved in validating clinical pathology assays, and these generally apply to hormone assays as well. The inclusion of at least the following five key elements when validating an assay is recommended: Limits of sensitivity: The smallest amount of hormone that can be detected with reasonable confidence. Precision: The reproducibility of the assay. This helps calculate inter- and intra-assay variability. Linearity (or parallelism): The ability to accurately measure serially diluted hormone samples. Spike recovery: This is related to linearity testing and evaluates the ability to accurately detect spiked amounts of hormones. Specificity: Evaluates the ability to detect the hormone and unrelated hormones, for example, LH and not FSH or TSH.
In addition to the five common technical endpoints, it is also useful to consider other applications and scenarios under which the assay could be used. Additional validation steps may be necessary, some of which include the following.
Anticoagulants
Most hormone assay vendors provide information on the most appropriate anticoagulant to use. However, when validating a new assay in an individual laboratory, it is useful to include other anticoagulants or serum to determine the compatibility in the assay. These data become useful when measuring hormones in residual serum samples from a study conducted for other reasons.
Interference of Test Substance with Assay
With assays involving antibodies, there is the potential that the test substance found in the serum/plasma can interfere with the binding of the hormone to the antibody, thereby causing false-positive effects. The use of spiked samples containing increasing concentrations of the test substance in serum/plasma samples can help determine whether interference is a concern. In cases in which interference is a problem, alternative methods such as liquid chromatography–mass spectrometry
Species Cross-Reactivity
Although assays for steroid or thyroid hormones designed for use in one species will often successfully measure those hormones in other species, assays for protein or peptide hormones are typically specific to one or several species. However, even though an assay kit can measure the same hormone in different species, assays are usually optimized for a single species. In some cases, the serum or plasma matrix can vary between species and lead to different performance characteristics of assays. Therefore, validation of plasma or serum samples of the test species is necessary, and when conducting these validation studies, including samples from other commonly used test species would be useful for later. For example, a hormone assay kit for rats could be tested with dog and monkey serum and plasma samples. When validating assays across species, the reference range for one species may not be appropriate for another, and a relevant reference range must be established for the species of interest.
Batch-to-batch Differences between Kits
Although commercial vendors of hormone assay kits have stringent quality controls, batch-to-batch variability may occur. Therefore, caution should be exercised when depending solely on the assay control samples that come with the kit. It is good practice to have an independent internal control sample included in each assay to monitor for batch-to-batch variability. Batch-to-batch consistency between kits can be measured using the reference controls from multiple kits or a pooled sample of serum from many animals that is aliquoted and tested across many experiments. The latter practice facilitates comparison to historical control data. As part of the validation, it is important to test multiple batches of a kit before declaring an assay from a vendor validated. Absence of any significant batch-to-batch variability is especially important if historical ranges will be referenced for data interpretation. It is also important to note the specific antibody used in a particular enzyme immunoassay (EIA) and radioimmunoassay (RIA) hormone assay kit, as this may change over time and potentially affect results (e.g., a more specific antibody may lead to lower measured values).
Sample Stability
Sample stability is a common concern when developing hormonal assays. The method of sample collection and storage can affect the measurement of hormone concentrations. It is recommended that sample stability be measured at different temperatures and lengths of storage. It is also important to consider that stability of specific hormones may vary. The effect of multiple freeze-thaw cycles should be assessed. There may be variability in sample stability due to variations in the function of individual freezers and locations in a freezer. For this reason, the use of frost-free freezers, which have periodic heat cycles to prevent ice buildup, should be avoided.
Establishment of a Normal Range
Normal ranges for each hormone should be established for correct data interpretation in each species. Each laboratory should produce its own normal ranges. When establishing normal or calibration ranges, it is important to understand the various biological factors (i.e., circadian rhythms, reproductive cycles, seasonality) that affect hormone concentrations. For example, it is not useful to combine normal ovarian hormone concentrations from different stages of the rat estrous cycle. The normal range at each stage of the cycle is different from the other stages. Normal ranges and confidence intervals will facilitate data interpretation and the design of studies with adequate statistical power. The articles in this series will provide guidelines on the recommended number of animals for measurement of various hormones. However, group numbers can vary significantly among laboratories due to sampling methods, the assays used, and the magnitude of change requisite for the experimental question at hand. Ideally, each laboratory should establish limits of acceptable hormone concentration variability for their own work.
Free versus Bound Hormone Concentration
Some hormones, especially sex hormones such as estrogen and testosterone and thyroid hormones (T3 and T4), are bound to high-affinity serum-binding proteins (e.g., thyroxine-binding globulin, transthyretin, sex hormone–binding globulin, corticoid-binding globulin) as well as lower affinity less-specific proteins such as serum albumin. According to the free hormone/drug hypothesis, at steady state, bioactivity is determined by the free or unbound fraction of the hormone/drug. Therefore, in some situations, measuring free hormones is important. When validating assays, it is important to understand whether the assays are measuring total or free hormones. The type of serum protein and the affinity of the binding differ between species (i.e., there are multiple types of testosterone and thyroxine-binding proteins in different species; Corvol and Bardin 1973; Larsson, Pettersson, and Carlström et al. 1985). The interspecies differences in the types of serum proteins and affinities complicate both the use of free hormone assays across species and the translation of endocrine free fractions between species. Free hormone assays that are independent of binding strength, such as equilibrium dialysis assays, are the gold standard, but assays that calculate the free fraction based on binding strength in one species may not be as easily transferable to another species. When free hormone assessments are necessary, careful validation of the assay is essential, and use of the purified binding protein from the test species is preferred. Free hormone concentration indices may also be calculated from total hormone and binding protein concentrations, although commercial binding protein assays are often technically challenging and species specific.
Ability to Measure Biologically Meaningful Changes in Hormone Concentrations
A common approach when validating a hormone assay is to conduct linearity tests to evaluate whether an assay can detect specific changes in hormone concentrations within the calibration range. For example, a known amount of a standard is spiked into a matrix and analyzed with the kit that is being validated to observe whether the kit can measure the spiked hormone. However, there are some inherent pitfalls with this method, because the antibody or binding protein is specifically meant to recognize the standards, which may not be the same as recognizing the native hormone in an alternative matrix. Another pitfall is that small peptides have a tendency to adhere to glass and other vessels if not properly handled, which confound the measurement of dilution curves. Finally, it is challenging to simulate biologically relevant hormonal changes artificially in dilution curves. One way to overcome these challenges is to obtain samples known to differ in hormone concentrations to assess assay performance. In some cases, small studies using known positive control agents can be conducted to collect such samples, and these can also provide samples that can be used as subsequent internal assay controls. Whenever possible, it is recommended to use biologically relevant samples to validate hormone assay kits. Examples of practices to provide confidence in assay results using positive control procedures included are the following: To validate a human gastrin kit to analyze cynomolgus monkey serum gastrin concentration. Serum samples were collected from overnight fasted monkeys before and after feeding a high-protein and fruit snack. This permitted an assessment of the ability of the human-based assay kit to detect expected changes in gastrin concentration levels induced by diet. To evaluate the sensitivity of estrogen and progesterone assay kits in rodents, serum samples were collected at various times in the estrous cycle. The ability of the kits to detect differences in hormone concentrations across the cycle was documented. To demonstrate the sensitivity of thyroid assay kits in rodents, a serum dilution curve was run, and parallel changes between the standard curve and the serum dilution curve were confirmed. Calibrators (especially for T4) at or below the lowest level of thyroid hormone expected in the study must be included to document the limits of quantification of the assay. In RIA assays for T3 and T4, the lowest calibrator provided in the kit was halved to extend the limits of detection for T4 down to the 5-ng/mL range.
Hormone Assays
Immunoassays are the most common method currently used to quantify hormones, although LC/MS methods for small-molecule hormones are increasingly being used in human clinical practice (Moal et al. 2007). These include the RIA, in which the ligand is labeled with a radionuclide, and the EIA, most commonly represented by the enzyme-linked immunoabsorbant assay (ELISA).
Current RIAs are typically highly sensitive and specific, and they take advantage of the specificity of the antigen-antibody reaction and the sensitivity that is inherent to the measurement of radioactive compounds (Kricka 2001). The main disadvantage of RIAs is the short shelf life of the assay kits due to radionuclide decay, generation of radioactive waste that is expensive to discard, and additional safety procedures required for handling of radioactive material.
EIAs overcome most of the disadvantages of RIAs and are more readily automated. However, one of the main disadvantages with sandwich immunoassays such as the ELISA is the potential for interference. The ability to automate immunoassays has enabled new technology to be developed based on the EIA concept, such as multiplex assay technology. This is a suspension array technology that uses a flow cytometer to analyze multiple hormones from a single serum sample. Because multiplex assay technology measures the concentrations of multiple hormones from a small volume of serum, many of the principles discussed in this article related to hormone endpoints in toxicity studies could be realized. For example, measuring multiple biologically related hormones and serial profiling to capture hormone pulses were not possible without large volumes of serum or plasma, which necessitated large studies. This is no longer a technical hurdle with multiplex assay technology. Selecting which assay technology to use will depend on the level of sensitivity needed, access to the equipment, and appropriately trained staff. A practical comparison of RIA, ELISA, and multiplex for female reproductive hormones LH, FSH, and prolactin is presented in Table 3.
Comparison of RIA, ELISA, and multiplex technology in terms of volume of sample needed, sensitivity, relative cost for reagents, and time to analyze rat LH, FSH, and prolactin.
RIA = radioimmunoassay; ELISA = enzyme-linked immunosorbent assay; LH = luteinizing hormone; FSH = follicle-stimulating hormone.
Conclusions
The articles in this series are intended to provide guidance on approaches for evaluating blood samples for circulating hormone concentrations that avoid many of the issues associated with hormone measurements in routine nonclinical toxicology studies. This information is particularly useful because samples for hormone measurement readily obtained in preclinical studies often serve as biomarkers of toxicity and can guide further studies. Since preclinical toxicity studies are the foundation for approaches used in clinical studies, it is important that any test article–related hormone effect be accurate, validated, and appropriately interpreted.
Footnotes
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The recommendations in this article are endorsed and supported by the STP.
