Abstract
The study of developmental neurotoxicity (DNT) continues to be an important component of safety evaluation of candidate therapeutic agents and of industrial and environmental chemicals. Developmental neurotoxicity is considered to be an adverse change in the central and/or peripheral nervous system during development of an organism and has been primarily evaluated by studying functional outcomes, such as changes in behavior, neuropathology, neurochemistry, and/or neurophysiology. Neurobehavioral evaluations are a component of a wide range of toxicology studies in laboratory animal models, whereas neurochemistry and neurophysiology are less commonly employed. Although the primary focus of this article is on neurobehavioral evaluation in pre- and postnatal development and juvenile toxicology studies used in pharmaceutical development, concepts may also apply to adult nonclinical safety studies and Environmental Protection Agency/chemical assessments. This article summarizes the proceedings of a symposium held during the 2015 American College of Toxicology annual meeting and includes a discussion of the current status of DNT testing as well as potential issues and recommendations. Topics include the regulatory context for DNT testing; study design and interpretation; behavioral test selection, including a comparison of core learning and memory systems; age of testing; repeated testing of the same animals; use of alternative animal models; impact of findings; and extrapolation of animal results to humans. Integration of the regulatory experience and scientific concepts presented during this symposium, as well as from subsequent discussion and input, provides a synopsis of the current state of DNT testing in safety assessment, as well as a potential roadmap for future advancement.
Introduction
Neurotoxicity has been defined as an adverse change in the structure or function of the central and/or peripheral nervous system following exposure to a chemical, physical, or biological agent. 1 Per the Environmental Protection Agency’s Guidelines for Neurotoxicity Risk Assessment, 2 structural neurotoxic effects are defined as neuroanatomical changes occurring at any level of nervous system organization, whereas functional changes are defined as neurochemical, neurophysiological, or behavioral effects. There is an extensive body of literature encompassing the various aspects of neurotoxicology, including the evaluation of adverse effects on behavior, including assessment of somatic/autonomic, sensory, motor, and cognitive function.
Developmental neurotoxicology is the study of adverse effects on the nervous system resulting from exposure to a toxicant during development. The nervous system continues to develop postnatally in humans and in most of the commonly used laboratory species; this prolonged period of development contributes to a potential increase in susceptibility to neurotoxic insult. Evaluation of the functional consequences of developmental neurotoxicity (DNT) employs many of the principles and methods used in neurotoxicity assessment, integrated with those of developmental toxicology. Over the last approximately 60 years, these principles and methods, particularly with regard to testing of behavior, have been refined and validated. They are now routinely employed in postnatal nonclinical safety assessment, as well as in evaluation of industrial and environmental chemicals, in studies that expose the conceptus, neonate, and/or juvenile animal to potentially toxic insult. In addition, input from global regulatory agencies has provided fine-tuning with regard to the implication of adverse results from behavior tests conducted in these populations.
The following sections provide a summary of the symposium “Current Topics in Postnatal Behavioral Testing,” presented at the 2015 American College of Toxicology Annual Meeting. The rationale for conducting neurobehavioral evaluations is discussed, with a focus on the pre- and postnatal development (PPND) and juvenile toxicity studies conducted to support the safe conduct of clinical trials and registration of candidate therapeutic agents. Neurobehavioral tests employed to evaluate domains encompassing reflex ontogeny, sensorimotor function, locomotor activity, arousal/reactivity, and learning and memory, as recommended by global regulatory agencies, are described. These include well-established, validated tests, as well as those that are less common but worthy of consideration. The animal model typically used in PPND and juvenile toxicity studies is the rat, but this species may not be appropriate in all circumstances. Potential reasons for consideration of alternative species and pros and cons of various animal models are discussed. Data interpretation, potential confounders, and regulatory acceptance are important components of neurobehavioral testing, and focused discussions of core learning and memory systems and the functional observational battery (FOB) follow. Case studies are also provided to illustrate the types of tests employed and their outcomes. Information contained herein provides a brief review of the current state of the art for evaluation of DNT and also includes identification of issues and recommendations for the future evaluation of neurotoxicity in vulnerable young populations.
Neurobehavioral Evaluations as a Component of Developmental Toxicity Testing—Regulatory Perspective (Ikram Elayan/Edward Fisher)
Evaluation of the effects of a drug on the central nervous system (CNS) is an essential component of the nonclinical safety assessment conducted during drug development, and neurobehavioral testing is an important tool for accomplishing that objective. In a regulatory setting, data collected from nonclinical neurobehavioral studies are critical for delineating possible drug effects that are sometimes difficult or impossible to detect in clinical trials where meaningful behavioral outcomes are not well defined or validated and where the number of exposures and duration of treatment and follow-up are limited. Drug safety assessment based on postapproval human data also raises a host of methodological issues. The clinical detection or confirmation of DNT resulting from in utero or early postnatal drug exposure is particularly difficult in this regard; for example, efforts to study the behavioral effects of developmental exposure to antiepileptic drugs in humans have been hampered by the difficulty of controlling for confounding factors, uncertainty about the accuracy of clinical information, and insufficient outcome criteria. 3 Besides the limitations of clinical trial and postmarketing data, factors such as the variability in the natural development of the brain, the different stages of development that can be affected by treatment, and individual differences in response make a clinical follow-up paradigm difficult to implement. Until well-defined specific clinical outcomes and/or other validated modalities (eg, imaging techniques or neurochemical biomarkers) are adopted for clinical monitoring and appropriately designed epidemiological studies are conducted, animal studies might be the only way to understand the effects of a drug on the developing nervous system. The results of nonclinical neurobehavioral studies can be important in limiting exposure levels or excluding certain populations in clinical trials. In addition, data from these studies can be described in consent documents and product labeling so that a more informed risk–benefit decision can be made.
Prior to first use in humans, nonclinical studies performed to characterize the pharmacology of the drug may entail evaluation of behavioral effects, and behavioral end points are the primary means of assessing the potential for acute neurotoxicity in CNS safety pharmacology studies. In general toxicity studies, clinical observations can reflect acute and long-term effects of the drug on the CNS. These evaluations are generally conducted in adult animals; however, evaluation of possible drug effects on the nervous system by means of neurobehavioral testing is also a key part of the nonclinical developmental toxicity assessment. In the PPND study, which investigates effects of the drug on the developing embryo/fetus in utero and during the early postnatal period until weaning, neurobehavioral assessments of the offspring are conducted postweaning. 4 And neurobehavioral testing is generally included in juvenile animal studies, in which young animals are directly dosed with the drug from the early postnatal period through sexual maturity or early adulthood. 5 In this section, we will discuss the neurobehavioral evaluations needed in nonclinical studies and some tests that could be utilized for such assessments. Considerations and issues surrounding neurobehavioral evaluations conducted in a drug regulatory context, including those related to study design and the interpretation and impact of findings, as well as case studies, will also be discussed.
Testing guidelines that call for neurobehavioral assessments (eg, International Conference on Harmonisation [ICH] Guidelines on Detection of Toxicity to Reproduction for Medicinal Products [S5(R2)] 6 and Safety Pharmacology Studies for Pharmaceuticals [S7A] 7 and Food and Drug Administration (FDA) Guidance on Nonclinical Safety Evaluation of Pediatric Drug Products 8 ) generically recommend evaluations of activity, sensory functions, and learning and memory but do not specify test procedures. This flexibility has, for example, allowed for the use of a variety of learning and memory tests ranging in complexity from simple 2-choice discrimination tasks to complex labyrinthine water mazes. Given the need for sensitive and reliable as well as practical means of detecting the full spectrum of potential neurotoxic effects of drug exposure during development, the selection of neurobehavioral tests should be scientifically justified. However, the specific criteria for deciding which tests are most appropriate for a particular study are not well defined. The need for additional clarification of this question was recognized in the recent concept paper for a proposed revision of the ICH S5 guideline (Final Concept Paper S5[R3] 9 ).
Several behavioral domains are considered important for evaluating possible nervous system effects, including sensory and motor function, arousal and reactivity, cognitive function (learning and memory, attention), and social behavior. The Irwin test 10 is one of the early test batteries used to evaluate drug effects on the nervous system. This test, which was initially developed for use in mice, requires trained and experienced personnel to obtain consistent and reliable results. Although the full and extensive test battery is rarely used, a modified version, usually referred to as the “Modified Irwin,” is frequently used as a first-tier assay to address the need for a nonclinical CNS safety pharmacology evaluation (as per ICH S7A 7 ) during early drug development. The FOB test, which utilizes some aspects of the Irwin test, is also commonly used for this purpose. This test can be conducted with either mice or rats, and minimal outcomes that are usually assessed include activity and locomotion, bizarre behavior, convulsions, autonomic function, grip strength, and body temperature. Some versions of this test have been adapted for dogs and nonhuman primates. 11,12 Although these test batteries are often used as part of the safety pharmacology evaluation for new pharmaceuticals and can provide some information about the general condition of the animals under treatment, additional behavioral testing is generally needed to characterize observational results in the FOB.
Motor function can be assessed as both spontaneous and induced movements. Tests to evaluate motor activity include the rotarod balance performance test, automated systems that utilize infrared beam breaks, mazes, and other more specialized schedule-controlled operant tests. The automated open field locomotor activity monitor is a valuable system to utilize in screening or to follow up an observed locomotor effect, as the activity of the animal can be quantified by means of computerized programs that analyze specific components, such as horizontal and vertical activity, as well as repetitive beam interruptions. This is the test system generally recommended for inclusion in the neurobehavioral assessments conducted for PPND and juvenile animal studies.
The startle response is a reflex reaction that involves sensorimotor (gating) function and is mediated and modulated via well-characterized neural circuits that are comparable across species. 13 A stimulus (acoustic such as noise or pressure like an air puff) can be used to trigger a startle response that traverses a pathway from the sensory nerves through the giant pontine reticular nucleus in the midbrain to motor neurons in the spinal cord. This reflex reaction can be modulated by input from higher brain centers; for example, the response tends to be greater in the presence of threat, fear, and pain and can be decreased by anxiolytics. Startle habituation can be manifested by a decrease in the magnitude or frequency of a response following repeated presentation of the stimulus and is considered a simple form of learning. The startle response can be measured in automated test systems that employ a “stabilimeter chamber” or an accelerometer piezoelectric force transducer, which is connected to a detector that measures whole-body flinch of an animal induced by the reflex reaction. Parameters that are evaluated in this test include the first response amplitude, maximum response amplitude, time to maximum response, and average response. Protocols for measuring response habituation and modification by sensory stimuli (prepulse inhibition [PPI]) with these systems have been described. 14 This test can be utilized to assess the integrity of specific aspects of sensorimotor function and the underlying CNS substrate. It has proven to be a sensitive means of detecting drug-induced neurotoxicity that is also practical for inclusion in neurobehavioral test batteries for regulatory use.
Evaluation of cognitive function requires reliable and sensitive tests to detect potential deficits that are usually hard to recognize in animals. A variety of tests have been used over the years ranging from very simple habituation or discrimination tests to more complex operant or conditioning tests. These tests vary in their complexity and sensitivity and require the animals to perform some learning task, which then can be reintroduced for the evaluation of effects on memory. Ideally, effects on learning and memory should be separated from effects that are not involved in the cognitive or associative process (ie, effects on sensory, motor, or motivational factors).
Passive avoidance and maze tasks have been the most commonly used learning and memory tests for regulatory developmental toxicity studies of pharmaceuticals. Although passive avoidance has some desirable features and has been shown capable of detecting neurotoxic effects, it is considered less sensitive and more susceptible to nonmnemonic variables, such as activity level, than some of the spatial learning tests. Mazes vary in task complexity from simple T- or M-mazes to multiple T-mazes, such as the Biel and Cincinnati water mazes (CWMs), or other tests of spatial learning and memory, such as the Barnes maze, the RAM, and the Morris water maze (MWM). 15 All mazes include 3 aspects that are essential for learning and memory: acquisition, consolidation, and retrieval; but the cognitive processes or components, such as learning strategy and memory duration, and neural substrates involved vary among tasks. Some of these mazes are dry and some are water mazes that require the animals to swim; some are associated with an appetitive component, such as a food rewards, whereas others are not; and each has its own advantages and disadvantages, as will be discussed in more detail in the following sections. Complex multiple T (Cincinnati) and MWMs have been recommended as well-suited for regulatory learning and memory assessments in rodents. 15
When conducting neurobehavioral studies, some design and methodological considerations are crucial in order to obtain valid and meaningful results. It is important that the number of animals used in the study be adequate for detection of reasonably small treatment effects, preferably at least 15/sex/group. It is important to distinguish pharmacological from long-term toxic effects by testing during the treatment period and after an appropriate recovery period. In order to avoid preexposure to the tests, separate sets of animals should be used for testing during treatment and at the end of the recovery period. The battery of functional assessments performed should be capable of detecting relatively subtle impairments. And while it is understood that no single test can evaluate all aspects of learning and memory, tests should be chosen to maximize the detection of an adverse effect across a range of important cognitive domains.
The level of confidence that lack of adverse behavioral results in animals exposed to a drug will be predictive of safety for humans will obviously depend on the adequacy of the evaluation. However, beyond the general advice that the functional testing assesses certain behavioral domains, there is little regulatory guidance in this area and regulators are often reluctant to recommend specific test methods. Thus, there is a clear need to establish performance criteria for neurobehavioral tests with respect to such aspects as their reliability, sensitivity, and human predictive value. Determination of the relevance of adverse behavioral findings to humans will involve consideration of factors such as drug dose/exposure, magnitude of the effect, reversibility, and relation to other effects. 16 Depending on the test employed, extrapolation of animal findings to humans may not be straightforward given species differences in behavioral repertoire and uncertainties about the behavioral functions being assessed and the neural systems subserving the behavioral response. 17 But neurobehavioral changes observed using a validated test system at a clinically relevant dose in an appropriate animal species would generally be assumed, like other toxic effects, to indicate a risk to humans and lead to appropriate risk/benefit considerations during drug development and risk communication in product labeling.
The following case studies are intended to present some examples on how to follow up on the effect of a certain treatment on the function of the nervous system, the different tests that are used, and how the data obtained for these studies were used.
Case Study 1
Lacosamide (LCM) is a recently approved anticonvulsant drug that interacts pharmacologically with voltage-gated sodium channels and the collapsin response mediator protein 2 (CRMP-2). The CRMP proteins are highly expressed in the CNS during early development, and CRMP-2 has been reported to play a role in neuronal differentiation, polarization, and axonal outgrowth. In pharmacology studies, LCM was shown to inhibit the CRMP-2–mediated effects of neurotrophins on axonal outgrowth of primary hippocampal cells at pharmacologically relevant concentrations. 18 When LCM was assessed for its acute effects on the CNS in rodents, depressant effects were observed as manifested by sedation, reduced spontaneous locomotor activity, and impairment of motor coordination. Clinical observations in the chronic nonclinical toxicity studies included neurological signs such as ataxia, reduced motility, tremor, and convulsions at high doses. In a juvenile rat toxicity study in which the drug was administered for 6 weeks beginning on postnatal day (PND) 7, brain weights (absolute and relative) were decreased at the end of the dosing period and neurobehavioral testing conducted beginning 1 week after cessation of dosing indicated drug treatment effects on open field (decreased latencies to first movement and number of sectors entered) and MWM (increased escape latencies during both the learning and memory phases) performance. Because the early postnatal period in rats is generally thought to correspond to late pregnancy in humans in terms of brain development, the juvenile rat study results were included in both the pregnancy and the pediatric use sections of labeling (Vimpat product labeling; http://www.accessdata.fda.gov/drugsatfda_docs/label/2015/022253s030,022254s022,022255s016lbl.pdf).
Case Study 2
Duloxetine is a serotonin and norepinephrine reuptake inhibitor indicated for the treatment of major depressive disorder and generalized anxiety disorder in children and adolescents ages 7 to 17 years. Juvenile rats were treated with duloxetine HCl from PND 21 to 90. Different neurobehavioral evaluations were performed in this study including motor activity using a figure-8 photobeam activity system, auditory startle habituation, and learning and memory using the CWM (2 paths were utilized; Path A—forward path and Path B—backward path). In the learning and memory test, there was a significant treatment-related increase in the number of errors navigating the Path B configuration of the maze in both males and females treated with the high dose during the treatment period, indicative of sequential learning deficits. The lack of an effect on the performance of animals in Path A, while an effect was observed in Path B, suggests that a more complex paradigm in this maze (Path B) allowed for observing some neurobehavioral deficit that would have been missed if the animals were tested only on the forward path (Path A). However, these learning deficits were reversible following cessation of drug treatment. There was no effect on the time to complete the task in either Path A or B during treatment. The findings of this study were described in the pediatric section of the labeling (Cymbalta; http://www.accessdata.fda.gov/drugsatfda_docs/label/2015/021427s046lbl.pdf).
Assessment of Learning, Memory, Startle, and Functional Observations in Regulatory Studies (Charles Vorhees)
Learning and Memory Systems
There are 3 primary learning and memory systems in mammals that encode information necessary for navigating through the environment that are essential for survival. 19 In humans, these are explicit/declarative memory (memory for people, places, things, and events, which are long-term reference memories), implicit/procedural/egocentric memory (for skills and proximal navigation, which are also long-term memories), and working/short-term memory (memories held until new information is consolidated into long-term memory or forgotten). Explicit/declarative memory (that includes allocentric/spatial memory) is encoded in the hippocampus and entorhinal cortex; implicit/procedural memory (that includes egocentric memory) is encoded in the striatum and related structures; and working memory is primarily encoded in the prefrontal cortex. Testing these memory systems in rodents provides information on the homologous memory functions in humans and is valuable to identify potential human hazards. Assessment of memory for place/location is assessed using tests of allocentric/spatial navigation; assessment of procedural/implicit memory is assessed using tests of egocentric navigation; and assessment of working memory is assessed using tests of trial-dependent memory. Examples of tests assessing spatial learning and memory are the MWM, Barnes maze, passive avoidance, and the unbaited arms of the radial arm maze (RAM). Examples of tests assessing egocentric learning and memory are the CWM and Biel water maze, the cued version of the MWM, and the Star Maze. Examples of tests assessing working memory are the baited arms of the RAM, spontaneous alternation, the 5-choice serial reaction time test, 20 appetitive T-mazes, and others.
With few exceptions (novel object recognition and spontaneous alternation), learning tasks require some type of motivation imposed on the animal by the experimenter. The 3 most common motivators are positive reinforcement (typically for food), punishment (such as shock), and negative reinforcement (to avoid or escape from an aversive stimulus such as water escape or shock avoidance). All learning tasks cause some level of stress. Some think that stress is always adverse, but in fact the Yerkes-Dodson law demonstrates that when the level of stress is analyzed against performance, this function exhibits an inverted U shape. Performance is poor at high levels of stress and also at low levels of stress if the task is demanding enough. Optimal performance occurs at intermediate levels of stress, but the key is having stress at the right level in relation to the difficulty of the task. Typically, this is determined empirically. In most learning tasks, if the stress level is appropriate, animals learn well. If animals learn well, regardless of whether the motivation is appetitive, swimming to escape water, or shock avoidance, then the stress is appropriately aligned with the task. Hence, when good learning is observed, it is safe to assume that the stress is not excessive. Prima facie evidence that stress is not properly balanced occurs when learning curves are too shallow or too steep. Getting this balance right is the first step in making sure a test procedure is worth using.
Water Mazes
Among the motivators mentioned earlier, there are several advantages of water mazes. These include the following. Water is an equal motivator across animals, that is, water is equally wet and equally cold to all animals (within reasonable limits). However, if a treatment affects fur density or hypothalamic thermoregulation, then water may not be equally motivating for all animals. Motivation to escape can be ascertained by testing animals in a straight water-filled swimming channel where there is essentially no learning. One can infer motivational equality if escape times are comparable across groups. If groups differ in swimming time in a straight channel, then one needs to consider the implications this has for interpreting differences in maze learning. A slow swimming group can be caused by several factors. It could be because the treatment affects motor coordination such that the act of swimming is impaired; the experimental group might swim slower because they are less motivated or less energetic due to sickness or lethargy; or the experimental group might become easily fatigued if they cannot regulate core body temperature, and therefore, lose heat in water. But these are exceptions that are rarely encountered. In most cases, water is uniform in motivating escape over a wide range of body weights.
21
This cannot be said of appetitive tasks. Appetitive tasks cannot assure motivational equality if the independent variable induces changes in growth, maintenance of body weight, palatability of the reward, or similar factors. This is especially a concern in developing animals where appetite or palatability can be significantly affected by treatments that impair growth. This, in turn, can affect motivation and hence rate of learning. By contrast, rodents swim at similar speeds even if they differ in body weight and appetite. Moreover, even when a treatment changes swim speed if the changes are small, they usually present no problem. When a swim speed difference is seen in conjunction with a learning difference, swim speed differences rarely account for the learning differences. Nonetheless, if concern over speed differences does arise, they can be accounted for by using analysis of covariance with swim speed as the covariate in the analysis of learning. In this way, performance and learning factors can be disentangled. The second advantage of swimming tasks is that 100% of rats and close to 100% of mice perform these tests. This avoids the problem of selection bias. Selection bias is more common in appetitive tests where there can be animals that will not perform the task or perform poorly causing animals to be dropped from the data set. Dropout is a serious source of uncertainty in interpreting data, since it is impossible to know what effect the dropouts would have had they learned. In operant procedures, even when all animals learn the contingencies, there can be wide variation in levels of performance making analysis of the data problematic. Water mazes circumvent dropout problems, since all animals can swim. This makes water mazes well suited for regulatory studies where selection bias presents problems for submitters and regulators. In most water mazes, 100% of control rats and a high percentage of mice master the task. Although it is important that all animals are able to swim and finish as noted earlier, it is a further advantage if they demonstrate improvement across trials and reach proficient levels of performance by the end of the test. Shallow learning curves prevent one from detecting differences because of ceiling effects and are a concern as to whether the task is suitable if it is so difficult that animals can hardly improve. Conversely, steep learning curves also present problems. When a task is too easy, it can be insensitive because even impaired animals can learn it. Both problems are seen in the literature with the MWM. There are many mouse studies where the learning curves are nearly flat. And there are MWM studies in rats where the maze is so small and the platform so large that rats reach optimal performance on the first day of testing. It is best to demonstrate moderate learning curves in controls, typically over several days of testing. Ideally, the animals should find the task challenging at the beginning but reach asymptotic performance by the end. Every strain of rat tested in the MWM and other water mazes described below show proficient learning, as do many mouse strains. However, there are mouse strains that show poor MWM learning and may not be good experimental subjects for tests of some types of learning and memory.
22
-24
Morris Water Maze
Of the 3 types of learning and memory, the one for which the underlying neural network, cellular physiology, and molecular mechanisms are best known is for spatial or allocentric learning and memory. The test best validated for assessing spatial ability is the MWM (Figure 1). Because of its efficiency, ease of implementation, internal controls, and delineation of the processes of learning and memory (acquisition, consolidation, and retrieval) within the same test, the MWM is arguably the best learning test for regulatory and many basic research studies. The MWM has different versions or procedural variations. As a test of spatial ability, it is conducted with a submerged, hidden platform as the point of escape (goal) and is performed using a set of pseudorandom start positions located around the periphery of the tank that change on each trial. Changing the start locations is critical to prevent an animal from learning a fixed route to the platform; this establishes that the animal is using spatial cues to find the platform. As noted earlier, there is also a version of the MWM that assesses egocentric navigation: the cued version. For this procedure, the escape platform is either raised above the water level or a flag or similar object is mounted on the platform that extends above the water so that animals can see where the platform is located. To prevent (or at least minimize) an animal’s use of distal cues, curtains are closed around the maze so that there are few room cues visible from inside the tank. One additional component of the cued version is that both start positions around the periphery of the tank and the position of the platform within the tank are varied on every trial. This makes a spatial strategy difficult for the animal to use because the relational cues are different on every trial. Although the cued version of the MWM has been used as an egocentric test, in our experience, it is not very sensitive for this purpose. However, this version has secondary value. Cued trials, given either before or after hidden trials, provide valuable information that the animals are not visually impaired, can use proximal cues, and that under less demanding conditions can learn the task readily. Evidence that the cued version is not very sensitive as a stand-alone test of egocentric navigation is exemplified by experiments where we treated rats with methamphetamine, fenfluramine, or 3,4-methylenedioxymethamphetamine during neonatal development and tested them as adults in the hidden and cued MWM and in the CWM 25 -30 (see reviews by Skelton et al 31 and Jablonski et al 32 ). We found enduring learning and memory deficits on hidden platform learning in the MWM and CWM but no effects on cued MWM learning. When using the MWM, there are several features that are important for it to work properly, and a number of these are detailed in Figure 1.

Picture of the Morris water maze (MWM). Tank size: for rats, tanks should be 183 cm in diameter or larger, and for mice 122 to 150 cm, but no larger. Large tanks for rats of 213 and 244 cm are even better because the difficulty of the task, and hence its sensitivity, depend on the search area in relation to the size of the escape platform, but rats can still learn in these larger tanks quite well. As the search area increases relative to platform size, the task becomes progressively more difficult and we find that larger tanks can reveal effects that smaller tanks cannot. Data demonstrating this effect are reported elsewhere. 15 Another feature is platform size and location. For an adult rat, a sufficient platform size is 10 cm in diameter. Larger platforms are not helpful and can even reduce task difficulty by changing the ratio of search to goal area. We use a 10-cm platform during initial learning (acquisition), then reduce it to 7 cm for further testing when the platform is moved to the opposite quadrant (reversal). We reduce it still further to 5 cm when we move the platform for a third time to an adjacent quadrant (shift). The platform should be submerged 1 to 1.5 cm below the surface and be made of a material that matches the background color of the tank or is transparent such that the animal cannot see it from water level. The platform should be placed about halfway between the side wall and the center of the tank. A tank that is about 50 cm deep is sufficient for rats and should be filled to a depth of ∼20 cm. This water depth prevents rats from touching the bottom, which is important because if they can touch bottom, they will sometimes push off in an effort to jump toward the rim of the tank in an effort to escape. Leaving about 30 cm of free board above the water level is also important as it discourages animals from trying to leap from the platform to the rim. Jumping behavior distracts from searching and slows learning. Water should be at a laboratory’s typical ambient air temperature. When the tank is drained for cleaning, it should be refilled with clean water the night before so that it equilibrates to room temperature by the next day. A daily record of water temperature should be maintained. Water temperatures for rats or mice of 19°C to 23°C work well, with 20°C to 21°C being the best in our experience. Warming water is not necessary and slows learning. On the other hand, water temperatures below 19°C reduce core body temperature and interfere with learning. The older notion of coloring water is not necessary so long as the color of the tank and platform are designed so that the platform is not visible to the animal. If one is not certain about this, it can be tested. To do this, test a group of animals with the platform below the surface and change the platform and start positions randomly on every trial. If the animal can see the platform, it will learn to swim directly to it, whereas if it cannot, it will search on every trial. There will be some improvement in the first few trials because rats figure out that the platform is not near the edge or center and will become proficient at searching in the intermediate zone, but performance will plateau and show no further improvement.
If more than 1 test phase is used (acquisition, reversal, shift, etc), one should use each start position each day to balance start locations. A typical procedure is 4 trials per day from 4 different start locations. In rats, trials may be given back-to-back. Since each trial has a time limit of 60 to 120 seconds, rats do not become fatigued even if they go to the time limit on every trial (which is rare); however, this is not the case for mice. Because of their smaller body mass, mice lose body heat rapidly in water; therefore, mice need at least 10 minutes between trials to avoid fatigue. Limiting mice to 60 or 90 seconds per trial is advisable. In mice, one also has to cope with floaters in some strains. Avoid pushing or prodding mice that float. We use the following procedure. If a mouse reaches the time limit on trial 1 by floating most of the time, wait 10 minutes and give it trial 2. If it again goes to the limit floating (if it goes the limit but is actively searching, this procedure does not apply), remove it and repeat the procedure the next day. In our experience, this eliminates most floaters. But if a mouse fails to swim and search after 2 days (4 trials), then it should be removed from the test. If there are other tests after the MWM, these mice should proceed with their cohort to the next test despite this small experience difference as this difference is unlikely to carry over to the next test.
There will inevitably be animals that reach the time limit. There are 2 schools of thought on how such trials should be handled. One idea is to help them (assisted escape). For this procedure, the experimenter guides the animal to the goal after it reaches the time limit. The concept is that the animal will gain from this experience and perform better on the next trial having been shown where the goal is. There are several issues with this approach. First, what is it that people mean by “the experimenter guided the animal to the platform?” No one describes how they do this, which creates uncertainty and makes replicating a study difficult. When we have done it, we used a pole with a distinctive object fastened to the end to enhance visibility. This is held a short distance from the animal’s nose until they notice it and begin to follow; the experimenter then moves the tip of the pole gradually toward the goal at a rate they estimate the animal can keep up with without grabbing the pole. This method is fraught with problems. First, each experimenter leads animals differently, thereby introducing variability. Second, some animals follow an object well, whereas others do not. Third, some animals lunge for the pole and grasp it. Shaking them off is disruptive causing some animals to then swim away from the pole. However, if the pole is too far from the animal’s nose, they will not attend to it. When this happens, the animal swims away and has to be reengaged, but if this happens too many times, the animal will lose interest and then cannot be guided. Overall, experimenter interventions are never uniform. If these are the limitations of assisted escape, what are the advantages? There are data suggesting that animals acquire some tasks slightly faster using assisted escape, but the effects are small and can turn out to be counterproductive if one helps more animals in the experimental than in the control group.
The alternative is unassisted escape. In the MWM, unassisted escape can mean 1 of the 2 things: (1) lifting the animal out and placing it in its cage or (2) putting it on the platform first and then putting it in its cage. Most experiments using unassisted escape place the animal on the platform first (for the time of the intertrial interval) and then give it the next trial or place it in a cage if there is going to be a rest period between trials. The concept behind this is to allow the animal to look around the room at distal cues to reinforce its memory for where the goal is within the environment. The advantages of unassisted escape are that there is little experimenter judgment and less interaction between the animal and the experimenter. Because there is less interaction, unassisted escape reduces the risk of experimenter effects. The only disadvantage is slightly slower learning. The question between the 2 methods is what is gained or lost by one method versus the other? Given that the objective is to determine whether the treatment causes differences, anything that minimizes differences (assisted escape) runs counter to the experimental objective. Therefore, for both theoretical and practical reasons, we use unassisted escape.
Aside from the 3 learning phases described so far for the MWM in which animals learn to find a hidden platform (acquisition, reversal, and shift), there is another aspect to the test: the probe trial. A probe trial is a test of memory and is given to see whether the animal remembers where the platform was by removing the platform and recording where the animal searches. If the animal searches where the platform had been and searches in the correct quadrant more than in other areas, the animal has demonstrated memory. Probe trials can be given at different intervals during or after platform trials. The most common is to give a single probe trial 24 hours after the last platform trial of the last day. If platform trials are given for 6 days, then the probe trial is given on day 7. If one tests acquisition, reversal, and shift, then each phase consists of 7 days, 6 days of platform trials, and a probe trial on the seventh day. Probe trials generally range from 30 to 60 seconds in length. There are data that 120 seconds is too long because it results in extinction. If a probe trial is given each day before platform trials, it can be used to track the emergence of reference memory. Although this technique may be useful in some circumstances, it is not common or necessary in most cases. More commonly, a probe trial is given at the end only. However, the spacing between the last platform trial and the probe trial is important. If given shortly after the last training trial, it creates interpretational issues. This is because (1) the purpose of the probe is to assess reference memory, but if the trial is given right after the last platform trial, it may be measuring memory for the last platform trial, that is, short-term memory and (2) a short interval does not allow for consolidation. For these reasons, it is generally given 24 hours after the last platform trial.
A frequent question about the MWM is whether it can be used repeatedly to test whether an effect is progressive or to determine whether recovery from an effect is occurring or to determine whether an effect is irreversible after a recovery period. Although we have not often used the MWM this way, there is no reason to believe it would not work. In support of this, we did an experiment where we tested rats repeatedly by moving the platform to different locations repeatedly. In this experiment, we tested rats in the maze in acquisition, reversal, shift to the left, reversal of left shift, shift to the right, reversal of right shift, and so on, after developmental exposure to methamphetamine. 33 We found that although the deficit caused by the drug was largest for acquisition, it remained through all the different phases. It was larger for the phases where the platform moved to an opposite quadrant and smaller where the platform was moved to an adjacent quadrant. The effect was even more pronounced after all these phases when we moved it a final time and reduced the platform’s size from 10 to 5 cm. These data are shown in Figure 2. The data suggest that if a deficit persists using back-to-back phases, it would also remain for spaced phases. 34 Therefore, we think this is an appropriate way to monitor changes in learning and memory over time. For specific details, see Vorhees and Williams. 35

Morris water maze. Mean ± standard error of the mean (SEM) latency(s) per day to find a hidden platform during multiple phases of testing. Each rat received 4 trials per day with one start from each of the 4 different start positions around the perimeter of the 210-cm-diameter tank. Sprague Dawley rats were used. Rats were treated with 0, 5, 10, or 15 mg/kg of (+)-methamphetamine by subcutaneous injection 4 times per day at 2-hour intervals from P10-20 and tested as adults. For the first 5 phases, the platform was 10 cm in diameter, and for the sixth phase, the platform was 5 cm in diameter. For all phases, the platform was submerged 1.5 cm below the water. Platform positions during each phase were acquisition (Acq), NE quadrant; reversal (Rev-1), SW quadrant; first shift (Shift-1), SE quadrant; second reversal (Rev-2), NW quadrant; second shift (Shift-2), NE quadrant; third reversal (Rev-3), SW quadrant (5 cm platform). Note that all groups took longer on changes to the opposite compared with adjacent quadrants. Note also how reducing the platform size for the final phase increased group differences. Effects of methamphetamine were significant, but asterisks have been removed to make the effects of the different phases clearer. Adapted from Williams et al. 33
Cincinnati Water Maze
Egocentric navigation assesses striatal pathways and is distinct from spatial learning. Separately assessing egocentric learning requires different tests. One test that can be used to assess egocentric navigation is the CWM. This test can be conducted under different conditions. When used in the presence of standard laboratory ambient visible light, it assesses egocentric and spatial navigation simultaneously since the animal can use proximal cues inside the maze, internal cues of limb movement, and distal room cues it can see above the maze. However, if the test is conducted in complete darkness (under infrared light), the animal must rely solely on internal cues of limb movements, speed, and turns, thereby making it a test of egocentric learning. Cincinnati water maze is shown schematically in Figure 3 along with details about the apparatus.

Schematic of the Cincinnati water maze (CWM). As with the MWM, the walls of the maze are typically 50 cm in height and the water filled to a depth of 20 cm. Whether tested in the light or dark, the CWM requires a pretest in a straight swimming channel. Although such a channel can vary in length, ours is 244 cm long, 15 cm wide, and 50 cm high with a hidden platform at one end. The day before maze trials, rats are given 4 timed trials to swim from one end to the other. What happens if straight channel trials are not given before maze trials? There will be many failures and some animals will give up and stop searching for the goal with the result that the data are virtually unusable. This is because the CWM is so difficult that rats need the reinforcement of learning that escape is possible by finding the hidden platform in the straight channel. This leads animals to persistently search when in the maze and eventually leads to finding the goal. Even with premaze straight channel trials, there will be a few rats that will not find the goal and will stop searching and swim in one area using as little effort as possible to keep their nose above water. However, without straight channel trials, many rats will stop searching. Straight channel trials provide another contribution: once the rat figures out that it is a straight line from start to the finish, it swims the channel rapidly. By timing these trials, one gets an index of swim speed from which one can infer swimming competence and motivation to escape. If swim latencies are equal across groups, then one can conclude that differences in maze performance are not attributable to secondary performance factors. In the CWM, we give 2 trials per day. If tested in the light, control animals become proficient in 5 days. When tested in the dark, 15 to 18 days are required. Because the CWM is much more difficult than the MWM, the time limit per trial is longer: 5 minutes. If the goal is found in under 5 minutes on trial-1, trial-2 is given immediately. If the rat reaches the time limit on trial-1, it is given at least a 5-minute rest before trial-2. Unassisted escape is used in the CWM just as for the MWM. Water temperature is also the same as for the MWM. The value of water mazes, both the MWM and CWM in regulatory studies, is discussed in greater detail elsewhere. 15 You will note that in discussing the CWM, only rats are mentioned. This is because we find that mice, in a mouse-scaled CWM, cannot find the goal. The CWM appears to be too complicated for mice. Why this occurs is not clear. It could be a cognitive limitation, it could be that mice have a lower frustration tolerance than rats, it could be fatigue, or it could be a species-specific response characteristic that is ill-suited to this maze. But, whatever the reason, few mice are able to find the exit even under visible light conditions and even if the complexity of the maze is reduced by blocking some cul-de-sacs. On each trial, latency and errors are recorded. Although correlated, the 2 measures are not identical, and periodically, we find effects on one and not the other. Such discrepancies are rare but when they occur should be examined carefully. One source of discrepancy occurs when rats in one group, more than in another, do not find the goal. This can be caused by different types of behavior. One is where the rat searches continuously for 5 minutes but does not find the escape. This is usually because they perseverate on some channels to the exclusion of others. The other situation is where the rat searches at first and then stops and remains in one area without further searching. Rats that fall into the latter category cause a problem because the number of errors committed by these rats underestimate what their errors would have been had they continued searching. When we identify rats that stop searching, we use a corrected error score for them on those trials where they remained largely stationary. We do this by identifying the animal that makes the most errors in the experiment and use this number as the error score for animals that reach the time limit but stop searching. In 30 years of using the CWM, we find this to be the best solution to this problem. That this is appropriate is supported by examining the range of errors among animals: those that stop searching make very few errors, those that keep searching make large numbers of errors (up to a 100 or more). This can be further verified by examining the number of trial failures: treatment groups with many trial failures should have more errors; if they do not and this is verified by observation, then adjustment is warranted. We also compare errors with latency to see that they are in general agreement, when they are not, then nonsearching animals are always the cause.
A recurrent issue in analyzing data from learning and other tests with a time limit is that time limits truncate the range of possible values causing censorship. Such censorship alters the shape of the underlying distribution upon which inferential statistics are based. There are methods for dealing with this, although they are not often used in toxicology. When censorship is minor, it can usually be ignored and the data analyzed by factorial analysis of variance. For example, when the CWM is run under lighted conditions, the number of trials on which animals reach the 5-minute limit is small and standard statistical analyses are sufficient. However, when run in the dark, the number of 5-minute trials is high on the first 3 to 4 days. An approach that can be used without resorting to specialized methods is to use the inflection point to divide the data into segments. The first 6 to 8 trials when all animals reach the 5-minute limit are excluded, and data from the point of change are analyzed. This focuses the analysis on the days when rats are improving, which is the part of the learning curve of interest.
Passive Avoidance
Another widely used test in neurotoxicity is passive (or inhibitory) avoidance. This test is often conducted in a 2-chamber apparatus with one side lighted and the other dark with a sliding door in between. A grid floor is used through which scrambled footshock may be delivered. A variation is a step-down passive avoidance where the animal sits on a perch in the middle of the test chamber. Once the animal steps off, it gets a footshock and is placed back on the perch for up to 3 minutes to see how long it remains there. In step-through passive avoidance, the animal is placed on the lighted side. After a specified interval, the door is opened and the animal’s preference for the dark usually causes it to crossover. Once it crosses, it receives a footshock. After a delay, it is put back on the lighted side, the door opened, and timed for how long it takes for it to cross over again (up to 3 minutes). A variation is to give animals repeated trials until they remain on the lighted side for 3 minutes. There are those who believe the trials to criterion method is significantly better than the 1-trial method, but no matter the method, the test has noteworthy limitations: (1) variability is large. When the door is opened, some animals dart to the dark side, whereas others take a long time to cross, and some are so fearful they never cross within the 3-minute limit. This creates large error variance that makes the test insensitive; (2) the test is, by definition, a do-nothing assay; the premise of the test is to measure how long the animals sit and do not cross over rather than an active response as in all other learning tests. An animal may remain stationary for many reasons, some of which have nothing to do with associative learning; this makes the test prone to confounding by factors the experimenter cannot observe; (3) in the 1-trial method, there is no learning curve; even in the trials to criterion method, the learning curve is steep; this makes it difficult to assess rate of learning, that is, what should be evaluated; (4) the test cannot differentiate between a change in shock or pain threshold from learning; hence, the test requires secondary procedures to ensure that changes in latency are attributable to learning and not something else; but (5) perhaps the most telling reason that passive avoidance is not recommended is that when we recently evaluated the effects of preweaning methamphetamine on later learning on 4 different tests, all showed deficits except for passive avoidance. Rats tested in the MWM (in multiple phases), CWM, and radial water maze (RWM) all showed impaired learning, but in passive avoidance, animals showed no effect (not even a trend). If a well-established developmental neurotoxin adversely affects all these types of learning but not passive avoidance, how good a test can it be? The reason it remains in use is that direct comparisons between it and other methods are rarely done; our new data being one of the few cases where a comparison was made with a known developmental neurotoxin. Doubts have been raised about passive avoidance for decades, but there has been resistance to switching to tests such as the MWM and CWM; this may be inertia, but recently, regulatory agencies have become more attuned to the type of learning tests submitters use and have asked for water mazes that assess spatial and egocentric learning and memory in many cases.
Acoustic Startle Response
The acoustic startle response (ASR) methods are well known, and its usefulness has been demonstrated many times, including in an EPA review of 69 DNT studies on pesticides. 36 In the review by Raffaele et al, 36 tests were compared for which ones provided the point of departure (POD) for risk assessment. The ASR showed good utility in this regard. In addition, ASR is quantitative and the response is homologous across species from rodents to humans, thereby providing excellent translational relevance. Furthermore, using ASR in combination with response modification paradigms, such as PPI, are well documented in a number of neuropsychiatric conditions providing predictive validity for detecting human brain disorders.
A seldom used variant of ASR is the tactile startle response (TSR). The TSR has received little attention in neurotoxicology. The reason for this is not entirely clear but may stem from the fact that the apparatus is a little more complicated than ASR. But a potential advantage of TSR is that the response amplitude is 2 to 3 times larger than ASR. Why rodents are more affected by an air puff than a loud sound is unclear, but the strength of the response is so much more robust that it is a promising alternative to ASR, provided it can be demonstrated to be as at least as sensitive. The TSR is certainly easier to measure than ASR. Whether its magnitude translates to increased sensitivity is unknown. We are currently exploring its potential for detecting the effects of pesticides. The TSR is also interesting because it is similar to startle in humans. Human startle also uses an air-puff stimulus, but just to the corner of the eye and the response measured is an eye blink rather than a whole-body response as in rats, nevertheless, the 2 are very similar. A further advantage of startle is that it can be tested over and over because the rate of habituation is modest compared with most behavioral tests. Startle is also mediated by a well-described neural circuit that provides a source of information if one wants to find the nexus of an observed effect. 37 By analyzing the pathway, one could theoretically find the site at which a change occurred.
Functional Observational Battery
The FOB was adopted by the EPA in the health effects guidelines decades ago when neurotoxicological methods were still being developed. Some methods were further along than others. For example, methods for ASR and motor activity (locomotor activity, open-field activity, and spontaneous locomotor activity) were already well established. Although learning and memory methods were available, there was not a broad consensus on which ones the EPA should recommend, yet the importance of such assessment was recognized. Therefore, the EPA wrote the guideline for this assessment in general terms. For adult rats, schedule-controlled operant behavior was codified. However, developmental assessments written for the guideline were so general that virtually any tests could be used. In addition, the agency wanted generic methods to assess basic neurological functions and they had one under development internally that they called the FOB. The FOB was based on the work of Irwin. 10 He called his method the comprehensive observational assessment (COA). The COA was designed for mice and was intended to be a rapid screen for easily observed physical and neurobehavioral changes for new pharmaceutical compounds. In laboratories of the EPA, the COA changed to become the FOB. This is the only method the EPA put in the guideline that it developed itself. The ASR, open field, learning and memory, and operant were all taken from the literature. Concerns about the FOB were expressed by many scientists outside the agency from the outset, but some also supported it. Over many years, the EPA continued to work on the method. The first published report on it appeared in 1988. 38 In subsequent years, the method was gradually expanded. Described in 1988 in terms similar to Irwin’s COA as being a screening instrument, over time it became more time-consuming. Today, the FOB is neither quick nor simple, but it is unlikely anyone would mind this if it was successful for its intended purpose. The EPA’s own review of 69 DNT studies submitted to the agency over a period of 20 years 36 found that the FOB provided the POD for risk assessment for insecticides only 4% of the time, that is, in 3 studies of 69. Were this not enough evidence to raise concerns, there are other issues with the method. First, it is subjective, relying on observer ratings. Second, in terms of validity, the FOB has been used in a number of studies of insecticides and known neurotoxins. It is clear that at high doses, the FOB shows effects, but there is little evidence that it is sensitive to anything less than overt toxicity. Second, it lacks construct validity. Procedures within the battery are rudimentary observations, many with no relationship to any CNS pathway, such as coat color, coat texture, falling foot splay, urination, defecation, and redness around eyes. Some are CNS/peripheral nervous system related but only in a crude way such as stance, walking, pupil reflex, tail pinch, clicker-elicited ear twitch, whisker response, tremors, seizures, and so on. Rating scales are at best ordinal but subjectively ordinal, and the resulting data are not readily analyzed statistically. Data tend to be shown in large tables marked with 1 or more symbols to reflect the average rated symptom as none, mild, moderate, or severe, but the definitions of each are not clear. Third, there is no unifying mechanistic underpinning to measure, and they are not analogous to a human neurological examination to even a minimal degree. Grip strength and core body temperature may be exceptions. These 2 measures are quantitative and one could make a case for retaining these and only these. Urination and defecation measures have been around for decades but are unreliable. These are used as indices of autonomic function, but the problem is that because of differences in species, strain, supplier, vivaria, housing, enrichment, diet, handling, background noise, bedding, and cleanliness standards, no 2 laboratories provide the same conditions, making these measurements essentially meaningless. The bottom line is that if 69 regulatory DNT studies with standardized FOBs show that it does not provide the POD for risk assessment with chemicals designed to be neurotoxic, then it probably does not belong in the guidelines. Any replacement should be quantitative, based on established CNS constructs of clear importance to an organism’s survival, and homologous to the same CNS function in humans. Memory, learning, attention, impulsivity, and executive function are worth considering. Alternatively, one could do another 69 DNT studies using the FOB and then have 138 studies showing it is not very useful. It is up to the scientific community to form a consensus that it should be removed from the guideline, otherwise there will be another 69 studies using this method. A final point about the FOB is this: outside of regulatory studies, the FOB is not used. If those who specialize in neurotoxicology and neuroscience research decline to use it that is important information about its scientific value since most researchers use ASR/PPI, OF, MWM, CWM, RAM/RWM, operant, and other behavioral methods regularly.
In sum, in deciding on methods for evaluating developmental, juvenile, or adult neurotoxicity, the choices should be based on the merits of the test as demonstrated in the literature as well as construct/theoretical validity, practicality, and interpretational/predictive value. On this basis, some tests have proven merit. For learning and memory, the MWM and CWM are recommended for assessing allocentric and egocentric navigation. For working memory, there are no clear leading choices, but RAM or its swimming equivalent, RWM, are valid tests. For sensorimotor gating, ASR/PPI stands out, with TSR having future potential if more research shows that it has advantages. For movement, automated motor activity tests are standard and well validated, especially when tested for 1 hour or more. By contrast, the FOB and passive avoidance have serious deficiencies and are not recommended.
Additional Neurobehavioral Evaluations and Considerations in Regulatory Studies (LaRonda Morford)
A number of regulatory guidelines require neurobehavioral assessments as end points. Regulatory studies that incorporate neurotoxicity assessments include, but are not limited to, safety pharmacology studies, 7 PPND studies, 6 and juvenile toxicity studies 8,39,40 for pharmaceuticals and neurotoxicity studies, 41 DNT, 42 and extended 1-generation reproductive toxicity studies 43 for agro/chemicals. Neurotoxicity end points in these studies include biochemical, behavioral, and morphological evaluations. As discussed earlier, the primary behavioral assays included are clinical observations or observational batteries of some type, motor activity, startle response, and learning and memory.
Although neurobehavioral data are often a major component of regulatory-driven safety assessments, behavioral evaluations may also play important roles in efficacy studies and issue resolution studies. Often the same assay may be applied to address either safety or efficacy. The difference may be in how the data are interpreted. One example is seizure threshold studies. In safety assessment, this test is used to identify adverse effects of the compound on convulsive liability, whereas the same test can be applied to identify compounds that might be effective in treating epilepsy. Thus, the same test can have a safety or an efficacy application depending on how the data are interpreted.
Neurobehavioral evaluations may be used for screening purposes as a Tier 1 test (hazard identification) or to further characterize effects on the CNS as a Tier 2 test (hazard characterization). Tier 1 or screening tests typically consist of simple or quick tests of behavior that may be used to identify whether a chemical acts on the nervous system, whereas Tier 2 testing involves more complex tests that provide a more complete description of the effects. Innate or reflex behaviors, such as locomotor activity and sensory function, provide a broad assessment of neurological function and may be evaluated at both levels of assessments. Learned or conditioned behaviors require training of the test subject, are focused on specific aspects of behavior, and are generally more time and resource intensive, and therefore, are often included in Tier 2 testing.
Inclusion of clinical observations to assess the general health and condition of the animals is common in all types of studies. However, studies assessing neurofunction often include additional clinical observations or even observational batteries. The latter are used for screening purposes to detect any potential CNS effect and may be useful when little or no information is available for a chemical. Generally, laboratories use their own battery and procedures such that, in practice, one size does not fit all and there is no single method used across all studies or laboratories. Importantly, this results in the need for careful review and understanding of the nomenclature used. For example, cageside observations, detailed clinical observations, neurologic examinations, expanded clinical observations, modified Irwin observational battery, modified FOB, and FOB are all methods that may be used by a laboratory to assess neurofunction. However, the details of the procedures may or may not be similar across laboratories even when using the same “method.” The description of behavior may also vary depending on the laboratory and method used. For example, many laboratories simply report the “presence” or “absence” of a behavior, whereas others may rank the severity of the behavior. Often, laboratories use both with only specific behaviors given severity scores. Severity scores of subjective evaluations allow a semiquantitative aspect to the data and may provide more specific information on distribution across treatment groups. Severity scores are often considered to be a more sensitive approach than simply listing “normal” or “abnormal”; however, because they are subjective, they must be clearly defined and standardized within the laboratory.
In addition to the differences across laboratories, neurobehavioral observations and observational batteries require minimization of differences within laboratories. Training of staff is critical and should include some measure of inter-rater reliability and periodic refresher training. Observer bias, however unintentional, should be avoided, and, when possible, staff should be blinded to treatment. Training and procedures should be specific to the species and age of the animal. Assessments must be tailored to the age and species as the normal response varies, and the observer must understand normal response. However, it’s recognized that determining specific age ranges may be challenging to train staff to be able to identify all possible responses for all ages.
Functional observational batteries and Irwin screens are designed to detect major overt behavioral, physiological, and neurological signs. These observational batteries cast a wide net with the goal of detecting any potential CNS effects and are a Tier 1 or screening approach designed to identify potential hazards. Although general aspects and experimental tests that should be included in these batteries are described in the regulations, each laboratory usually has its own version. Therefore, clearly defined protocols are critical, especially when comparing data across studies or class of compounds. Staff should be unaware of the animal’s treatment and should understand normal behavior and factors that can alter behaviors. For example, handling can affect an animal’s response. Thus, staff conducting the testing must be comfortable in handling animals. The staff must also be certified with tests using positive controls. An advantage of observational batteries is that the same animal may be repeatedly assessed to determine onset, progression, duration, and reversibility of a neurotoxic injury.
Evaluations of motor activity can be stand-alone tests or conducted in the same animals evaluated in behavioral batteries, such as the FOB. Like observational batteries, motor activity may be considered in most cases a Tier 1 or screening approach designed to identify potential hazards. Motor activity evaluates spontaneous activity, and many commercially available systems are available including photocell based, field sensing, mechanical, and electronic or video tracking. The size and shape of the testing chambers range from polycarbonate cages similar to home cages to open fields, circular alleys, and figure-8 forms. Evaluations may include overall levels of activity during a test session, habituation, or decreases in activity levels during a session, spatial distribution or location within the testing chamber, and ontogeny of activity as locomotion matures during the first few weeks postnatally in the rat. “Crawling” behavior peaks around PND 7 and disappears around PND 15, whereas adult-like locomotion appears around PND 15 to 16 in the rat. 44 It is important to understand the ontogeny for developmental studies. For example, activity levels are typically lowest with little to no habituation in a PND 13 rat, a time before their eyes open and when their motor skills are quite limited. A peak period of hyperactivity occurs around PND 15 to 20 and then there is a slight decrease in activity levels followed by a gradual increase in adult activity levels around PND 50 to 60. 44 It is important to note that developmental delays can occur, which may shift this normal pattern of activity. Developmental delays could be due to maternal nutrition (or offspring undernutrition) or a delay in the ontogeny of motor systems or a combination of both. However, if one is not aware of this normal response and developmental maturation, one may not be able to interpret the activity data accurately, especially if only one time point is evaluated.
Auditory startle response is the most common test of sensorimotor function included in developmental safety assessment testing. Typically, the force of the motor response following a suprathreshold auditory stimulus is measured in an automated system; however, air puffs or lights may also be used as the stimuli. Auditory startle response is functional in rats by PND 12. 44 Latency and magnitude are the dependent variables, and trials can range from a few to many. The response typically habituates with repeated stimulus exposure. Auditory startle response can be significantly impacted by a number of factors, and care is needed to ensure equipment is set up properly. For example, vibrational transfer from one chamber to another should be avoided. This can be achieved by placing the testing chambers on marble/granite tables or similar structures. Testing chambers should not be placed next to one another on stainless steel tables as the vibrations from one chamber can often be picked up by an adjacent chamber. Animal holders can also have a significant impact on startle “responses.” Rodents should not be able to hang onto holders while testing; instead all 4 paws should be on the floor of the holder so that an accurate force can be detected. Furthermore, animal holders should be designed such that exploration and excessive movement are avoided.
In addition to ASR, PPI is sometimes included in developmental safety testing, but this is generally based on a specific concern or trigger, such as mechanism of action. Prepulse inhibition is a test of sensorimotor gating where the startle response decreases after preexposure to a weaker stimulus. As in ASR, the stimulus can be acoustic, tactile, or visual. Prepulse inhibition is often used in primary pharmacological or efficacy testing as it is considered an animal model of schizophrenia. Additionally, by varying the intensity of the prestimulus tone, the auditory threshold can be established, which may be useful in detecting midfrequency hearing loss.
Additional neurobehavioral tests that can be conducted with pharmaceuticals and/or agro/chemicals include tests of neuromotor functions, tests of emotionality such as anxiety or depression, and tests of pain. Neuromotor function may be measured by motor activity, but other aspects can also be evaluated, such as coordination, equilibrium, and strength. Grip strength is often included in observational batteries and is sensitive to CNS depression, spinal or peripheral pathology, neuromuscular junction dysfunction, and nonspecific factors. Grip strength evaluations may include forelimb and/or hindlimb and generally consist of 3 trials with either the maximum force or the average of the trials reported. Rotarod is an automated test that evaluates the ability to maintain balance on a rod that is rotating at either a fixed or a gradually increasing speed. Although the test appears simplistic, it can be confounded by rodents that are unable to perform or refuse to perform the task under control conditions. Evaluations may include the time to fall off or latency when rotational speed is constant or the latency and rotation speed to fall when rotation speed is increased over time.
Anxiolytic and anxiogenic behavior is commonly evaluated in an elevated plus maze or an elevated zero maze. These tests leverage rodents’ natural conflict between exploration of a novel environment and the avoidance of open bright spaces. Anxiolytic compounds result in rodents spending more time in open areas, whereas anxiogenic compounds result in rodents spending less time in open areas. In addition to the elevated plus or zero maze, open-field motor activity may also be used as a preliminary indicator of anxiety that can then indicate the need for further Tier 2 testing. In this assessment, time spent in the center versus time spent on the perimeter is measured with more anxious rodents spending more time on the perimeter avoiding the brightly lit center area.
Depression-like behavior is commonly evaluated in the forced swim test or Porsolt swim test. In this test, a rodent’s state of despair is reflected by the amount of time it spends immobile. In this test, rodents are forced to swim in conditions in which there is no escape; eventually, the rodents will make only the movements necessary to hold their head above the water. This is often referred to as “learned helplessness,” since they have learned that escape is impossible. Testing generally occurs over 2 days. On the first day, the rodent is placed in a cylinder filled with water and forced to swim for 15 minutes. On the second day, the rodent is placed back into the water for 5 minutes and the time spent immobile is measured. Antidepressants decrease the time spent immobile. The tail suspension test in the mouse is another test often used to test the efficacy of compounds for antidepressant activity. It is similar to the forced swim test except immobility is induced by suspending the mouse by its tail. The mouse will initially try to escape by moving vigorously and then, after a few minutes, become immobile.
There are a number of general considerations to be aware of when conducting neurobehavioral testing. These include the importance of health checks, role of stress in behavioral response, and the importance of counter balancing in testing. Behavioral responses can be significantly impacted by the general condition of the animals. Therefore, it is important to conduct health checks to ensure that the animals tested in behavioral evaluations are not compromised. In addition, stress can also affect behavioral responses. Stress can be produced by routine husbandry such as changing of cages and noise associated with this task, environmental conditions, such as temperature or humidity excursions, the transport of the animals to the testing rooms, and the conditions in the testing room, such as testing in a dark room during the light cycle. Therefore, care should be taken to ensure that any potential husbandry activities occur after behavioral testing, any excursions in environmental conditions are documented appropriately and communicated to the individual interpreting the behavioral data, and that animals are allowed sufficient time to acclimate to the testing room. For example, mice generally exhibit increased activity levels after being moved to clean cages. Therefore, behavioral testing immediately after cage changing may yield results that are very different from results in mice that have not had their cages changed immediately prior to testing. Care should also be taken to ensure appropriate counter balancing of the treatment groups across the testing days if testing occurs over a range of days, across the time of day especially if testing occurs over several hours and/or periods of the day, and across location of the equipment in relation to the room, such as if activity chambers are placed on top, middle, and bottom shelves.
In order to ensure appropriate neurobehavioral testing, a laboratory should demonstrate proficiency and sensitivity to detect change. For example, positive control studies should be conducted to demonstrate the ability to detect both increases and decreases in measured parameters. A dose response should also be included to demonstrate differences in detection sensitivity. The positive control study should be conducted under the same conditions as the regulatory studies will be conducted including utilizing the same equipment settings and parameters to be evaluated. The testing order and time of testing should mimic the intended regulatory studies as well. The age of testing, gender, and number of animals should also be the same as the intended regulatory studies since all of these factors can influence the response and therefore the ability to detect expected differences in response. Finally, the data from the positive control study should be analyzed in the same way that it will be analyzed in the regulatory studies. It is important to note that proficiency demonstrated in adult animals does not necessarily translate to proficiency in younger animals.
Designing positive control studies that appropriately include all of these factors and mimic the intended developmental toxicity regulatory studies poses additional challenges. Specifically, one needs to consider the best way to test the appropriate aged animals. If each behavioral test is “validated” individually in age-relevant animals, then the handling experiences of the animals may be different, which can impact the behavioral response. For example, in developmental toxicity regulatory studies, such as the PPND study, pups are generally handled daily from birth. However, if specific aged animals are ordered for testing of individual behavioral measures, the animals will likely not have been handled daily, and therefore, the behavioral response may be different. However, one positive control compound may not affect all end points. Therefore, designing the positive control study to mimic the DNT study may allow use of animals with similar handling experiences and provide the testing laboratory additional experience in conducting DNT studies but may require more than 1 positive control compound to be tested.
In addition to being good laboratory practice, EPA and Organisation for Economic Co-operation and Development (OECD) guidelines require positive control data from the testing laboratory with the expectation that these data will be submitted with the regulatory study report. 45 Guidance on the selection and use of positive control agents, as well as how to interpret and report positive control data, is provided by Crofton et al. 45
In addition to the initial positive control study conducted to demonstrate proficiency and ability to detect appropriate changes, there may be circumstances when further positive control studies may need to be considered. This may include when significant changes in equipment, software, procedures, and/or testing paradigms occur. For example, a study may be needed when changing the number and/or placement of photobeams in automated motor activity assessments as photobeams placed far apart or vertically high off the floor of the chamber may not be activated to the same degree, or at all, by smaller animals. Other situations that may require consideration for the need for additional studies include changing the length of test session or animal housing conditions, such as group housing versus individual housing as well as when there are staffing, strain, species, or supplier changes.
Control data are also extremely important in data interpretation. Concurrent control data should be reviewed to determine the extent of variability in the response which, if high, may indicate potential issues with procedures, handling of the animals, or other factors. The concurrent control data should also be reviewed for expected age- or gender-related changes. Monitoring historical control data is invaluable in detecting “drift” in the control baselines over time. A general drift in a response over months or years may represent “genetic drift” in the study organism, whereas an abrupt drift in response between sessions or over days may represent equipment problems.
When interpreting neurobehavioral data and the determination of biological versus statistical significance, one needs to carefully review the response and consider gender differences, “U-shaped” responses, pharmacokinetics/toxicokinetics, single versus multiple end points, and variability. If only one gender is affected, this does not automatically mean that it is not biologically significant as responses may be different between males and females or one gender may be differentially susceptible. Therefore, it is important to understand normal responses in each gender. If only the middose group is affected, this does not automatically mean that the effect is not real or of importance, since there may be a paradoxical dose response or “U-shaped response.” For example, high doses of amphetamine cause stereotyped behaviors characterized by intense repetitive localized movements, but little forward locomotion, whereas lower doses cause increases in locomotion and less stereotyped behaviors. Data from multiple end points evaluating similar behavioral responses should be interpreted together as they may show a pattern of response. For example, comparisons of effects on multiple FOB end points and across functional domains aid in data interpretation and may indicate a deficit in a specific functional domain. Data interpretation should never occur in isolation. An integrative assessment of all data should occur, including an assessment of systemic toxicity, as effects on body weight may impact behavioral response. Deficits in behavioral tests may simply be an artifact of testing a sick animal unable to perform the procedures of the task adequately. Locomotion and limb movements are required for the performance of most behavioral paradigms; therefore, if motor function is impacted, other behaviors may be secondarily affected.
Developmental studies introduce further complications to data interpretation. For example, one may be tempted to conclude that there are “sensitivity differences” between developing animals and adults when it may be a reflection of the equipment and/or testing methods or the response may simply be different in developing animals. Therefore, understanding normal behavioral response across development is critical in not only interpreting the data but also in ensuring that the test methods and equipment are appropriate for that specific aged strain and species.
Finally, correlative changes in behavioral and pathological outcomes should not always be expected. There may be behavioral changes without pathological change. Potential explanations may include effects on nerve cell communication, changes in receptors and/or neurotransmitter release, or biochemical changes; changes that are often not reflected in routine histopathology. Furthermore, the timing of the assessments is also critical. Behavioral changes may occur in advance of measurable structural changes in the system. There may also be structural changes or lesions but no apparent behavioral effects as the nervous system shows functional plasticity, residual capacity, and compensatory mechanisms that may occur in the face of permanent pathological change.
In summary, neurobehavioral assessments are included in a number of regulatory studies with pharmaceuticals and agro/chemicals. These assessments can provide meaningful hazard identification. However, the design of neurobehavioral assessments needs careful consideration. For example, the inclusion of a battery of neurobehavioral tests allows evaluation of a number of functional domains but only if the laboratory is proficient in evaluating and interpreting those specific end points in the context of that specific study.
Developmental Neurotoxicity Testing in Nontraditional Animal Models (Judith Henck)
Global regulatory guidelines governing safety assessment for candidate therapeutic agents indicate that the nonclinical species selected for PPND and/or juvenile toxicity studies should be a relevant model, appropriate for evaluating toxicity end points in the intended human population. 6,8,39 Testing in one appropriate species is generally considered to be sufficient 6,8,39,40 The species of choice has typically been the rat, for which there is a great deal of historical information from well-validated developmental and behavioral tests. However, choice of species for evaluation of a specific pharmaceutical agent needs to be driven by relevance to humans with regard to pharmacology and/or toxicology, and the rat may not be suitable in all cases. This is particularly true for biopharmaceuticals (biologics) because of limited species specificity or immunogenicity, which often leads to the use of nonhuman primates (NHPs) for developmental toxicity testing. The minipig is gaining in popularity as an alternative species, and other species have also been employed in developmental toxicity testing including the dog, mouse, rabbit, hamster, guinea pig, and ferret. Experience with these alternative species continues to increase, and for many of these species, historical information is available for general developmental toxicity parameters. However, validated methods for assessment of DNT and historical information are often not available to evaluate all functional domains recommended by regulatory agencies. Lack of validated, rigorous behavioral methodology for alternative species often leads to the use of observational assessment as the primary means for evaluation of behavioral function. Options available for DNT testing of these species follow, including advantages and limitations.
Per regulatory expectations, well-established/validated methods should be used to monitor key CNS functions. Alternative species can be evaluated for desired functional parameters based on the availability of acceptable methods. Selection of the most appropriate species for studies that include DNT testing involves such factors as the potential for translation of results to humans, species-specific ontogeny and functional characteristics of the nervous system, and characteristics of each of the functional domains recommended for DNT testing by regulatory authorities. Excellent reviews have been published on the comparison of structural and functional nervous system development in humans to commonly used laboratory animal species, primarily rats, dogs, and NHPs. 44,46 Sufficient similarities exist to indicate that various laboratory species could be acceptable models for human DNT. Examples for which concordance exists between the response of humans and laboratory animals to developmental neurotoxicants include methylmercury, lead, ethanol, toluene, polychlorinated biphenyls, methamphetamine, organophosphates, phenytoin, valproic acid, and cocaine. 47,48 Laboratory investigation of these developmental neurotoxicants has primarily been conducted in rats, with additional work in mice and NHPs. However, translational information for additional species is not as readily available.
Pharmacology is an important factor to be considered in species selection. Appropriate in vitro binding affinity and/or functional activity at the therapeutic target is a key consideration. If this is not the case for the rat, it may still be an appropriate species for evaluation of off-target toxicity, but an alternative species should be considered for on-target toxicity. Pharmacokinetics must also be considered in species selection to ensure acceptable in vivo exposure/activity. For biologics, immunogenicity-mediated effects on exposure should not be substantial or, if they occur, could be mitigated by increasing the dose, frequency of administration, and/or route of administration. The metabolic profile of the selected species should include major human metabolites of the candidate therapeutic agent.
Species-specific ontogeny of nervous system structure and function is an important consideration in species selection. For DNT testing, it is crucial to know the developmental time lines for acquisition of the functional domains to be evaluated. As an example, walking behavior as an indicator of locomotor activity occurs in humans at approximately PND 396 but occurs in rats, dogs, and rhesus monkeys at approximately PND 12-16, 20-28, and 49, respectively. 44 Although this activity tends to be common across mammalian species, there are some sensory, motivational, cognitive, motor, and social phenomena that are unique to particular species. As an example, differences in the cognitive capacity of rodents compared to primates disappear when cognition is assessed using the predominant sensory modality of each: olfaction in rodents and vision in primates. 49 This consideration may necessitate tailoring assessment techniques for the selected species, or in the absence of validated methodology, an awareness of the differences when interpreting test results. An additional consideration is anticipated toxicity, based on drug class or nonclinical and/or clinical data with the specific therapeutic candidate, and the ability of the desired species to demonstrate it. The Ministry of Health, Labour and Welfare guideline on juvenile animal testing for pediatric drugs 40 states that it is preferable to select animal species and strains for which there are adequate existing nonclinical data. Although consistent comparison across studies is important for general toxicity parameters, it may not be particularly informative for DNT parameters, as a rigorous assessment of behavior is typically not conducted prior to the PPND and juvenile toxicity studies. This is an important consideration in data interpretation regarding sensitivity of developmentally exposed animals relative to adults. Often the information required to make such a comparison is not available. Performance on behavioral tests may not only differ with respect to species but also among strains of a given species. An example is found in the work of Plappert et al, 50 who compared prepulse facilitation and PPI of the acoustic startle response in 3 inbred mouse strains and 1 hybrid strain and found significant differences in response. This example and others underscore the importance of consistent use of specific strains in DNT testing of alternative species.
Behavioral test methods for rats representing all the recommended functional domains have been utilized over the course of many years in pharmacology and toxicology research and certainly qualify as being well established and validated. This is also the case for specific domains in alternative species, but not all domains have been equally validated within or among species. Table 1 presents a categorization of the availability of well-established/validated tests for each of the recommended functional domains in rats, as well as a number of alternative species. The current existence of tests is categorized as available, equivocal, or not available. The following sections will provide more in-depth information for each species, including descriptions of tests that currently exist and are used in other experimental settings, but require adaptation or additional validation to be acceptable for regulatory submissions.
Availability of Well-Established/Validated Methods for Behavior Testing of Traditional and Nontraditional Laboratory Animal Species Based on a 2015 Review of the Scientific Literature and Regulatory Submissions.
aA = Available; well established/validated methodology exists for this species or could be readily adapted.
bE = Equivocal; limited information on methodology for this species exists in the scientific literature and/or regulatory submissions; minimal experience.
cN = Not Available; methodology does not exist for this species or would require substantial adaptation.
Rats
Sprague Dawley and Wistar are the most commonly used rat strains in DNT testing for nonclinical safety assessment. Substantial historical information has been acquired from PPND and juvenile toxicity studies and, in many cases, serves as a template for testing of alternative species. Behavioral tests commonly used in rats are described in preceding sections.
Dogs
The beagle dog is the primary breed for which data are available from PPND and juvenile toxicity studies and is considered particularly advantageous in circumstances in which the developing skeletal system is considered a potential target. However, if it is used as the sole species for these studies, behavior testing may present challenges. Reflex ontogeny has been well characterized, and testing can potentially begin at birth. A detailed description of canine stages of development from birth through adolescence has been provided by Robinson and coworkers, 51 and these can be evaluated observationally. A FOB containing behavioral elements comparable to a comprehensive standard veterinary clinical examination has been described by Gad and Gad. 52 There are currently no validated tests of learning and memory for dogs, although some tests are being evaluated such as olfactory habituation, conditioning and avoidance of cold air, and visual discrimination. 51
Mice
As mentioned previously, the Irwin battery, a series of observational tests designed to evaluate CNS side effects of drugs, was developed in mice over 50 years ago and is still in use by the pharmaceutical industry, primarily in safety pharmacology assessment. 53 This test battery was a precursor of the FOB, which shares many similarities and is often used to evaluate DNT in mice. The primary mouse strains used for PPND and juvenile toxicity testing to date are CD-1 and B6C3F1. In recent years, the use of genetically altered mouse strains has increased in nonclinical safety assessment, particularly for biologics; these include surrogates and knock-out or knock-in models. It should be noted, however, that use of these strains may involve technical limitations as well as limited historical data and regulatory experience. As with the more commonly used rat, well-characterized and validated behavior tests are available in mice to evaluate all of the functional domains recommended by regulatory agencies.
Nonhuman Primates
The NHP species most commonly evaluated for DNT is the cynomolgus macaque. Reflex ontogeny, sensorimotor function, and locomotor activity have primarily been assessed using an FOB. Golub et al 54,55 described a comprehensive neurobehavioral test battery for infant monkeys that was employed as early as PND 1. The periodic assessment of mother–infant interaction has also been employed as a measure of development of social behavior. 56,57 The availability of well-established/validated methods for testing reactivity/arousal in NHPs is considered equivocal. Although this domain can be assessed using the infant test battery and FOB, validated, rigorous tests are not currently available. Learning and memory tests are also considered equivocal because they are not well validated in the context of safety assessment. Per the ICH S6(R1) guidance on safety evaluation of biotechnology products, 58 neurobehavioral assessment can be limited to clinical behavioral observations; the guidance indicates that because instrumental learning requires a training period, the postnatal duration of a PPND study may be at least 9 months and is not recommended. However, evaluation of learning and memory may be required for small molecules and for large molecules to address specific concerns. Therefore, commonly used learning and memory tests developed in pharmacology and environmental and industrial toxicology settings are being explored for evaluation of DNT in the pharmaceutical development setting. A test that appears to be gaining favor is the 2-object discrimination and reversal task using a Wisconsin General Testing Apparatus (ODR-WGTA). 59,60 -62 From a time line perspective, NHPs in general can be successfully trained on this task at ≥ 6 months of age, and approximately 6 weeks are required to achieve successful performance. 57 Initial performance of this task provides an assessment of learning; however, evaluation of memory will require an additional extended time. 62 Cappon and coworkers 59 have raised concerns regarding the impact of small sample size on the ability to detect meaningful treatment-related differences in the ODR-WGTA test.
Minipigs
The primary breed of minipig used in biomedical research is the Göttingen. Pigs are in general considered a good species for behavioral research in that brain development, morphology, and vascular anatomy are similar to humans. 63 However, certain functions mature more rapidly than those of humans. Piglets are capable of standing shortly after birth, and this advanced state of maturity relative to humans needs to be considered in interpretation of tests in neonates that require neuromuscular function. At present, most functional domains are equivocal with regard to well-established/validated methods, and more data are needed. Functional observational batteries have been developed that provide information on most domains, and an open field assessment has been used to provide more detailed information on locomotor activity and other related behaviors. 64 The majority of open field parameters can be assessed observationally; however, ambulation is also amenable to video tracking. 65 Laferriére and coworkers 66 reported testing of minipigs in an open field as early as 2 days of age; however, this would require the testing laboratory to order sows with piglets, since they should not be transported prior to weaning. Pigs are considered to be social animals, and consideration should be given to open field assessment of pairs or small groups; an additional advantage to group testing is evaluation of social behavior. 66 Sensorimotor function (acoustic startle response) has been evaluated in pigs using PPI of the eyeblink reflex. 67 An excellent review of learning and memory assessment in pigs has been provided by Gieling et al 68 and includes descriptions of spatial tasks (mazes), object recognition tests, and classical and operant conditioning tasks. Of these, free-choice spatial tests, as described by Gieling et al, 69 appear to be especially promising.
Rabbits
The most commonly used strain of rabbit in safety assessment is the New Zealand white. Reflex ontogeny and ontogeny of locomotor activity 70 appear to be relatively well characterized in rabbits. Testing should not begin prior to PND5, as disturbing the litter prior to this time has been associated with excess mortality of the kits. 71 Open field testing, generally initiated at the time of sexual maturation, can evaluate total locomotor activity, as well as latency to enter the arena, standing on hind legs, defecation, scent marking, grooming, and attempts to jump out of the arena. 72 Sensorimotor function and reactivity can be included in an FOB but are not well documented. The primary test of learning and memory used for rabbits in biomedical research is eyeblink conditioning (EBC), which pairs an auditory or visual conditioned stimulus (CS) with an unconditioned stimulus (typically a puff of air) to elicit a conditioned response (eye blink or extension of nictitating membrane). Although there is a fairly large body of knowledge regarding use of this test in multiple species in pharmacological research and the reflex pathway is well characterized, it is not yet considered sufficiently validated for safety assessment. Sparks and Schreurs 73 have developed a complex paradigm that includes an assessment of hearing ability by varying intensity of the auditory CS.
Hamsters
Reflex ontogeny has been well established for the hamster and is similar to that of rats and mice. 74 The most commonly used strain is the golden Syrian. The acoustic startle response, including habituation, is also well characterized, 75 although research to determine whether auditory prepulses are inhibitory or facilitatory in hamsters is ongoing. Locomotor activity, as well as a number of other parameters, has been assessed for hamsters using an open field. This can also be a component of an FOB, including evaluation of stages of locomotor ontogeny. Tests for reactivity and learning and memory are considered equivocal for the hamster. Reactivity may be an FOB component, but little information has been reported, while learning and memory tests have been reported for disease state models but not for safety assessment.
Guinea Pigs
The guinea pig is a precocial species and offspring are quite mature at birth. 76 Because of this advanced development, traditional reflex ontogeny cannot be assessed. Sensorimotor function has been evaluated via automated acoustic startle systems and may include habituation and/or PPI. 77 Acoustic startle may also be assessed as part of an FOB, which can be used to evaluate locomotor activity and reactivity as well. Open field has been used to measure locomotor activity. At present, there are no validated tests of learning and memory for guinea pigs, although work is ongoing to apply tests such as passive avoidance and EBC.
Ferrets
The European ferret is the strain most commonly used in biomedical research, primarily to study hearing deficits (a certain proportion of ferrets are deaf due a genetic defect), visual impairment, and antiemetic properties of pharmaceutical agents; ferrets are also used as disease models for influenza drugs and vaccines. Testing for ontogeny of reflexes and locomotor activity has been reported
78
and appears to be relatively straightforward. Although ferrets have been used for auditory and visual research, evaluation of sensorimotor function has not been reported in a safety assessment setting and is therefore considered equivocal in terms of available tests. This is also the case for evaluation of locomotor activity once ontogeny has been established. Test methods have not been reported for reactivity or learning and memory in evaluation of DNT. The future state of DNT testing alternative species as well as alternative methods will be important considerations. Screening for DNT may include in vitro assays as well as tests in nonmammalian species. The differentiation of mouse embryonic stem cells (mESCs) into various neuronal cell types has been established as an in vitro screen for DNT, known as the DNT-EST.
79
Neural differentiation of mESCs has additionally been used to investigate the role of microRNA (miRNA) expression in DNT.
80
Studies with the nematode
In contrast, DNT data from alternative mammalian species are more likely to be accepted by regulators, particularly as new information is provided to enrich historical databases. A review by Baldrick 86 of pediatric investigation plans (PIPs) submitted from 2008 to 2013 revealed that while the majority of juvenile toxicity studies were conducted in rats, approximately 10% were conducted in dogs, 3% in NHPs, and 2% in pigs, with 1 study conducted in rabbits. It is unknown how many of the PIPs for which alternative species were proposed included assessment of DNT. However, it is likely that this number will increase in the future, and it is therefore important to know what will constitute regulatory acceptance of behavioral tests that may be somewhat different in nature from those commonly used in rats and whether testing of all currently recommended functional domains will be expected. Investigators involved in the conduct of PPND and juvenile toxicity studies are encouraged to publish DNT test validation efforts, historical data, and results from evaluation of candidate therapeutic agents in alternative species to better inform safety assessment efforts with the most appropriate animal model.
Summary
Neurotoxicity testing is an important component of postnatal evaluation in the nonclinical safety assessment setting. Current behavioral tests and their evaluation rely on an extensive body of scientific literature that integrates the principles and methods of neurotoxicology with those of developmental toxicology, neuroanatomy, neurobiology, and experimental psychology that have been developed and refined over the course of many decades. These tests have been applied to both adult and postnatal nonclinical studies as well as studies to evaluate environmental and industrial chemicals, creating a wide range of experience to contribute to decision-making on optimal tests and procedures as well as data interpretation. The primary emphasis of the symposium on which this paper is based was nonclinical postnatal behavioral testing as a component of PPND and juvenile toxicity studies. Regulatory guidance has defined specific functional domains to be evaluated: sensory and motor function, arousal and reactivity, cognitive function (learning and memory, attention), and social behavior. Specific behavior tests have not been recommended to evaluate these domains, but regulatory expectation is that the tests be relevant to humans, well established, and validated. Extensive experience in the field of DNT testing has provided sufficient information to allow for recommendations of tests that satisfy these criteria, resulting in increased consistency in this aspect of regulatory submissions.
Evaluation of the history and current state of DNT testing allows for development of a potential roadmap for future advancement. Continued mining of the scientific literature as well as regulatory case studies will help to optimize methodology and interpretation. To this end, regulators and nonclinical safety assessment experts are encouraged to publish findings on specific therapeutic candidates as well as historical information to add to the existing knowledge base. Additional important considerations include refinement of validation criteria for behavioral tests to ensure that they adequately evaluate recommended functional domains, enhanced understanding of translation of behavioral results in laboratory animals to humans, and reevaluation of existing methodology, such as functional observational batteries and water mazes to ensure that implementation is consistent and that the tests are providing the level of rigor required to adequately evaluate effects on the developing nervous system. Moving forward, it will be important for regulators and nonclinical safety assessment experts to embrace new ideas on alternative methods and species used to evaluate DNT, while remembering the rich history that has contributed to this science to date.
Footnotes
Authors’ Note
This article reflects the views of the authors and should not be construed to represent FDA’s views or policies.
Acknowledgments
Dr. Arippa Ravindranreviewed the Cymbalta juvenile animal study and the authors are thankful for his contribution.
Author Contributions
J. Henck contributed to conception and design and contributed to acquisition, analysis, and interpretation. I. Elayan contributed to conception and design and contributed to acquisition, analysis, and interpretation. J. E. Fisher contributed to design and contributed to acquisition, analysis, and interpretation. C. Vorhees contributed to conception and design and contributed to acquisition, analysis, and interpretation. L. Morford contributed to conception and design and contributed to acquisition, analysis, and interpretation. All authors drafted the manuscript, critically revised the manuscript, gave final approval, and agree to be accountable for all aspects of work ensuring integrity and accuracy.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
