Abstract
Evidence-based severity assessment in laboratory animals is, apart from the ethical responsibility, imperative to generate reproducible, standardized and valid data. However, the path towards a valid study design determining the degree of pain, distress and suffering experienced by the animal is lined with pitfalls and obstacles as we will elucidate in this review. Furthermore, we will ponder on the genesis of a holistic concept relying on multifactorial composite scales. These have to combine robust and reliable parameters to measure the multidimensional aspects that define the severity of animal experiments, generating a basis for the substantiation of the refinement principle.
In an effort to express and confer meaning to the ethical concerns associated with the potential pain, suffering and distress animals might experience during scientific procedures, the replacement, reduction, refinement (3R) concept was established in 1959. 1 This concept is firmly anchored ethically and legally in animal-based research with numerous valuable refinement strategies ranging from, for example, the appropriate administration of analgesia to the pre-definition and application of humane endpoints.2–4 The key element of refinement is, however, the prompt recognition and exact assessment of the overall status of the animal. 5 In continuation, a prospective and, to some extent, retrospective severity assessment of all experimental procedures undertaken on laboratory animals has been mandatory in the European Union since 2010. 6 According to Articles 15, 38, 39 and 54, experimental procedures have to be classified and allocated to severity categories as further specified in Annex VIII. However, due to the multitude of animal and disease models and the low availability of evidence-based, validated methods, this classification is not a trivial task and compounds the problems created by an obligatory ethical and legal framework in the context of great scientific uncertainty. Considering the multitude of facets and dimensions that techniques utilized to determine pain and distress have to cover, questions arise and pitfalls emerge. An issue of concern is the implementation of severity assessment procedures in the experimental study design without risking the validity of the study. There are sufficient examples of assessment parameters influencing the outcome of the respective experiment, thereby potentially introducing bias and eventually contributing to the reproducibility crisis, which is in and of itself ethically problematic.7,8 Furthermore, scientists struggle with the design and structure of severity assessment strategies. Is there an ideal study design to determine the degree of pain and distress? What aspects of the animal's burden have to be recognized? Which techniques are appropriate in which model and what techniques will influence each other, risking the generation of additive effects? Unfortunately, few non-invasive methods that will not influence the outcome of study parameters are available. Of course, minimal interference is imperative in routine severity assessment; however, it is less necessary when establishing new techniques and models or in the context of the validation of a putative refinement measure. Furthermore, it is questionable whether the utilization of single parameters is sufficient in this context or whether a multifactorial approach alone will adequately reflect the burden of the animal. Here, however, bias may even be a greater risk factor, for example due to the effect of chronological order or the reciprocal influences of assessment procedures.
In this review we will address these questions by briefly outlining the terms that determine the severity of animal experiments, while focusing on laboratory mice. We will exemplarily extract benefits and drawbacks of severity assessment techniques and highlight how techniques might interfere with the purpose of an animal experiment. Furthermore, we will ponder on the structure of a valid study design to determine pain and distress in animals. Finally, by contrasting a single parameter versus a multifactorial approach, we will comment on the necessity of a holistic concept to measure all dimensions of the animal's burden.
A beast of burden: Multidimensional severity assessment
According to Annex VIII of the European Union directive 2010/63, the ‘severity of a procedure shall be determined by the degree of pain, suffering, distress or lasting harm expected to be experienced by an individual animal during the course of the procedure’.
6
In addition, cumulative suffering and prevention from expressing the natural behaviour shall be taken into account. Therefore, the statutory provisions demanding a graded severity assessment encompass the terms ‘pain’, ‘suffering’, ‘distress’ and ‘lasting harm’. Here, definitions have evolved cumulating in a network of connecting and coalescing aspects that determine the burden of the animal (as outlined in Figure 1). We will, however, elucidate these definitions just briefly here, as extensive reviews are available on this topic.9–11 Pain has been defined as ‘an unpleasant sensory and emotional experience associated with potential or actual tissue damage’ by the International Association for the Study of Pain with a corresponding definition for animals.12,13 Recently, this definition has been extended to include sensory, emotional, cognitive and social components.
14
Because pain cannot be assessed directly in animals, but is rather deduced from pain-associated behaviours, indirect methods that attempt to quantify pain-associated behaviour or nociception have been generated.5,9,15,16 It is, however, generally accepted that species-specific multifactorial composite pain scales are imperative rather than relying on single behavioural tests.
17
Regarding the concepts of ‘suffering’, ‘distress’ and ‘lasting harm’, precise definitions as well as objective measures are somewhat lacking. Moberg and colleagues have, however, provided working definitions and a conceptual framework on the relationship of stress and distress in animals.9,18 Here, consensus has been reached that distress arises when the biological costs of a preceding stress, defined as the adaptive response to a threat to the animal's homeostasis, negatively affect biological functions essential to the well-being of the animal.9,18,19 In line with this interpretation, alterations in biologic function may serve as indicators of distress. Pursuing this approach, suffering may arise when the biological adaptation processes fail to return the organism to homeostasis and the negative state affecting biological functions persists, leaving the animal in a state where it cannot cope with or tolerate the inflicted level of distress or pain, respectively, any longer.
11
In a broader context, the term ‘suffering’ is often associated with other negative or adverse states such as pain, anxiety or fear linking these concepts together without sufficiently demarcating them. This may also apply for the term ‘lasting harm’ as this may refer to and encompass continuing physical damage but may also evolve from persistent adverse states including fear, depression or anxiety. Contemplating the concept of anxiety independently, the measurement of this complex emotion becomes challenging due to the variety of anxiety facets, the diversity in underlying hypotheses, and the diverse behavioural tests that are directed at the involved psychophysiological processes but have also been shown to lack validity.20–24 In addition, the concept that the suffering inflicted on the animal by scientific procedures can solely be assessed in its entirety when aspects of the affective internal/emotional state are incorporated into assessment has evolved recently. Cognitive bias or judgement bias tests may reflect the internal affective or emotional state of animals, an attempt at including the animal's perspective into severity assessment strategies.25–29 Affective experiences including emotions are, however, subjective states as well and cannot be measured directly, but have to be interpreted from physiological and behavioural indices, rendering the measurement of the affective state a major and complex challenge.
30
Multidimensional severity assessment. Connecting and coalescing terms determine the burden of the animal and have to be gauged relying on multiple parameters.
As we have seen, providing definitions for the process of severity assessment is not straightforward as an accurate distinction of terms is challenging and terms are often used as equivalents for each other as well as in fluent conjunction. In addition, recognizing pain and distress in non-communicating beings relies heavily on the abilities and experience of the responsible observer as demonstrated recently.
31
Furthermore, it is a balancing act to deduce the necessary information from indices inferred by observation without relying on assumptions regarding the animal's feelings.
32
In summary, it has to be considered that
Techniques to assess severity: Balancing burden, bias and benefits
Burden, bias, benefits and critical control points of individual severity assessment techniques.
Of course, major tools of severity assessment in laboratory animals are clinical score sheets, which have been developed and refined for multiple animal and disease models.2,36,37 Benefits include the successful and adequate application in experimental set ups, easy implementation and, depending on the score sheet design, only a limited necessity to handle the animal. 38 There are, however, increasing concerns regarding sensitivity, validation, standardization and observer dependency.34,39–41 A major challenge in this context remains ascertaining whether clinical scoring sufficiently reflects mild to moderate states of pain and distress. If clinical scoring is applied, emphasis should be put on explicitly trained personnel and the utilization of validated, standardized and specifically tailored score sheets as has been extensively discussed and reviewed.31,33,41,42 Particular emphasis is often placed on the monitoring of the body weight of animals during experimental procedures, on the assumption that a negative energy balance will reflect the severity the animal experiences.2,43 Indeed, shifts in energy utilization or energetic inefficiency may be the result of many stressors and although there are multiple examples on the sufficient application as a severity assessment parameter in experimental studies, insufficient sensitivity, validation and correlation with other parameters have been criticized.44–46 The often-applied strict division into predefined categories has to our knowledge never been validated and should therefore be critically evaluated before being used in severity assessment strategies as it may not necessarily reflect model-specific dynamics or the multiple physiological processes involved in the change of body weight. Furthermore, these pre-set divisions vary among guidelines as well as available clinical score sheets assigning different percentages of body weight loss to a varying number of corresponding severity categories or score points.2,11,40,42 In a critical appraisal of the suitability of body weight monitoring for severity assessment in different models, a minority of mice reached the predefined humane endpoint for body weight loss, but did not show any clinical conspicuities underlining the strong necessity for validation (Talbot et al., this issue).47 Physiological parameters may also be collected by invasive approaches, for example by telemetric recording after transmitter implantation. A major benefit is the observer independency that is contrasted by a high degree of invasiveness and the monitoring of non-specific parameters such as heart rate, core body temperature or locomotor activity that may be influenceable by many variables and have to be interpreted in the overall context, underlining again a necessity for standardized conditions. 48 However, telemetric recording may prove useful in detecting signs of pain and distress that otherwise would not be noticed by direct observation as, for example, described when assessing post-laparotomy pain in laboratory mice. 49
Furthermore, a multitude of behavioural severity assessment parameters is available. Grimace scales have been established in several species to assess spontaneously emitted pain behaviour and validity studies have been performed, for example, to ensure that handling techniques do not confound the assessment.50–52 Benefits, including a high accuracy and inter-rater reliability with potential for standardization and automatization minimizing observer-dependency, as well as limitations, have been extensively reviewed.51,53,54 Furthermore, other emotional states have been assessed recently by the analysis of facial expressions.54–56 In addition, other home cage-based behaviours such as burrowing and nest building can serve as indicators of wellbeing, more specifically as parameters in models of psychiatric disorders and to assess pain, distress and disease progression.57,58 However, environmental factors may have an impact as it has been shown that, for example, nest behaviour was dependent on cage size, an easily overlooked husbandry detail. 59 Therefore, again, standardization is crucial as assessment protocols may vary between laboratories, resulting in the lack or changed manifestation of these behaviours (Schwabe et al., this issue; Jirkof et al., this issue).60,61 Another motivational- or emotional-driven behaviour, namely voluntary wheel running, has been utilized to assess the severity of experimental procedures. 62 Benefits include an observer-independent, automatized assessment of severity that has been proposed to serve as an indicator of disturbed wellbeing (Mallien et al., this issue).63 However, several potential effects on experimental set ups have been described, including increased neurogenesis, anatomical and physical changes and the prevention of learned helplessness/behavioural depression.64–66 Meanwhile, wheel running was shown to be modulated by social interactions and when wheel running was subjected to refinement changes by group housing in a study of this special issue, running behaviour of mice changed as well (Weegh et al., this issue).67,68 There is also a wide panel of experimental environment behaviours that may be used to assess anxiety- and depression-like behaviour as extensively reviewed elsewhere. 22 It has, however, been criticized recently that the validity regarding, for example, the plus-maze or the open-field test is insufficient.20,21
With regard to biochemical parameters, measurement of glucocorticoids has been extensively applied for severity assessment with a focus on the analysis of stress. However, it needs to be kept in mind that hormone secretion is subject to many variables and measurement will not distinguish between distress and eustress, therefore the combination with behavioural or physiological parameters is recommended.69,70 Crucial impact has to be attributed to the sampling method and procedure as sampling itself can be stressful, thereby affecting study outcome. 62 However, benefits of corticosterone measurement are observer independency and, if for example faecal corticosterone metabolites are measured, non-invasiveness. 71
Finally, non-invasive imaging techniques are increasingly available for utilization in severity assessment strategies, allowing direct visualization and monitoring of disease progression in real time, 72 although a major drawback is the necessity for anaesthesia with a potential confounding effect on study outcome and animal welfare. 73 Therefore, and due to the high equipment requirements, imaging techniques have been used mostly in combinations to set up and enhance composite scales for severity assessment. However, imaging biomarker candidates for behavioural impairment in rodent epilepsy models have been described recently, expanding the field of available imaging severity assessment techniques, which we will not address here separately. 74
It all comes down to study design
As extensively discussed above, although severity assessment techniques are being developed or re-evaluated, interferences with study outcome are abundant. Furthermore, it still has to be determined whether and how techniques will influence each other, whether even the chronological order of assessment will introduce bias in severity assessment as well as study outcome and how these puzzle pieces can be put together to create a valid study design. Here, the adherence to the basic principles of good scientific practice as laid out in respective guidelines is essential. However, although it is generally accepted that flaws in the experimental design of animal experiments will lead to bias and ultimately to a lack of translatability, it took the wakeup call of the reproducibility crisis to stomp out universal rules not only on the reporting, but also on the design of animal studies.7,8,75,76 Sufficient literature is therefore available on this topic with emphasis on the sources and control of variability, on practical aspects confronting scientists, on the statistical fundamentals and on the recently published PREPARE guidelines providing recommendations on the preparation of animal studies in form of a checklist.77–81 Furthermore, the utilization of a web-based experimental design assistant may prove helpful to detect confounding variables in laboratory animal studies. 82 However, although the respective knowledge is freely available, it has only recently been demonstrated that there has been little improvement regarding reporting standards and that, for example, the anaesthetic and analgesic regimens used in animal research proposals needed optimization, implying there is still room for improvement in the quality of laboratory animal studies.83,84
Apart from adhering to the appropriate guidelines, an additional criterion, namely minimal interference with the experimental study, is essential. There are numerous examples that the welfare status of the animal is a major contributor in gaining valid experimental data, but even differential housing conditions or handling techniques were found to influence study outcomes and impact on the comprehension of biological processes.61,85–88 Interestingly, however, in the first systematic approach at determining the influence of environmental enrichment the mean values of several physiological parameters were affected, but no consistent effect on variability was detected, demonstrating that housing conditions can be improved without compromising data. 89 Besides, scientists need to consider that the effects of pain, stress or distress may compromise experimental results and data validity.90–92 Here, severity assessment itself may be confounded, for example as stress due to early separation from the dam has been shown to influence nociception in rats and housing conditions influenced anxiety-related behaviour in mice.93,94 This holds true even more when considering the recent discoveries regarding empathy for pain or distress in rodents and its potential impact on severity assessment.95,96
Therefore, to reduce interferences to a minimum, there is a strong need for concise strategies and the establishment as well as the adherence to standard operation procedures (SOPs). Precisely defined process steps, ideally made digitally available, will then allow for standardization on every level of the study design, leading to a high internal validity as impacts on the experimental process are minimized. It has, however, been critically noted that for behavioural testing emphasis should be put on environmental heterogenization by relying on multi-laboratory studies as well as on systematic and controlled within-laboratory heterogenization. 97 Here, it may, however, be challenging to improve the precision on estimates, especially with regard to inter-laboratory collaborations. To achieve reproducibility, monitoring of 95% confidence intervals between laboratory outcomes may be helpful in monitoring differences in point estimates and distribution skewness (e.g. by the application of bootstrapping methods). The necessity of implementing and harmonizing SOPs in study designs for severity assessment is, however, emphasized by findings described in this special issue. Here, a non-defined detail of the SOP for the assessment of burrowing behaviour in mice led to discrepancies in burrowing performance (Jirkof et al., this issue).61 In another study, scoring of nesting behaviour as predefined in the respective SOP had to be optimized to enhance inter-rater reliability (Schwabe et al., this issue).60 Furthermore, effects of predictability and adaptation to procedures as well as effects of the chronology of procedures have to be taken into account. Repeated predictable stress will cause resilience against colitis-induced behavioural changes in mice and behavioural changes associated with learned helplessness do not occur if the stressor is controllable.98,99 It may also be assumed that the chronology of assessment techniques may result in the generation of additive or reciprocal effects.
In summary, the implementation of severity assessment into an experimental study design seems conceivable only in close connection with the establishment of strictly standardized strategies and superordinate systems relying on objective and constantly applicable measures that leave enough room to be tailored to the multitude of animal models and procedures. 33 Furthermore, these comprehensive systems will rely on the standardized recording and analysis of physiological, behavioural and biochemical as well as imaging techniques.33,34
Do we have to have them all?
Early on, it was questioned why there is no simple way to measure animal welfare; single parameters may not accurately depict the burden of the animal and the application of multiple parameters might yield conflicting results. 100 However, the lessons learned from efforts to optimize pain assessment in animals indicate the usefulness of species-specific multidimensional composite pain scales, which became a standard procedure for the assessment of the sensory-discriminatory and the affective-emotional dimensions of pain in veterinary clinics. 17 Along this line, it was recently suggested that a combination of approaches analysing behaviour and appearance of animals should be used for laboratory rodents, although a reference assessment scheme is not yet available. 101 In addition, the necessity to select the appropriate indicators as well as the appropriate number of indicators has been postulated (for a detailed guide, see the report of the BVAAWF/FRAME/RSPCA/UFAW Joint Working Group on Refinement).34,102 In this context it may also be conceivable to set up combinations of indices in accordance with the Five Domains Model, resulting in a conceptual framework by evaluating nutrition, environment, health, behaviour and mental state of an animal. 103
Recently, a panel of parameters was successfully utilized to determine the severity of single and repeated isoflurane anaesthesia procedures.
104
In another study a combination of endocrinological, physical and behavioural parameters was used yielding no signs of compromised welfare after either single or repeated open-field testing.
105
However, choosing from a list of available parameters leaves room for subjectivity. In this context, validity studies may prove beneficial as demonstrated recently in a study concerned with grimace scales utilizing a statistical approach to identify a classifier to estimate the pain status in animals.
49
Furthermore, when analysing multiple parameters, the application of comprehensive statistical and bioinformatic procedures can provide valuable additional information. For instance, principal component analysis (PCA) of complex datasets can visualize distance and relatedness between animal groups taking multiple variables into account.
106
Thus, regarding evidence-based severity assessment, PCA may aid in finding the most appropriate parameters as demonstrated recently in a study assessing severity in a rat model of repeated seizures.
106
For this, however, certain criteria such as linear correlation of data have to be met. In addition, the implicit assumption of distribution makes it hard to find independent features in non-Gaussian data, and in animal sciences these appear rather abundantly, namely in score values. Therefore, PCA cannot be regarded as a generalized model for finding optimal parameters. For this, heuristic methods based on mutual information and entropy as well as manifold techniques for non-linear cases might be better suited. Another approach may focus on the identification of clusters. The distinction of clusters in underlying large datasets as obtained by measuring multiple severity assessment parameters is achievable by cluster analyses, utilized in a recent study relying on the automated assessment of voluntary wheel running, an emotional- or motivational-driven behaviour. Here, unsupervised
However, to put these notions into practice, a step-by-step approach may be necessary that will first have to define which parameters are sensitive enough to measure distress with low inter-rater variability. Then, cut-offs indicating a specific distress level will have to be defined relying on statistical and bioinformatic procedures. These will have to be validated in independent datasets as well as in different laboratories. The general applicability will then have to be assessed by applying these validated parameters and cut-offs in other animal models. In a next step, suitable parameters will have to be combined into a composite scale. Finally, validation and broad applicability of this scale will have to be validated in independent datasets and distinct animal models as well.
An ethical and legal challenge
In essence, this approach would go a long way towards ameliorating current methods for satisfying ethical and legal requirements in severity assessment. It has become clear that, despite its long history in animal welfare regulation, the area of severity assessment in laboratory animals is running the risk of succumbing to what can only be described as an ‘ought-implies-can’ crisis: the lack of a common language and shortage of agreed definitions create a baseline of ambiguity about what severity means and how it may be measured. At the same time, the ethical and legal framework requires that certainty be established before the justifiability of an experiment is proven. This creates the very unusual constellation that a socially desirable activity that by necessity takes place in a space of epistemic uncertainty (research) must only be undertaken when ethically justifiable, which ought to be proven to a standard that the epistemic uncertainty itself prevents: severity
Conclusion
Unravelling the actual severity of experimental procedures has become mandatory for ethical reasons as well as to ensure the generation of reproducible, standardized and valid data. Considering the variety of influencing factors that in addition to pain can contribute to an experimental animal's burden, it is highly probable that the analysis of the overall or cumulative severity requires multidimensional composite scales with a combination of robust parameters.
However, the path to establish such an ideal study design is strewn with obstacles due to the multitude of severity aspects, the potential of interference with the experimental study, the lack of standardization and the need for a statistical-based multifactorial and multi-centred parameter approach. Therefore, an evidence-based severity assessment needs to pay credit to the potential introduction of bias by implementing highly standardized SOPs, strategies and superordinate systems. Furthermore, constantly applicable, robust parameters should aim at the immediate identification of the actual, real-time severity experienced by the animal. This requires the development of further simple and non-invasive assessment techniques as well as the intra-, inter- and cross-validation of techniques. Here, cross-correlation analyses may guide the future selection of reliable and robust parameters to be combined in a holistic concept that integrates all dimensions of the animal's burden, ultimately contributing to the realization of the refinement principle.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This review was supported by the Deutsche Forschungsgemeinschaft (DFG research group FOR 2591 ‘Severity assessment in animal-based research’, grant number: BL 953/10-1; PO 681/9-1; ZE 712/1-1; VO 450/15-1).
