Sage Journals: Discover world-class research

Abstract

German

Spanish

French

Evidence-based severity assessment in laboratory animals is, apart from the ethical responsibility, imperative to generate reproducible, standardized and valid data. However, the path towards a valid study design determining the degree of pain, distress and suffering experienced by the animal is lined with pitfalls and obstacles as we will elucidate in this review. Furthermore, we will ponder on the genesis of a holistic concept relying on multifactorial composite scales. These have to combine robust and reliable parameters to measure the multidimensional aspects that define the severity of animal experiments, generating a basis for the substantiation of the refinement principle.

Keywords

ethics welfare refinement severity assessment

In an effort to express and confer meaning to the ethical concerns associated with the potential pain, suffering and distress animals might experience during scientific procedures, the replacement, reduction, refinement (3R) concept was established in 1959.¹ This concept is firmly anchored ethically and legally in animal-based research with numerous valuable refinement strategies ranging from, for example, the appropriate administration of analgesia to the pre-definition and application of humane endpoints.^2–4 The key element of refinement is, however, the prompt recognition and exact assessment of the overall status of the animal.⁵ In continuation, a prospective and, to some extent, retrospective severity assessment of all experimental procedures undertaken on laboratory animals has been mandatory in the European Union since 2010.⁶ According to Articles 15, 38, 39 and 54, experimental procedures have to be classified and allocated to severity categories as further specified in Annex VIII. However, due to the multitude of animal and disease models and the low availability of evidence-based, validated methods, this classification is not a trivial task and compounds the problems created by an obligatory ethical and legal framework in the context of great scientific uncertainty. Considering the multitude of facets and dimensions that techniques utilized to determine pain and distress have to cover, questions arise and pitfalls emerge. An issue of concern is the implementation of severity assessment procedures in the experimental study design without risking the validity of the study. There are sufficient examples of assessment parameters influencing the outcome of the respective experiment, thereby potentially introducing bias and eventually contributing to the reproducibility crisis, which is in and of itself ethically problematic.^7,8 Furthermore, scientists struggle with the design and structure of severity assessment strategies. Is there an ideal study design to determine the degree of pain and distress? What aspects of the animal's burden have to be recognized? Which techniques are appropriate in which model and what techniques will influence each other, risking the generation of additive effects? Unfortunately, few non-invasive methods that will not influence the outcome of study parameters are available. Of course, minimal interference is imperative in routine severity assessment; however, it is less necessary when establishing new techniques and models or in the context of the validation of a putative refinement measure. Furthermore, it is questionable whether the utilization of single parameters is sufficient in this context or whether a multifactorial approach alone will adequately reflect the burden of the animal. Here, however, bias may even be a greater risk factor, for example due to the effect of chronological order or the reciprocal influences of assessment procedures.

In this review we will address these questions by briefly outlining the terms that determine the severity of animal experiments, while focusing on laboratory mice. We will exemplarily extract benefits and drawbacks of severity assessment techniques and highlight how techniques might interfere with the purpose of an animal experiment. Furthermore, we will ponder on the structure of a valid study design to determine pain and distress in animals. Finally, by contrasting a single parameter versus a multifactorial approach, we will comment on the necessity of a holistic concept to measure all dimensions of the animal's burden.

A beast of burden: Multidimensional severity assessment

According to Annex VIII of the European Union directive 2010/63, the ‘severity of a procedure shall be determined by the degree of pain, suffering, distress or lasting harm expected to be experienced by an individual animal during the course of the procedure’.⁶ In addition, cumulative suffering and prevention from expressing the natural behaviour shall be taken into account. Therefore, the statutory provisions demanding a graded severity assessment encompass the terms ‘pain’, ‘suffering’, ‘distress’ and ‘lasting harm’. Here, definitions have evolved cumulating in a network of connecting and coalescing aspects that determine the burden of the animal (as outlined in Figure 1). We will, however, elucidate these definitions just briefly here, as extensive reviews are available on this topic.^9–11 Pain has been defined as ‘an unpleasant sensory and emotional experience associated with potential or actual tissue damage’ by the International Association for the Study of Pain with a corresponding definition for animals.^12,13 Recently, this definition has been extended to include sensory, emotional, cognitive and social components.¹⁴ Because pain cannot be assessed directly in animals, but is rather deduced from pain-associated behaviours, indirect methods that attempt to quantify pain-associated behaviour or nociception have been generated.^5,9,15,16 It is, however, generally accepted that species-specific multifactorial composite pain scales are imperative rather than relying on single behavioural tests.¹⁷ Regarding the concepts of ‘suffering’, ‘distress’ and ‘lasting harm’, precise definitions as well as objective measures are somewhat lacking. Moberg and colleagues have, however, provided working definitions and a conceptual framework on the relationship of stress and distress in animals.^9,18 Here, consensus has been reached that distress arises when the biological costs of a preceding stress, defined as the adaptive response to a threat to the animal's homeostasis, negatively affect biological functions essential to the well-being of the animal.^9,18,19 In line with this interpretation, alterations in biologic function may serve as indicators of distress. Pursuing this approach, suffering may arise when the biological adaptation processes fail to return the organism to homeostasis and the negative state affecting biological functions persists, leaving the animal in a state where it cannot cope with or tolerate the inflicted level of distress or pain, respectively, any longer.¹¹ In a broader context, the term ‘suffering’ is often associated with other negative or adverse states such as pain, anxiety or fear linking these concepts together without sufficiently demarcating them. This may also apply for the term ‘lasting harm’ as this may refer to and encompass continuing physical damage but may also evolve from persistent adverse states including fear, depression or anxiety. Contemplating the concept of anxiety independently, the measurement of this complex emotion becomes challenging due to the variety of anxiety facets, the diversity in underlying hypotheses, and the diverse behavioural tests that are directed at the involved psychophysiological processes but have also been shown to lack validity.^20–24 In addition, the concept that the suffering inflicted on the animal by scientific procedures can solely be assessed in its entirety when aspects of the affective internal/emotional state are incorporated into assessment has evolved recently. Cognitive bias or judgement bias tests may reflect the internal affective or emotional state of animals, an attempt at including the animal's perspective into severity assessment strategies.^25–29 Affective experiences including emotions are, however, subjective states as well and cannot be measured directly, but have to be interpreted from physiological and behavioural indices, rendering the measurement of the affective state a major and complex challenge.³⁰
Figure 1.
Multidimensional severity assessment. Connecting and coalescing terms determine the burden of the animal and have to be gauged relying on multiple parameters.

As we have seen, providing definitions for the process of severity assessment is not straightforward as an accurate distinction of terms is challenging and terms are often used as equivalents for each other as well as in fluent conjunction. In addition, recognizing pain and distress in non-communicating beings relies heavily on the abilities and experience of the responsible observer as demonstrated recently.³¹ Furthermore, it is a balancing act to deduce the necessary information from indices inferred by observation without relying on assumptions regarding the animal's feelings.³² In summary, it has to be considered that all aspects determining the burden of the animal have to be integrated into a multidimensional severity assessment relying on physiological, behavioural and biochemical techniques (see Figure 1).

Techniques to assess severity: Balancing burden, bias and benefits

Guidelines on the recognition of pain, distress and discomfort in experimental animals were introduced as early as 1985 taking into account changes in appearance, body weight, behaviour and clinical signs.² These guidelines have been extensively reviewed, introducing further assessment techniques, approaches and recommendations. Early on, attempts at the generation of indices relying on the numerical rating of severity components were made and extended to assess severity or welfare relying on systematic protocols and elaborate score sheets while also focusing on appropriate recording and reporting systems.^{4,10,11,33,34} It has, however, been critically noted that the putative objectivity may merely be perfunctory as most parameters have not been validated per se, but rather assumed to be related to pain and distress without the necessary statistical weighting or correlation, therefore rendering these parameters to be subjectively chosen and highly observer dependent.^11,32,35 It is therefore imperative that severity assessment parameters are evaluated and validated with regard as to whether they can serve as objective tools to detect pain and distress. In this context, we aim at providing examples of currently utilized parameters for severity assessment describing burden and benefits while focusing on possible interferences with the experimental study design. However, as new approaches develop constantly, this list is not aimed at being exhaustive, but supposed to draw attention to obstacles and pitfalls that will arise during severity assessment, thereby raising the awareness for the introduction of bias in experimental studies. In alignment to the components compiled by Hawkins and colleagues, severity assessment parameters were categorized into physiological, biochemical and behavioural techniques (see Table 1).³⁴
Table 1.
Burden, bias, benefits and critical control points of individual severity assessment techniques.

Parameter What burden is measured? What bias may be introduced? What are the benefits? What are the critical control points? References

Physiological

Clinical scoring - Aimed at recognition of pain, distress and disease progression - Score sheets generally encompass physiological and simple behavioural parameters (mainly focused on hyper- and hypoactivity) - Observer dependency - Inter-rater variability - Dependency on subjective criteria (each criterion should be validated) - Dependency on robustness of the respective score sheet design - Dependency on sensitivity (score sheets require validation that mild to moderate states of pain, distress and disease will be recognized as well) - Easy to perform - Not time consuming - Non-invasive (depending on score sheet design, animals need not necessarily be handled) - Education and training programmes (to ensure observer experience, reliability and validity) - Score sheet design (validated, standardized and tailored to species and model) - Standardization 2,31,33, 34,36,37, 38,39 –42

Body weight - Loss or gain generally monitored on the assumption that the course of body weight can directly correlate with disease progression or degree of pain, distress or suffering - Influenced by many variables, which need not necessarily be model or severity related - Direct correlation with severity is controversially discussed, rather model dependent - Sensitivity - Observer independent - Easy to perform - Not time consuming - Non-invasive (although animals have to be handled) - Validation (does change of body weight reflect severity in the specific model?) - Correlation with other parameters? - Standardization (may reduce bias, e.g. handling, transport) 2,43 –47

Telemetry - Aimed at recognition of pain or distress by monitoring non-specific physiologic parameters such as heart rate or core body temperature that have to be put into context - Influenced by many variables, which need not necessarily be model or severity related - Due to high degree of invasiveness potential confounding effects on study outcome - Observer independent - Automatized assessment possible - Signs of pain and distress may be detected that would not be monitored by direct observation - Reduction of inter-animal variability - Validation and correlation with other parameters - Standardization - Automatization 48–49

Behavioral

Mouse grimace scale - Pain (spontaneously emitted pain behaviour measured according to facial expressions) - Evidence on the reflection of other emotional states - Low-grade observer-dependency (also dependent on level of automatization) - May require handling, short-term separation or single housing - Dependent on reliable protocol and experimental set up - High inter-rater reliability - Easy to perform, but requires necessary equipment - Expenditure of time dependent on stage of automatization - Non-invasive (although animals may have to be handled) - Standardization - Automatization 50 –56

Nest building - Pain, distress, disease progression - Observer dependency - Potential inter-rater variability (dependent on optimization of protocol and training) - Dependent on reliable protocol and detailed scoring system - Influenceable by environmental variables - Sufficient inter-rater reliability (dependent on optimization of protocol and training) - Easy to perform - Not time consuming - Non-invasive - Education and training programmes (to ensure observer experience, reliability and validity) - Protocol optimization (tailored to species and model) - Standardization 58 –60

Burrowing - Pain, distress, disease progression - Dependent on reliable protocol - Observer independent - Easy to perform - Not time consuming - Non-invasive - Standardization 57–58,61

Voluntary wheel running - Pain and other severity aspects (recent evidence on the reflection of the motivational, emotional and cognitive state of animals) - Potential effects on experimental results have been described - Influenceable by environmental variables - May require single housing - Observer independent - Easily implemented, but requires necessary equipment - Non-invasive - May be applicable in group housing - Automatization - Standardization 62 –68

Biochemical

Corticosterone - Stress, although assays do not distinguish between distress and eustress - Influenceable by many variables, which need not necessarily be model or severity related - Successful application dependent on assay selection, sampling procedure and physiological as well as analytical validation - Sampling procedure may impact on study outcome depending on degree of invasiveness (e.g. blood sampling) - Observer independent - Easy to perform after establishment and validation - Measurement of faecal corticosterone metabolites non-invasive - Establishment and validation of protocol and assay - Standardization 69 –71

Of course, major tools of severity assessment in laboratory animals are clinical score sheets, which have been developed and refined for multiple animal and disease models.^2,36,37 Benefits include the successful and adequate application in experimental set ups, easy implementation and, depending on the score sheet design, only a limited necessity to handle the animal.³⁸ There are, however, increasing concerns regarding sensitivity, validation, standardization and observer dependency.^34,39–41 A major challenge in this context remains ascertaining whether clinical scoring sufficiently reflects mild to moderate states of pain and distress. If clinical scoring is applied, emphasis should be put on explicitly trained personnel and the utilization of validated, standardized and specifically tailored score sheets as has been extensively discussed and reviewed.^31,33,41,42 Particular emphasis is often placed on the monitoring of the body weight of animals during experimental procedures, on the assumption that a negative energy balance will reflect the severity the animal experiences.^2,43 Indeed, shifts in energy utilization or energetic inefficiency may be the result of many stressors and although there are multiple examples on the sufficient application as a severity assessment parameter in experimental studies, insufficient sensitivity, validation and correlation with other parameters have been criticized.^44–46 The often-applied strict division into predefined categories has to our knowledge never been validated and should therefore be critically evaluated before being used in severity assessment strategies as it may not necessarily reflect model-specific dynamics or the multiple physiological processes involved in the change of body weight. Furthermore, these pre-set divisions vary among guidelines as well as available clinical score sheets assigning different percentages of body weight loss to a varying number of corresponding severity categories or score points.^2,11,40,42 In a critical appraisal of the suitability of body weight monitoring for severity assessment in different models, a minority of mice reached the predefined humane endpoint for body weight loss, but did not show any clinical conspicuities underlining the strong necessity for validation (Talbot et al., this issue).⁴⁷ Physiological parameters may also be collected by invasive approaches, for example by telemetric recording after transmitter implantation. A major benefit is the observer independency that is contrasted by a high degree of invasiveness and the monitoring of non-specific parameters such as heart rate, core body temperature or locomotor activity that may be influenceable by many variables and have to be interpreted in the overall context, underlining again a necessity for standardized conditions.⁴⁸ However, telemetric recording may prove useful in detecting signs of pain and distress that otherwise would not be noticed by direct observation as, for example, described when assessing post-laparotomy pain in laboratory mice.⁴⁹

Furthermore, a multitude of behavioural severity assessment parameters is available. Grimace scales have been established in several species to assess spontaneously emitted pain behaviour and validity studies have been performed, for example, to ensure that handling techniques do not confound the assessment.^50–52 Benefits, including a high accuracy and inter-rater reliability with potential for standardization and automatization minimizing observer-dependency, as well as limitations, have been extensively reviewed.^51,53,54 Furthermore, other emotional states have been assessed recently by the analysis of facial expressions.^54–56 In addition, other home cage-based behaviours such as burrowing and nest building can serve as indicators of wellbeing, more specifically as parameters in models of psychiatric disorders and to assess pain, distress and disease progression.^57,58 However, environmental factors may have an impact as it has been shown that, for example, nest behaviour was dependent on cage size, an easily overlooked husbandry detail.⁵⁹ Therefore, again, standardization is crucial as assessment protocols may vary between laboratories, resulting in the lack or changed manifestation of these behaviours (Schwabe et al., this issue; Jirkof et al., this issue).^60,61 Another motivational- or emotional-driven behaviour, namely voluntary wheel running, has been utilized to assess the severity of experimental procedures.⁶² Benefits include an observer-independent, automatized assessment of severity that has been proposed to serve as an indicator of disturbed wellbeing (Mallien et al., this issue).⁶³ However, several potential effects on experimental set ups have been described, including increased neurogenesis, anatomical and physical changes and the prevention of learned helplessness/behavioural depression.^64–66 Meanwhile, wheel running was shown to be modulated by social interactions and when wheel running was subjected to refinement changes by group housing in a study of this special issue, running behaviour of mice changed as well (Weegh et al., this issue).^67,68 There is also a wide panel of experimental environment behaviours that may be used to assess anxiety- and depression-like behaviour as extensively reviewed elsewhere.²² It has, however, been criticized recently that the validity regarding, for example, the plus-maze or the open-field test is insufficient.^20,21

With regard to biochemical parameters, measurement of glucocorticoids has been extensively applied for severity assessment with a focus on the analysis of stress. However, it needs to be kept in mind that hormone secretion is subject to many variables and measurement will not distinguish between distress and eustress, therefore the combination with behavioural or physiological parameters is recommended.^69,70 Crucial impact has to be attributed to the sampling method and procedure as sampling itself can be stressful, thereby affecting study outcome.⁶² However, benefits of corticosterone measurement are observer independency and, if for example faecal corticosterone metabolites are measured, non-invasiveness.⁷¹

Finally, non-invasive imaging techniques are increasingly available for utilization in severity assessment strategies, allowing direct visualization and monitoring of disease progression in real time,⁷² although a major drawback is the necessity for anaesthesia with a potential confounding effect on study outcome and animal welfare.⁷³ Therefore, and due to the high equipment requirements, imaging techniques have been used mostly in combinations to set up and enhance composite scales for severity assessment. However, imaging biomarker candidates for behavioural impairment in rodent epilepsy models have been described recently, expanding the field of available imaging severity assessment techniques, which we will not address here separately.⁷⁴

It all comes down to study design

As extensively discussed above, although severity assessment techniques are being developed or re-evaluated, interferences with study outcome are abundant. Furthermore, it still has to be determined whether and how techniques will influence each other, whether even the chronological order of assessment will introduce bias in severity assessment as well as study outcome and how these puzzle pieces can be put together to create a valid study design. Here, the adherence to the basic principles of good scientific practice as laid out in respective guidelines is essential. However, although it is generally accepted that flaws in the experimental design of animal experiments will lead to bias and ultimately to a lack of translatability, it took the wakeup call of the reproducibility crisis to stomp out universal rules not only on the reporting, but also on the design of animal studies.^7,8,75,76 Sufficient literature is therefore available on this topic with emphasis on the sources and control of variability, on practical aspects confronting scientists, on the statistical fundamentals and on the recently published PREPARE guidelines providing recommendations on the preparation of animal studies in form of a checklist.^77–81 Furthermore, the utilization of a web-based experimental design assistant may prove helpful to detect confounding variables in laboratory animal studies.⁸² However, although the respective knowledge is freely available, it has only recently been demonstrated that there has been little improvement regarding reporting standards and that, for example, the anaesthetic and analgesic regimens used in animal research proposals needed optimization, implying there is still room for improvement in the quality of laboratory animal studies.^83,84

Apart from adhering to the appropriate guidelines, an additional criterion, namely minimal interference with the experimental study, is essential. There are numerous examples that the welfare status of the animal is a major contributor in gaining valid experimental data, but even differential housing conditions or handling techniques were found to influence study outcomes and impact on the comprehension of biological processes.^61,85–88 Interestingly, however, in the first systematic approach at determining the influence of environmental enrichment the mean values of several physiological parameters were affected, but no consistent effect on variability was detected, demonstrating that housing conditions can be improved without compromising data.⁸⁹ Besides, scientists need to consider that the effects of pain, stress or distress may compromise experimental results and data validity.^90–92 Here, severity assessment itself may be confounded, for example as stress due to early separation from the dam has been shown to influence nociception in rats and housing conditions influenced anxiety-related behaviour in mice.^93,94 This holds true even more when considering the recent discoveries regarding empathy for pain or distress in rodents and its potential impact on severity assessment.^95,96

Therefore, to reduce interferences to a minimum, there is a strong need for concise strategies and the establishment as well as the adherence to standard operation procedures (SOPs). Precisely defined process steps, ideally made digitally available, will then allow for standardization on every level of the study design, leading to a high internal validity as impacts on the experimental process are minimized. It has, however, been critically noted that for behavioural testing emphasis should be put on environmental heterogenization by relying on multi-laboratory studies as well as on systematic and controlled within-laboratory heterogenization.⁹⁷ Here, it may, however, be challenging to improve the precision on estimates, especially with regard to inter-laboratory collaborations. To achieve reproducibility, monitoring of 95% confidence intervals between laboratory outcomes may be helpful in monitoring differences in point estimates and distribution skewness (e.g. by the application of bootstrapping methods). The necessity of implementing and harmonizing SOPs in study designs for severity assessment is, however, emphasized by findings described in this special issue. Here, a non-defined detail of the SOP for the assessment of burrowing behaviour in mice led to discrepancies in burrowing performance (Jirkof et al., this issue).⁶¹ In another study, scoring of nesting behaviour as predefined in the respective SOP had to be optimized to enhance inter-rater reliability (Schwabe et al., this issue).⁶⁰ Furthermore, effects of predictability and adaptation to procedures as well as effects of the chronology of procedures have to be taken into account. Repeated predictable stress will cause resilience against colitis-induced behavioural changes in mice and behavioural changes associated with learned helplessness do not occur if the stressor is controllable.^98,99 It may also be assumed that the chronology of assessment techniques may result in the generation of additive or reciprocal effects.

In summary, the implementation of severity assessment into an experimental study design seems conceivable only in close connection with the establishment of strictly standardized strategies and superordinate systems relying on objective and constantly applicable measures that leave enough room to be tailored to the multitude of animal models and procedures.³³ Furthermore, these comprehensive systems will rely on the standardized recording and analysis of physiological, behavioural and biochemical as well as imaging techniques.^33,34

Do we have to have them all?

Early on, it was questioned why there is no simple way to measure animal welfare; single parameters may not accurately depict the burden of the animal and the application of multiple parameters might yield conflicting results.¹⁰⁰ However, the lessons learned from efforts to optimize pain assessment in animals indicate the usefulness of species-specific multidimensional composite pain scales, which became a standard procedure for the assessment of the sensory-discriminatory and the affective-emotional dimensions of pain in veterinary clinics.¹⁷ Along this line, it was recently suggested that a combination of approaches analysing behaviour and appearance of animals should be used for laboratory rodents, although a reference assessment scheme is not yet available.¹⁰¹ In addition, the necessity to select the appropriate indicators as well as the appropriate number of indicators has been postulated (for a detailed guide, see the report of the BVAAWF/FRAME/RSPCA/UFAW Joint Working Group on Refinement).^34,102 In this context it may also be conceivable to set up combinations of indices in accordance with the Five Domains Model, resulting in a conceptual framework by evaluating nutrition, environment, health, behaviour and mental state of an animal.¹⁰³

Recently, a panel of parameters was successfully utilized to determine the severity of single and repeated isoflurane anaesthesia procedures.¹⁰⁴ In another study a combination of endocrinological, physical and behavioural parameters was used yielding no signs of compromised welfare after either single or repeated open-field testing.¹⁰⁵ However, choosing from a list of available parameters leaves room for subjectivity. In this context, validity studies may prove beneficial as demonstrated recently in a study concerned with grimace scales utilizing a statistical approach to identify a classifier to estimate the pain status in animals.⁴⁹ Furthermore, when analysing multiple parameters, the application of comprehensive statistical and bioinformatic procedures can provide valuable additional information. For instance, principal component analysis (PCA) of complex datasets can visualize distance and relatedness between animal groups taking multiple variables into account.¹⁰⁶ Thus, regarding evidence-based severity assessment, PCA may aid in finding the most appropriate parameters as demonstrated recently in a study assessing severity in a rat model of repeated seizures.¹⁰⁶ For this, however, certain criteria such as linear correlation of data have to be met. In addition, the implicit assumption of distribution makes it hard to find independent features in non-Gaussian data, and in animal sciences these appear rather abundantly, namely in score values. Therefore, PCA cannot be regarded as a generalized model for finding optimal parameters. For this, heuristic methods based on mutual information and entropy as well as manifold techniques for non-linear cases might be better suited. Another approach may focus on the identification of clusters. The distinction of clusters in underlying large datasets as obtained by measuring multiple severity assessment parameters is achievable by cluster analyses, utilized in a recent study relying on the automated assessment of voluntary wheel running, an emotional- or motivational-driven behaviour. Here, unsupervised k-means algorithm-based cluster analysis of voluntary wheel running and body weight data enabled the discrimination of distinct levels of severity, allowing an unbiased individual severity grading in laboratory mice.⁶² Finally, the comprehensive analysis of multiple behavioural, biochemical and physiological parameters can result in a gain in knowledge about the informative value of a selected parameter and of its relative alteration in an experimental paradigm. Here, correlation patterns may be detected between simple, easy-to-apply behavioural assays or biochemical parameters and more complex behavioural paradigms. The application of comprehensive statistical and bioinformatic procedures can therefore provide a basis for selecting appropriate or candidate parameters with high significance and validity for evidence-based severity assessment.

However, to put these notions into practice, a step-by-step approach may be necessary that will first have to define which parameters are sensitive enough to measure distress with low inter-rater variability. Then, cut-offs indicating a specific distress level will have to be defined relying on statistical and bioinformatic procedures. These will have to be validated in independent datasets as well as in different laboratories. The general applicability will then have to be assessed by applying these validated parameters and cut-offs in other animal models. In a next step, suitable parameters will have to be combined into a composite scale. Finally, validation and broad applicability of this scale will have to be validated in independent datasets and distinct animal models as well.

An ethical and legal challenge

In essence, this approach would go a long way towards ameliorating current methods for satisfying ethical and legal requirements in severity assessment. It has become clear that, despite its long history in animal welfare regulation, the area of severity assessment in laboratory animals is running the risk of succumbing to what can only be described as an ‘ought-implies-can’ crisis: the lack of a common language and shortage of agreed definitions create a baseline of ambiguity about what severity means and how it may be measured. At the same time, the ethical and legal framework requires that certainty be established before the justifiability of an experiment is proven. This creates the very unusual constellation that a socially desirable activity that by necessity takes place in a space of epistemic uncertainty (research) must only be undertaken when ethically justifiable, which ought to be proven to a standard that the epistemic uncertainty itself prevents: severity ought to be proven with certainty, but cannot. The scientific community's natural response to this challenge is to identify quantitative, and reproducible, approaches to what regulators will mostly have seen as an inherently qualitative task: pain, suffering and distress are emotional responses that we wish to avoid in a fellow creature first, and only by subsequent deduction are spikes in an organism's hormone levels, or morbidly accelerated breathing emblematic of these feelings. The juxtapositional challenge of fulfilling the compelling ethical (and legal) expectations in a fashion that is compatible with good scientific practice (which is itself a paramount ethical concern) can only convincingly be overcome by establishing strong common standards, sharing data as broadly as possible, improving the training of personnel and continuous development of better techniques.

Conclusion

Unravelling the actual severity of experimental procedures has become mandatory for ethical reasons as well as to ensure the generation of reproducible, standardized and valid data. Considering the variety of influencing factors that in addition to pain can contribute to an experimental animal's burden, it is highly probable that the analysis of the overall or cumulative severity requires multidimensional composite scales with a combination of robust parameters.

However, the path to establish such an ideal study design is strewn with obstacles due to the multitude of severity aspects, the potential of interference with the experimental study, the lack of standardization and the need for a statistical-based multifactorial and multi-centred parameter approach. Therefore, an evidence-based severity assessment needs to pay credit to the potential introduction of bias by implementing highly standardized SOPs, strategies and superordinate systems. Furthermore, constantly applicable, robust parameters should aim at the immediate identification of the actual, real-time severity experienced by the animal. This requires the development of further simple and non-invasive assessment techniques as well as the intra-, inter- and cross-validation of techniques. Here, cross-correlation analyses may guide the future selection of reliable and robust parameters to be combined in a holistic concept that integrates all dimensions of the animal's burden, ultimately contributing to the realization of the refinement principle.

Parameter	What burden is measured?	What bias may be introduced?	What are the benefits?	What are the critical control points?	References
Physiological
Clinical scoring	- Aimed at recognition of pain, distress and disease progression - Score sheets generally encompass physiological and simple behavioural parameters (mainly focused on hyper- and hypoactivity)	- Observer dependency - Inter-rater variability - Dependency on subjective criteria (each criterion should be validated) - Dependency on robustness of the respective score sheet design - Dependency on sensitivity (score sheets require validation that mild to moderate states of pain, distress and disease will be recognized as well)	- Easy to perform - Not time consuming - Non-invasive (depending on score sheet design, animals need not necessarily be handled)	- Education and training programmes (to ensure observer experience, reliability and validity) - Score sheet design (validated, standardized and tailored to species and model) - Standardization	2,31,33, 34,36,37, 38,39 –42
Body weight	- Loss or gain generally monitored on the assumption that the course of body weight can directly correlate with disease progression or degree of pain, distress or suffering	- Influenced by many variables, which need not necessarily be model or severity related - Direct correlation with severity is controversially discussed, rather model dependent - Sensitivity	- Observer independent - Easy to perform - Not time consuming - Non-invasive (although animals have to be handled)	- Validation (does change of body weight reflect severity in the specific model?) - Correlation with other parameters? - Standardization (may reduce bias, e.g. handling, transport)	2,43 –47
Telemetry	- Aimed at recognition of pain or distress by monitoring non-specific physiologic parameters such as heart rate or core body temperature that have to be put into context	- Influenced by many variables, which need not necessarily be model or severity related - Due to high degree of invasiveness potential confounding effects on study outcome	- Observer independent - Automatized assessment possible - Signs of pain and distress may be detected that would not be monitored by direct observation - Reduction of inter-animal variability	- Validation and correlation with other parameters - Standardization - Automatization	48–49
Behavioral
Mouse grimace scale	- Pain (spontaneously emitted pain behaviour measured according to facial expressions) - Evidence on the reflection of other emotional states	- Low-grade observer-dependency (also dependent on level of automatization) - May require handling, short-term separation or single housing - Dependent on reliable protocol and experimental set up	- High inter-rater reliability - Easy to perform, but requires necessary equipment - Expenditure of time dependent on stage of automatization - Non-invasive (although animals may have to be handled)	- Standardization - Automatization	50 –56
Nest building	- Pain, distress, disease progression	- Observer dependency - Potential inter-rater variability (dependent on optimization of protocol and training) - Dependent on reliable protocol and detailed scoring system - Influenceable by environmental variables	- Sufficient inter-rater reliability (dependent on optimization of protocol and training) - Easy to perform - Not time consuming - Non-invasive	- Education and training programmes (to ensure observer experience, reliability and validity) - Protocol optimization (tailored to species and model) - Standardization	58 –60
Burrowing	- Pain, distress, disease progression	- Dependent on reliable protocol	- Observer independent - Easy to perform - Not time consuming - Non-invasive	- Standardization	57–58,61
Voluntary wheel running	- Pain and other severity aspects (recent evidence on the reflection of the motivational, emotional and cognitive state of animals)	- Potential effects on experimental results have been described - Influenceable by environmental variables - May require single housing	- Observer independent - Easily implemented, but requires necessary equipment - Non-invasive - May be applicable in group housing	- Automatization - Standardization	62 –68
Biochemical
Corticosterone	- Stress, although assays do not distinguish between distress and eustress	- Influenceable by many variables, which need not necessarily be model or severity related - Successful application dependent on assay selection, sampling procedure and physiological as well as analytical validation - Sampling procedure may impact on study outcome depending on degree of invasiveness (e.g. blood sampling)	- Observer independent - Easy to perform after establishment and validation - Measurement of faecal corticosterone metabolites non-invasive	- Establishment and validation of protocol and assay - Standardization	69 –71

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This review was supported by the Deutsche Forschungsgemeinschaft (DFG research group FOR 2591 ‘Severity assessment in animal-based research’, grant number: BL 953/10-1; PO 681/9-1; ZE 712/1-1; VO 450/15-1).

References

Russell

WMS

Burch

. The principles of humane experimental technique, Wheathampstead (UK): Universities Federation for Animal Welfare, 1959.

Morton

Griffiths

. Guidelines on the recognition of pain, distress and discomfort in experimental animals and an hypothesis for assessment. Vet Rec 1985; 116: 431–436.

Tannenbaum

Bennett

. Russell and Burch's 3Rs then and now: The need for clarity in definition and purpose. J Am Assoc Lab Anim Sci 2015; 54: 120–132.

Morton

. A systematic approach for establishing humane endpoints. ILAR J 2000; 41: 80–86.

Flecknell

. Refinement of animal use: Assessment and alleviation of pain and distress. Lab Anim 1994; 28: 222–231.

Directive 2010/63/EU of the European Parliament and of the Council of 22 September 2010 on the protection of animals used for scientific purposes. O J Eur Union 2010; L276: 233279.

Begley

Ioannidis

. Reproducibility in science: Improving the standard for basic and preclinical research. Circ Res 2015; 116: 116–126.

Kilkenny

Parsons

Kadyszewski

, et al. Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PLoS One 2009; 4: e7824.

Carstens

Moberg

. Recognizing pain and distress in laboratory animals. ILAR J 2000; 41: 62–71.

10.

Wallace

Sanford

Smith

, et al. The assessment and control of the severity of scientific procedures on laboratory animals. Lab Anim 1990; 24: 97–130.

11.

Pain and distress in laboratory rodents and lagomorphs. Report of the Federation of European Laboratory Animal Science Associations (FELASA) Working Group on Pain and Distress accepted by the FELASA Board of Management November 1992. Lab Anim 1994; 28: 97–112.

12.

Part III: Pain Terms, A Current List with Definitions and Notes on Usage. In: Merskey

Bogduk

(eds). Classification of Chronic Pain, 2nd ed. Seattle: IASP Press, 1994, pp. 209–214.

13.

Zimmerman

. Physiological mechanisms of pain and its treatment. Klinische Anaesthesiol Intensivether 1986; 32: 1–19.

14.

Williams

Craig

. Updating the definition of pain. Pain 2016; 157: 2420–2423.

15.

Mogil

. Animal models of pain: Orogress and challenges. Nat Rev Neurosci 2009; 10: 283–294.

16.

Deuis

Dvorakova

Vetter

. Methods used to evaluate pain behaviors in rodents. Front Mol Neurosci 2017; 10: 284.

17.

Reid

Nolan

Scott

. Measuring pain in dogs and cats using structured behavioural observation. Vet J 2018; 236: 72–79.

18.

Moberg

. When does stress become distress?. Lab Anim 1999; 28: 22–26.

19.

Committee on Recognition and Alleviation of Distress in Laboratory Animals, National Research Council. Recognition and Alleviation of Distress in Laboratory Animals, Washington (DC), 2008.

20.

Ennaceur

. Tests of unconditioned anxiety: Pitfalls and disappointments. Physiol Behav 2014; 135: 55–71.

21.

Ennaceur

Chazot

. Preclinical animal anxiety research: Flaws and prejudices. Pharmacol Res Perspect 2016; 4: e00223.

22.

Harro

. Animals, anxiety, and anxiety disorders: How to measure anxiety in rodents and why. Behav Brain Res 2018; 352: 81–93.

23.

Hoffman

. New dimensions in the use of rodent behavioral tests for novel drug discovery and development. Expert Opin Drug Discov 2016; 11: 343–353.

24.

Rodgers

. Animal models of ‘anxiety’: Where next?. Behav Pharmacol 1997; 8: 477–496. discussion 497–504.

25.

Boleij

van't Klooster

Lavrijsen

, et al. A test to identify judgement bias in mice. Behav Brain Res 2012; 233: 45–54.

26.

Guldimann

Vogeli

Wolf

, et al. Frontal brain deactivation during a non-verbal cognitive judgement bias test in sheep. Brain Cogn 2015; 93: 35–41.

27.

Habedank

Kahnau

Diederich

, et al. Severity assessment from an animal’s point of view. Berl Münch Tierärztl Wochenschr 2018; 131: 304–320.

28.

Kloke

Schreiber

Bodden

, et al. Hope for the best or prepare for the worst? Towards a spatial cognitive bias test for mice. PLoS One 2014; 9: e105431.

29.

Harding

Paul

Mendl

. Animal behaviour: Cognitive bias and affective state. Nature 2004; 427: 312.

30.

Flecknell

Leach

Bateson

. Affective state and quality of life in mice. Pain 2011; 152: 963–964.

31.

Herrmann

Flecknell

. Severity classification of surgical procedures and application of health monitoring strategies in animal research proposals: A retrospective review. Altern Lab Anim 2018; 46: 273–289.

32.

Flecknell

. Replacement, reduction and refinement. ALTEX 2002; 19: 73–78.

33.

Smith

Anderson

Degryse

, et al. Classification and reporting of severity experienced by animals used in scientific procedures: FELASA/ECLAM/ESLAV Working Group report. Lab Anim 2018; 52: 5–57.

34.

Hawkins

Morton

Burman

, et al. A guide to defining and implementing protocols for the welfare assessment of laboratory animals: Eleventh report of the BVAAWF/FRAME/RSPCA/UFAW Joint Working Group on Refinement. Lab Anim 2011; 45: 1–13.

35.

Hawkins

. Recognizing and assessing pain, suffering and distress in laboratory animals: A survey of current practice in the UK with recommendations. Lab Anim 2002; 36: 378–395.

36.

van Griensven

Dahlweid

Giannoudis

, et al. Dehydroepiandrosterone (DHEA) modulates the activity and the expression of lymphocyte subpopulations induced by cecal ligation and puncture. Shock 2002; 18: 445–449.

37.

Jirkof

Tourvieille

Cinelli

, et al. Buprenorphine for pain relief in mice: Repeated injections vs sustained-release depot formulation. Lab Anim 2015; 49: 177–187.

38.

Lloyd

Wolfensohn

Practical use of distress scoring systems in the application of humane endpoints. In: Hendriksen

CFMM, D. B.

(ed). International Conference on humane endpoints, Zeist: The Royal Society of Medicine Press, 1999, pp. 48–53.

39.

Keubler

Tolba

Bleich

, et al. Severity assessment in laboratory animals: A short overview on potentially applicable parameters. Berl Münch Tierärztl Wochenschr 2018; 131: 299–303.

40.

Kanzler

Rix

Czigany

, et al. Recommendation for severity assessment following liver resection and liver transplantation in rats: Part I. Lab Anim 2016; 50: 459–467.

41.

Palle

Ferreira

Methner

, et al. The more the merrier? Scoring, statistics and animal welfare in experimental autoimmune encephalomyelitis. Lab Anim 2016; 50: 427–432.

42.

Fentener van Vlissingen

Borrens

Girod

, et al. The reporting of clinical signs in laboratory animals: FELASA Working Group Report. Lab Anim 2015; 49: 267–283.

43.

Ullman-Cullere

Foltz

. Body condition scoring: A rapid and accurate method for assessing health status in mice. Lab Anim Sci 1999; 49: 319–323.

44.

Elsasser

Kahl

Rumsey

, et al. Modulation of growth performance in disease: Reactive nitrogen compounds and their impact on cell proteins. Domest Anim Endocrinol 2000; 19: 75–84.

45.

Laugero

Moberg

. Energetic response to repeated restraint stress in rapidly growing mice. Am J Physiol Endocrinol Metab 2000; 279: E33–43.

46.

Harris

Zhou

Youngblood

, et al. Effect of repeated stress on body weight and body composition of rats fed low- and high-fat diets. Am J Physiol 1998; 275: R1928–1938.

47.

Talbot SR, Biernot S, Bleich A, et al. Defining body weight reduction as a humane endpoint: a critical appraisal. Lab Anim 2020; 54: 99–110.

48.

Cesarovic

Jirkof

Rettich

, et al. Implantation of radiotelemetry transmitters yielding data on ECG, heart rate, core body temperature and activity in free-moving laboratory mice. J Vis Exp 2011; 57: 3260.

49.

Arras

Rettich

Cinelli

, et al. Assessment of post-laparotomy pain in laboratory mice by telemetric recording of heart rate and heart rate variability. BMC Vet Res 2007; 3: 16.

50.

Dalla Costa

Pascuzzo

Leach

, et al. Can grimace scales estimate the pain status in horses and mice? A statistical approach to identify a classifier. PLoS One 2018; 13: e0200339.

51.

Langford

Bailey

Chanda

, et al. Coding of facial expressions of pain in the laboratory mouse. Nat Methods 2010; 7: 447–449.

52.

Miller

Leach

. The effect of handling method on the mouse grimace scale in two strains of laboratory mice. Lab Anim 2016; 50: 305–307.

53.

Häger

Biernot

Buettner

, et al. The Sheep Grimace Scale as an indicator of post-operative distress and pain in laboratory sheep. PLoS One 2017; 12: e0175839.

54.

Descovich

Wathan

Leach

, et al. Facial expression: An under-utilised tool for the assessment of welfare in mammals. ALTEX 2017; 34: 409–429.

55.

Camerlink

Coulange

Farish

, et al. Facial expression as a potential measure of both intent and emotion. Sci Rep 2018; 8: 17602.

56.

Finlayson

Lampe

Hintze

, et al. Facial Indicators of Positive Emotions in Rats. PLoS One 2016; 11: e0166446.

57.

Deacon

. Burrowing in rodents: a sensitive method for detecting behavioral dysfunction. Nat Protoc 2006; 1: 118–121.

58.

Jirkof

. Burrowing and nest building behavior as indicators of well-being in mice. J Neurosci Methods 2014; 234: 139–146.

59.

Gaskill

Pritchett-Corning

. The effect of cage space on behavior and reproduction in Crl:CD1(Icr) and C57BL/6NCrl Laboratory Mice. PLoS One 2015; 10: e0127875.

60.

Schwabe K, Boldt L, Bleich A, et al. Nest-building performance in rats: impact of vendor, experience, and sex. Lab Anim 2020; 54: 17–25.

61.

Jirkof P, Abdelrahman A, Bleich A, et al. A safe bet? Inter-laboratory variability in behavior-based severity assessment. Lab Anim 2020; 54: 73–82.

62.

Häger

Keubler

Talbot

, et al. Running in the wheel: Defining individual severity levels in mice. PLoS Biol 2018; 16: e2006159.

63.

Mallien AS, Häger C, Palme R, et al. Systematic analysis of severity in a widely-used cognitive depression model for mice. Lab Anim 2020; 54: 40–49.

64.

Greenwood

Foley

Day

, et al. Freewheel running prevents learned helplessness/behavioral depression: Role of dorsal raphe serotonergic neurons. J Neurosci 2003; 23: 2889–2898.

65.

Richter

Gass

Fuss

. Resting is rusting: A critical view on rodent wheel-running behavior. Neuroscientist 2014; 20: 313–325.

66.

Fuss

Ben Abdallah

Vogt

, et al. Voluntary exercise induces anxiety-like behavior in adult C57BL/6J mice correlating with hippocampal neurogenesis. Hippocampus 2010; 20: 364–376.

67.

Dewan

Garland

Jr. Hiramatsu

, et al. I smell a mouse: Indirect genetic effects on voluntary wheel-running distance, duration and speed. Behav Genet 2019; 49: 49–59.

68.

Weegh N, Füner J, Jahnke O, et al. Wheel running behaviour in group-housed female mice indicates disturbed wellbeing due to DSS colitis. Lab Anim 2020; 54: 63–72.

69.

Mormede

Andanson

Auperin

, et al. Exploration of the hypothalamic-pituitary-adrenal function as a tool to evaluate animal welfare. Physiol Behav 2007; 92: 317–339.

70.

Ralph

Tilbrook

. INVITED REVIEW: The usefulness of measuring glucocorticoids for assessing animal welfare. J Anim Sci 2016; 94: 457–470.

71.

Palme

. Non-invasive measurement of glucocorticoids: Advances and problems. Physiol Behav 2019; 199: 229–243.

72.

Michael

Keubler

Smoczek

, et al. Quantitative phenotyping of inflammatory bowel disease in the IL-10-deficient mouse by use of noninvasive magnetic resonance imaging. Inflamm Bowel Dis 2013; 19: 185–193.

73.

Tremoleda

Kerton

Gsell

. Anaesthesia and physiological monitoring during in vivo imaging of laboratory rodents: Considerations on experimental outcomes and animal welfare. EJNMMI Res 2012; 2: 44.

74.

van Dijk

Di Liberto

Brendel

, et al. Imaging biomarkers of behavioral impairments: A pilot micro-positron emission tomographic study in a rat electrical post-status epilepticus model. Epilepsia 2018; 59: 2194–2205.

75.

Howells

Sena

Macleod

. Bringing rigour to translational medicine. Nat Rev Neurol 2014; 10: 37–43.

76.

Smith

Clutton

Lilley

, et al. Improving animal research: PREPARE before you ARRIVE. BMJ 2018; 360: k760.

77.

Smith

Clutton

Lilley

, et al. PREPARE: guidelines for planning animal research and testing. Lab Anim 2018; 52: 135–141.

78.

Colman

. Impact of the genetics and source of preclinical safety animal models on study design, results, and interpretation. Toxicol Pathol 2017; 45: 94–106.

79.

Festing

Altman

. Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J 2002; 43: 244–258.

80.

Johnson

Besselsen

. Practical aspects of experimental design in animal research. ILAR J 2002; 43: 202–206.

81.

Howard

. Control of variability. ILAR J 2002; 43: 194–201.

82.

du Sert

Bamsey

Bate

, et al. The experimental design assistant. Nat Methods 2017; 14: 1024–1025.

83.

Herrmann

Flecknell

. Retrospective review of anesthetic and analgesic regimens used in animal research proposals. ALTEX 2019; 36: 65–80.

84.

Baker

Lidster

Sottomayor

, et al. Two years later: journals are not yet enforcing the ARRIVE guidelines on reporting standards for pre-clinical animal studies. PLoS Biol 2014; 12: e1001756.

85.

Wurbel

. Ideal homes? Housing effects on rodent brain and behaviour. Trends Neurosci 2001; 24: 207–211.

86.

Poole

. Happy animals make good science. Lab Anim 1997; 31: 116–124.

87.

Clarkson

Dwyer

Flecknell

, et al. Handling method alters the hedonic value of reward in laboratory mice. Sci Rep 2018; 8: 2448.

88.

Garrido

De Blas

Ronzoni

, et al. Differential effects of environmental enrichment and isolation housing on the hormonal and neurochemical responses to stress in the prefrontal cortex of the adult rat: Relationship to working and emotional memories. J Neural Transm (Vienna) 2013; 120: 829–843.

89.

Andre

Gau

Scheideler

, et al. Laboratory mouse housing conditions can be improved using common environmental enrichment without compromising data. PLoS Biol 2018; 16: e2005019.

90.

Everds

Snyder

Bailey

, et al. Interpreting stress responses during routine toxicity studies: A review of the biology, impact, and assessment. Toxicol Pathol 2013; 41: 560–614.

91.

Garner

. Stereotypies and other abnormal repetitive behaviors: Potential impact on validity, reliability, and replicability of scientific outcomes. ILAR journal 2005; 46: 106–117.

92.

Jirkof

. Side effects of pain and analgesia in animal experimentation. Lab Anim (NY) 2017; 46: 123–128.

93.

Dickinson

Leach

Flecknell

. Influence of early neonatal experience on nociceptive responses and analgesic effects in rats. Lab Anim 2009; 43: 11–16.

94.

Burman

Buccarello

Redaelli

, et al. The effect of two different individually ventilated cage systems on anxiety-related behaviour and welfare in two strains of laboratory mouse. Physiol Behav 2014; 124: 92–99.

95.

Chen

. Empathy for distress in humans and rodents. Neurosci Bull 2018; 34: 216–236.

96.

Mogil

. Social modulation of and by pain in humans and rodents. Pain 2015; 156(Suppl 1): S35–S41.

97.

Richter

Garner

Wurbel

. Environmental standardization: Cure or cause of poor reproducibility in animal experiments?. Nat Methods 2009; 6: 257–261.

98.

Hassan

Jain

Reichmann

, et al. Repeated predictable stress causes resilience against colitis-induced behavioral changes in mice. Front Behav Neurosci 2014; 8: 386.

99.

Maier

Watkins

. Stressor controllability and learned helplessness: The roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci Biobehav Rev 2005; 29: 829–841.

100.

Mason

Mendl

. Why is there no simple way of measuring animal welfare?. Anim Welfare 1993; 2: 301–319.

101.

Flecknell

. Rodent analgesia: Assessment and therapeutics. Vet J 2018; 232: 70–77.

102.

Broom

. Assessing welfare and suffering. Behav Processes 1991; 25: 117–123.

103.

Mellor

. Operational details of the five domains model and its key applications to the assessment and management of animal welfare. Animals (Basel) 2017, pp. 7.

104.

Hohlbaum

Bert

Dietze

, et al. Severity classification of repeated isoflurane anesthesia in C57BL/6JRj mice-Assessing the degree of distress. PLoS One 2017; 12: e0179588.

105.

Bodden

Siestrup

Palme

, et al. Evidence-based severity assessment: Impact of repeated versus single open-field testing on welfare in C57BL/6J mice. Behav Brain Res 2018; 336: 261–268.

106.

Moller

Wolf

van Dijk

, et al. Toward evidence-based severity assessment in rat models with repeated seizures: I. Electrical kindling. Epilepsia 2018; 59: 765–777.

Where are we heading? Challenges in evidence-based severity assessment

Abstract

Keywords

A beast of burden: Multidimensional severity assessment

Techniques to assess severity: Balancing burden, bias and benefits

It all comes down to study design

Do we have to have them all?

An ethical and legal challenge

Conclusion

Footnotes

Declaration of Conflicting Interests

Funding

References