Abstract
The Mouse Grimace Scale (MGS) is an established method for estimating pain in mice during animal studies. Recently, an improved and standardized MGS set-up and an algorithm for automated and blinded output of images for MGS evaluation were introduced. The present study evaluated the application of this standardized set-up and the robustness of the associated algorithm at four facilities in different locations and as part of varied experimental projects. Experiments using the MGS performed at four facilities (F1–F4) were included in the study; 200 pictures per facility (100 pictures each rated as positive and negative by the algorithm) were evaluated by three raters for image quality and reliability of the algorithm. In three of the four facilities, sufficient image quality and consistency were demonstrated. Intraclass correlation coefficient, calculated to demonstrate the correlation among raters at the three facilities (F1–F3), showed excellent correlation. The specificity and sensitivity of the results obtained by different raters and the algorithm were analysed using Fisher's exact test (p < 0.05). The analysis indicated a sensitivity of 77% and a specificity of 64%. The results of our study showed that the algorithm demonstrated robust performance at facilities in different locations in accordance with the strict application of our MGS setup.
Based on the developments in experimental animal research and implementation of the European Union Directive (2010/63 EU) 1 on the protection of animals used for scientific purposes, ensuring the highest possible level of animal well-being has become a major priority in animal studies. Article 15 of this directive mandates a severity assessment of each procedure in an animal study. 1 Based on this requirement, methods of evaluating any changes in animal well-being and estimating potential suffering are necessary.
Because most rodent species used in animal research are flight and prey animals,2,3 these animals avoid overtly exhibiting signs or vocalizing their pain. 4 Due to this lack of self-indication of pain severity by the animals, 4 objective criteria must be implemented to assess pain severity.
Researchers have employed various methods for this purpose, including clinical examinations and scoring sheets, 5 specific stress parameter evaluations and behavioural tests.6–8
The Mouse Grimace Scale (MGS), a noninvasive method of visually recognizing pain on the basis of facial expressions of mice, 9 has become an established method for identifying acute pain in mice and has been repeatedly used in animal experiments.3,10,11 In the original publication of Langford et al., 9 the MGS pictures for analysis were cropped from pre-recorded videos, selected and then scored manually.
In a recent study, our group could improve the MGS set-up by video recording up to four animals simultaneously. Additionally, a tool for automated image selection for blinded MGS analysis was introduced (Ernst et al.,).12 Automation and standardization of technical processes is necessary to minimize subjective influences and avoid selection and performance biases. The aim of the present study was to investigate the application of the modified and improved MGS set-up and the robustness of the automated process in a multi-laboratory analysis. The application and conformity of the set-up and image selection tool were also assessed.
Materials and methods
Study design
To confirm the applicability of the improved MGS set-up and automated image selection tool (Ernst et al., under review), animal research studies at four experimental facilities (F1–F4) at different locations employed this approach. Most of the studies were part of assessments that evaluated the severity of procedures using animals for scientific purposes. The recently introduced modified MGS set-up was analogously implemented in all the studies.
At F1, the MGS set-up was implemented in a refinement study on the possible benefits of infiltration with local anaesthetics (lidocaine–bupivacaine) in combination with systemic paracetamol administration via drinking water compared with systemic analgesia only after surgery. Male and female C57Bl/6J mice were subjected to a minor laparotomy and treated with a combination of local and systemic analgesia, local or systemic analgesia alone, or anaesthesia and systemic analgesia only. Among other behavioural tests, the MGS was used to assess changes in animal well-being and potential pain severity. Baseline measurements were taken at the same time points during the day as postoperative measurements: 1, 6 and 24 h after surgery.
At F2, a project on pain severity assessment after inducing liver fibrosis was conducted. Fibrosis was induced using CCl4 dissolved in germ oil. Male C57Bl/6N mice were intraperitoneally injected (50 µl) either with 0.6 ml/kg CCl4 in mixed germ oil or germ oil only (control) three times per week for 4 weeks. MGS scoring was performed 1 h before and after injections. Baseline measurements were taken before starting the induction of liver fibrosis.
At F3, a project on the effects of intraperitoneal transmitter implantation or a corresponding SHAM operation on different clinical and behavioural parameters in female C57Bl/6J mice was conducted. To differentiate between the effects of the surgical procedure itself and the transmitter, SHAM-operated mice were monitored as a control group. MGS scoring was performed at 30 and 180 mins after surgery on the same day and on days 1, 2, 3, 5 and 7.
At F4, male and female mice, wild-types for A1783V mutation and with or without the presence of Cre were used for animal experiments. 13 All mice underwent surgery during which a telemetry device (HD-X02, DSI, St Paul, USA) was subcutaneously implanted and an electrode was implanted into the hippocampus. The electrode leads were fixed with three screws in the skull and covered with paladur (Heraeus®, Hanau, Germany). Baseline measurements were taken 1 day prior to surgery after a habituation phase of 10 mins. Video recordings were taken 1 h prior to surgery and at 1, 3, 6, 25 and 49 h following gain of consciousness after surgery.
All the studies were conducted in accordance with the legal requirements, and anaesthesia and analgesia interventions were obtained if appropriate. Human endpoint protocols were applied for all the studies. Additional details concerning procedures or surgeries can be found in the supplementary material.
Ethical statement
MGS evaluations were a preliminary part of the experiments. According to the 3R principles, 14 no additional animals were used to perform this study. All the studies were conducted in accordance with EU Directive (2010/63 EU) and the legal provisions of German Animal Welfare Act (TierSchG). 15
Cantonal Veterinary Office, Zurich, Switzerland, approved the animal housing and experimental procedures for the project at F1 under licence number 097/2017. For the project at F2, the permission licence was granted by Governmental Animal Care and Use Committee (Landesamt für Natur, Umwelt und Verbraucherschutz, LANUV AZ: 84-02.04.2014.A417, North Rhine Westphalia, Germany). All experiments performed at F3 were approved by Niedersächsisches Landesamt für Landwirtschaft und Lebensmittelsicherheit (LAVES) under licence number 15/1905. For the project at F4, all investigations were approved by the Government of Upper Bavaria (license number 55.2-1-54-2532-168-2016).
Animals
Two facilities (F1 and F3) used C57Bl/6J mice, one facility (F1) used C57Bl6/N mice, and one facility (F4) used transgenic mice with a C57Bl6 background selected according to the researchers' interest in their main study. The choice of sex, age and strain was independently made and was a nonexclusive part of the present study. Only a black coat colour and the presence of adult animals were indicated as relevant to the present study. Additional information concerning housing and husbandry conditions according to the ARRIVE guidelines 16 can be found in the supplementary material.
MGS set-up
With the modified and improved MGS video recording set-up, four animals could be simultaneously filmed under standardized conditions. To maximize the quality of MGS pictures, four equally sized MGS boxes (9 cm × 5 cm × 5 cm), which were placed in an observation rack located within a light tent, were illuminated from the side, bottom and front. Additional air holes were drilled into the front as well as into the lid of the boxes to reduce fogging. At F1, the recording time was set at approximately 5 mins, and at F2, F3 and F4, the recording time was set at 10 mins.
For automated analysis, box positions in the videos were manually defined and 300 images from each box were automatically extracted. Subsequently, the algorithm analysed the extracted box images using a fully convolutional architecture 17 to detect the position and size of the animals' eyes. Eye areas in the images were automatically measured, and all the images in which the largest visible eye had an area of at least 100 pixels were considered suitable for MGS scoring (image size: approximately 500 × 500 pixels). Among these images, 10 images per animal were randomly selected by the algorithm for further manual scoring. For this purpose, we developed a tool that displays the images of all trials and animals in a randomized and blinded manner to minimize bias.
Evaluation process
At each facility, 100 images each rated as positive or negative by the algorithm were selected for image quality evaluation; therefore, a total of 200 images per facility were evaluated. Images rated as positive were suitable for MGS scoring, whereas those rated as negative could not be used. Evaluation of image quality was performed by three raters at three facilities. All the raters had comparable experience in the performance and assessment of MGS images. The criteria for fulfilling individual evaluation points were discussed with all the participants in advance.
For manual selection, a maximum score of 54 points could be achieved by fulfilling all positive criteria. Six evaluation criteria (mouse in profile, eyes recognizable, ears recognizable, nose recognizable, mouse in steady position and general image quality) were assessed and assigned a maximum of nine points per criterion. The quality gradations for fulfilment within the evaluation criteria were as follows: 1–3 = poor, 4–6 = moderate, and 7–9 = excellent. Images rejected due to non-fulfilment of even any one criterion were given a score of −1. As a cut-off value, images with a score of ≤30, that is 55% of the maximum score, or a score of −1 for any evaluation criterion were rated as negative. In this study, the focus of the algorithm was mainly on the detection of the eye.
Statistical analysis
GraphPad Prism (GraphPad Prism, Version 7, La Jolla California USA, www.graphpad.com) and R software (version 3.4.1) 18 were used for data analysis. Intraclass correlation coefficient (ICC), which is an estimate of inter-rater reliability, was calculated using the ICC function from the interrater reliability (irr) library 18 using a two-way ANOVA to assess agreement. Fisher's exact test was used to analyse specificity and sensitivity. To identify reasons for false positive assignments by the algorithm, data were analysed using one-way ANOVA and Tukey's multiple comparison test. The data were considered statistically significant at p < 0.05.
Results
The number of scorable images at the facilities and their distribution are presented in Figure 1. For each facility, 100 positive and negative images each were selected by the algorithm and evaluated by raters for image quality. Examples of such images from each facility are shown in Figure 1. Regarding facility and location: at F1, an average of 97 images (standard deviation (SD) = 17.08, n = 3, n corresponds to the number of raters (one rater per facility)) were suitable for MGS evaluation; at F2, an average of 86.25 images (SD = 9.52, n = 3) were suitable; and at F3, an average of 96 images (SD = 27.67, n = 3) were suitable. Because of deviations in performance from the initial set-up protocol in terms of colour and illumination, videos from F4 could not be included for image-quality evaluation.
Examples of images of all facilities: F1, top left; F2, top right; F3, bottom left; and F4, bottom right. The number of scorable images between the facilities is presented. The data distribution within the different facilities shows a bimodal distribution. The cut-off value is indicated by a dashed line. The data distribution for F4 cannot be displayed because of differences in colour, brightness and the presence of a head implant, given that the algorithm could not generate adequate images without further adjustment.
Results of the analysis of intraclass correlation coefficient (ICC) values and their 95% confidence intervals as determined using a two-way ANOVA. r0 is a specification of the null hypothesis (H1: r = r0). H1: r > r0 denotes that a one-sided F-test was performed.
Specificity and sensitivity of the algorithm.
Among the 600 rated images, an average of 88.67 (SD = 30.16) images assessed by the algorithm were evaluated as false positives compared with those assessed by the raters. A total of 44 similar images were rated as false positives by the raters compared with the algorithm. For images evaluated as false positives by the algorithm depending on the different evaluation criteria, the number of rejected images is presented in Figure 2. The results showed significant differences according to the ‘nose recognizable’ and ‘mouse in steady position’ criteria.
The number of rejected images per criterion with median and 95% confidence interval calculated using a one-way ANOVA: F (6,14) = 18,67, *Tukey's multiple comparisons test adjusted p value < 0.05.
Discussion
The present study evaluated the applicability and robustness of the modified MGS set-up and the algorithm for image selection. Figure 1 shows that the standardized set-up is applicable and reproducible and that a sufficient number of images were deemed suitable for analysis at three facilities in different locations. Reproducibility is an important attribute in the performance of animal experiments. 20 Vasilevsky et al. have demonstrated that the applicability of scientific methods is dependent on the ability of reproducing other studies and building on previous work, and they noted that a lack in the provision of methodological details considerably reduces this reproducibility. 21 This conclusion is supported by the findings of our study such that deviations from the standardized protocol (e.g. colour and illumination level) at one facility rendered it impossible for the algorithm to evaluate images without adapting the algorithm to this particular set-up.
Miller and Leach describe in their study deviations in MGS values between different mouse strains and sexes. 22 In order to investigate the selection criteria for MGS images, we have deliberately included animals of different breeds that have a C57Bl/6 background and represent both sexes. In the selection of the pictures, we could not detect any signs for gender-specific selection criteria for MGS. The application of the algorithm shows, in principle, a positive result rate, with a sensitivity of 77% and specificity of 64%. The reduction in false positives supports the use of this algorithm. With regard to specificity, the ‘nose recognizable’ and ‘mouse in steady position’ criteria are decisive for the selection of false positive results (Figure 2). Recognition of the ‘nose recognizable’ criterion appears to result in a reduced reliability. This has also been reported by other studies and may not be algorithm-dependent alone. 23
Conclusion
The present study demonstrated the applicability of the improved MGS set-up and the functionality of the associated algorithm at three facilities in different locations. Limitations in the specificity of the algorithm, especially because of the lack of detection of moving animals, are currently being adjusted by improving the algorithm, and this should result in a reduction in the number of images rated as false positives in future studies (Kopaczka et al., 24 submitted to the 2019 Annual International Conference of EMBC)
Supplemental Material
LAN881664 Supplemetal Material - Supplemental material for Semi-automated generation of pictures for the Mouse Grimace Scale: A multi-laboratory analysis (Part 2)
Supplemental material, LAN881664 Supplemetal Material for Semi-automated generation of pictures for the Mouse Grimace Scale: A multi-laboratory analysis (Part 2) by Lisa Ernst, Marcin Kopaczka, Mareike Schulz, Steven R Talbot, Birgitta Struve, Christine Häger, André Bleich, Mattea Durst, Paulin Jirkof, Margarete Arras, Roelof Maarten van Dijk, Nina Miljanovic, Heidrun Potschka, Dorit Merhof and Rene H Tolba in Laboratory Animals
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was funded in part by German Research Foundation (Deutsche Forschungsgemeinschaft - DFG) FOR 2591 Consortium and ME3737/18-1. Reference numbers are listed in the supplementary material.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
