Abstract
An important factor for successful translational stroke research is study quality. Low-quality studies are at risk of biased results and effect overestimation, as has been intensely discussed for small animal stroke research. However, little is known about the methodological rigor and quality in large animal stroke models, which are becoming more frequently used in the field. Based on research in two databases, this systematic review surveys and analyses the methodological quality in large animal stroke research. Quality analysis was based on the Stroke Therapy Academic Industry Roundtable and the Animals in Research: Reporting In Vivo Experiments guidelines. Our analysis revealed that large animal models are utilized with similar shortcomings as small animal models. Moreover, translational benefits of large animal models may be limited due to lacking implementation of important quality criteria such as randomization, allocation concealment, and blinded assessment of outcome. On the other hand, an increase of study quality over time and a positive correlation between study quality and journal impact factor were identified. Based on the obtained findings, we derive recommendations for optimal study planning, conducting, and data analysis/reporting when using large animal stroke models to fully benefit from the translational advantages offered by these models.
Introduction
Acute ischemic stroke management and care have profoundly improved with the introduction of intravenous thrombolysis and, recently, mechanical thrombectomy for large vessel occlusions. 1 However, by far not all patients can benefit from the therapeutic progress due to numerous contraindications, restricted availability, and narrow therapeutic time windows of these therapeutic approaches. This causes a tremendous need for novel treatment options, but the translation of preclinical findings into clinically applicable and efficient therapies has so far been mostly ineffective and prone to failure. 2
Critical assessment of rodent studies revealed that one important reason for the translational failure is the lack of methodological quality in these preclinical studies, causing a higher risk for poor internal validity, overestimation of effect sizes, and biased conclusions thus affecting rationale and design of subsequent clinical trials.3–5
Large animal models become more frequently used in preclinical stroke research since they are believed to provide a number of significant advantages in the translational process.6,7 On the other hand, large animal stroke models are both more laborious and more expensive to utilize than rodent models. Budgetary limitations often restrict sample sizes in large animal experiments, which limits statistical power. 8 Hence, it is essential to conduct large animal experiments with highest methodological rigor and to predefine precise endpoints that can be assessed with sufficient statistical power to take full advantage of the translational value of large animal stroke models.
Little is known about the methodological rigor and quality of large animal stroke experiments. We performed a systematic review and quality assessment of studies using large animal stroke models. Our quality analysis was based on the Stroke Therapy Academic Industry Roundtable (STAIR)9,10 and Animals in Research: Reporting In Vivo Experiments (ARRIVE) guidelines. 11 Based on the obtained results, we also provide suggestions for methodological improvements in large animal stroke research.
Material and methods
Study selection
Literature research was performed by the first author (LK). LK was supported by EM, a professional librarian with extensive experience in systematic literature research who helped with designing the search strategy. The two last authors (SM and JB) were consulted by LK in case of any doubts or questions when extracting information from the literature. Intra-assessor reproducibility was not assessed.
Search strategy
We conducted a systematic search for preclinical large animal experiments in stroke using the databases Medline via Ovid from Wolters Kluwer and Science Citation Index Expanded via Web of Science from Clarivate Analytics.
The initial search was conducted on 26 September 2017, and an update was performed on 9 August 2019. Data base entries between 1 January 1990 and 8 August 2019 were covered.
Search terms were “large animal” (including any relevant species, e.g. dogs, cats, pigs, rabbits, non-human primates, sheep, goats, etc.) and “ischemic stroke” (involving for instance “brain ischemia” OR “ischemic neuronal injury” OR “thromboembolic stroke” OR “cerebrovascular disorders”). In the search strategies, we combined the aspects

Overview on quantitative search results and frequency of large animal experiments in stroke research since 1990. (a) Flow diagram of publication identification.
Inclusion and exclusion criteria
We included preclinical large animal studies conducted and published between 1990 and 2019 that report investigations of therapeutic and/or diagnostic procedures for ischemic stroke. The studies needed to compare at least two groups, i.e. one in which a new procedure (therapeutic or diagnostic) is tested by comparing it to a second group being subjected to a standard or reference procedure (control group). Only studies in English were included.
We excluded studies focusing on diseases other than ischemic stroke, using small animal (e.g. rodent) models, clinical trials, in vitro studies, reviews, and meta-analyses. Purely descriptive studies only reporting a method or procedure, or non-controlled experiments (e.g. cases series) were also excluded.
Data extraction
Basic study characteristics and impact factor
First, study meta-data were extracted. Those included information on species, type of intervention, year of publication and region of origin (North America, Europe, Asia and Oceania), aim of evaluation (e.g. safety, feasibility), the stroke model used, study duration and information on investigation of dose–response relationship (if applicable), compliance with animal welfare regulations, subject health condition prior to enrolment, animal housing conditions, and additional veterinary care.
Second, we documented the impact factor (IF) of the journal in which the study results were published, measured in the year of publication. IFs were identified via the annual Thomson Reuters Journal Impact Factor report. Where the IF could not be retrieved for the required year, we contacted the respective journal and asked to provide the IF for the particular year(s).
Group sizes
We further extracted the number of subjects in experimental groups for each species. Group sizes were obtained for control and the diagnostic or therapeutic procedure group(s).
Analysis
Assessment of reporting quality
We designated a scale that was applicable to both, diagnostic and therapeutic procedures, to assess study quality (Table 1). The quality score includes central STAIR and ARRIVE criteria, supplemented by additional quality items. The score comprised four categories, containing six items each. Category 1 addresses reporting of study subject details and welfare, category 2 covered the reporting of details on study design, category 3 addressed internal study validity, and category 4 assessed quality of outcome analysis and reporting. Each study was assigned a score from 0 (lowest quality) to 24 (highest quality), with each category having a quality value of 0 (lowest quality) to 6 (highest quality).
Quality score items.
aAnalysis modalities were considered appropriate when being sufficient to assess the respective research question or endpoint (see Supplementary Table 3 for details).
bConclusion was considered justified when supported by correctly analyzed results.
Additional aspects influencing study quality
We further investigated whether study quality improved after the implementation of the STAIR guidelines in 1999, and their update in 2009.9,10 We also analyzed differences in quality with respect to species, region of study origin, and type of investigation (i.e. assessment of neuroprotectives, thrombolytics, cell therapies, diagnostics, and others). Furthermore, we evaluated possible associations between the quality score and IF.
Group sizes
Where a study reported more than one procedure group, they were all counted individually (maximum number was
Statistics
All statistical analyses were performed using GraphPad PRISM 5 Software. Statistical significance was determined as
Results
Data set and year of publication
Initial and update searches identified a total of 10,282 manuscripts being reduced to 8093 after elimination of duplicates (Figure 1a; a list of all studies included can be found in the supplementary material). A total of 208 studies were included in final analysis after screening abstracts and full text according to preset inclusion and exclusion criteria (Figure 1a). Results of basic study characteristics are shown in Table 2.
Basic characteristics of included animal experimental studies.
aThese included hypothermia (
be.g. feeding, light/dark circle, single or grouped housing.
Analysis of publication output per year revealed that the number of large animal experiments published from 1990 to 2014 generally decreased from
Study quality
The overall median quality score was 11 (range: 3–22; IQR: 4 (9–13)) out of 24. The median quality score in the first category (reporting of study subject details and welfare) was 2 out of 6 (range: 1–5; IQR: 1 (1–2)). The second category (study planning quality) also reached a median quality score of 2 (range: 1–6; IQR: 1 (2–3)). The third category (study conductance quality) had a median score of 3 (range: 0–6; IQR: 2 (2–4)). Category 4 (result reporting and analysis quality) had a median quality score of 4 (range: 0–6; IQR: 1 (2–4)). A significantly lower number of quality criteria were fulfilled in category 1 in comparison to the others (
Study subject details and welfare (category 1)
All studies reported the species used, but only 146 studies (70.2%) reported that the study was approved by responsible animal welfare authorities. Sex and age were reported by 31 studies (15.0%). Sex only was reported by 153 (73.6%), while age was not reported solely. The pre-study subject health status was reported by only 12 studies (5.8%). Medication details including the use of companion medication (e.g. analgetics, antibiotics) was reported in only 20 studies (9.6%). Comorbidities were not reported by any study.
Study planning (category 2)
Working hypotheses were reported in 207 (99.5%) studies. However, primary study endpoints were nominally determined in only 10 studies (4.8%); 135 (64.6%) studies reported that the study rationale was based on earlier small animal (
Study conductance (category 3)
Randomization was reported in 116 studies (55.8%), and allocation concealment was reported in 59 cases (28.4%). One-hundred four studies (50.0%) reported blinded outcome assessment. Measurement of physiological parameters was reported in 165 cases (79.3%). The most frequently monitored parameters included mean arterial pressure (systemic), temperature, blood gases, blood pH, and exhalation gases. One-hundred eighty-six studies reported appropriate outcome analysis modalities (89.4%; information on inappropriate analysis modalities are provided in Supplementary Table 3). These included survival rate (
Result reporting and analysis (category 4)
One-hundred sixty-eight studies (80.8%) adequately reported relevant data and findings in form of detailed tables or graphs. However, data were almost exclusively reported as means or medians. Individual data points were only provided by 16 studies (7.7%). Drop outs and excluded subjects were reported in 105 studies (50.5%). Application of appropriate statistical tests was reported in 192 studies (92.3%). Sixteen studies incompletely reported statistical analysis and, for example, lacked information regarding statistical tests applied including post hoc tests; 91 studies (43.8%) described potential sources of error and bias in the experiment, while 115 (55.3%) reported limitations such as small sample size or that it was impossible to perform randomization. A conclusion fully justified by study findings was given by most, but not all, reports (
Additional influences on study quality
Study quality versus origin, species, and type of intervention
Total median quality score was highest in studies from North America (median: 12; IQR: 10–14), statistically different from studies conducted in Asia and Oceania (median: 10; IQR: (8.75–12), or Europe (median: 10; IQR: 8–11.75;

Influence of study origin and STAIR criteria publication on study quality. (a) Total quality score, (b) category 1: reporting of study subject and animal welfare, (c) category 2: study planning quality (North America vs. Europe
Study quality in the post-STAIR era
Methodological quality significantly improved after introduction of the STAIR guidelines in 1999 (1990–1999 pre-STAIR median: 10, IQR: 8–12; post-STAIR median: 12, IQR: 9–15;
Improvements were particularly evident in categories 1 and 4. In category 1, quality scores were lower in pre-STAIR studies (1990–1999; median: 1; IQR: 1–2) as compared to studies published after the first publication of STAIR guidelines and prior to the 2009 update (2000–2009; median: 2; IQR: 1.25–2) and to studies published after the 2009 update (2010–2019; median: 2; IQR: 2–3;
Study quality versus IF
The IF was available for 172 studies (82.7%). We could not retrieve the IF for the remaining studies or no IF was yet assigned on the particular journal in the year of publication (

Association between total quality score versus impact factor. Scatterplot shows correlation between quality score and IF (
Group sizes
Average group sizes across species are given in Table 3. Analysis of group sizes revealed that total (combined control and procedure) group size was largest in rabbits as compared to pigs (
Median experimental group sizes across large animal species.
C: control group; P: procedure group(s); T: total (combined) groups.
Notes: Ranges (min.–max.) are given in brackets.

Group sizes across species. (a) Total group sizes were largest in rabbits as compared to pigs (
Discussion
Systematic bias may cause over- or underestimation of study results. 3 Quality items such as randomization, allocation concealment, and blinded assessment improve internal validity, 14 but are often neglected in small animal studies.3,5,15
Large animal models are believed to offer significant benefits for translational stroke research. There is higher anatomical similarity to the human brain 16 and to the human cerebrovascular system.6,7,17 Another benefit is the option to use these models in experiments closely mimicking a human clinical situation, and applying the same medical techniques and equipment for diagnostic and therapeutic interventions that would be used in human patients.7,18 Moreover, physiological characteristics of large animal models including heart and respiratory frequency, blood pressure, as well as pharmacodynamic and pharmacokinetic profiles are similar to humans.19,20 However, in view of these advantages, large animal studies require much greater efforts and resources. It is therefore important that quality in large animal studies is as high as possible to efficiently utilize the advantages large animal models offer for translational research.
Overall, we found that methodological quality in large animal stroke studies was mediocre. Although quality generally improved significantly over the last decades and potentially due to the 1999 publication and 2019 update of the STAIR criteria, our analysis revealed some important shortcomings. Improvements are needed in reporting study subject details and welfare (quality score category 1). Aspects such as sex and age, pre-study health conditions, and medications should be reported routinely for optimal study transparency and reproducibility, and transferability of study results. 9 The lack of comorbid large animal models is not surprising. Comorbidities are difficult to simulate in outbred large animal models as they occur due to age, distress, malnutrition, and other factors according to the human situation, and can take significant time in large animals to develop. Research on models exhibiting comorbidities may remain a domain of small animal research. Nevertheless, any spontaneously occurring comorbidity being diagnosed in large animals used for research should be reported.
Working hypotheses were reported in almost all studies (99.5%), but often without any obvious influence on study design. For instance, only 4.8% of the studies defined and reported primary endpoints, while analysis of expectable effect size and a priori sample size calculation were performed in few cases only (13.0%). This may severely limit the translational benefits of large animal models since study results may be hard to interpret based on potentially poor statistical power. Given the significant resources required to perform large animal studies, considering these aspects is essential. On the other hand, determination of effect size can be challenging when previous research data are lacking or not entirely applicable. In these cases, we recommend to perform large animal pilot studies that may help to assess basic characteristics in the respective model, such as variability of infarct size and its impact on the envisioned primary endpoint.
While half of the studies reported inclusion and exclusion criteria (50.0%), almost none (1.0%) applied them a priori. Defining inclusion and exclusion criteria during or after the study is believed to be a major source of bias, particularly when a study is conducted in non-blinded fashion. Hence, such bias can unfortunately not be excluded for most studies we analyzed.
Important quality aspects such as randomization (55.8%), allocation concealment (28.4%), and blinded assessment of outcome (50.0%) were more frequently reported in large animal studies as compared to small animal stroke experiments (randomization: 33.3%; blinded assessment of outcome: 44.4%, 15 allocation concealment: 25.9%; randomization, allocation concealment, and blinded assessment of outcome: 24.1.%). 21 Nevertheless, the number of studies not reporting those is still remarkably high in particular since blinding and randomization should be minimum standard quality assurance procedures in confirmative stroke research 22 to which almost all large animal studies aim to contribute.
Imaging techniques such as magnetic resonance imaging, computed tomography, and angiography (43.5%) as well as physiological monitoring (80.4%) were utilized relatively frequently. This is a positive aspect since large animals are particularly suitable for clinical imaging techniques while thorough physiological monitoring creates meaningful information that may warrant subject in- or exclusion. However, verification of infarct induction (only reported in 48.1%) as well as infarct size should be conducted thoroughly and routinely to avoid the risk of increasing inter-subject/-study/-group variability, further reducing statistical power of an experiment. Parameter such as cerebral blood flow reduction for verification of infarct induction was documented by only 7.2% of studies. This is surprising since these parameters are relatively easy to determine in large animals, while clinical imaging techniques may be used to confirm the induced lesion directly. 20
Large animals are suitable for long-term studies including functional endpoint assessment. However, we only found a relatively low percentage (6.7%) of studies being conducted for more than one month, the minimum follow-up period recommended by the STAIR guidelines for functional endpoints. Next to costs, this may be due to the selection of other primary endpoints such as safety or efficacy of recanalization methods which can be assessed more rapidly. However, experimenters who wish to assess behavioral endpoints should take into consideration that functional consequences of stroke in large animals can be more heterogeneous than in rodent models, and may develop over longer time spans. 23
We recognized significant improvements in methodological quality since the publication of the first STAIR guidelines in 1999, and in particular after the STAIR guideline update in 2009. Similar improvements were reported for small animal stroke studies from 2010 to 2013.
24
These findings indicate the positive impact of specific good research practice guidelines, which should be advanced continuously as evidenced by the recent 2019 STAIR guideline updates.
25
In contrast to previous findings in small animal studies,
24
we also identified positive association (
Group sizes were significantly larger in rabbits as compared to other species. This is not surprising as rabbits are the smallest and cheapest of all large animal species what allows for larger group sizes. Importantly, group sizes in primates are generally not different to that of other species. This does not mean that group sizes were sufficient for each research question, but shows that costs related to primate experiments did not prevent the same group sizes as seen in other large animal species despite rabbits.
Our study has a number of limitations. We applied a predefined search strategy and protocol being developed together an expert in literature meta-analyses (EM) and experts in stroke research (JB, SM). However, search strategy and protocol were not registered (ex-ante protocol). Data extraction was not done in duplicate, but senior experts were consulted in all doubtful cases. Intra-assessor reproducibility was not assessed. Moreover, we did not discriminate between studies focusing on therapeutic and diagnostic procedures. Large animal models provide a number of benefits over rodent models for diagnostic studies due to the larger brain size and in particular when clinical imaging is used. 30 However, those studies are often exploratory in nature. Since quality demands are different (and a bit lower) than in confirmative studies, those imaging-related studies would perform nominally worse but still can contribute invaluably to their respective field. 31 Finally, we did not include a number of insightful imaging studies because they did not conduct a formal inter-group comparison.32–35
Conclusions and recommendations
Although large animal models offer a number of clear advantages for translational stroke research, we found that they have similar shortcomings to small animal models, limiting this benefit. Therefore, we derived a number of recommendations to address these limitations but are, at the same time, relatively easy to implement.
Study planning and preparation
Large animal stroke studies are mostly confirmative studies. Therefore, study planning should be based on high quality standards applied for randomized controlled clinical trials (RCTs) when possible. Key elements of RCT planning and design such as a priori sample size calculation and endpoint definition should be conducted. 22 We encourage to involve statisticians already in early planning steps to optimize study design. 26 Study planning can also be supported by specific software tools. For instance, the National Centre for the Replacement, Refinement and Reduction of Animals in Research provides a freeware called Experimental Design Assistance (https://eda.nc3rs.org.uk), which is free to use and was built to guide researchers through their study planning. 27 Since optimal sample sizes may not be achieved for all endpoints, it is important to clearly define the most appropriate primary study endpoint, and to power the study properly. Collaboration between research teams in form of peer quality checks and validation of study design can highly increase objectivity and validity of a study. 14 Inter-group collaboration and transfer of experience can also help to handle very complex models and/or experimental setups, helping to reduce inter-subject variability negatively affecting statistical power. Confirmative studies might be preregistered to maximize transparency. 36
Effect size estimation and pilot trials
Collecting valid information from previous research is essential for reliable effect size estimation. If such data are not available, pilot studies may be helpful for at least basically estimating variability of stroke impact and outcome in the model. In case previous experience with a particular model is low, variability is more likely to be higher and effect size is more likely to lower in such pilot trials. This will contribute to more conservative study planning since sample sizes calculated based on that information will be larger. An important side effect of pilot trials is experimenter training which limits experimenter-caused endpoint variability (see below) in the main experiment. In addition, meta-analyses can help to collect relevant information on effect size or regarding a specific research question from related fields. 28
Reducing the effect of sample size limitations and endpoint variability
Financial and logistical restrictions often impact sample and group sizes in large animal experiments. This is an understandable limitation which is difficult to overcome. Selection of a proper and relevant primary endpoint that can be adequately powered with respected to the addressed research question is therefore important to minimize the risk for low statistical power. Of note, some endpoints often used in studies assessing therapeutic interventions, including infarct size and functional deficits, exhibit a higher variability in large animal models than in rodent. This makes comparison of absolute data more difficult. 23 Relative analysis of repeatedly assessed endpoints, i.e. in comparison to the individual initial infarct size and/or functional deficit can efficiently compensate for such variability. Repeated assessments also allow calculating the area under the curve for particular endpoints. This may provide a benefit in statistical power to identify whether a real outcome benefit is present over time. However, this comes at the cost of temporal resolution: it cannot be concluded exactly when this benefit became evident. There is also preliminary evidence for fast and slow stroke progressors in large animals, indicating different collateral status and somewhat resembling the human situation, but further contributing to inter-subject variability. It is recommended to consider this fact when planning an acute stroke study. 29
In experiments of highly similar design, controls may be pooled. Of note, this counteracts randomization and therefore requires extremely thorough validation of comparability of control subjects from different experiments/sources. If comparability is thoroughly proven, this may help to increase statistical power, but the limitations of this approach and potentially resulting bias need to be discussed transparently and in detail when publishing results.
The possibility to repeatedly collect a broad spectrum of physiological data should be utilized where possible, as deviation from normal parameter ranges may explain variability and warrant post-hoc exclusion of subjects in single cases.
Study duration and documentation
We recommend considering long-term experiments whenever meaningful, possible and meeting animal welfare requirements. Even though long-term experiments involve greater efforts, the amount of data collected for individual subjects may be much higher, providing a better overall picture on the assessed intervention. Documentation should be as transparent as possible because transparency is not challenging or laborious, but contributes significantly to increased scientific rigor, reproducibility, and unbiased study result interpretation. Methodological limitations including lacking quality aspects due to good reason should be clearly stated as this allows better interpretation of positive, neutral, and negative study results.
Supplemental Material
sj-pdf-1-jcb-10.1177_0271678X20931062 - Supplemental material for Quality and validity of large animal experiments in stroke: A systematic review
Supplemental material, sj-pdf-1-jcb-10.1177_0271678X20931062 for Quality and validity of large animal experiments in stroke: A systematic review by Leona Kringe, Emily S Sena, Edith Motschall, Zsanett Bahor, Qianying Wang, Andrea M Herrmann, Christoph Mülling, Stephan Meckel and Johannes Boltze in Journal of Cerebral Blood Flow & Metabolism
Footnotes
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: ESS is supported by the Stroke Association (SA L-SNC 18\1003).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
