Principles for Valid Histopathologic Scoring in Research

Abstract

Histopathologic scoring is a tool by which semiquantitative data can be obtained from tissues. Initially, a thorough understanding of the experimental design, study objectives, and methods is required for the pathologist to appropriately examine tissues and develop lesion scoring approaches. Many principles go into the development of a scoring system such as tissue examination, lesion identification, scoring definitions, and consistency in interpretation. Masking (aka “blinding”) of the pathologist to experimental groups is often necessary to constrain bias, and multiple mechanisms are available. Development of a tissue scoring system requires appreciation of the attributes and limitations of the data (eg, nominal, ordinal, interval, and ratio data) to be evaluated. Incidence, ordinal, and rank methods of tissue scoring are demonstrated along with key principles for statistical analyses and reporting. Validation of a scoring system occurs through 2 principal measures: (1) validation of repeatability and (2) validation of tissue pathobiology. Understanding key principles of tissue scoring can help in the development and/or optimization of scoring systems so as to consistently yield meaningful and valid scoring data.

Keywords

grading histopathology lesions ordinal semiquantitative scoring validation

Through the course of investigation, research laboratories often submit tissues to histopathology cores for tissue processing and examination by a pathologist.^11,27,48 Pathologists provide morphologic assessment of these tissues, including examination for group-specific differences. Many times, there is a need for more rigorous evaluation of the tissue either to prove a group difference or to substantiate the observations of the initial examination.

Scoring (aka “grading”) is a tool that can be used to derive data from biologic systems (eg, tissues) for analysis and group comparisons. Scoring can be applied at different levels of tissue examination, including antemortem imaging techniques,^6,35,55 postmortem macroscopic examination,^18,36,69 and histopathologic examination.^17,39,46,70 Crissman and colleagues¹² suggested that a scoring system should exhibit 3 fundamental characteristics: (1) it should be definable, (2) it should be reproducible, and (3) it should produce meaningful results. This article reviews key principles for the development of scoring systems so that the pathologist has the best opportunity to meet these key principles. Importantly, these fundamental principles of scoring tissues are applicable to most organs, tissues, and models systems.

Methods

This article describes key principles for the development of semiquantitative scoring systems via histologic examination. Even so, these same concepts can be useful for the development of semiquantitative scoring systems in other research contexts such as commercial immunohistochemistry kits, serologic assays, or applications of specialized software packages. Notably, it is beyond the scope of this article to address principles for quantitative techniques and applications.

All experimental data in the figures and tables of this article were created to demonstrate important principles associated with scoring. Experimental data were constructed to replicate situations that are commonly encountered by comparative pathologists in academia, and emphasis was selectively placed on histopathology-based examples. Importantly, these examples of scoring methods were simplified in scope and complexity for ease of understanding the basic concepts. Representative analyses were made for each example scoring method, but these should not be taken as exclusive statistical options. All statistical analyses and graphs were made using Prism software (GraphPad Software, La Jolla, CA).

Perspective

Sound methodology in histopathologic scoring is important to detect biologic differences in treatment groups. Importantly, it does not compensate for poor experimental design or improperly sampled tissues that occur “upstream.” Many papers have been submitted to journals (but not necessarily published) in which the sampling and histopathologic scoring approaches were robust in nature, but the experimental designs were markedly flawed. In these cases, even when statistically significant data could be generated by the authors, they were without context and lacked validity for proper interpretation. A simple proverb states “junk in, junk out.” Experimental background should be sought out for projects where tissues are submitted for pathologist examination. Proper perspective begins early, and many objectives need to be considered. As described below, developing a sound experimental design, understanding the purpose of the study, and considering how best to sample the appropriate tissues are all important features of perspective.

Experimental Design

Experimental planning and design are necessary for the development of a sound scientific study, and understanding these methods is essential for context of proper data interpretation.^3,4,12,58,76 Species, strain, sex, age, appropriate controls, method/type of genetic manipulation, microbial status of colony, tissue handling, and treatments (type dose, route, duration, etc) all play a role in the evaluation and eventual interpretation of the data. Ancillary data such as clinical chemistries, imaging, and/or clinical behavior can further give relevant insights for effective tissue evaluation. For example, if hepatocellular-specific enzymes were elevated in a treatment group, then targeted sampling and examination of the liver would be valuable.

Study Objectives

Understanding the study objectives is useful in effective tissue examination and development of a meaningful scoring system. For example, a murine study of Pseudomonas aeruginosa infection may demonstrate antemortem group differences in the extent of neutrophilic lung inflammation based on routine examination.⁴⁰ A scoring system may be readily applied to corroborate this observation, which would be sufficient for many studies. However, if the study’s objective was to determine if neutrophil transmigration into the lungs was defective, then a scoring system that focuses on neutrophil transmigration might be developed, if possible, to more meaningfully demonstrate this mechanistic change.

Tissue Sampling

Sampling of tissues can greatly influence the diagnostic or treatment-related results of a study.^5,7,37 For example, in some strains of mice, islet numbers can vary widely between pancreatic lobes,³² and therefore consistent tissue collection should be performed for optimal islet assessment. In academia, tissues are sometimes collected by the collaborator lab and stained slides submitted to the pathologist for examination. Awareness of the collection method as well as the level of consistency in sampling and sectioning helps to ensure that unintentional bias is prevented.⁸

Principles for Scoring

To determine an appropriate histologic scoring system for any tissue, key principles should be considered. Although this list is not exhaustive, these considerations will help to develop a useful scoring method.

Masking

An important goal for any experimental study is to constrain biases that can skew the final data and conclusions.⁶⁰ Bias can be introduced into any stage of the experimental project.^49,63 “Masking” (aka blinding) of the pathologist to experimental groups/treatments is a means of preventing bias from entering into the examination and scoring of tissues. Lack of masking can lead to unintentional observational bias that can often exaggerate treatment effects.^15,57 Different levels of masking for the pathologist can be implemented (Table 1), but consideration of the study goals as well as the limitations of the masking method need to be discussed before examination.

Table 1.

Common Methods of Masking Tissues for Histopathologic Examination.

Method	Description	Comments
Comprehensive	Individual samples are labeled without reference to treatment group (eg, 1, 2, 3, 4, 5, etc) and minimal background information (perspective) is given.	Pro: Bias is comprehensively constrained.
Comprehensive		Con: Pathologist labor may be increased in examination, while sensitivity to subtle study-specific lesions may decrease.¹²
Grouped	Samples are coded by groups (eg, A1, A2, …, A10; B1, B2, …, B10); relevant background material, including study design and objectives, is disclosed to pathologist.	Pro: Pathologist is masked to group treatments but is aware of tissue grouping and background information.
Grouped		Con: Overt group differences can functionally unmask the pathologist and if performing ordinal scoring may warrant comprehensive masking.
Postexamination masking	Full disclosure of experimental design and objectives with unmasked initial evaluation; masking and randomization of samples are done prior to scoring.	Pro: Offers full disclosure to the pathologist for examination and scoring development.
Postexamination masking		Con: Pathologists may recall group assignments of samples with small n/group, which makes masking ineffective.

Examination

A thorough examination of all tissues/slides provides a context for scoring tissue lesions. For example, a lesion common to all groups could be indicative of a “background” lesion, and scoring of this lesion parameter could be of little meaning to the study. But sometimes in the context of a research study, subtle differences in the frequency or severity of the “background” lesion may be indicative of a mechanistic change related to treatment and can be further assessed.⁵⁹ A review of the study objectives and the relevant literature may predict differences in a specific lesion parameter, which could then be examined and scored to provide context for the current model.

Lesion Parameters

What types of lesions can be studied by a scoring system? If lesions are identifiable in tissues, then these can often be applied into a scoring system (Table 2). Some lesions may be detectable in any tissue (eg, cellular inflammation), whereas other lesion parameters may be specific for the organ/tissue (eg, cholestasis in liver) being scored. Although it is not feasible to concisely review all lesion parameters for all tissues, numerous approaches to scoring for specific organs or models can often be found in a targeted literature search.

Table 2.

Examples of Tissues and Techniques in Which Histopathologic Scoring Has Been Reported.

Pancreas¹⁶	Cystic degeneration
	Fat necrosis
	Fibrosis
	Lymphoid inflammation
	Neutrophilic inflammation
Liver^39,72	Cell injury
	Nuclear and cytoplasmic features
	Inflammation
	Fibrosis
	Steatosis
Respiratory^{23,33,44,47,51,55,64}	Bronchitis/bronchiolitis
	Edema
	Epithelial thickening
	Epithelial degeneration/necrosis
	Fibrosis
	Interstitial pneumonia
	Lymphoid inflammation
	Metaplasia
	Neutrophilic inflammation
Spleen⁴⁶	Bacteria
	Necrosis
	Neutrophils influx
	Thrombosis
Orthopedic^9,25,53	Cartilage calcification
	Cartilage
	Fibrosis
	Osteoarthritis
	Osteophytes degeneration
	Subchondral bone damage
	Synovial hyperplasia
	Synovial inflammation
	Vascularity
Digestive tract^10,17,26	Enterocolitis
	Epithelial erosion
	Gut lumen contents
	Gastric neutrophils
	Gastritis
	Gastric metaplasia
	Hemorrhage
	Vascular congestion
	Villous fusion
Brain^29,42	Hypoxic injury
Brain^29,42	Infarction
Immunohistochemistry^23,38	Staining distribution
In situ hybridization²⁸	Staining distribution

Scoring Definitions

Scoring systems often segregate samples into defined categories. It is useful to have clear language both characterizing and setting boundaries for each category.^59,68 Exclusive use of vague terms, such as mild, moderate, or severe, in ordinal scoring can reduce interobserver repeatability and may even compromise intraobserver repeatability over time. Whenever possible, specific terminology including the use of the percent of tissue affected can enhance the repeatability as well as sensitivity of the system.

Interpretation Consistency

“Diagnostic drift” is a situation in which the assignment of scores may vary slightly in consistency through the scoring process. This can happen in situations where there are a large number of samples, multiple pathologists examine subsets of tissues, slides are examined over a long period, or category characteristics/boundaries are poorly defined.¹³ In research settings, it is most useful to have one pathologist score the slides in a reasonable period of time, if applicable, to provide for additional consistency.^12,13 Of course, this approach is not always possible, and review (by the same or a secondary pathologist) at the conclusion of the study may be warranted especially for more arduous studies.

Examples of Scoring Approaches

Types of Data Measures

Many years ago, Stevens⁶⁶ wrote an article describing 4 key types of measurement scales used in research: nominal, ordinal, interval, and ratio (Table 3). Generally speaking, nominal and ordinal scales produce qualitative data, whereas interval and ratio scales produce quantitative data. Qualitative data are that which approximate or characterize something as opposed to quantitative data, which measure something. For instance, biologic data that are acquired from morphometry have a ratio scale with a true zero point and produce quantitative data; relevant examples include length (eg, acinus diameter) or area (eg, acinus area). In contrast, nominal and ordinal scales, which are commonly used in scoring systems, produce qualitative data, and thus any scoring is considered “semiquantitative” in nature. Understanding the types of data as well as their constraints helps in their analysis.

Table 3.

Types and Examples of Data Measurements in Research.

Types	Definition	Example(s)
Nominal	Samples assigned to a category without reference to severity gradations.	“Binary”—presence or absence of a lesion (+/–)
Nominal		“Categorical”—lesions assigned to a nonordered category (carcinoma, sarcoma)
Ordinal	Samples assigned to a category showing an ordered progression in severity	0—normal
		1—mild
		2—moderate
		3—severe
Interval	Samples quantified on a scale between 2 extremes and with an arbitrary zero value. Samples can be compared based on differences in value but not using multiplication or division.	Celsius scale of 0 to 100° based on freezing and boiling points of water.
Ratio	Samples quantified on a scale with a true zero value. Samples can be compared through differences or multiplication/division.	Most morphometry data (eg, length, area, etc) produce quantitative values.

Adapted from Stevens.⁶⁶

There are multiple approaches to score tissues, and common scoring methods for pathologists are highlighted below. For simplicity, these methods have been generally assigned into 3 groups for enhanced understanding and application. The reader would be advised that for additional information, other resources may be useful.^13,31,59,75

Incidence Method

This approach records the case incidence of a lesion (ie, those affected) in an experimental cohort.^31,65 Similar types of scoring methods include binomial scoring (presence or absence of lesion) and percent affected. Lesions are defined by categories (ie, nominal data) and recorded in a contingency table. For example, the trachea can be examined for the presence or absence of inflammation in submucosal glands (Table 4). These nominal data can be reported as a contingency table (Table 4) or shown as a graph for publication (Fig. 1).

Table 4.

Scoring of Trachea Submucosal Glands for the Presence of Cellular Inflammation.^a

Group	Normal	Inflammation	% Inflammation
A	13	2	13.3
B	6	9	60.0

^aSections of trachea with submucosal glands from each animal in group A (n = 15) and group B (n = 15) were examined and designated as within normal limits or with cellular inflammation.

Figure 1.

Example of the incidence method. Example graph for reporting incidence data from Table 4. *P = .02, Fisher exact test.

Ordinal Method

The ordinal method is commonly used by many pathologists for lesion scoring, and important principles for the method are discussed below.

This method assigns data into defined categorical groups that are arranged in an “ordered” progression in lesion severity.⁶⁶ For example, a scoring system can be based on the estimated percentage of the tracheal wall that is affected by a lesion; in this case, a score (0–4) may be assigned (Table 5). The most common approach to ordinal scoring is to assign a summary score for each animal based on the tissue examination. An example of this can be seen in Table 6, where tracheal inflammation and hyperplasia are scored.

Table 5.

Example of Ordinal Scores Based on Distribution of Tracheal Lesions.

Score	Trachea (% Wall Affected)
0	No change
1	<25
2	26–50
3	51–75
4	76–100

Table 6.

Trachea Inflammation and Hyperplasia Scores From Treatment Groups A and B.^a

	Group A		Group B
Animal	Inflammation	Hyperplasia	Inflammation	Hyperplasia
1	1	1	3	2
2	0	0	2	1
3	1	0	3	2
4	1	1	2	1
5	0	0	1	1
6	2	1	1	1
7	1	0	2	2
8	1	1	2	1
9	1	0	2	1
10	1	0	1	0
Median	1	0	2^b	1^c

^aScoring was performed for each parameter based on Table 5.

^bGroup A vs B, P = .006, Mann-Whitney test.

^cGroup A vs B, P = .011, Mann-Whitney test.

Another method found in the literature is to count several fields of tissue (eg, 10 random 400× fields) for each animal, each field scored, and a mean (ie, average) score assigned for the whole tissue of that animal. The problem with this approach is that the mean represents a measure of central tendency that is only appropriate for interval and ratio data. For ordinal data, the median is the most appropriate measure for central tendency. This statistical axiom is not without some controversy, and it is not within the scope of this article to resolve it.

Scoring approaches vary among pathologists. Many times, a tissue will have multiple lesions that can be assigned scored. Dependent on their approach to these situations, pathologists have been described as either “lumpers” or “splitters.”⁷⁵ Lumpers use multiple parameters or anatomic sites to define each ordinal level. For example, multiple separate renal lesions associated with acute tubular injury are grouped together to give a single scoring system (Table 7). On the other hand, splitters separate each parameter or anatomic site for scoring purposes. As opposed to the lumpers, splitters assign each specific renal lesion its own appropriate scoring system (Table 8). Lumper methods can be more efficient for the pathologist, saving labor and time when groups have overt differences; however, splitter methods are more sensitive to parameter-specific or sequential changes that may occur in a model and also have more repeatability.¹³

Table 7.

Example of a Scoring System That Combines Lesion Parameters to Define Each Category.

Score	Kidney Scoring for Acute Tubular Injury
1	Isolated tubular ectasia, rare sloughed cells in tubular lumens, inflammation absent to minimal
2	Multifocal tubular ectasia, patchy sloughed cells in tubular lumens, rare to multifocal interstitial inflammation
3	Coalescing to diffuse tubular ectasia, diffuse sloughed and necrotic cells obstructing tubular lumens, multifocal to diffuse inflammation

Table 8.

Example of a Scoring System That Takes Parameters From Table 7 and Separates Each Into Its Own Scoring System.

Score	Ectasia	Necrosis	Inflammation
1	Rare (<5%)	Rare (<5%)	Rare (<5%)
2	Multifocal (6%–40%)	Multifocal (6%–40%)	Multifocal (6%–40%)
3	Coalescing (41%–80%)	Coalescing (41%–80%)	Coalescing (41%–80%)
4	Diffuse (>80%)	Diffuse (>80%)	Diffuse (>80%)

When modifying or developing a new ordinal scoring system, it is useful to evaluate the variance of lesion severity in all samples so as to “fit” the scoring system into the range of lesions. For example, if an infection model is studied at day 2 postinoculation, the range of lesions may be entirely different from those previously studied at day 6 postinoculation. If this adjustment is not done, then the scoring system may be so skewed as to be ineffective for assessment of group differences at the different time point.

The number of score categories within the ordinal method has potential implications for the study, and this ranges from as few as 3 to as many as 10 or more per system.^29,33,42,59 A small number of score categories (eg, 3) can reduce the sensitivity of the scoring system so that more animal numbers (or more severe group differences) are required to detect a real biologic difference between groups. Alternatively, a large number of ordinal scores may cause difficulty in score assignment as there is often less obvious distinction between categories. This means that a scoring system with a large number of categories is prone to have reduced repeatability. It has been suggested that ∼4 to 5 score levels may be an optimal range to maximize detection and repeatability.^59,68

Ordinal scores are most commonly derived from direct evaluation of tissues with assignment of scores by the observer; however, transformation of quantitative data to ordinal scores has been described and is another source of ordinal scores.^19,40 Transformation of data can be a useful tool to constrain sample variance that is often found in animal-based research.

Rank Method

The rank (“ordering”) method is not commonly used by pathologists, but it is simplistic in application.^31,65 This method is remarkably similar to what pathologists do (subconsciously) in their routine tissue evaluations. Samples from the treatment groups are combined and then ranked from most severe to least severe (or vice versa), and the rank number for each sample is used for analysis (Fig. 2). Although the ranked method is conceptually straightforward in application, it may be more labor intensive with larger sample numbers.⁷⁵

Figure 2.

Example of the rank method. Samples (circles) from group A (white circles) and group B (black circles) are combined (top row) for examination. The samples are then ranked in order of lesion severity (represented by circle diameter, bottom row). The rank numbers for group A (1, 2, 3, 4, 7, 8) and group B (5, 6, 9, 10, 11, 12) are then analyzed. P = .03, Mann-Whitney test.

Statistics

Key components of statistical analysis are important in any research project. A biostatistician should collaborate with researchers for routine planning of experimental design through analyses of their data.²¹ Access to a user-friendly statistical software package can also be useful for routine analyses and synthesis of graphs for publication. The use of scored data may be discipline dependent as scoring and its analyses are common for pathologists at academic and medical institutions, but recent International Harmonization of Nomenclature and Diagnostic criteria (INHAND) recommendations suggest that toxicologic pathologists should rely on their morphologic interpretation preferentially over statistical inference of scoring.⁴⁵

Choosing the appropriate statistical test is an important component for every experimental data set. Statistical tests have “assumptions” on which they function, and if an assumption is not applicable for the data being examined, then the validity (and interpretation) of the statistical approach may be in question. For example, ordinal data do not meet the assumption of a normal (Gaussian) distribution. Parametric analyses (eg, Student’s t-test) should not be used to analyze ordinal data, but rather nonparametric analyses (eg, Mann-Whitney test) should be considered.^60,62 Misuse of statistical analysis in research is recognized,^60,67 and accordingly, it is not uncommon for ordinal scoring data to be analyzed by inappropriate parametric tests (eg, Student’s t-tests). Increasingly, these types of inappropriate statistical analyses are being identified at submission of peer-reviewed papers, causing mandatory statistical revision or manuscript rejection. For a broader perspective on statistical analysis of data, the reader is encouraged to examine these resources.^{20
–22,30,31,41,60
–62}

Validation

Scoring methods should be designed to be a reproducible and meaningful analysis of data (ie, a valid scoring system). But how does one know that a scoring method is valid? Validation mechanisms have been used in many tissue-specific scoring systems.^36,50,71,74 Validation can be summarized as 2 basic approaches: that of validating observer repeatability and that of validating tissue pathobiology.

Validation in Repeatability

Recent reports have highlighted the importance of repeatability in research.^1,52 For instance, Begley and Ellis¹ attempted to repeat the work of 53 major “landmark” papers but were successful in only 11% (6 of 53) of the cases. Similarly, recognition of the need to accurately reproduce experimental methods has caused some journals to expand their word limits for Materials and Methods sections.⁵⁴ Repeatability in pathology methods (including scoring) is a relevant and important consideration in experimental design as well as reporting of data.

One approach to validate scoring systems has been to assess their repeatability through evaluation of intra- and interobserver correlation.^14,24,43,73 This evaluation is often reported by a κ value (value of 0–1) that is calculated from observer agreements (Table 9). Validation using this method only assesses the repeatability of the method but should not be confused with validation of tissue pathobiology as described below.

Table 9.

Interobserver Agreement (Observers A and B) for Classif-ication of Hepatocellular Carcinoma (HCC) and Hepatocellular Aden-oma (HCA) From Liver Tumor Samples (n = 100).^a

	HCC—B	HCA—B
HCC—A	39	10
HCA—A	6	45

^aThe κ value was calculated as (HCC + HCA agreements)/total assessments. κ = (39 + 45)/100 = 0.84. The κ score indicates there is a strong agreement between observers A and B in classifying these liver tumors.

Validation of Tissue Pathobiology

Another approach to validate a scoring system is to analyze the relationship between the scores and relevant parameters of disease severity (ie, pathobiology).^2,10,34,56 This relationship is defined through correlation (eg, Spearman correlation for nonparametric data), which produces a value from –1.0 to 1.0. For example, comparison of tissue scores to relevant pathobiology data (eg, clinical score, body weight, complete blood counts, etc) would ideally demonstrate a strong positive correlation (Fig. 3). Its interpretation is similar to that of the κ—the closer to zero, the lower the correlation. If it is a negative value, then the scoring system has a negative correlation to pathobiology, which would seem unsuitable (if not even “backwards”) for many situations. If the scoring system does not have a strong, positive relationship to disease pathobiology, there may be reason to question its value in the respective model.

Figure 3.

Validation of pathobiology. Tissue scores (x-axis) are graphed out in comparison to relevant pathobiology parameters (y-axis) to see if there is a relationship (ie, correlation) (r = 0.80, P = .001, Spearman correlation). Since the r value is positive and close to 1, this would indicate a strong correlation of the scoring method with tissue pathobiology.

Each validation method is mutually exclusive in its scope. For example, when evaluating interobserver correlation, a high κ value gives confidence in the scoring method’s repeatability. That said, it does not give any credibility to the scoring method’s representation of tissue pathobiology, and the contrary is true as well.

Conclusions

Scoring tissue lesions can be a useful tool for evaluating research tissues and corroborating morphologic findings. Following key principles can guide the pathologist to develop a useful and valid scoring system that is both repeatable and meaningful for the project.

Footnotes

Acknowledgements

We thank the Department of Pathology (University of Iowa) for generous support.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

We acknowledge generous financial support from the NIH (HL091842, HL051670, DK054759, DK091211) and US Veteran's Administration (Center for the Prevention and Treatment of Visual Loss).

References

Begley

Ellis

. Raise standards for preclinical cancer research. Nature. 2012;483:531–533.

Bleich

Mahler

Most

. Refined histopathologic scoring system improves power to detect colitis QTL in mice. Mamm Genome. 2004;15:865–871.

Brayton

Justice

Montgomery

. Evaluating mutant mice: anatomic pathology. Vet Pathol. 2001;38:1–19.

Brayton

Treuting

Ward

. Pathobiology of aging mice and GEM: background strains and experimental design. Vet Pathol. 2012;49:85–105.

Brisson

Matsui

Rieder

. Translational research in pediatrics: tissue sampling and biobanking. Pediatrics. 2012;129:1–10.

Brown

James

. Correlation of MRI findings to histology of acetaminophen toxicity in the mouse. Magn Reson Imaging. 2012;30:283–289.

Bucci

. Basic techniques. In: Haschek

Rousseaux

Wallig

, eds. Handbook of Toxicologic Pathology. 2nd ed. Vol. 1. San Diego, CA: Academic Press; 2002:171–185.

Burkhardt

Pandher

Solter

. Recommendations for the evaluation of pathology data in nonclinical safety biomarker qualification studies. Toxicol Pathol. 2011;39:1129–1137.

Cake

Smith

Young

. Synovial pathology in an ovine model of osteoarthritis: effect of intraarticular hyaluronan (Hyalgan). Clin Exp Rheumatol. 2008;26:561–567.

10.

Cheng

Dhall

Zhao

. Murine model of Hirschsprung-associated enterocolitis, I: phenotypic characterization with development of a histopathologic grading system. J Pediatr Surg. 2010;45:475–482.

11.

Crawford

Tykocinski

. Pathology as the enabler of human research. Lab Invest. 2005;85:1058–1064.

12.

Crissman

Goodman

Hildebrandt

. Best practices guideline: toxicologic histopathology. Toxicol Pathol. 2004;32:126–131.

13.

Cross

. Grading and scoring in histopathology. Histopathology. 1998;33:99–106.

14.

Cross

. Kappa statistics as indicators of quality assurance in histopathology and cytopathology. J Clin Pathol. 1996;49:597–599.

15.

Day

Altman

. Statistics notes: blinding in clinical trials and other studies. Br Med J. 2000;321:504.

16.

De Cock

Forman

Farver

. Prevalence and histopathologic characteristics of pancreatitis in cats. Vet Pathol. 2007;44:39–49.

17.

Eaton

Danon

Krakowka

. A reproducible scoring system for quantification of histologic lesions of inflammatory disease in mouse gastric epithelium. Comp Med. 2007;57:57–65.

18.

Ettreiki

Gadonna-Widehem

Mangin

. Juvenile ferric iron prevents microbiota dysbiosis and colitis in adult rodents. World J Gastroenterol. 2012;18:2619–2629.

19.

Ferguson

Hook

Garcia

. A simple post hoc transformation that improves the metric properties of the BBB scale for rats with moderate to severe spinal cord injury. J Neurotrauma. 2004;21:1601–1613.

20.

Festing

MFW

. Design and statistical methods in studies using animal models of development. ILAR J. 2006;47:5–14.

21.

Festing

MFW

Altman

. Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J. 2002;43:244–258.

22.

Gad

. Statistics and Experimental Design for Toxicologists and Pharmacologists. Boca Raton, FL: CRC Press; 2006.

23.

Gauger

Vincent

Loving

. Kinetics of lung lesion development and pro-inflammatory cytokine response in pigs with vaccine-associated enhanced respiratory disease induced by challenge with pandemic (2009) A/H1N1 influenza virus. Vet Pathol. 2012;49:900–912.

24.

Germolec

Nyska

Kashon

. Extended histopathology in immunotoxicity testing: interlaboratory validation studies. Toxicol Sci. 2004;78:107–115.

25.

Gerwin

Bendele

Glasson

. The OARSI histopathology initiative: recommendations for histological assessments of osteoarthritis in the rat. Osteoarthritis Cartilage. 2010;18(suppl 3):S24–S34.

26.

Gholamiandehkordi

Timbermont

Lanckriet

. Quantification of gut lesions in a subclinical necrotic enteritis model. Avian Pathol. 2007;36:375–382.

27.

Gibson-Corley

Hochstedler

Sturm

. Successful integration of the histology core laboratory in translational research. J Histotechnol. 2012;35:17–21.

28.

Goddard

Smith

Hoyland

. Localisation and semiquantitative assessment of hepatic procollagen mRNA in primary biliary cirrhosis. Gut. 1998;43:433–440.

29.

Grafe

Woodworth

Noppens

. Long-term histological outcome after post-hypoxic treatment with 100% or 40% oxygen in a model of perinatal hypoxic-ischemic brain injury. Int J Dev Neurosci. 2008;26:119–124.

30.

Holland

. The comparative power of the discriminant methods used in toxicological pathology. Toxicol Pathol. 2005;33:490–494.

31.

Holland

. Analysis of unbiased histopathology data from rodent toxicity studies (or, are these groups different enough to ascribe it to treatment?). Toxicol Pathol. 2011;39:569–575.

32.

Hörnblad

Cheddad

Ahlgren

. An improved protocol for optical projection tomography imaging reveals lobular heterogeneities in pancreatic islet and β-cell mass distribution. Islets. 2011;3:204–208.

33.

Hübner

Gitter

El Mokhtari

. Standardized quantification of pulmonary fibrosis in histological samples. Biotechniques. 2008;44:507–511, 514–517.

34.

Isobe

Adachi

Hayashi

. Spontaneous glomerular and tubulointerstitial lesions in common marmosets (Callithrix jacchus). Vet Pathol. 2012;49:839–845.

35.

Jacobs

Prokop

Oen

. Semiquantitative assessment of cardiovascular disease markers in multislice computed tomography of the chest: interobserver and intraobserver agreements. J Comput Assist Tomogr. 2010;34:279–284.

36.

Jassal

Nedeltchev

Osborne

. A modified scoring system to describe gross pathology in the rabbit model of tuberculosis. BMC Microbiol. 2011;11:49.

37.

Kayser

Schultz

Goldmann

. Theory of sampling and its application in tissue based diagnosis. Diagn Pathol. 2009;4:6.

38.

Kitching

Katerelos

Mudge

. Interleukin-10 inhibits experimental mesangial proliferative glomerulonephritis. Clin Exp Immunol. 2002;128:36–43.

39.

Kleiner

Brunt

, Natta MV, et al, for the Nonalcoholic Steatohepatitis Clinical Research Network. Design and validation of a histological scoring system for nonalcoholic fatty liver disease. Hepatology. 2005;41:1313–1321.

40.

Klesney-Tait

Keck

. Transepithelial migration of neutrophils into the lung requires TREM-1. J Clin Invest. 2013;123:138–149.

41.

Kohlmann

Moock

. How to analyze your data. In: Stengal

Bhandari

Hanson

, eds. Handbooks: Statistics and Data Management. Davos, Switzerland: AO Publishing; 2009:93–110.

42.

Lafemina

Sheldon

Ferriero

. Acute hypoxia-ischemia results in hydrogen peroxide accumulation in neonatal but not adult mouse brain. Pediatr Res. 2006;59:680–683.

43.

Landis

Koch

. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174.

44.

Langlois

Meyerholz

Coleman

. Oseltamivir treatment prevents the increased influenza virus disease severity and lethality occurring in chronic ethanol consuming mice. Alcohol Clin Exp Res. 2010;34:1425–1431.

45.

Mann

Vahle

Keenan

. International harmonization of toxicologic pathology nomenclature: an overview and review of basic principles. Toxicol Pathol. 2012;40:7S–13S.

46.

Miao

Leaf

Treuting

. Caspase-1–induced pyroptosis is an innate immune effector mechanism against intracellular bacteria. Nat Immunol. 2010;11:1136–1142.

47.

Murthy

Adamcakova-Dodd

Perry

. Modulation of reactive oxygen species by Rac1 or catalase prevents asbestos-induced pulmonary fibrosis. Am J Physiol Lung Cell Mol Physiol. 2009;297:L846–L855.

48.

Olivier

Naumann

Goeken

. Genetically modified species in research: opportunities and challenges for the histology core laboratory. J Histotechnol. 2012;35:63–67.

49.

Pannucci

Wilkins

. Identifying and avoiding bias in research. Plast Reconstr Surg. 2010;126:619–625.

50.

Pearson

Kurien

Shu

KSS

. Histopathology grading systems for characterisation of human knee osteoarthritis—reproducibility, variability, reliability, correlation, and validity. Osteoarthritis Cartilage. 2011;19:324–331.

51.

Pinson

Schoeb

Lindsey

. Evaluation by scoring and computerized morphometry of lesions of early Mycoplasma pulmonis infection and ammonia exposure in F344/N rats. Vet Pathol. 1986;23:550–555.

52.

Prinz

Schlange

Asadullah

. Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov. 2011;10:712.

53.

Pritzker

Gay

Jimenez

. Osteoarthritis cartilage histopathology: grading and staging. Osteoarthritis Cartilage. 2006;14:13–29.

54.

Reproducible methods [editorial]. Nat Cell Biol. 2009;11:667.

55.

Rollins

Meyerholz

Johnson

. A forensic investigation into the etiology of bat mortality at a wind farm: barotrauma or traumatic injury? Vet Pathol. 2012;49:362–371.

56.

Scheinin

Butler

Salway

. Validation of the interleukin-10 knockout mouse model of colitis: antitumour necrosis factor–antibodies suppress the progression of colitis. Clin Exp Immunol. 2003;133:38–43.

57.

Schulz

Chalmers

Hayes

. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995;273:408 412.

58.

Sellers

. The Gene or not the gene—that is the question: understanding the genetically engineered mouse phenotype. Vet Pathol. 2012;49:5–15.

59.

Shackelford

Long

Wolf

. Qualitative and quantitative analysis of nonneoplastic lesions in toxicology studies. Toxicol Pathol. 2002;30:93–96.

60.

Shott

. Statistics simplified: designing studies that answer questions. J Am Vet Med Assoc. 2011;238:55–58.

61.

Shott

. Statistics simplified: detecting statistical errors in veterinary research. J Am Vet Med Assoc. 2011;237:305–308.

62.

Shott

. Statistics simplified: wrapping it all up. J Am Vet Med Assoc. 2011;239:362–371.

63.

Sica

. Bias in research studies. Radiology. 2006;238:780–789.

64.

Snider

Confer

Payton

. Pulmonary histopathology of Cytauxzoon felis infections in the cat. Vet Pathol. 2010;47:698–702.

65.

Steel

RGD

Torrie

Dickey

. Nonparametric statistics. In: Steel

RGD

Torrie

Dickey

, eds. Principles and Procedures of Statistics: A Biomedical Approach. 3rd ed. Boston, MA: WCB McGraw-Hill; 1997:563–588.

66.

Stevens

. On the theory of scales of measurement. Science. 1946;103:677–680.

67.

Strasak

Zaman

Pfeiffer

. Statistical errors in medical research—a review of common pitfalls. Swiss Med Wkly. 2007;137:44–49.

68.

Thoolen

Maronpot

Harada

. Proliferative and nonproliferative lesions of the rat and mouse hepatobiliary system. Toxicol Pathol. 2010;38:5S–81S.

69.

Timbermont

Lanckriet

Dewulf

. Control of Clostridium perfringens–induced necrotic enteritis in broilers by target-released butyric acid, fatty acids and essential oils. Avian Pathol. 2010;39:117–121.

70.

Torrence

Brabb

Viney

. Serum biomarkers in a mouse model of bacterial-induced inflammatory bowel disease. Inflamm Bowel Dis. 2008;14:480–490.

71.

Vascellari

Giantin

Capello

. Expression of Ki67, BCL-2, and COX-2 in canine cutaneous mast cell tumors: association with grading and prognosis. Vet Pathol. 2013;50:110–121.

72.

Venturi

Sempoux

Bueno

. Novel histologic scoring system for long-term allograft fibrosis after liver transplantation in children. Am J Transplant. 2012;12:2986–2996.

73.

Viera

Garrett

. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37:360–363.

74.

Wachtel

Shome

Sutherland

. Derivation and validation of murine histologic alterations resembling asthma, with two proposed histologic grade parameters. BMC Immunol. 2009;10:58.

75.

Ward

Thoolen

. Grading of lesions. Toxicol Pathol. 2011;39:745–746.

76.

Zeiss

Ward

Allore

. Designing phenotyping studies for genetically engineered mice. Vet Pathol. 2012;49:24–31.