Comparative Pathologists: Ultimate Control Freaks Seeking Validation !

Abstract

Definable, reproducible, and meaningful are elemental features of grading/scoring systems, while thoroughness, accuracy, and consistency are quality indicators of pathology reports. The expertise of pathologists is significantly underutilized when it is limited to rendering diagnoses. The opportunity to provide guidance on animal model development, experimental design, optimal sample collection, and data interpretation not only contributes to job satisfaction but also, more importantly, promotes validation of the pathology data. Keys to validation include standard operating procedures, experimental controls, and standardized nomenclature applied throughout the experimental design and execution, tissue sampling, and slide preparation, as well as the creation or adaptation and application of semiquantitative grading/scoring systems. Diagnostic drift, thresholds, mental noise, and various diurnal fluctuations strongly influence the repeatability of grading/scoring systems used by the same or different pathologists. Quantitative image analyses are not plagued by the visual and cognitive traps that affect manual semiquantitative grading schemes but may still be affected by technical variables associated with necropsy, tissue sampling, and slide preparation. The validity of a grading scheme is ultimately assessed by its repeatability and biologic relevance, so it is important to correlate scores with comprehensive pathobiology data such as results of antemortem imaging, clinical pathology data, body and organ weights, and histopathologic evaluation of full tissue sets.

Keywords

controls grading quantitative image analysis repeatability reproducibility scoring tissue trimming validation

Definable, reproducible, and meaningful are elemental features of grading/scoring systems, while thoroughness, accuracy, and consistency are quality indicators of pathology reports.^7,28 All of these characteristics complement the extreme attention to detail that is endogenous to most pathologists. However, nothing disrupts this serenity quite like a flat of slides submitted by a new customer for “blinded” evaluation, which is unaccompanied by experimental objectives and design. Internal chaos further ensues upon inspection of the slides, which may reveal sections of tissue that vary in size or number or exhibit questionable staining. The expertise of pathologists is significantly underutilized when it is limited to rendering diagnoses. The opportunity to provide guidance on animal model development, experimental design, optimal sample collection, and data interpretation not only contributes to job satisfaction but also, more importantly, promotes validation of the pathology data. Keys to validation include standard operating procedures to facilitate training of personnel, experimental controls, and standardized nomenclature applied throughout the experimental design and execution; consistent tissue sampling and slide preparation; as well as the creation or adaptation and application of reproducible semiquantitative grading/scoring systems. Special consideration is also given to variables that may influence quantitative analyses.

Experimental Design, Animal Considerations, and Controls

A clear understanding of the overall hypothesis, objectives of the study, and experimental outcomes informs the selection of the appropriate animal model, lesions to be graded/scored, and special histochemical and/or immunohistochemical/-fluorescence stains that may be applied. The Animal Research: Reporting of In Vivo Experiments (ARRIVE) Guidelines were developed by the National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) to improve the design, analysis, and reporting of research using animals.¹⁷ Through a checklist for authors and reviewers of manuscripts and grant applications, the inclusion of essential reportable information promotes experimental reproducibility and transparency. Detailed methodology is expected regarding the design of and procedures conducted on experimental and control groups; animal parameters such as species, strain, sex, and age; housing and husbandry conditions; and sample size with group allocation. The Experimental Design Assistant, also developed by NC3Rs, provides investigators with a tool that creates a visual representation of the experimental design, identifies potential problems, proposes enhancements, and assists with sample size calculation and randomization.⁶

Vital to any experiment is the design of the experimental and control groups, with the latter comprising various negative and positive controls (Table 1).^10,15 True negative controls are spared from any type of treatment and are therefore not expected to develop associated lesions. Sham controls are intended to mimic a surgical procedure (ie, laparotomy) or treatment (ie, placebo) in the absence of actually performing the procedure (ie, hepatectomy) or administering the test article. The administration of test articles dictates that vehicle controls receive the same volume via the same route to rule out specific effects attributed to the vehicle alone. To ensure that a response can be detected, positive controls are included whereby a known manipulation results in a reproducible effect (ie, phenobarbital administration and the development of proliferative liver lesions). Positive controls may often be combined with gold standard treatment regimens, referred to as comparative controls, in the evaluation of various drugs.¹⁵ It may be permissible to reduce the sample size for select control groups in certain studies. However, it is advisable to consult with the pathologist prior to the start of the experiment to confirm that the pair-contrast method of lesion grading/scoring will not be applied, which would necessitate identical sample sizes for comparison of individuals across all groups in pairs or sets.¹⁴

Table 1.

Common Controls Used in Animal Model Research.

Type	Negative	Inactive/Sham	Positive	Comparative
Animal experiment	Spared from all manipulation	Mimics surgical procedure or treatment	Known manipulation results in reproducible effect	Gold standard care/treatment
Staining	Tissue lacking specific component or antigen; omission of primary antibody; replacement of primary antibody with serum or isotype immunoglobulins		Tissue containing specific component or antigen	Chromogen only; counterstain only

It cannot be overemphasized that animal groups should also be matched for strain/substrain, age, sex, and genotype through the use of littermates. Animals should also be procured from the same vendor when possible. Furthermore, husbandry considerations such as vivarium location (and therefore health status), caging, bedding, housing density, dark-light cycles, ambient temperature and humidity, diet, and water should be identical for all animals.

Necropsy and Tissue Sampling

Compliance with a number of best practices at the time of tissue harvest will help to minimize variability and promote reproducibility.²² Animal identifiers and groups, as well as the order in which they are necropsied, should be randomized. If a large number of animals necessitates sacrifice and tissue harvest over several days, a subset of animals from all groups should be processed each day. In general, submission of live animals is preferred, which permits recording of body and organ weights, collection of blood and other body fluids for clinical pathology testing, and evaluation of complete tissue sets. Taken together, this full complement of pathology data will facilitate the validation of the animal as a model of the disease in humans.

The Registry of Industrial Toxicology Data (RITA) and the North American Control Animal Database (NACAD) have published guidelines for the sampling and trimming of rodent tissues.^18,23 Sample size, direction of sectioning, and number of sections should be consistent between animals and groups. One should also exercise caution when using tissue sponges with thick tissues in shallow cassettes, which can induce artifacts. If stereology is intended, forethought is necessary to ensure collection of the entire organ or tissue and the systematic uniform random sampling required for 3-dimensional estimation.⁵ The interval between euthanasia or tissue anoxia and fixation should be minimized and maintained between animals.

Slide Preparation

Virtually all steps in the preparation of histology slides can directly or indirectly affect the reproducibility and value of grading/scoring systems.²² The common use of formaldehyde and its cross-linking mechanism of fixation have been cause for concern regarding overfixation. However, it has been shown that underfixation is a more significant concern, which warrants ∼20 times the volume of fixative to tissue and optimal fixation times, in general, of 48–72 hours.^4,9,22 Processing protocols should be adjusted according to tissue type to ensure complete dehydration, clearing, and paraffin infiltration. Consistent embedding protocols that group tissues by size and texture will help to prompt recut requests in the event all expected tissues do not end up on a given slide.

Keys to consistent special histochemical and immunohistochemical/-fluorescence staining protocols are appropriate controls (Table 1), consistent vendor-sourced reagents appropriately selected for the intended species and application, preventive equipment maintenance, and optimized procedures.²² Positive controls represent sites known to express the component or antigen, ideally within the specimen; however, a separate slide is also adequate.^9,13,22,30 Awareness of expected staining patterns in various tissues with regard to quality, quantity, and localization will facilitate validation. Negative controls should distinguish between specific antigen binding by the primary antibody and nonspecific binding of the secondary antibody.^4,9,13,22,29 Substitution of the primary antibody with serum- or isotype-specific immunoglobulins at the same concentration as the primary antibody will accomplish the former, while exclusion of the primary antibody will achieve the latter. Tissues have varying degrees of endogenous peroxidases and biotin, so optimization should include tissues representing different levels or, at the very least, the exact grading/scoring system target tissue. Freshly cut slides sectioned at a consistent thickness help to minimize the impact on stain density, while replicate sections on the same slide and serial sections address batch staining consistency and intratissue variability, respectively. For water-soluble stains to adequately penetrate tissues, tissues must be thoroughly deparaffinized and rehydrated. Because of the large number of variables, batch staining by the same lab and using the same autostainer, antibody lot, and antigen retrieval method are preferred.

An internal quality assurance program whereby the pathologist and technologist routinely evaluate stained slides is essential to assess the adequacy of staining in experimental and control slides, as well as the overall quality of the slide including microtomy and coverslipping.^9,11,21 Adequacy of staining should itself be scored for staining intensity, uniformity, localization, and specificity, as well as background staining, counterstaining, and artifacts.^11,21 Internal quality assurance audits may also be supplemented by external quality assurance programs, such as HistoQIP administered by the National Society for Histotechnology; however, the emphasis on specific human tissues and markers may preclude justification of the expense.

Creating, Adapting, and Applying a Grading/Scoring System

When first presented with a new grading/scoring opportunity, the pathologist typically conducts an exhaustive literature review to identify established systems consistently used in humans or animals. The application of human grading schemes to animal tissues strives to uphold the relevance of the animal model to the human disease. While the intent of this literature review is to avoid “reinventing the wheel,” the reality is that grading systems published with a paucity of methodology details are highly prevalent. High citation frequencies are not necessarily evidence of a system’s exemplary or gold standard status. The paucity of published methodology details, combined with idiosyncrasies of the animal model, often necessitate adaptation of existing grading/scoring systems, which may or may not be validated.

As a pathologist creates or adapts a grading scheme, it is important to consider the number of features and severity levels and their thorough definitions using standardized nomenclature and representative photomicrographs to facilitate consensus.²⁷ A popular concept to illustrate the limits of the human brain when making visual judgments along a continuum is a rainbow, categorized by 7 colors: red, orange, yellow, green, blue, indigo, and violet.^8,24 However, intra- and interobserver repeatability is inversely correlated with the number of severity levels, which, when excessive, results in overlap of categories and similarities between severity levels that are uncategorical.^8,12,28 As such, by convention, severity levels are typically maintained between 3 and 5.^12,27,28 It should also be noted that human nature has the inherent tendency to avoid extremes of ranges.⁸ Specific, detailed definitions of each feature and the various severity levels with corresponding photographic documentation facilitates interobserver agreement while perhaps also compensating for differences in training and experience of pathologists. Where possible, standardized nomenclature as proposed in the International Harmonization of Nomenclature and Diagnostic (INHAND) criteria for lesions of various organ systems in rats and mice, as well as other consensus and recommendation papers for proliferative preneoplastic and neoplastic lesions in genetically engineered mice, should be used to classify histologic features.²⁰

Upon development of a grading/scoring scheme, it is warranted to evaluate all samples to contemplate fit with the system. Slides from pilot time-course studies are suitable for this purpose because of the limited number of slides and, more importantly, presence of the full spectrum of lesions that can develop over time. Upon further refinement of the grading scheme if necessary, scoring of experimental slide sets may commence. A practical tiered approach is recommended whereby the initial evaluation is unblinded and followed by a targeted masked review of findings to ensure diagnostic criteria have been followed.^7,27 Commencing the evaluation with the control group(s) may prevent excessive lesion identification in experimental groups.

At this point, there are several causes of potential discord regardless of whether there are single or multiple raters. “Diagnostic drift” is attributed to the under- or overreporting of lesions that can occur when a single pathologist must evaluate a large number of samples over a prolonged period of time, grading/scoring schemes are not clearly defined, and/or multiple pathologists are involved.^{7,12,22,27,28} The latter instance is particularly problematic given that different observers will have different thresholds for histopathologic findings, below which they are interpreted to be within normal limits.^7,20,27,28 Thresholds may be influenced by pathologist experience, type of study, and number of slides. Additional factors that may alter histopathologic evaluations, include the rater’s inherent skill, education, training, experience, and mental noise, as well as diurnal fluctuations in mood, nutritional state, and energy.³²

The validity of a grading scheme is ultimately assessed by its repeatability between replicate measurements wherein a kappa value between –1 and +1 is calculated from intra- and interobserver agreement of nonparametric ordinal measurements.^8,12,24 Kappa values of –1 and +1 correspond to perfect disagreement and perfect agreement, respectively, while 0 corresponds to no relationship between the raters such that any consensus is due to chance. Arbitrary designations of 0.4 and 0.6 delineate the boundaries of poor, moderate, and excellent agreement.¹⁹ Additional statistical approaches have been developed to assess agreement with continuous measurements over time.³ We are reminded that biologic relevance is independent of repeatability, so it is important to correlate scores with comprehensive pathobiology data such as results of ante mortem imaging, clinical pathology data, body and organ weights, and histopathologic evaluation of full tissue sets.^12,22

Specific Considerations for Quantitative Analyses

Quantitative image analyses are not plagued by the visual and cognitive traps that affect manual semiquantitative grading schemes and result in objective and measurable parametric data.² However, the technical variables noted above for necropsy and tissue sampling, as well as slide preparation, still apply to quantitative digital image analyses, and their control is necessary to ensure optimal results.^{1,22,25,26,31} While a pathologist can “read through” the effects of substandard tissue fixation, variably thick paraffin or frozen sections, tissue folds, uneven staining, and coverslipped dust, image analysis software cannot. Tissue staining must be optimized and performed in batches, preferably a single batch. In the event this is not possible, it is recommended to include the same tissue in different batches for quality control. In addition, stain controls should be performed for each batch to facilitate color deconvolution of each stain of interest (Table 1).²⁵

Digital scanning of all slides in a study should be performed with the same objective, typically either 20× or 40×, to ensure consistent resolution per pixel.³² Excessive mounting media, haphazardly placed coverslips, and thick paraffin sections will dictate the assignment of manual focus points instead of autofocusing methods. While brightfield slide scanning is approaching routine, oil immersion lenses require Z-stacking to view the same area at different fields of focus, and fluorescent scanners essentially require separate scans for each fluorophore.³¹

Unique considerations apply to sampling in digital slides, whereby analyses are typically performed on a limited subset of the entire digital scan, or region of interest (ROI).¹⁶ Extreme care should be taken in drawing ROIs as the recognition of microscopic tissue for pattern recognition training and/or analysis is only as specific as the ROI used to train the computer.^25,31 Some image analysis software packages permit users to develop algorithms of varying complexity while others only allow users to fine-tune established algorithms through the adjustment of certain parameters as they relate to nuclear and cell size and shape, diameter of blood vessel lumina, and so forth. There is intrinsic subjectivity in algorithm creation; however, variability is eliminated provided that subsequent analyses on the same tissue type are performed with the same algorithm. The setting of threshold pixel values, while seemingly arbitrary, can be informed by the results of manual semiquantitative scoring. Mark-ups of analyzed fields, during and following tuning, at relatively high magnification should be evaluated to ensure the accuracy of the settings.

Summary

The ultimate goal of histopathologic grading, whether semiquantitative or quantitative, is the generation of meaningful results in a reproducible manner. Multiple variables, both common and unique to the aforementioned analyses, must be controlled to ensure optimum experimental design and consistent tissue sampling and slide preparation, which are the basis for a valid, reproducible grading system.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Krista M. D. La Perle, DVM, PhD

References

Aefffner

Wilson

Bolon

, et al. Commentary: roles for pathologists in a high-throughput image analysis team. Toxicol Pathol. 2016;44(6):825–834.

Aeffner

Wilson

Martin

, et al. The gold standard paradox in digital image analysis: manual versus automated scoring as ground truth. Arch Pathol Lab Med. 2017;141(9):1267–1275.

Barnhart

Haber

Lin

. An overview on assessing agreement with continuous measurements. J Biopharm Stat. 2007;17(4):529–569.

Battifora

. Quality assurance issues in immunohistochemistry. J Histotechnol. 1999;22(3):169–175.

Boyce

Dorph-Petersen

Lyck

, et al. Design-based stereology: introduction to basic concepts and practice approaches for estimation of cell number. Toxicol Pathol. 2010;38(7):1011–1025.

Cressey

. Better designs for animal studies. Nature. 2016;531(7592):128.

Crissman

Goodman

Hildebrandt

, et al. Best practices guideline: toxicologic histopathology. Toxicol Pathol. 2004;32(1):126–131.

Cross

. Grading and scoring in histopathology. Histopathology. 1998;33(2):99–106.

Eisen

. Quality management in immunohistochemistry. Diagn Histopathol. 2008;14(7):299–307.

10.

Festing

MFW

Altman

. Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J. 2002;43(4):244–258.

11.

Francis

. Quality assurance in immunohistochemistry—an update. Pathology. 2010;42(suppl 1):S11–S16.

12.

Gibson-Corley

Olivier

Myerholz

. Principles for valid histopathologic scoring in research. Vet Pathol. 2013;50(6):1007–1015.

13.

Hewitt

Baskin

Frevert

, et al. Controls for immunohistochemistry: the histochemical society’s standards of practice for validation of immunohistochemical assays. J Histochem Cytochem. 2014;62(10):693–697.

14.

Holland

. Analysis of unbiased histopathology data from rodent toxicity studies (or, are these groups different enough to ascribe it to treatment?). Toxicol Pathol. 2011;39(4):569–575.

15.

Johnson

Besselsen

. Practical aspects of experimental design in animal research. ILAR J. 2002;43(4):202–206.

16.

Kayser

Schultz

Goldmann

, et al. Theory of sampling and its application in tissue based diagnosis. Diagn Pathol. 2009;4:6.

17.

Kilkenny

Browne

Cuthill

, et al. Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research. Plos Biol. 2010;8(6):e1000412.

18.

Kittel

Ruehl-Fehlert

Morawietz

, et al. Revised guides for organ sampling and trimming in rats and mice – part 2. A joint publication of the RITA and NACAD groups. Exp Toxicol Pathol. 2004;55(6):413–431.

19.

Landis

Koch

. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–174.

20.

Mann

Vahle

Keenan

, et al. International harmonization of toxicologic pathology nomenclature: an overview and review of basic principles. Toxicol Pathol. 2012;40(4S):7S–13S.

21.

Maxwell

McCluggage

. Audit and internal quality control in immunohistochemistry. J Clin Pathol. 2000;53(12):929–932.

22.

Meyerholz

Beck

. Principles and approaches for reproducible scoring of tissue stains in research. Lab Invest. 2018;98(7):844–855.

23.

Morawietz

Ruehl-Fehlert

Kittel

, et al. Revised guides for organ sampling and trimming in rats and mice—part 3. A joint publication of the RITA and NACAD groups. Exp Toxicol Pathol. 2004;55(6):433–449.

24.

Morris

. Information and observer disagreement in histopathology. Histopathology. 1994;25(2):123–128.

25.

Potts

Young

Voelker

. The role and impact of quantitative discovery pathology. Drug Discov Today. 2010;15(21/22):943–950.

26.

Riber-Hansen

Vainer

Steiniche

. Digital image analysis: a review of reproducibility, stability and basic requirements for optimal results. Acta Pathol Microbiol Immunol Scand. 2012;120(4):276–289.

27.

Schafer

Eighmy

Fikes

, et al. Use of severity grades to characterize histopathologic changes. Toxicol Pathol. 2018;46(3):256–265.

28.

Shakelford

Long

Wolf

, et al. Qualitative and quantitative analysis of nonneoplastic lesions in toxicology studies. Toxicol Pathol. 2002;30(1):93–96.

29.

Torlakovic

Francis

Garratt

, et al. Standardization of negative controls in diagnostic immunohistochemistry: recommendations from the international ad hoc expert committee. Appl Immunohistochem Mol Morphol. 2014;22(4):241–252.

30.

Torlakovic

Nielsen

Francis

, et al. Standardization of positive controls in diagnostic immunohistochemistry: recommendations from the international ad hoc expert committee. Appl Immunohistochem Mol Morphol. 2015;23(1):1–18.

31.

Webster

Dunstan

. Whole-slide imaging and automated image analysis: considerations and opportunities in the practice of pathology. Vet Pathol. 2014;51(1):211–223.

32.

Weinberger

. How valuable is blind evaluation in histopathologic examinations in conjunction with animal toxicity studies? Toxicol Pathol. 1979;7(2):14–17.