Use of Severity Grades to Characterize Histopathologic Changes

Abstract

The severity grade is an important component of a histopathologic diagnosis in a nonclinical toxicity study that helps distinguish treatment-related effects from background findings and aids in determining adverse dose levels during hazard characterization. Severity grades should be assigned based only on the extent (i.e., amount and complexity) of the morphologic change in the examined tissue section(s) and be clearly defined in the pathology report for critical lesions impacting study interpretation. However, the level of detail provided and criteria by which severity grades are assigned can vary, which can lead to inappropriate comparisons and confusion when evaluating pathology results. To help address this issue, a Working Group of the Society of Toxicologic Pathology’s Scientific and Regulatory Policy Committee was formed to provide a “points to consider” article on the assignment and application of pathology severity grades. Overall, the Working Group supports greater transparency and consistency in the reporting of grading scales and provides recommendations to improve selection of diagnoses requiring more detailed severity criteria. This information should enhance the overall understanding by toxicologic pathologists, toxicologists, and regulatory reviewers of pathology findings and thereby improve effective communication in regulatory submissions.

Keywords

histopathology severity grades severity scores adversity hazard risk assessment

Interpretation of histopathologic effects in nonclinical toxicity studies is a fundamental part of the hazard identification and risk assessment process for pharmaceuticals, foods, and chemicals. These findings are communicated through morphological diagnoses of the pathologist based on microscopic examination of standardized tissue sections and may include modifiers such as distribution (e.g., focal, multifocal, or diffuse), cell types involved (e.g. neutrophilic to modify an inflammatory process), or other features of the lesion. The severity grade represents an important part of the diagnosis for many findings, as it indicates the extent of a lesion and categorizes features into descriptive semiquantitative scales. This information adds relevant context to observed findings and comprises an important element of the pathology report (Shackelford et al. 2002). Along with incidence, severity and dose-responsiveness of histopathologic findings are used to identify treatment-related effects and may contribute to the determination of dose levels that serve as the basis for establishing safety margins, safe human starting doses, and/or exposure limits.

Previous Recommendations and Best Practices Guidance of the Society of Toxicologic Pathology (STP) recommended that severity grades (also referred to as scores) should be definable, reproducible, and meaningful (Crissman et al. 2004; Morton et al. 2006). Additionally, the Crissman STP Best Practices Guidance stated that “A description of each of the various grades should be included in the narrative for target lesions where severity is critical to interpretation of the data.” “Critical” lesions include those in which consideration of adversity may be appropriate only at higher levels of severity, those potentially driving the critical dose effect levels such as the study no/lowest observed adverse effect level (NOAEL/LOAEL) or the highest nonseverely toxic dose, and those (relevant to critical dose effect levels) in which the incidence alone may not adequately reflect whether a treatment-related effect is present. More specialized histopathologic grading scales may also be needed for focused investigative studies of specific disease processes (Gibson-Corley, Olivier, and Meyerholz 2013). Historical examples include chronic progressive nephropathy (CPN; Hard, Betz, and Seely 2012) and fiber-induced lesions in the lung (McConnell and Davis 2002). In such cases, appropriately applied severity grades can inform whether a treatment-related exacerbation of a background change is present and add context to the interpretation of histopathologic findings. In addition, judiciously assigned severity grades can provide important information about dose–response relationships and whether a particular finding is within expected historical control limits.

The challenge for pathologists is to find a proper balance between overly generic and excessively detailed grading schemes. More general grading scales provide greater flexibility and efficiency for routine evaluations (Zbinden 1976) by avoiding exhaustive and potentially unnecessary review of features for every individual lesion type within different organ systems. It is simply not feasible or necessary to apply lesion-specific grading criteria to the vast majority of histopathologic findings. Moreover, in many private industrial laboratories and contract research organizations, the grading scale may be determined within the particular established pathology database system a company is using and thus not easily allow for modification of the number of categories or the descriptive words associated with each grade. Any ad hoc criteria would in those cases need to be described in the methods or discussion section of the pathology report. Still, if regulatory reviewers are to base a LOAEL on a given effect at a higher severity but not at lower levels of severity, then some rationale for the basis of this decision needs to be provided by the pathologist. This latter issue is particularly important, given that different types of severity scales are used across laboratories and different severity criteria may be used among pathologists.

A Working Group of the STP Scientific and Regulatory Policy Committee (SRPC) was formed to review current issues and considerations related to severity grading in nonclinical study reports. The ultimate goal of this Working Group was to improve communication by the study pathologist and increase consistency in the assignment and description of severity grades. The scope of this points to consider article applies specifically to approaches for assessing and recording the severity of histopathologic changes in toxicity studies submitted for review by regulatory agencies of pharmaceuticals, agrochemicals, foods, consumer products, industrial chemicals, and other environmental contaminants. The grading scales specific to local reactions to medical devices (biocompatibility) are not addressed in this points to consider manuscript because they are well defined elsewhere (International Organization for Standardization 2016). However, the points to consider are applicable for any evaluation of systemic tissues distant from the implant/application site as part of a routine safety assessment of a medical device. In toxicologic pathology, the terms “grading” or “scoring” are often considered synonymous and are referred to in this article simply as grading. The intended audience for this document includes toxicologic pathologists, toxicologists, and regulatory reviewers. Topics include the routine use and criteria of severity grades, the influence of severity grades on the determination of adversity, and a brief review of statistical considerations for histopathologic grades.

Purpose and Function of Histopathologic Grading

Histopathologic severity grades should represent the extent of change detected in the microscopic sections available to the pathologist for evaluation (Table 1). “Extent” implies the amount of change with consideration given to distribution (focal, multifocal, and diffuse) of tissue or organ microscopically affected and/or the complexity of the morphologic change observed. For example, in an advanced or complex lesion, a pathologist may combine a spectrum of changes under one diagnosis and give it a higher severity grade, whereas a simple lesion that has few or no differing components may get a lower severity grade. The amount of the tissue/organ affected is characterized by a distribution determined microscopically and possibly macroscopically. For the latter case, a microscopic change may be given a higher severity grade if the macroscopic correlate was widespread during gross examination, whereas a similar microscopic change may be given a lower severity grade if the lesion was known to have a limited distribution or was inapparent at gross examination. Macroscopic observations may thus influence the assigned histopathologic grade.

Table 1.

Considerations for Assigning Histopathologic Severity Grades.

Criteria	Consider when assigning a severity grade	Examples
Amount of tissue/organ affected (considering gross observations)	Yes	Necrosis affecting a minority of hepatocytes (i.e., single hepatocytes) vs. necrosis affecting a majority (i.e., confluent or bridging) Full thickness ulcer vs. partial thickness erosion
Complexity or context of morphologic change	Yes	Inflammation comprised of only inflammatory cells vs. inflammation comprised of inflammatory cells associated with tissue injury and/or fibrosis
Distribution	Yes	Focal, multifocal, or diffuse, OR Subanatomic compartments (e.g. red pulp or white pulp of the spleen)
Organ weights and clinical pathology	No	Difficult to directly correlate to the extent of a microscopic change
Test article relationship	No	Some test article–related findings may have a low-severity grade. Conversely, a high-severity grade may be assigned to some changes not related to the test article (e.g., CPN in rats or traumatic injuries due to dosing accidents).
Human relevance	No	Thyroid hypertrophy/hyperplasia secondary to hepatocellular enzyme induction in rats
Biological impact	No	Neuronal necrosis vs. pancreatic acinar apoptotic necrosis
Organ function or clinical pathology	No	Proteinuria related to glomerular lesions
Health status of animal	No	Moribundity or increased mortality
Adversity	No	Some adverse changes may be minimal in severity (e.g., neuronal necrosis).
Reversibility	No	Reversal or lack of reversal should not influence the perceived extent of change.

The assignment of severity for a given grading scale should reflect only the histopathologic appearance in the context of microscopic observations as described above. Grading should be evaluated independent of the health status of the animal, mortality, reversibility, or any perceived relationships to test article administration, organ function, clinical pathology changes, biological impact, human relevance, or adversity. Similarly, organ weight changes should generally not be used to determine a histopathologic score, since they may not directly correlate with the specific microscopic diagnosis. While severity grades do not inherently indicate a treatment effect or correlate with adversity, the assigned severity grades can contribute to the pathologist’s conclusion about the NOAEL.

Histopathologic severity grades are a semiquantitative assessment of the relative extent of microscopic lesions. The toxicologic pathologist hones this skill through years of specialized residency and professional training (Markovits et al. 2013). The process of histopathologic slide interpretation in toxicologic pathology thus constitutes a scientific art, similar to diagnostic medicine, which incorporates professional judgment and experience. As stand-alone descriptors, severity grades may thus carry different meanings to pathologists, readers, or regulatory reviewers in various agencies and geographic regions. Documentation of grading criteria for critical findings helps the reviewer follow the pathologist’s observations and interpretation of changes, particularly for histopathologic changes that are important in determining adversity or dose-limiting effects. This documentation can provide key contextual information for routine diagnoses like inflammation, rare or odd findings, and more complex lesions that incorporate a spectrum of changes that may not be apparent from the stand-alone diagnostic term. Communication of grading criteria is also important when comparing critical findings of studies with the same compound in different test systems or species and across doses and exposure durations. Without suitably defined grading criteria for critical/key findings, reviewers may simply have to rely on a subjective understanding of grading descriptors, which may in some cases result in inconsistent and/or incorrect interpretations and lead to additional peer review by an independent pathology working group (PWG; Mann and Hardisty 2014; Wolf and Maack 2017; Wolf et al. 2014).

Assignment of severity grades is generally intended to capture the spectrum or range of findings within a given study. Grades are relative assessments, however, based on diverse types of morphologic information, from pathologic process to lesion extent. Standard grades such as “mild” or “moderate” do not necessarily have inherent meaning across lesions or studies, and the frames of reference may be influenced by numerous contextual factors including background findings, knowledge of a particular strain or species of animal model, and study type and duration. Moreover, grading approaches often vary among pathologists and testing laboratories. Because of these factors, when severity is critical for interpretation, an explanation of the defined criteria for any type of specific morphologic interpretation is needed for that grading scale.

In current practice, grading scales and categorical terms are not uniform across all pathology reporting systems. For example, a grade 1 lesion may be classified as “minimal” by some pathologists and “very slight” by others; because of these differences, the terms used with the grade assigned should be clearly stated in each report. In most data collection systems used for nonclinical safety (toxicity) studies, either a 4- or 5-point scale is most often encountered. Current efforts related to Standard for Exchange of Nonclinical Data (SEND) implementation and controlled terminology of the Clinical Data Interchange Standards Consortium for nonclinical studies may result in future guidance on selection and use of either a 4- or 5-point grading scale, but these issues are evolving at this point and considered outside the scope of this working group.

For some findings, it is sufficient to record histopathologic changes only as “present” without the assignment of a severity grade, if the severity of the change is difficult to assess in the tissue section and/or will not impact the interpretation of the finding or the relationship of that finding to the test article. Examples of findings often recorded as present include calculi, parasites, or congenital anomalies. The decision not to apply a severity grade is also appropriate when sectioning factors (e.g., free-floating elements) preclude consistent evaluation (Adams and Crabbes 2013). In nonclinical studies, neoplasms are not graded for severity but are recorded as present, with or without additional information such as single or multiple, benign or malignant, or metastatic. The list of findings that are typically recorded as present rather than graded can vary by test facility (e.g., cysts are not graded by some institutions). Therefore, in these cases, findings that are not graded should be defined by the test facility in standard operating procedures or other guidance documents used by that institution so that this information is available for reference by study and peer review pathologists.

Toxicologic pathology examinations often consist of a two-stage process, an initial “identification” stage and, if needed, a later “confirmation stage,” as described by Long and Hardisty (2012). In accordance with best practice guidelines (Crissman et al. 2004), the identification stage is performed in an unblinded fashion, meaning that the pathologist has knowledge of treatment groups and all available associated information (e.g., hematology results and organ weights). Lack of blinding during this stage should not be considered a study weakness or source of bias (e.g., during systematic review). During the confirmation stage, if needed, the pathologist evaluates the slides masked to the treatment group (blinded) and applies the specific diagnostic features previously determined in the identification stage (Long and Hardisty 2012). The initial knowledge of treatment groups in the identification stage enables the pathologist to determine the type and amount of baseline (background or spontaneous) changes that occur in control animals, from which possible treatment-related effects in dosed animals must be qualitatively and semiquantitatively differentiated. During the identification stage, some lesion types require “thresholding,” a practice defined as determining which findings will be recorded as morphologic changes and which will be considered variations in normal morphology and not be recorded. For example, macrophages are normally present in the alveolar spaces of the lung; however, various treatments may induce an increase in the number of these cells. Discerning when an actual treatment-related increase is present involves thresholding.

Based on this process, the diagnosis of many low-grade (grade 1 or minimal) lesions may depend on the pathologist’s individual threshold with regard to what is “normal” or “within normal biological variation” for a particular species/strain or animal age and can therefore be influenced by the training and personal experience of the study pathologist (Long and Hardisty 2012). This context precludes simple catchall rules for assigning or interpreting severity grades and highlights the idea that the study pathologist should provide morphologic details and grading criteria to enable study interpretation. These criteria are essential for an independent reviewer to fully understand the application of a particular severity category and also contribute to historical control databases. A common example is the presence of lymphoid depletion in the thymus. In a young animal in which physiologic thymic involution is not yet expected, this change may be considered severe, while in an older animal, its presence may be considered normal. Severity grades thus reflect study context as well as relation to concurrent controls.

For a given grading scale, however, assignment of severity should reflect only the morphologic features of that particular microscopic lesion. In other words, severity grades do not inherently indicate a treatment effect or correlate with adversity. Certain spontaneous age-related changes such as CPN in rodents may have high-severity grades, while other test article–related changes such as retinal degeneration may have low-severity grades but a greater impact on establishing the NOAEL/LOAEL. Furthermore, severity grades may be especially useful for interpretation of findings with high-background incidences in control animals such as immune cell infiltrates, thymic lymphoid depletion, or CPN (e.g., Hard, Betz, and Seely 2012). In such cases, a difference in severity between control and treated groups may be critical to the identification of a test article–related effect and no observed effect level (NOEL) or NOAEL/LOAEL. Carefully and properly defined severity grades can drive the determination of these critical dose effect levels, regardless of the grading scale terms applied for a given study, as long as clear morphologic criteria are presented.

Defining Grading Criteria

For consistent and transparent regulatory review, it is essential that pathology reports provide a clear rationale for the assignment of severity grades within the context of an individual study, particularly for critical/key lesions that may have an impact on establishing the NOAEL/LOAEL. Grades for these critical findings should be supported by explicit morphologic criteria to communicate the extent (distribution, amount, and complexity) of the finding and/or its various components and minimize impacts of thresholding or scale differences between pathologists (Long and Hardisty 2012; Shackelford et al. 2002). While severity grades for a given lesion should not be defined by presumed impact on organ or biological function, reserve capacity, overall health status of the animal, adversity, or reversibility, interpretive comments relating grades to these concepts may be helpful in the discussion of key findings.

Severity grading scales can be classified as generic, which do not have explicit morphologic descriptions for each severity grade, or may be lesion-specific, in which each grade is defined by specific morphologic criteria. Whether generic or lesion-specific, grading scales will typically use the same terms (minimal, mild, etc.) within the same study. Generic grading criteria may be applicable to most, or even all, histopathologic findings in a standard nonclinical toxicity study. Less frequently, lesion-specific scales based on the spectrum of observed findings and/or criteria in established publications may be used. In such cases, scales should be referenced in the pathology report and annotated where needed to provide detailed study- or compound-specific information. For common background changes like CPN, standardized criteria may be developed within an organization or specialty and be presented in the pathology narrative when considered critical to the study interpretation. For other key findings in which specific grading scales are needed (e.g., those that may inform the NOAEL of a study), the criteria should be described in the pathology narrative so that they are reproducible and meaningful for other pathologists, toxicologists, and regulatory reviewers (Crissman et al. 2004).

Examples of generic and specific scales are provided in Table 2. For specific scales, there are several basic features that may provide useful criteria to establish thresholds between grades cutoffs for different grades:

Table 2.

Examples of Generic and Specific Histopathologic Grading Schemes in Multiple Formats.

Example 1. Generic Grading Criteria Histopathologic grades were assigned as level 1 (minimal), 2 (mild), 3 (moderate), 4 (marked), or 5 (severe) based on an increasing extent and/or complexity of change, unless otherwise specified. Example 2. Generic Grading Criteria

Grade 1/Minimal: The first (lowest) level of severity in an ordered list based on a four-level scale of minimal, mild, moderate, and severe.

Grade 2/Mild: The second level of severity in an ordered list based on a four-level scale of minimal, mild, moderate, and severe.

Grade 3/Moderate: The third level of severity in an ordered list based on a four-level scale of minimal, mild, moderate, and severe.

Grade 4/Severe: The fourth (highest) level of severity in an ordered list based on a four-level scale of minimal, mild, moderate, and severe.

Example 3. Specific Grading Criteria: Epidermal Hyperplasia of Dermal Dose Site

Within normal limits: <3 cells thick

Grade 1/Minimal: 3–4 cells thick

Grade 2/Slight: 5–6 cells thick

Grade 3/Moderate: 7–8 cells thick

Grade 4/Marked: 9–10 cells thick

Grade 5/Severe: >10 cells thick

Example 4. Specific Grading Criteria: Centrilobular Hepatocellular Necrosis

Grade 0: within normal limits

Grade 1 (minimal): approximately <5% of centrilobular hepatocytes are necrotic

Grade 2 (mild): approximately 5% to 20% of the liver is affected by centrilobular hepatocyte necrosis that is often circumferential

Grade 3 (moderate):approximately 20% to 40% of the liver is affected by centrilobular hepatocyte necrosis that is often circumferential and bridging

Grade 4 (marked): generally >50% of the liver is affected by centrilobular hepatocyte necrosis that is bridging, confluent, and often extends beyond centrilobular zones

Example 5. Specific Grading Criteria: Retinal Degeneration

With minimal retinal degeneration, the retina has loss of or ragged appearance of the photoreceptor layer. Mild retinal degeneration has loss of the photoreceptor layer and thinning of the outer nuclear layer. Occasional outer nuclear layer nuclei are displaced into the photoreceptor layer. When moderate retinal degeneration was observed, the outer nuclear layer is notably thinned with accompanying thinning of the inner nuclear layer. Marked retinal degeneration has thinning and decreased cellularity of all layers of the retina.

lesion location, distribution, or pattern;

semiquantitative metrics such as proportion of the section, organ, or tissue affected;

shifts in process and spectrum/complexity of morphologic changes.

Histopathologic findings of higher severity (e.g., moderate, marked, or severe) often indicate either a greater area of change or changes that exhibit multiple features, and it is important to communicate in the pathology report how these features were captured in the pathology tables with severity grades. As lesion complexity increases, a pathologist may consider whether to maintain the same diagnostic term to cover more features or create a new diagnostic term. This lumping/splitting decision may occur in the same study, to account for variation among animals, or between studies in the same program, to account for differences in exposure duration, age, or chronicity of the disease process. A higher severity grade could reflect a constellation of related findings (“lumping” option) or, alternatively, be separated out into lower severity components (“splitting” option). In either scenario, it is critical that the pathology narrative clearly describes the basis for severity of critical findings.

In an example grading scale presented by Kaufmann et al. (2009), severity of laryngeal squamous metaplasia is determined by the distribution of metaplasia, number of squamous cell layers, and degree of keratinization. Other features such as hyperplasia, inflammation, and ulceration would be tallied as separate diagnoses but would be considered in the interpretation of laryngeal changes as being adverse or not. This type of splitting may not be readily apparent with a generic scheme, where criteria for each grade are not explicitly defined. Similarly, the extent of the tissue affected may not directly convey severity across different types of lesions. For example, inflammation comprising a low number of immune cells affecting a minor proportion of an organ section may be considered minimal, whereas ulceration or fibrosis affecting the same proportion may warrant a higher severity grade. In cases where numerical ranges are used as criteria (e.g., <10%, 10%-25%, etc.), these numbers should be clearly characterized as semiquantitative estimates to avoid the appearance of false precision unless backed by quantitative measurements.

Variability in Severity Grading

Inconsistencies in the application of severity grades can cause confusion in the toxicology review process. Intrastudy inconsistency can occur when grading is not similarly applied to microscopic observations within a study (“diagnostic drift” or “grading drift”; Shackelford et al. 2002). It is most commonly encountered when the microscopic evaluation takes place over an extended period of time, when multiple pathologists are assigned to one study, or when large numbers of animals are examined. For example, a two-year carcinogenicity bioassay in which one pathologist evaluates the males and another pathologist evaluates the females would involve all three of these factors. Grading should be carefully controlled through such means as peer review and reference to representative examples of different lesion severity grades throughout the course of microscopic evaluation. Study pathologists may also choose to reevaluate specific tissue sections, especially those of potential target tissues, prior to pathology finalization, in a targeted masked or blinded fashion, to confirm or refine observations and severity grades (Crissman et al. 2004). A pathology peer review by a second pathologist should be conducted via a process that detects and addresses diagnostic drift to limit intrastudy inconsistencies for critical findings, which will depend on grading scales with clearly defined morphologic criteria (Morton et al. 2010).

Interstudy differences may occur when grading is not consistently applied between studies. As mentioned above, this variability may result from different training, experience, thresholding levels, and professional opinions among study pathologists. What one pathologist might grade as minimal for a commonly observed background finding might be considered “within normal limits” by a different pathologist. Likewise, a “marked (4/5)” grade for one pathologist might be considered “moderate (3/5)” for another pathologist based on their experience with a particular compound class that may influence slight differences in semiquantitative threshold. This example highlights the idea that categorical severity grades do not carry inherent meaning outside the context of a given study. Interstudy inconsistency may also be due to different pathologist preferences in recording observations for a particular disease state. In some cases, particular test facilities have standardized protocols to facilitate more consistent generation of historical control data. These different approaches can affect grading. For example, an observation of mild rodent CPN could also be recorded under the individual components (tubular basophilia, hyaline casts, tubular dilation, etc.). Each of these individual components will vary in this complex disease (say, from minimal to mild), even though the overall grade for CPN is mild severity. Interstudy inconsistency can result from differences in the frame of reference that may be applied in specific circumstances. For example, at the lower end of the grading range, when there are very subtle differences, the pathologist could modify the criteria to identify the cutoff from normal to minimal and minimal to mild, for example. This type of thresholding should be explained in the pathology report. Pathology peer review and masked targeted slide evaluation may also be important in the evaluation of critical findings at the low end of the grading scale.

It is important to note that there will always be some degree of inconsistency in severity grading across studies, especially when evaluated by different pathologists. However, as long as each pathologist maintains consistent criteria within a given study, this interstudy variability should not impact the accurate identification of test article effects and adverse dose levels. The peer review process should evaluate whether there is consistency within a study and overall concordance between the study and peer review pathologists (Morton et al. 2010). In general, peer review pathologists will not be concerned when there are minor differences of opinion with the severity grades assigned by the study pathologist (i.e., within one grade of the peer reviewer’s grade), as long as the grades are consistent within the study, and do not affect overall identification of a dose response or interpretation of the pathology data (Long and Hardisty 2012). These one-grade differences between pathologists generally relate to differences in thresholding. However, if a peer review pathologist identifies inconsistent application of grading severities among animals within a study, this issue is invariably brought to the attention of the study pathologist(s), often requiring a reevaluation of the slides by the study pathologist. Correction of such inconsistencies at (or before) peer review should allow for more accurate determination of NOELs/NOAELs and may improve the consistency of severity grading by future pathologists evaluating and/or reviewing additional studies with the same compound or related compounds (Morton et al. 2010).

While the peer review process does not directly address interstudy differences, it is possible to conduct a post hoc pathology peer review via a PWG if it is considered essential to evaluate the severity grading of a specific finding across multiple studies (e.g., Wolf et al. 2014).

Toxicologists and regulatory reviewers should be cautious when making interstudy comparisons or pooling data for a specific severity grade from more than one study. There should be a general understanding that severity grades between studies often vary by one severity grade, that they may differ based on the study type and experimental model, and that comparisons across studies should consider carefully the rationale of the toxicologist and pathologist for determining which findings are test article–related and thus relevant to health guidance values such as the NOEL and NOAEL. The severity grades are tools applied for the purpose of aiding the overall process of determining these critical dose levels or response thresholds in study parameters within specific studies.

It should be recognized that in some cases, differences in grading of lesions between studies are actually due to inherent biological variability in the test system(s) or study type and not due to inconsistency in grading by the pathologist(s) involved. These factors include differences in age and source of the animal test species, exposure levels and duration, route of administration, and normal biological variability. For example, two rats within a high-dose group in one study may have development of the same lesion attributed to test article administration, but one is minimal and the other is marked. Such interanimal variation is expected, given the use of outbred species (dogs, mini-pigs, and nonhuman primates) and stocks (e.g., CD-1 mice and Sprague-Dawley and Wistar Han rats; e.g., Engelhardt, Gries, and Long 1993) in nonclinical toxicity studies. Therefore, it is not difficult to understand how different studies of the same test article result in different distributions of severity grades for test article–related findings. Morphologic features of a lesion such as endometrial hyperplasia may be quite different in sexually mature animals from those in senescent animals or in different model species with different reproductive cycles. The biological context is simply different. Regardless of the study factors, the critical information derived from histopathologic evaluation of a study, which includes diagnoses as well as incidence and severity grades, should be the determination of the NOEL and NOAEL. This determination drives dose selection for future nonclinical studies and clinical trials.

Severity Grades Are Only One Part of the Weight of Evidence That Define Adversity

Adversity levels are determined for administered dose or systemic exposure levels, by a weight-of-evidence evaluation of all findings in a study or data package (Kerlin et al. 2016; Palazzi et al. 2016). There are cases in which an individual finding in isolation may be considered adverse. However, in many cases, a combination of changes, any one of which may not be adverse in isolation, may define adversity. While there are examples of individual histopathologic changes of any grade that may define adversity (e.g., neuronal necrosis), many histopathologic changes in isolation may not be sufficient to define adversity, except possibly only at higher levels of severity. Lower levels of severity of these changes may define adversity only when accompanied by other findings (e.g., changes in associated clinical chemistry parameters). According to Palazzi et al. (2016), “[T]he primary consideration here is whether a given change could impair cell/tissue/organ function or reserve capacity to respond to additional challenge…[I]f the absence of a functional correlate can be clearly demonstrated then the lesion may not define adversity.” The point of this is that a severity grade in and of itself does not define adversity. Conversely, the adverse nature of a change does not necessitate a higher severity grade (i.e., even changes of minimal severity may be adverse). The relationship of a severity grade to the determination of adversity for a histopathologic change must be made considering the nature of the change and any accompanying findings. When a severity grade is important to assessing the adversity of a histopathologic change, the rationale for assigning the severity grade should be clearly reported (Kerlin et al. 2016; Table 1; Figure 1).

Figure 1.

Decision tree for use of detailed grading criteria.

Statistics and Meta-analysis of Histopathologic Severity Grades

Although statistical analyses are routinely used to compare tumor incidence in carcinogenicity studies, the use of statistics with severity grades in toxicologic pathology is a controversial topic. Our goal here is not to evaluate or endorse specific statistical approaches for analyzing severity grades but rather to describe some of the challenges inherent in using severity grades to make comparisons across groups or studies. However, with rare exception, the use of statistics for ordinal and nonlinear severity grades is not recommended (Mann et al. 2012).

While the terms grading and scoring are generally considered to be synonymous in histopathology (and were assumed to be so in this document), they can carry different connotations. This Working Group favored the use of “grade” because the term “score” can be misinterpreted as a discrete quantitative measure, but in the case of histopathology, this is rarely the case. Based on the extent of morphologic changes, a pathologist typically assigns a categorical number to a microscopic finding, 0 through 4 or 5, using a nonlinear (or ordinal) scale (Mann et al. 2012). While assigned grades can range from completely subjective to semiquantitative estimates of severity, they are not actual (or cardinal number) measurements (Shackelford et al. 2002). It may be more appropriate to assign a score when more quantitative methods such as morphometric image analysis are utilized. In toxicologic pathology, composite scores incorporating multiple histopathologic changes have also been developed for specific disease entities (e.g., as described in McConnell and Davis 2002), but these specialized scoring systems are not commonly used in routine nonclinical toxicity studies.

As discussed previously, pathology data are generally descriptive in nature and therefore inherently subjective. Because severity grades are neither continuous nor normally distributed, it is inappropriate to apply parametric statistical tests based on the group mean of the severity grade (Zbinden 1976). The use of the median as a descriptive statistic may also obscure low-incidence, treatment-related findings. Severity grades are thus best captured in the pathology report by presenting incidence for each grade within a group rather than a single numeric measure of central tendency. Severity grades are assigned on an ordinal basis (whole numbers), and a minor difference in grade (one-grade difference for individual lesions or less than one-grade difference on average) may not necessarily represent a meaningful group difference. With ordinal grading, differences in grading can easily occur for lesions near a break point between two grades. For lesions with multiple tissue changes, different pathologists may fall on either side of a grade difference (e.g., grade 2 mild vs. grade 3 moderate) depending on which tissue component they consider most important. Furthermore, an assigned grade represents the unequal assimilation of multiple criteria from a spectrum of histopathologic changes. Therefore, when small differences exist, any comparison of severity grades between studies, or even between groups, is generally of limited use.

There have been few publications arguing the merits and challenges of using descriptive statistics applied to a more granular analysis of a specific finding (Holland and Holland 2011a; and subsequent letters to the editor by Wolf 2011; Ward and Thoolen 2011; Levin 2011; and Holland and Holland 2011b). One proposal is the categorical ranking of severity for a finding to support a potential effect or trend. However, in practical and general use, even the application of an appropriate nonparametric statistical test to rank-ordered severity grades is unlikely to alter the overall assessment of treatment-related effects. The power of nonclinical studies is rarely adequate to dismiss a potential signal as unlikely based on statistical analysis or to specifically highlight a signal only based on severity grade distribution.

In some cases, risk assessment groups have used statistical significance to evaluate potential treatment effects. For example, severity grades have been used in modeling of human risk based on statistical (categorical regression) software at the U.S. Environmental Protection Agency (e.g., https://www.epa.gov/bmds/catreg). This dose–response tool utilizes numerical severity grades to fit regression models and calculate the odds of developing a lesion of some severity or higher. In this application, a better understanding of the biological basis for severity grades is needed to help modelers use these modified severity grades to make decisions on how to best represent statistical associations between diagnoses. Given the variability between studies discussed previously, the STP Working Group considers these models to be of questionable value for routine nonclinical toxicity studies.

Finally, it may be a potential issue that tools like SEND currently do not extract pathology report narrative interpretations, which confounds the pathologist’s investment in contextualizing the severity grade data and interpretation. The pathology community recognizes that as SEND and other “big data” tools become more readily available to industry and regulatory reviewers, a potential impact of interstudy inconsistencies may be magnified, producing unintentional interpretations, trends, or cause for concern that may delay access to impactful treatments or products to patients or consumers while questions from health authorities are addressed.

Conclusions

The primary goal of this Working Group was to improve the communication of severity grading criteria used by pathologists in toxicity study reports, thereby also increasing consistency in the assignment of severity grades, as well as subsequent interpretation of pathology data by peer review pathologists, toxicologists, and regulatory agencies. In this article, we have reviewed current issues and considerations pertaining to histopathologic severity grades in the context of recent recommendations on adversity determinations in toxicity studies (Kerlin et al. 2016; Palazzi et al. 2016). The following are summary points to consider for the assignment of grading scales, effective communication of grading criteria in reporting documents, and the interpretation of grades by reviewers.

Purpose and Function

Histopathology is an interpretive science based largely on the professional judgment of a highly trained toxicologic pathologist. While scientifically informed, morphologic diagnoses are nevertheless subjective in nature. Severity grades provide a semiquantitative assessment of the extent of a lesion based on morphologic criteria and thus represent an important descriptive component of many histopathologic findings.

Basis for Criteria

Assigning grading criteria should be based exclusively on the morphologic changes observed within the target tissue and not the presumed impact on organ or biological function, reserve capacity to respond to subsequent stress, overall health status of the animal, adversity, reversibility, presumed relevance to humans, or other types of data such as organ weights and clinical pathology.

Defining Grading Criteria

When the severity of critical test article–related changes potentially impacts the NOAEL/LOAEL, specific histopathologic grading criteria should be clearly described within the pathology report. This documentation is an important part of effectively communicating the pathologist’s interpretation of findings to subsequent reviewers. Without defined grading criteria, report readers must rely on a subjective understanding of grading descriptors, possibly resulting in inaccurate decisions.

Inherency

Histopathologic grading of lesion severity in a toxicity study provides important context for determining critical dose effect levels, which may then be used to inform risk assessment and dose selection in future studies. However, because the pathologist assigns severity criteria specifically for the study at hand, the grading categories (e.g., minimal, mild, moderate, marked, and severe) are relative terms, and are generally not intended to carry inherent meaning for different lesions within a study or similar lesions between studies.

Variability

Differences in grading within and between studies are unavoidable and due in part to various experimental factors and normal biological variability of the animal model. Overall variability can be reduced through reporting of detailed morphologic grading criteria (where applicable for critical findings) and a pathology peer review process that incorporates these criteria.

Adversity

Severity grades may contribute to the determination of adversity levels. However, severity grades do not necessarily have a direct relationship to adversity and severity grades should not be assigned based on any presumed relationship to adversity. When the severity grade is critical to the determination of adversity, the basis for assigning the severity grade should be clearly defined. Communication of the biological impact and potential adversity associated with histopathologic findings is most appropriate within the report narrative.

Statistical Evaluation

Formal statistical analyses of severity grades, including assignment of significance, are rarely appropriate in routine nonclinical studies or modeling across studies.

Footnotes

Authors’ Note

This article is a product of a STP Working Group and has been reviewed and approved by the SRPC. The article does not represent a formal best practice recommendation of the Society but provides key points to consider in designing or interpreting data from regulated toxicity and safety studies. The opinions expressed in this document are those of the authors and do not reflect views or policies of the employing institutions including the U.S. Food and Drug Administration. This article has also been reviewed by the U.S. Environmental Protection Agency and approved for publication. Approval does not signify that the contents reflect the views of the agency, and mention of trade names or commercial products does not constitute endorsement or recommendation for use.

Acknowledgments

The authors would like to thank the reviewers from the Scientific and Regulatory Policy Committee, U.S. Environmental Protection Agency, and U.S. Food and Drug Administration for their critical comments on this manuscript.

Author Contribution

All authors (KS, JE, JF, WH, RH, GL, EM, DP, MT, CW, SF) contributed to conception or design; data acquisition, analysis, or interpretation; drafting the manuscript; and critically revising the manuscript. All authors gave final approval and agreed to be accountable for all aspects of work in ensuring that questions relating to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Adams

E. T.

Crabbes

T. A.

(2013). Basic approaches in anatomic toxicologic pathology. In Haschek and Rousseaux’s Handbook of Toxicologic Pathology ( Haschek

W. M.

Rousseaux

C. G.

Wallig

M. A.

Bolon

Ochoa

, eds.), pp. 164–65. Academic Press, Amsterdam, the Netherlands.

Crissman

J. W.

Goodman

D. G.

Hildebrandt

P. K.

Maronpot

R. R.

Prater

D. A.

Riley

J. H.

Seaman

W. J.

Thake

D. C.

(2004). Best practices guideline: Toxicologic histopathology. Toxicol Pathol 32, 126–31.

Engelhardt

J. A.

Gries

C. L.

Long

G. G.

(1993). Incidence of spontaneous neoplastic and nonneoplastic lesions in Charles River CD-1 mice varies with breeding origin. Toxicol Pathol 21, 538–41.

Gibson-Corley

K. N.

Olivier

A. K.

Meyerholz

D. K.

(2013). Principles for valid histopathologic scoring in research. Vet Pathol 50, 1007–15.

Hard

G. C.

Betz

L. J.

Seely

J. C.

(2012). Association of advanced chronic progressive nephropathy (CPN) with renal tubule tumors and precursor hyperplasia in control F344 rats from two-year carcinogenicity studies. Toxicol Pathol 40, 473–81.

Holland

(2011a). Analysis of unbiased histopathology data from rodent toxicity studies (or, are these groups different enough to ascribe it to treatment?). Toxicol Pathol 39, 569–75.

Holland

(2011b). Response to Letter to Editor by Dr. J. C. Wolf on “Analysis of unbiased histopathology data from rodent toxicity studies (or, are these groups different enough to ascribe it to treatment?).” Toxicol Pathol 39, 1138.

International Organization for Standardization. (2016). ISO 10993-6: 2016 Biological evaluation of medical devises—Part 6 tests for local effects after implantation. Accessed January 1, 2018. https://www.iso.org/standard/61089.html.

Kaufmann

Bader

Ernst

Harada

Hardisty

Kittel

Kolling

. (2009). 1st International ESTP Expert Workshop: “Larynx squamous metaplasia.” A re-consideration of morphology and diagnostic approaches in rodent studies and its relevance for human risk assessment. Exp Toxicol Pathol 61, 591–603.

10.

Kerlin

Bolon

Burkhardt

Francke

Greaves

Meador

Popp

(2016). Scientific and regulatory policy committee: Recommended (“best”) practices for determining, communicating, and using adverse effect data from nonclinical studies. Toxicol Pathol 44, 147–62.

11.

Levin

(2011). Concerning the analysis of unbiased histopathology data. Toxicol Pathol 39, 1139.

12.

Long

G. G.

Hardisty

J. F.

(2012). Regulatory forum opinion piece: Thresholds in toxicologic pathology. Toxicol Pathol 40, 1079–81.

13.

Mann

P. C.

Hardisty

J. H.

(2014). Pathology working groups. Toxicol Pathol 42, 283–84.

14.

Mann

P. C.

Vahle

Keenan

C. M.

Baker

J. F.

Bradley

A. E.

Goodman

D. G.

Harada

. (2012). International harmonization of toxicologic pathology nomenclature: An overview and review of basic principles. Toxicol Pathol 40, 7s–13s.

15.

Markovits

J. E.

Bouchard

P. R.

Clarke

C. J.

McMartin

D. N.

(2013). Introduction to toxicologic pathology. In Toxicologic Pathology: Nonclinical Safety Assessment ( Sahota

P. S.

Popp

J. A.

Hardisty

J. F.

Gopinath

, eds.), pp. 77–96. CRC Press, Boca Raton, FL.

16.

McConnell

E. E.

Davis

J. M.

(2002). Quantification of fibrosis in the lungs of rats using a morphometric method. Inhal Toxicol 14, 263–72.

17.

Morton

Kemp

R. K.

Francke-Carroll

Jensen

McCartney

Monticello

T. M.

Perry

. (2006). Best practices for reporting pathology interpretations within GLP toxicology studies. Toxicol Pathol 34, 806–9.

18.

Morton

Sellers

R. S.

Barale-Thomas

Bolon

George

Hardisty

J. F.

Irizarry

. (2010). Recommendations for pathology peer review. Toxicol Pathol 38, 1118–27.

19.

Palazzi

Burkhardt

J. E.

Caplain

Dellarco

Fant

Foster

J. R.

Francke

. (2016). Characterizing “adversity” of pathology findings in nonclinical toxicity studies: Results from the 4th ESTP international expert workshop. Toxicol Pathol 44, 810–24.

20.

Shackelford

Long

Wolf

Okerberg

Herbert

(2002). Qualitative and quantitative analysis of nonneoplastic lesions in toxicology studies. Toxicol Pathol 30, 93–96.

21.

Ward

J. M.

Thoolen

(2011). Grading of lesions. Toxicol Pathol 39, 745–46.

22.

Wolf

J. C.

(2011). Counterpoint to “Analysis of unbiased histopathology data from rodent toxicity studies (or, are these groups different enough to ascribe to treatment?).” Toxicol Pathol 39, 1017–19.

23.

Wolf

J. C.

Maack

(2017). Evaluating the credibility of histopathology data in environmental endocrine toxicity studies. Environ Toxicol Chem 36, 601–11.

24.

Wolf

J. C.

Ruehl-Fehlert

Segner

H. E.

Weber

Hardisty

J. F.

(2014). Pathology working group review of histopathologic specimens from three laboratory studies of diclofenac in trout. Aquat Toxicol 146, 127–36.

25.

Zbinden

(ed.) (1976). The role of pathology in toxicity testing In Progress in Toxicology, pp. 8–18. Springer-Verlag, Berlin, Germany.