Recommendations for the Evaluation of Pathology Data in Nonclinical Safety Biomarker Qualification Studies

Abstract

A set of best practices for the conduct of histopathology evaluation in nonclinical safety studies was endorsed by the Society of Toxicologic Pathology (STP) in 2004. These best practices indicate that the study pathologist should have knowledge of the treatment group and access to all available study-related data for the animal from which the tissue was obtained. A new set of best practices for the conduct of histopathology review for safety biomarker qualification for nonclinical studies has been endorsed by the STP and is summarized in this document. These best practices are generally similar to those for nonclinical safety studies, specifically that the pathologist be “unblinded” or have access to study data. Although histopathology evaluation in biomarker qualification studies must be performed without knowledge of novel biomarker data, the study pathologist(s) should be involved in the attendant meta-analyses of these data. Blinded evaluation is an experimental tool in biomarker qualification studies that is appropriate only when well-defined criteria for specific histopathologic findings are identified prior to blinded review. Additionally, this paper also considers the management of bias, the use of a tiered evaluation approach, the importance of using qualified pathologists and standard reporting, and the management of spontaneous findings.

Keywords

safety biomarker histopathology bias qualification.

Summary of Best Practices

Best practices for histopathology review of nonclinical safety studies (Crissman et al. 2004) generally apply to nonclinical safety biomarker qualification studies.

Histopathology evaluation in nonclinical safety biomarker qualification studies must be performed by a qualified pathologist using a consistent evaluation method.

Sampling bias in tissue selection and sectioning is inherent to histopathology evaluation and must be managed appropriately.

Changes indistinguishable from spontaneous findings in tissues must be identified and interpreted appropriately in biomarker studies.

A tiered approach to evaluation should be applied to nonclinical safety biomarker qualification studies, where appropriate.

During histopathologic evaluation in biomarker qualification studies, pathologists should be blinded to the data that are specific to the novel biomarker undergoing qualification; however, they should be involved in the attendant meta-analyses of these data.

Blinded histopathology evaluation is an experimental tool in biomarker qualification studies that is appropriate only under conditions in which well-defined criteria for specific desired end points have been achieved prior to the blinded review of said end point.

Introduction

Various consortia and working groups composed of professionals from industry, academia, and government institutions have undertaken or are undertaking nonclinical work to qualify safety biomarkers of tissue injury, function, and repair. Multiple groups have been working in parallel, and participants have developed a combination of questions, answers, and opinions with regard to study execution or preferred practices to achieve study end points (Burkhardt et al. 2010). Qualitative histopathology serves as a benchmark in these biomarker studies, since it is often the technique used to ultimately identify the pathologic process for which the biomarker is being developed. Similar to nonclinical safety studies, different practices may be used during histopathology evaluation within safety biomarker qualification studies. In nonclinical safety studies, histopathology is of particular interest, and the most favored or preferred practices have been reviewed extensively in previous publications (Crissman et al. 2004; Weinberger 1979). The application of these practices to nonclinical safety biomarker qualification studies has been considered less extensively. This paper will consider and discuss concepts relevant to biomarker qualification, and the use of qualitative histopathology as a benchmark, with a view to developing best practices for histopathology evaluation in nonclinical safety biomarker qualification studies.

Traditional biomarkers benefit from decades of nonclinical and clinical experience. This extensive experience is the foundation for confidence in the analytic and diagnostic accuracy of interpretation under a variety of normal physiologic and disease states. Even relatively recently, emergent biomarkers have benefited from years of qualification. For example, the small nuclear protein cystatin C was proposed as a biomarker of reduced glomerular filtration rate in the early 1980s (Grubb et al. 1985). Since then, numerous investigations have been performed that influenced data interpretation, development of additional test platforms, meta-analyses of clinical studies, and multinational expert meetings (Filler et al. 2005). Potential causes of pre-analytical and analytical variability, including the effects of age, sex, and intra-individual variation, as well as the application of cystatin C to specific diseases or clinical conditions, have been examined. Such systematic steps in biomarker qualification require time and resources, but they are essential to definitively qualify or establish the value of biomarkers. Newer nonclinical safety biomarkers are rarely afforded such an extensive analysis before integration into safety evaluation.

Histopathology evaluation of tissues is a significant component in the assessment of biomarker performance. Histopathology, by virtue of its widespread use as a reference standard (or universally agreed upon method for confirming the status of a disease), is a key element for determining the predictive or diagnostic value of novel nonclinical safety biomarkers (or index test; Table 1) using meta-analytic methods such as receiver operating characteristic (ROC) analyses, likelihood ratios, and so on. The method by which histopathology evaluations and interpretations are performed, namely, the histopathology practices, is therefore important and is the focus of this manuscript. Two terms, repeatedly referenced throughout the manuscript, are defined here:

Table 1.

Types of bias and their effect on diagnostic performance.

Type of bias	Description	Effect on diagnostic performance
Reference standard	Knowledge of reference test result during evaluation of an index test	Overestimation
Information	Interpretation of reference test with knowledge of index test results	Overestimation
Incorporation	Index test is incorporated in the determination of the reference standard	Overestimation
Disease spectrum	Lack of generalization of disease spectrum of test population to general population	Overestimation
Misclassification	Incorrect classification by reference test or standard	Over- or underestimation
Expectation or observer review	Knowledge of dose group, exposure, or clinical history	Over- or underestimation

Note: Index test indicates test undergoing evaluation; reference standard indicates agreed method for confirmation of presence or absence of target condition. After Leeflang 2008.

Blinded Evaluation

A formal process in which the pathologist conducts a histopathologic evaluation unaware of (“blinded to”) the identity of the animal represented on the slide. In blinded review, the pathologist may issue the study report without knowing the identity of any of the specimens with respect to treatment, group assignment, or outcomes from in-life data in the study. Consequently, the pathologist may be unable to offer relevant interpretations within the report, as breaking the “blind” by identification of dose group after histopathologic evaluation is required for assessment of dose-response and integration of histopathology findings with other study data.

Targeted Masked Evaluation

An informal process by which the pathologist temporarily masks the slide identity in order to clarify or refine evaluation of treatment effects during the course of the open or “unblinded” review. The pathologist issues the report with full knowledge of the identity of every specimen examined in relation to treatment, group assignment, and in-life data. Consequently, the pathologist can provide context and interpretation of findings relative to concurrent or historical controls and to incidence and severity of any treatment effect(s).

Background Information to Support Best Practices in the Histopathology Evaluation of Nonclinical Safety Biomarker Qualification Studies

Role of Bias in Biomarker Evaluation

Histopathology is a specialized discipline in which considerations for best practices may have similarities, but also important differences from other disciplines used in biomarker evaluation. Histopathology evaluation in nonclinical safety studies is comparable in some aspects to diagnostic assessment in other observational platforms such as radiology or ultrasonography, in which qualitative or semiquantitative end points are evaluated by professionals trained in the discipline. Evaluations are applied on a continuous scale, but outcomes are interpreted on discrete ordinal (e.g., severity graded) or binary (e.g., presence or absence of a finding) scales and in the context of normal anatomy or tissue features. Although histopathology evaluation in nonclinical safety studies shares some similarities to that applied in diagnostic pathology, there are some reasons these similarities do not hold true for biomarker qualification studies. Diagnostic-type studies of spontaneous disease commonly use predefined or institutionally accepted microscopic diagnostic criteria for specific end points, such as for identification, grading, and staging of cancer or hepatic fibrosis. Definitions of these end points have generally evolved over time, as scientists have characterized time-course for disease and have inferred a prognostic value for various tissue changes. Because nonclinical biomarker studies to evaluate toxicity effects typically have less well-defined and often unique histologic changes, evaluations are generally conducted in the context of findings present in the concurrent control animals.

In both clinical and nonclinical studies, visual diagnosis is influenced by two different but overlapping components. The first is a perceptual or non-analytical assessment with rapid adjudication of a diagnosis. The second is a cognitive or analytical assessment with conscious integration of diagnostic features into a diagnosis (Norman et al. 1992). Although dissociation of perceptual from cognitive processing occurs at different levels in the diagnostic process, perceptual processing can dominate diagnostic assessment in visual diagnosis. Similarly, clinical information can influence readers of visual end points through both perceptual and analytical processes, yet the net effect of both components is positive and, in effect, improves diagnostic assessment (Loy and Irwig 2004).

Various types of bias can be recognized in the context of scientific investigation. Clinical history and histopathology parameters represent covariate influences that can affect diagnostic outcomes, and generally, these parameters are recognized and considered in statistical evaluation in observational studies. These parameters include prevalence or severity of observed findings, observer experience, and sample metadata. Bias, or the systematic deviation from the “truth,” is common to any scientific study and particularly observational studies, and thus it also must be considered in the data evaluation process (Grimes and Schultz 2002; Sica 2006). Numerous types of bias with overlapping and confusing nomenclature and that are often discipline specific have been described, and a generalized summary, including descriptions and outcomes related to diagnostic test evaluation, is provided in Table 1. For example, an increase in disease prevalence can exert a positive bias such that the test reader may have a greater likelihood of identifying a finding (context bias). Similarly, severity and prevalence of disease in a study conducted at a tertiary referral center can overestimate the marker’s diagnostic performance in the general population (spectrum bias).

It has been proposed that knowledge of treatment group during histopathology evaluation of a nonclinical safety study introduces a bias that can confound the diagnostic performance of a candidate novel biomarker in safety biomarker qualification studies (Dieterle et al. 2010). Of the types of bias shown in Table 1, expectation or observer review bias most closely relates to the question of blinding pathologists to treatment and dose group during nonclinical safety or biomarker qualification studies. In this context, this type of bias equates to the provision of limited signalment data with no information regarding toxicant exposure, or equivalently, of “disease.” For clinical observational studies, knowledge of clinical history or reference standard has been associated with improved diagnostic accuracy of the observational test (Brealey et al. 2007). This improved diagnostic accuracy, however, is generally not significant based on a meta-analysis of paired observational studies in which radiographs, computed tomography scans, or mammograms were evaluated in blinded fashion, relative to a reference standard by the same observers with or without knowledge of clinical history under identical circumstances and separated by a washout period (Loy and Irwig 2004). In diagnostic performance evaluation of qualitative end points, significant increases or a positive bias in reading accuracy generally occurs only when knowledge or incorporation of the results of a reference standard are used in diagnostic performance evaluation. Most importantly, although providing an evaluator with information such as toxicant exposure or clinical data can sensitize the evaluator in an observational study toward identification of a given finding, this informational bias is relevant only when the diagnostic accuracy of the observational end point itself is being evaluated.

Inter-observer variability is one of the largest sources of variance in observational studies in the clinic (Brealey et al. 2007; Rousselet et al. 2005). Similarly, inter-observer variability is of concern in nonclinical histopathology evaluation (e.g., variability in severity scores assigned to histopathologic findings) and can arise from differences in observer experience or historical control perspective and from interstudy variability associated with the testing facility or the test system, including animal strain, age, sex, or feeding regimen (Qin et al. 2007; Roe 1988). Evaluation of concurrent control animals helps to address these differences and to control for interstudy variability in prevalence of background or spontaneous histopathology findings. Knowledge of control animal identity also provides the opportunity to anchor perceptual histopathology impressions within this context of inherent biological variability in control animals (Dieterle et al. 2010). In nonclinical safety studies in which histopathology is used as the reference standard for biomarker qualification, knowledge of dose group minimizes observer variation and ensures diagnostic consistency within and across studies. As such, unblinded histopathology evaluations in nonclinical studies improve the accuracy of the reference standard (i.e., histopathology evaluation) and, therefore, improves assessment of the diagnostic performance of the index test.

The perspective paper by D. F. Ransohoff entitled “Bias as a threat to the validity of cancer molecular-marker research” is often referenced in the discussion of blinded histopathology evaluation for safety biomarker studies (2005) and, for the purposes of this article, helps exemplify the salient aforementioned points regarding observational studies. This opinion paper discusses identification and subsequent handling of bias during evaluation of cancer molecular biomarkers in observational clinical studies based on outcome from randomized controlled trials. The applicability of perspectives from this paper, however, to histopathology assessment in safety biomarker qualification studies has important caveats and limitations. First, unlike randomized controlled trials or clinical observational studies, safety biomarker qualification studies use age- and sex-matched laboratory animals randomized into treated and control groups, thereby eliminating a “bias of inequality” that often occurs in clinical studies.

Second, randomization of sample analysis is routinely used in nonclinical studies to minimize the potential for “bias of unequal assessment of results.” Ransohoff does state that such tightly controlled, experimental studies, when properly designed, have better internal validity and thus a reduced risk of bias. It is only for less-controlled studies, such as clinical studies of heterogeneous human populations, that Ransohoff proposes blinding and randomization of both subjects and sample analyses to reduce the potential for bias.

Third, Ransohoff concedes that a proper and thorough description or standardization of operating procedures can greatly minimize bias. To this end, in safety biomarker qualification studies, the increased use of histopathology lexicons and best practices will clearly define and standardize the histopathology procedures in question here.

Fourth, a tiered approach for histopathologic evaluation in nonclinical safety studies (Crissman et al. 2004) is proposed herein for safety biomarker qualification studies. This approach provides a means to address potential bias by allowing the pathologist first to accurately discern and characterize the findings in an unblinded fashion, and only after a comprehensive understanding of the finding(s) of interest, perform a targeted masked evaluation, if indicated.

Fifth and perhaps most importantly, the approach of blinding pathologists to the biomarker data for the biomarker being tested, rather than to treatment group, in effect blinds them to the end point of interest, namely, the value of the safety biomarker. This approach is consistent with recommendations for independence of the index test (i.e., biomarker) and reference standard (i.e., histopathology) in diagnostic test evaluation, as outlined by the STARD (Standards for Reporting of Diagnostic Accuracy) initiative and others (Bossuyt et al. 2003; Deeks 1999; Ransohoff 2005).

Finally, Ransohoff clearly highlights the expense of blinded analysis and advises that blinded evaluation should be reserved for situations in which a specific question has to be answered based on promising preliminary data, such as late-stage clinical development. Thus, the proposed approach for histopathology assessment in safety biomarker qualification studies is in accord with this last consideration from Ransohoff.

Spontaneous Findings in Biomarker Studies

Exacerbation of spontaneous findings based on the identification of dose-responsive increases in incidence and/or severity of these findings is well recognized as a consequence or treatment-related effect in nonclinical safety evaluation. Beyond strain differences in the incidence of spontaneous findings, incidence of common spontaneous findings varies between studies as well as within and among institutions. However, although the prevalence of spontaneous findings within control animal populations can vary across studies or even within strains of animals, the severity of spontaneous findings is generally of low magnitude. Accordingly, dose-response in incidence and/or severity relative to concurrent and historical controls is relied on to discriminate treatment-related findings from spontaneous findings.

Assessment of diagnostic accuracy of candidate biomarkers is dependent on both the quality of the assay and the accuracy of the reference standard. Therefore, the STP recommends that consistency in histopathology evaluation, both in nonclinical safety evaluation and biomarker qualification studies, is of greater importance than blinding of pathologists to treatment group. Further, we propose that common spontaneous findings of minor severity have minimal impact on the diagnostic performance of the associated biomarker. Finally, these minor findings are completely assessed during diagnostic performance by ROC analysis, as this method systematically evaluates all possible sensitivity and specificity pairs within a dataset for all possible diagnostic thresholds of the marker being evaluated.

An example useful to illustrate this position is random multifocal or single-cell hepatocellular necrosis in rat liver, a spontaneous finding that is known to be exacerbated by treatment, which might be associated with increases in serum alanine amino aransferase (ALT) activity. Histopathology and serum ALT activity data from a recent hepatotoxicogenomics project were evaluated by ROC analysis for comparison of spontaneous versus treatment-related hepatocellular necrosis. In brief, male Sprague-Dawley rats (N = 3,204) were given 182 different treatments for four or fourteen days. Study end points collected on Day 5 or Day 15 included histopathology, clinical pathology, body and relative liver weight, and liver gene expression (Ennulat et al. 2010). Numeric data were normalized across studies to concurrent control means, such that individual values for treated animals were expressed as a fold change of the control mean.

Spontaneous or non–dose-responsive single-cell or multifocal random hepatocellular necrosis occurred in eighty-three rats given sixty-two different treatments, and treatment-related or dose-responsive hepatocellular necrosis was identified in seventy-nine rats given twenty-two different treatments. No histopathology findings were identified in 1,660 rats given 179 different treatments. Diagnostic performance and magnitude of change of ALT and other conventional hepatobiliary markers are summarized in Table 2, and ROC plots for ALT are provided in Figure 1.

Table 2.

Serum analyte values and liver cell necrosis in rat studies.

	Liver histopathology	Dose-responsive random multifocal or single-cell necrosis (n = 79)	Spontaneous random multifocal or single-cell necrosis (n = 83)	No abnormalities detected (n = 1,658)
ALT	AUC (SE)	0.785 (0.04)	0.562 (0.04)	0.605 (0.01)
	95CI	0.716 – 0.854	0.48 – 0.57	0.59 – 0.63
	MFC (SD)	6.4 (8.8)	1 (0.5)	1 (0.5)
	median	2.7	0.9	0.9
AST	AUC (SE)	0.801 (0.03)	0.566 (0.04)	0.625 (0.01)
	95CI	0.743 – 0.858	0.48 – 0.53	0.61 – 0.63
	MFC (SD)	4.3 (6.2)	1.1 (0.5)	1.1 (0.5)
	median	2.2	1	1
Tbili	AUC (SE)	0.743 (0.03)	0.53 (0.05)	0.627 (0.01)
	95CI	0.688 – 0.798	0.44 – 0.59	0.61 – 0.6
	MFC (SD)	4.9 (10.2)	1.3 (1)	1.1 (0.9)
	median	1.7	1	1
SBA	AUC (SE)	0.754 (0.03)	0.59 (0.04)	0.6 (0.01)
	95CI	0.704 – 0.804	0.51 – 0.57	0.58 – 0.56
	MFC (SD)	3.6 (6.5)	0.8 (0.5)	1 (1.2)
	median	1.8	0.8	0.8
ALP	AUC (SE)	0.528 (0.04)	0.565 (0.05)	0.565 (0.01)
	95CI	0.449 – 0.607	0.48 – 0.52	0.54 – 0.54
	MFC (SD)	1.2 (0.7)	1 (0.3)	1 (0.2)
	median	1	0.9	1
GGT	AUC (SE)	0.592 (0.03)	0.515 (0.03)	0.536 (0.01)
	95CI	0.539 – 0.644	0.47 – 0.53	0.52 – 0.55
	MFC (SD)	2 (6.7)	1.2 (1.6)	9.9 (60.3)
	median	0	0.5	0
Chol	AUC (SE)	0.673 (0.03)	0.532 (0.04)	0.552 (0.01)
	95CI	0.605 – 0.741	0.45 – 0.54	0.53 – 0.55
	MFC (SD)	1.6 (0.9)	1.1 (0.5)	1.1 (0.4)
	median	1.4	1	1.1
Trig	AUC (SE)	0.646 (0.04)	0.535 (0.05)	0.546 (0.01)
	95CI	0.573 – 0.719	0.44 – 0.63	0.52 – 0.57
	MFC (SD)	1.7 (1.4)	1.1 (1.2)	1 (0.6)
	median	1.4	0.9	0.9

Abbreviations: 95 CI, 95% confidence interval; ALP, alkaline phosphatase; ALT, alanine amino aransferase; AST, aspartate aminotransferase; AUC, area under the curve; Chol, cholesterol; GGT, gamma glutamyl transferase; MFC, mean fold change of concurrent control mean; SBA, serum bile acids; SD, standard deviation; n, number of animals; SE, standard error; Tbili, total bilirubin; Trig, triglycerides.

Figure 1.

ROC curves for serum ALT concentrations for rats with treatment-related or dose-responsive random multifocal or single-cell necrosis (red), spontaneous or background random multifocal or single-cell necrosis (green), or no abnormalities detected (blue).

Based on these data, ALT had no diagnostic utility for spontaneous random multifocal or single-cell hepatocellular necrosis based on magnitude of ALT AUC or serum ALT activity relative to respective control values. In contrast, treatment-related or dose-responsive exacerbation of random multifocal or single-cell hepatocellular necrosis was associated with significantly increased ALT AUC and serum ALT activity.

Thus, the presence of spontaneous histopathology findings in safety biomarker qualification studies requires reference to a control group to “normalize” histopathology findings and discriminate treatment-related from spontaneous findings. Further, all biomarker data are included in a comprehensive ROC analysis, and based on this unbiased analysis, spontaneous histopathology findings have minor impact on diagnostic performance of candidate biomarkers. Most importantly, consistency in the histopathology evaluation (i.e., accuracy of the reference standard) is of greater importance than blinding of pathologists to treatment group in both nonclinical safety evaluation and biomarker qualification studies.

Overview of Blinded Evaluation and Its Appropriate Application for Biomarker Qualification

Blinded histopathology evaluation, as defined herein, is a relatively common practice employed on a selective basis in research. Requirements include the complete and thorough morphologic characterization of histopathology finding(s) associated with a disease or toxicant, and the development of a predefined set of criteria to evaluate and score each characteristic of the finding. In general, blinded histopathology is applicable under the following three scenarios.

Evaluation of Well-Characterized Diseases or Animal Models

Dixon et al. (2004) demonstrated the effect of weight loss on non-alcoholic fatty liver disease by blinded assessment of non-alcoholic steatohepatitis using the method proposed by American Association of Liver Diseases. This method is a published and validated scoring system for non-alcoholic fatty liver disease and steatohepatitis (Kleiner et al, 2005). Similarly, Trebino et al. (2003) studied the effect of inducible prostaglandin E synthase on acute and chronic inflammation in knockout mice, including an experimental model of human rheumatoid arthritis. Effect of PGE synthase was studied by blinded histopathologic scoring of stifle joints using a well-defined and widely used scoring technique for osteoarthritis (Mankin et al. 1971).

Quantitative Analysis

A blinded histopathology evaluation is often used when microscopic features can be quantified and compared across treatment groups as end points on a continuous scale. For example, Konikoff et al. (2006) conducted a placebo-controlled, double-blinded trial to study the effect of fluticasone proprionate on pediatric eosinophilic esophagitis. Effectiveness of treatment was measured by counting the number of intraepithelial eosinophils in all 400× high-power fields in a single histologic section of each biopsy specimen. One might envision a comparable approach in which a well-characterized effect of a test article can be studied by counting elements such as specific cell types, apoptotic bodies, mitotic figures, and so on, or by measuring thickness, height or surface area, or particular cells or cell layers, particularly through the use of a stereologic approach.

Analysis for Identification of a Specific Diagnosis

Tokumaru et al. (2004) compared the performance of a panel of hypermethylation markers for glutathione S-transferase against the reference standard of a histopathology diagnosis of adenocarcinoma performed by a single pathologist. By blinded evaluation, the biopsy specimens were assigned a Gleason score, which is a well-established method of grading prostate adenocarcinomas for prognosis (Gleason 1977). In this case, the pathologist was blinded to biomarker results but was expected to conduct a purely diagnostic exercise using well-established and accepted criteria.

The common theme in the above examples is that blinded evaluation as the first tier of histopathology is effective only when the end point is well defined, quantified, and calibrated before the analysis is initiated. Such characterization of histopathology end points is often achieved by methods that have been developed over years of observation, application, and periodic re-evaluation and have been subjected to the rigors of a peer-reviewed publication process.

The same principles can be applied to the histopathology evaluation of safety biomarker studies, even when the end point is not well defined, by using targeted masked evaluation as a second layer in a tiered approach. The first layer in this tiered approach is an “unblinded” analysis to characterize the finding by comparing treated and control specimens, with the objectives to recognize especially subtle findings such as the increase in incidence or severity of spontaneous findings, and to develop scoring criteria. Following this first tier analysis, the pathologist may re-examine selected or all treated groups without knowledge of either animal or group identity (i.e., a targeted masked evaluation). This method allows confirmation of subtle treatment-related findings that can be consistently differentiated from those that occur in controls.

This tiered approach is often used as a tool by pathologists in evaluation of nonclinical safety studies and serves the dual purpose of maintaining the sensitivity of the pathologist in recognizing subtle changes while also providing an unbiased assessment. When this process is further combined with the rigors of a peer review process, the results and interpretations are of high quality and consistency. It is the opinion of the STP that the process of targeted masked evaluation, when justified by the study pathologist, is a true blinded evaluation, as it ensures removal of expectation bias and ensures validity of histopathology interpretation (Table 1).

Best Practices in the Histopathology Evaluation of Nonclinical Safety Biomarker Qualification Studies

Best practices for histopathology review of nonclinical safety studies (Crissman et al. 2004) generally apply to biomarker qualification studies.

The Society of Toxicologic Pathology has endorsed a set of best practices, relating to the methodology applied to histopathology evaluation within the context of nonclinical safety studies to support clinical studies and new product registration (Crissman et al. 2004). Key elements of these best practices indicate that the study pathologist should have: (1) knowledge of the treatment group from which the sample was obtained; and (2) complete knowledge of all available in-life and pathology data that relate to the animal from which the tissue was obtained. This informed type of analysis often is referred to as “unblinded,” because the study pathologist is not naive to the study data during the slide evaluation.

Additionally, these best practices describe a tiered approach in that the primary histopathology review can, at the discretion of the pathologist, be followed by a targeted masked evaluation to resolve subtle quantitative or qualitative variables across treatment groups. Subsequently, a peer review of histopathology is conducted in an unblinded fashion, during or after which the blinded or targeted masked evaluation can be used at the discretion of the peer reviewer.

Histopathology evaluation in biomarker qualification studies must be performed by a qualified pathologist using a consistent evaluation method.

Similar to the practice of many other medical subspecialties, proficiency in pathology is gained through intensive academic experience, hands-on training, and by repeated rigorous diagnostic application. This expertise is gained initially through extensive apprenticeship with experienced pathologists who already are proficient in the discipline, and this skill subsequently is refined throughout the entire professional life of a pathologist.

In the United States, most veterinary pathologists are diplomates of the American College of Veterinary Pathologists (ACVP), and the diplomate status is obtained through a certifying examination offered by the ACVP. This certifying examination is recognized by the American Board of Veterinary Specialties of the American Veterinary Medical Association. Eligibility for this certification process requires demonstration of broad and comprehensive training under the supervision of pathologists holding ACVP diplomate status. A smaller proportion of toxicologic pathologists is not affiliated with the ACVP, but these pathologists tend to be experts with extensive experience in the discipline. A similar certifying body, the European College of Veterinary Pathologists (ECVP), has been developed with the intent to ensure high standards of training and competency of pathologists in Europe, and more recently, this model has been adopted in Japan, South Korea, and India (Bolon et al. 2011). In South American countries such as Brazil, Argentina, and Colombia, current emphasis is on improving the training of veterinary pathologists through collaborations with established organizations such as the C. L. Davis Foundation. During this training, a critical step is learning to understand and assimilate the extremely wide variability of normal microscopic structure or biological variation of various organs (compounded for a veterinary pathologist as a result of interspecies and interstrain variations in these features and their associated physiology). Additional critical learning is to understand and detect manifestations of important cellular processes in microscopic sections and to develop an active vocabulary to accurately describe a finding. At a more advanced stage, the pathologist learns to recognize complex morphologic changes in tissues and then to infer the presence of a particular disease manifestation.

When arriving at a diagnosis in experimental pathology settings (such as in vivo nonclinical safety studies), the pathologist draws extensively on training and experience. Given the wide range of novel test articles evaluated in nonclinical safety studies, pathologists frequently encounter unique and subtle morphologic manifestations of tissue injury against a background of extensive variations of normal morphology. Reliance on training and memory alone to distinguish normal from abnormal can lead to diagnostic inconsistency, as both inherently include a learning bias. Thus, it is important that the first tier evaluation is made by comparison with concurrent study controls in order to accurately discriminate treatment-related findings. Study controls also serve to minimize reader variability and to maximize diagnostic consistency.

Sampling bias in tissue selection and sectioning is inherent in the histopathology evaluation and must be managed appropriately.

Laboratories conducting nonclinical safety studies normally have stringent standard operating procedures that standardize tissue sampling techniques, including spatial orientation and plane of sectioning (Bucci 2002). These practices of standardization, which are purposely biased to provide “optimum routine sections,” are critical to the systematic and consistent qualitative evaluation of histopathology findings, and departure from the standard can introduce variability that is extraneous to treatment-related effects (Bucci 2002).

Given the importance of consistent section presentation, as described above, it is recommended that this intentional sampling bias within these studies is necessary to facilitate detection of subtle changes between treatment groups and maximizes both sensitivity and selectivity of a qualitative histopathology evaluation. Changes indistinguishable from spontaneous findings in tissues must be identified and interpreted accurately in biomarker studies.

As with biological variation of clinical pathology parameters in normal or control animals in nonclinical safety studies (Carakostas and Banerjee 1990), variability in both the prevalence and character of spontaneous histopathology findings in control animals is well recognized. Biological variability in either biochemical or histopathology data is influenced by many variables including, but not limited to, age, sex, strain, source, and diet (Qin et al. 2007; Roe 1988).

Identification of spontaneous histopathology findings generally does not require knowledge of dose group but does require knowledge of control group. In addition, these spontaneous findings may bear little relationship to changes in traditional clinical pathology parameters, which was illustrated previously in the example of random multifocal or single-cell hepatocellular necrosis in which spontaneous or non–dose-responsive single-cell hepatocellular necrosis had no impact on the diagnostic performance of ALT based on ROC analysis or serum activity relative to controls. In contrast, diagnostic accuracy and magnitude of change in ALT values were significantly higher for a dose-responsive morphologically identical finding that was treatment related and comparable to other treatment-related manifestations of necrosis.

A tiered approach to evaluation should be applied to biomarker qualification studies, where appropriate.

Biomarker qualification is a progressive process requiring a systematic approach that is determined by many factors. Initially, data may be collected that support the use of the biomarker in a specific context dictated by the study design used for qualification. These data may qualify a biomarker for use in a defined population. With additional study that includes assessment of diagnostic criteria, the biomarker may be qualified or further validated for use as a diagnostic/prognostic clinical tool on an individual patient basis, depending upon validation status of the analytic assay. Hence, the criteria for qualification or validation are driven by both the context in which the marker is being used and the stage of assay development.

The approach to evaluation of histopathology findings in tissue is driven by the context in which the biomarker will be used. In studies conducted to qualify biomarkers of tissue injury for use in nonclinical safety assessment, classical regulatory toxicology study designs have been used with qualitative histopathology as the anchoring reference standard. The employed histopathology approaches have been optimized and proven to provide sufficient sensitivity to assign treatment-related histopathology findings to a given dose/exposure (Crissman et al. 2004). In this context, quantitative end points that reflect global organ function, such as biochemical or biomarker data, may show strong correlation with qualitative histopathology on a dose/exposure or group basis; however, the correlation may sometimes fail on an individual animal basis. Incomplete concordance on the individual animal level should not negate the use of the biomarker for an application to detect tissue injury or altered organ function on a dose group basis. This lack of concordance may derive from many causes including, and not limited to, timing of sampling of biochemical data relative to onset of injury, development stage or accuracy of the biomarker assay, histologic sampling, or subthreshold severity of tissue injury. Thus, improved individual animal correlation may require subsequent additional analysis using refined or more sensitive morphologic approaches such as immunohistochemistry, in situ hybridization, electron microscopy, or stereological analyses to provide morphologic correlates for changes that are subtle and below the threshold of sensitivity of routine qualitative histopathology. The application of such sensitive techniques is seldom warranted for initial qualification but may be useful to address specific applications or questions.

A successfully qualified novel biomarker that has significantly improved sensitivity over traditional biomarkers could theoretically enable clinical studies that may not have otherwise been possible owing to safety concerns. It is therefore critical that histopathology data from preclinical toxicity studies supporting biomarker qualification be as accurate as possible and, in particular, that the histopathology evaluation be conducted in a way that minimizes the bias that could result in overestimation of biomarker sensitivity. The overestimation of biomarker sensitivity, based on histopathology, is possible only if the histopathologist has concurrent access to the values for the biomarker undergoing qualification. Thus, in the context of nonclinical safety biomarker qualification studies, blinding of the pathologist to the novel biomarker result is an effective measure to guard against conscious or unconscious expectation bias and is therefore recommended as the standard of practice for these types of studies. This approach is recommended by the STARD initiative (Bossuyt et al. 2003) and is routinely performed in nonclinical biomarker qualification studies (Dieterle et al. 2010). Further, upon completion of biomarker studies, pathologist input in meta-analyses of biomarker data is critical to accurately contextualize findings (e.g., drug-induced vs. spontaneous nephropathy), prioritize primary versus secondary pathologic processes, or distill redundant or continuous findings into shared morphologic diagnoses, to provide greater power to statistical analyses.

Blinded histopathology evaluation may be appropriate in cases where the toxicity model is well characterized and accepted scoring criteria were determined from previous studies, and where these requirements are established prior to the conduct of the biomarker qualification study. Specifically, all of the following criteria should be met, if blinded evaluation is to be considered:

a. Spontaneous findings in the target tissue are well characterized.

b. Treatment dose/exposure yields a consistent response (or range of responses).

c. Time-course for the tissue response is known and documented.

d. Qualitative and quantitative tissue responses are well documented or defined.

e. Detailed criteria for characterizing and scoring tissue findings have been specified and broadly accepted.

f. Expected tissue findings are well illustrated and available to the pathologist (either representative slides or an extensive set of images, both with corresponding scores assigned).

In the event that one or more of the above criteria cannot be met, blinded histopathology evaluation is not appropriate and thus should not be exercised.

Conclusion

Qualification of safety biomarkers in nonclinical studies remains a topic of importance and one in which continued and comprehensive research activities are anticipated. Thorough review of the practices surrounding histopathology evaluation within safety biomarker qualification studies has prompted questions that are addressed herein. The significant body of experience from other applications of histopathology, in the context of well-controlled studies, coupled with personal or institutional experience were used to provide a set of best practices to guide the histopathology evaluation within safety biomarker qualification studies.

Footnotes

The recommendations contained in this manuscript have been endorsed by the Society of Toxicologic Pathology. The Regulatory Forum is designed to stimulate broad discussion of topics relevant to regulatory issues in toxicologic pathology. Readers of Toxicologic Pathology are encouraged to send their thoughts on these articles or ideas for new topics to regulatoryforum@toxpath.org. John Burkhardt, Abbott Laboratories: regular salary as an employee and use of office materials.

Karamjeet Pandher, Pfizer: regular salary as an employee and use of office materials.

Phil Solter, University of Illinois: regular salary as an employee and use of office materials.

Sean Troth, Merck: regular salary as an employee and use of office materials.

Rogley Boyce, Amgen: regular salary as an employee and use of office materials.

Tanja Zabka, Roche: regular salary as an employee and use of office materials.

Daniela Ennulat, Glaxo Smith Kline: regular salary as an employee and use of office materials.

Phil Solter, University of Illinois: regular salary as an employee and use of office materials.

Sean Troth, Merck: regular salary as an employee and use of office materials.

Rogley Boyce, Amgen: regular salary as an employee and use of office materials.

Tanja Zabka, Roche; regular salary as an employee and use of office materials.

Daniela Ennulat, Glaxo Smith Kline: regular salary as an employee and use of office materials.

Abbreviations

References

Bolon

Barale-Thomas

Bradley

Ettlin

R. A.

Franchi

C. A.

George

Giusti

A. M.

Hall

Jacobsen

Konishi

Ledieu

Morton

Park

J. H.

Scudamore

C. L.

Tsuda

Vijayasarathi

S. K.

Wijnands

M. V.

(2011). International recommendations for training future toxicologic pathologists participating in regulatory-type, nonclinical toxicity studies. Exp Toxicol Pathol. 63, 187–95.

Bossuyt

P. M.

Reitsma

J. B.

Bruns

D. E.

Gatsonis

C. A.

Glasziou

P. P.

Irwig

L. M.

Lijmer

J. G.

Moher

Rennie

de Vet

H. C. W.

(2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Clin Chem 49, 1–6.

Brealey

S. D.

Scally

A. J.

Hahn

Godfrey

(2007). Evidence of reference standard related bias in studies of plain radiography reading performance: A meta-regression. Br J Radiol 80, 406–13.

Bucci

T. J.

(2002) Basic techniques. In Handbook of Toxicologic Pathology, 2nd ed. ( Haschek

W. M.

Rousseaux

C. G.

Wallig

M.A.

, eds.), vol. 1, chap. 8 pp. 171–85. Academic Press, San Diego, CA.

Burkhardt

J. E.

Ennulat

Pandher

Solter

Troth

Waite Boyce

Zabka

. (2010) Letter to editor: Topic of histopathology in biomarker qualification studies. Toxicol Pathol 38, 666–67.

Carakostas

M. C.

Banerjee

A. K.

(1990). Interpreting rodent clinical laboratory data in safety assessment studies: Biological and analytical components of variation. Fundam Appl Toxicol 15, 744–53.

Crissman

J. W.

Goodman

D. G.

Hildebrandt

P. K.

Maronpot

R. R.

Prater

D. A.

Riley

J. H.

Seaman

W. J.

Thake

D. C.

(2004). Best practices guideline: Toxicologic pathology. Toxicol Pathol 32, 126–131.

Deeks

J. J.

(1999). Using evaluations of diagnostic tests: Understanding their limitations and making the most of available evidence. Ann Oncol 10, 761–68.

Dieterle

Sistare

Goodsaid

Papaluca

Ozer

J.S.

Webb

C. P.

Baer

Senagore

Schipper

M. J.

Vonderscher

Sultana

Gerhold

D. L.

Phillips

J. A.

Maurer

Carl

Laurie

Harpur

Sonee

Ennulat

Holder

Andrews-Cleavenger

Y. Z.

Thompson

K. L.

Goering

P. L.

Vidal

J. M.

Abadie

Maciulaitis

Jacobson-Kram

Defelice

A. F.

Hausner

E. A.

Blank

Thompson

Harlow

Throckmorton

Xiao

Taylor

Vamvakas

Flamion

Lima

B. S.

Kasper

Pasanen

Prasad

Troth

Bounous

Robinson-Gravatt

Betton

Davis

M. A.

Akunda

McDuffie

J. E.

Suter

Obert

Guffroy

Pinches

Jayadev

Blomme

E. A.

Beushausen

S. A.

Barlow

V. G.

Collins

Waring

Honor

Snook

Lee

Rossi

Walker

Mattes

(2010). Renal biomarker qualification submission: A dialog between the FDA-EMEA and Predictive Safety Testing Consortium. Nature Biotechnol 28, 455–62.

10.

Dixon

J. B.

Bhathal

J. S.

Hughes

N. R.

O’Brien

P. E.

(2004). Non-alcoholic fatty liver disease: Improvement in liver histologic analysis with weight loss. Hepatology 39, 1647–54.

11.

Ennulat

Magid-Slav

Rehm

Tatsuoka

(2010). Diagnostic performance of traditional hepatobiliary markers of drug-induced liver injury in the rat. Toxicol Sci 116, 397–412.

12.

Filler

Bökenkamp

Hofmann

Le Bricon

Martínez-Brú

Grub

(2005). Cystatin C as a marker of GFR—history, indications, and future research. Clin Biochem 38, 1–8.

13.

Gleason

D. F.

(1977). The Veteran’s Administration Cooperative Urologic Research Group: Histologic grading and clinical staging of prostatic carcinoma. In Urologic Pathology: The Prostate ( M. Tannenbaum

ed.), pp. 171–198, Lea and Febiger, Philadelphia, PA.

14.

Grimes

D. A.

Schultz

K. F.

(2002). Bias and causal associations in observational research. Lancet 359, 248–52.

15.

Grubb

Simonsen

Sturfelt

Truedsson

Thysell

(1985). Serum concentration of cystatin C, factor D and b2-microglobulin as a measure of glomerular filtration rate. Acta Med Scand 218, 499–503.

16.

Kleiner

D. E.

Brunt

E. M.

Van Natta

M. L.

Behling

Constos

M. J.

Cummings

O. W.

. (2003). Design and validation of a histological scoring system for non-alcoholic fatty liver disease. Hepatology 41, 1313–1321.

17.

Konikoff

M. R.

Noel

R. J.

Blanchard

Kirby

Jameson

S. C.

(2006). A randomized, double-blind, placebo-controlled trial of fluticasone propionate for pediatric eosinophilic esophagitis. Gasteroenterology 131, 1381–91.

18.

Leeflang

M. M.

Deeks

J. J.

Gatsonis

Bossuyt

P. M. M.

(2008). Systematic reviews of diagnostic test accuracy. Ann Intern Med 149, 889–97.

19.

Loy

C. T.

Irwig

(2004). Accuracy of diagnostic tests read with and without clinical information: a systematic review. JAMA 292, 1602–9.

20.

Mankin

Dorfman

Lippiello

Zarins

(1971). Biochemical and metabolic abnormalities in articular cartilage from osteo-arthritic human hips. II. Correlation of morphology with biochemical and metabolic data. J Bone Jt Surg Am 53, 523–27.

21.

Norman

G. R.

Coblentz

C. L.

Brooks

L. R.

Babcock

C. J.

(1992). Expertise in visual diagnosis: A review of the literature. Acad Med 67, S78–S83.

22.

Qin

L. Q.

Wang

J. Y.

Kaneko

Sato

Wang

P. Y.

(2007). One-day restriction changes hepatic metabolism and potentiates the hepatotoxicity of carbon tetrachloride and chloroform in rats. Tohoku J Exp Med 212, 379–87.

23.

Ransohoff

D. F.

(2005). Bias as a threat to the validity of cancer molecular-marker research. Nature Rev Cancer 5, 142–49.

24.

Roe

F. J. C.

(1988). Toxicity testing: some principles and some pitfalls in histopathological evaluation. Hum Toxicol 7, 405–10.

25.

Rousselet

M. C.

Michalak

Dupré

Croué

Bedossa

Saint-André

J. P.

Calès

Hepatitis Network

(2005). Sources of variability in histological scoring of chronic viral hepatitis. Hepatology 41, 257–64.

26.

Sica

G. T.

(2006). Bias in research studies. Radiology 238, 780–89.

27.

Tokumaru

Harden

S. V.

Sun

Yamashita

Epstein

J. I.

Sidransky

(2004). Optimal use of a panel of methylation markers with GSTP1 hypermethylation in the diagnosis of prostate adenocarcinoma. Clin Cancer Res 10, 5518–22.

28.

Trebino

C. A.

Stock

Gibbons

C. P.

Naiman

Roach

M. L.

Wachtmann

T. S.

Pandher

Lapointe

J-M.

Carter

Thomas

N. A.

Durtschi

Saha

Jakobsson

P-J.

Hambor

J. E.

McNeish

Carty

T. J.

Audoly

L. P.

(2003). Impaired inflammatory and pain responses in mice lacking an inducible prostaglandin E synthase. Proc Natl Acad Sci 100, 9044–49.

29.

Weinberger

M. A

. (1979). How valuable is blind evaluation in histopathologic examinations in conjunction with animal toxicity studies? Toxicol Pathol 7, 14–17.