Abstract
Background
Dysplasia in Barrett's esophagus (BE) biopsies is associated with low observer agreement among general pathologists. Therefore, expert review is advised. We are developing a web-based, national expert review panel for histological review of BE biopsies.
Objective
The aim of this study was to create benchmark quality criteria for future members.
Methods
Five expert BE pathologists, with 10–30 years of BE experience, weekly handling 5–10 cases (25% dysplastic), assessed a case set of 60 digitalized cases, enriched for dysplasia. Each case contained all slides from one endoscopy (non-dysplastic BE (NDBE), n = 21; low-grade dysplasia (LGD), n = 20; high-grade dysplasia (HGD), n = 19). All cases were randomized and assessed twice followed by group discussions to create a consensus diagnosis. Outcome measures: percentage of ‘indefinite for dysplasia’ (IND) diagnoses, intra-observer agreement, and agreement with the consensus ‘gold standard’ diagnosis.
Results
Mean percentage of IND diagnoses was 8% (3–14%) and mean intra-observer agreement was 0.84 (0.66–1.02). Mean agreement with the consensus diagnosis was 90% (95% prediction interval (PI) 82–98%).
Conclusion
Expert pathology review of BE requires the scoring of a limited number of IND cases, consistency of assessment and a high agreement with a consensus gold standard diagnosis. These benchmark quality criteria will be used to assess the performance of other pathologists joining our panel.
Keywords
Key summary
Barrett's esophagus (BE) with low-grade dysplasia (LGD) is an independent risk factor for the development of oesophageal cancer. Interobserver agreement for the diagnosis of LGD by general pathologists is low. Review of LGD cases by expert pathologists can accurately stratify patients according to progression risk. However, what constitutes an expert pathologist has not currently been defined. We propose to quantify expertise of pathologists assessing dysplastic BE, through the establishment of benchmark values for four quality criteria. Adhering to these benchmark quality criteria can improve the uniformity of interpretation of dysplastic BE and serve as useful criteria in a teaching environment.
Introduction
In BE, the normal stratified squamous epithelium of the distal esophagus has been replaced by columnar epithelium containing intestinal metaplasia. BE is a known risk factor for esophagus adenocarcinoma (EAC), especially when dysplasia is present. BE biopsies are graded according to the modified Vienna criteria for gastrointestinal neoplasms. 1 The grading of dysplasia in BE biopsies is difficult and associated with low observer agreement, because the morphological changes are gradual in the metaplasia-dysplasia-carcinoma sequence. Since endoscopic management of BE patients depends on the dysplasia grade,2–6 BE guidelines advise that all diagnoses of dysplasia should be reviewed by an expert gastrointestinal (GI) pathologist.2–7 We have shown that such an expert pathology review of BE biopsies has a significant impact on the management and outcome of patients.8–10 Based on this, and to implement recent BE guidelines, we set up a national digital review panel for dysplastic BE biopsy cases. This panel makes use of digital microscopy slides and is supported by all 15 expert BE pathologists from the eight BE expert centres in the Netherlands. The core of the panel consists of five pathologists who have been working together as a group for many years and all have extensive experience in the field of BE neoplasia.11–13 One of the problems in creating such an expert panel is that expert pathology is not easily quantified. In earlier publications, we have used the following qualifications for an expert BE pathologist: an actively practising histopathologist who is dedicated to the field of Barrett's for a minimum of 5 years, has a minimum BE biopsy caseload of five cases per week of which ≥25% are dysplastic, has participated in multiple training programmes, is considered an expert by his or her peers and has co-authored or peer-reviewed publications in this field.9,10,14–20
There are plans to expand the panel to include 10 other dedicated GI pathologists, working at the eight BE expert centres in the Netherlands. These pathologists have not been collaborating as intensively as the core group; therefore, they are currently participating in a structured self-assessment programme with multiple group discussions before joining the review panel. The goal of the current study was to establish quality parameters for our national digital BE review panel. For this, the five core expert BE pathologists reviewed all slides from all biopsies taken from 60 BE whole-endoscopy cases, followed by group discussions to create a consensus ‘gold standard’ diagnosis for all cases. The aim was to define benchmark quality criteria for future pathologists who wish to join this panel.
Materials and methods
Slide selection and scanning
We selected all formalin-fixed, paraffin-embedded tissue blocks and/or slides of 60 BE endoscopy procedures. The case set was enriched for dysplastic cases. Thirty-nine cases with an original diagnosis of LGD (n = 20) or HGD (n = 19) had been sent to our centre for consultation between 2012 and 2014. These 39 dysplastic cases were supplemented with 21 consecutive NDBE cases from a community hospital in the Amsterdam region. All cases were anonymized. Every case contained at least an Hematoxylin & Eosin (HE) and corresponding p53 immunohistochemically stained slide (clone DO-7+BP53-12, #MS-738-P, Thermo Fisher Scientific, Waltham, MA, USA). For each case, all slides were fully digitalized, using a scanner with a ×20 microscope objective (Slide, Olympus, Tokyo, Japan). They were checked for focus and acuity by the study coordinator and re-scanned if necessary. Subsequently, the slides were anonymized, randomized, renamed and stored on a secure server. The viewing software used to view the digital slides during the study was the virtual slide system ‘Digital Slidebox 4.5’ (http://dsb.amc.nl/dsb/login.php, Slidepath, Leica Microsystems, Dublin, Ireland).
Assessors
The core expert pathology panel consisted of five pathologists (FJWtK, CAS, SLM, MV, GJAO). They have been dedicated to the field of BE for a minimum of 10 years (range 10–30 years) and have a minimum caseload of 5–10 cases per week of which 25% are dysplastic. All pathologists have participated in the Dutch Barrett advisory committee for many years9,11,12 and are actively practising pathologists. All pathologists participated in multiple training programmes for endoscopists and pathologists (www.best-academia.eu) and each has co-authored more than 10 peer-reviewed publications in this field.8,9,12,15,17,18,20–25
Histologic assessment and earlier joint assessments and group discussions
The expert BE pathologists scored cases according to the modified Vienna criteria for gastrointestinal neoplasms.1,26 In a previous comparative study, they demonstrated that their histological assessment of glass slides and digitalized slides yielded comparable results. 13 For the current study, the pathologists independently assessed all cases twice in random order, with a wash-out time of at least 1 month between the two rounds. They individually logged onto the virtual slide system to assess the cases. The study coordinator supervised all assessments and recorded the pathologists’ answers on a case record form. Diagnostic possibilities were: NDBE; LGD; HGD; or ‘indefinite for dysplasia’ (IND). After the two assessment rounds, a group discussion was held in which cases that did not have an agreement of 4/5 or 5/5 pathologists were discussed. After discussion, all cases had a diagnostic agreement of 4/5 or 5/5 pathologists, and these diagnoses were considered as the consensus gold standard diagnosis of each case.
Outcome measurements
The outcome measurements were: (1) the percentage of diagnoses ‘indefinite for dysplasia’ per pathologist; (2) the intra-observer agreement per pathologist; and (3) the percentage agreement with the consensus gold standard diagnosis per pathologist. The percentage of IND diagnoses was depicted as the mean percentage over two assessment rounds, per pathologist. The intra-observer agreement was measured in kappa (see below) and was calculated by comparing each pathologist's first and second assessment per case. The percentage agreement with the consensus gold standard diagnosis was defined as the proportion of correct diagnoses per pathologist when comparing these to the consensus gold standard diagnosis (Supplementary Figure 1, diagonal). The cases that were not in agreement with the consensus gold standard diagnosis were either overdiagnosed (i.e. given a higher diagnosis by the pathologist than the consensus gold standard diagnosis) or underdiagnosed (i.e. given a lower diagnosis by the pathologist than the consensus gold standard diagnosis; see Supplementary Figure 1, lower left and upper right triangles). An additional focus was put on the cases diagnosed in consensus as HGD that were misdiagnosed as NDBE by the individual pathologist (see the darker square at the top right corner of Supplementary Figure 1). All calculations were carried out by using the mean of the two assessment rounds.
Statistical analysis
We studied the variation in the outcome parameters among our five core expert pathologists in order to use these as benchmark quality criteria for other pathologists joining the expert panel. For this, we considered them as a random sample taken from a hypothetical population of expert BE pathologists and assumed a normal distribution for the values. Therefore, we calculated the mean and standard deviation (SD), from which a 2.776*SD range around the mean was calculated (n = 5 pathologists yields 4 degrees of freedom) for the 95% prediction interval (PI). We assumed no statistical difference between pathologists if all values fell within this prediction interval. For the calculation of the intra-observer agreement, we used Cohen's kappa. This is a statistical measure for agreement adjusted for chance agreement.27,28 We used three diagnostic categories (NDBE; LGD + HGD; IND) and assigned custom weights to (dis)agreements, since the spectral changes do not necessarily follow the diagnostic categories 1 to 4. For example, IND is ranked ‘2’ but is not always situated between NDBE (1) and LGD (3).13,29 Agreement was assigned a score of 1, disagreements between NDBE and HGD were assigned a score of 0, all other disagreements a score of 0.5. Due to the possibility of skewed marginal totals, the maximum possible kappa per cross table does not always equal 1. Therefore, the agreement calculated as a fraction of maximum possible kappa is also depicted. The agreement was traditionally categorized as follows: a value of zero or less indicates agreement no better than chance alone (‘poor’); 0.00–0.20, ‘slight’; 0.21–0.40, ‘fair’; 0.41–0.60, ‘moderate’; 0.61–0.80, ‘substantial’; 0.81–1.00, ‘almost perfect’. 30 The percentage agreement with the consensus gold standard diagnosis was calculated by correlating the pathologist's diagnoses and consensus gold standard diagnoses in a 4 × 4 table (NDBE; IND; LGD; HGD, see Supplementary Figure 1 and Supplementary Tables 1 and 2). Since the management of both LGD and HGD as cancer precursors is the same in the Netherlands, these two categories were grouped. The statistical analyses were performed using the Statistical Package for the Social Sciences (SPSS 24.0, IBM Corp., Armonk, New York, USA). The custom weighted kappa was developed using the self-automated program Agreestat (version 2013.2, Advanced Analytics, LCC, Gaithersburg, USA).
Results
Baseline characteristics of samples in case set
Median age of patients at diagnosis was 66 years (IQR 58–71) and 73% were male. Cases contained a median of five slides (IQR 3–9), from a median of two levels (IQR 1–4) with four biopsies per level (IQR 3-4.5).
Percentage of diagnoses ‘indefinite for dysplasia’
Percentage of cases diagnosed as ‘indefinite for dysplasia’ for the five core pathologists (mean over two assessment rounds) for the complete case set (n = 60).
95% prediction interval.
Intra-observer agreement over all cases (n = 60)
Intra-observer agreement of five core pathologists for the complete case set (n = 60) in three categories. a
Non-dysplastic BE; indefinite for dysplasia; low-grade dysplasia/high-grade dysplasia.
Custom-weighted Cohen's kappa.
95% prediction interval.
Maximum possible kappa per cross table.
Percentage agreement of the five core pathologists with the gold standard diagnosis
Percentage agreement of five core pathologists with consensus gold standard diagnosis (mean over two assessment rounds) for the complete case set (n = 60).
95% prediction interval.
High-grade dysplasia.
Non-dysplastic BE.
Post-hoc analysis on cases with a baseline diagnosis of dysplasia (n = 39)
Percentage of cases diagnosed as ‘indefinite for dysplasia’ for the five core pathologists (mean over two assessment rounds) for cases with a baseline diagnosis of low-grade dysplasia or high-grade dysplasia (n = 39).
95% prediction interval.
Intra-observer agreement of five core pathologists for cases with a baseline diagnosis of low-grade dysplasia or high-grade dysplasia (n = 39) in three categories. a
Non-dysplastic BE; indefinite for dysplasia; low-grade dysplasia/high-grade dysplasia.
Custom-weighted Cohen's kappa.
95% prediction interval.
Maximum possible kappa per cross table.
Percentage agreement of five core pathologists with consensus gold standard diagnosis (mean over two assessment rounds) for cases with a baseline diagnosis of low-grade dysplasia or high-grade dysplasia (n = 39).
95% prediction interval.
High-grade dysplasia.
non-dysplastic BE.
Discussion
The aim of this study was to define benchmark quality criteria for the assessment of BE biopsies for our national digital review panel. For this purpose, our five core expert BE pathologists reviewed all slides of 60 whole-endoscopy BE cases enriched for dysplasia. After their individual assessments, they discussed discrepant cases and agreed on a consensus gold standard diagnosis for all cases. Our five core expert BE pathologists were found to have a mean percentage of IND diagnoses of 8% (95% PI: 3–14), a mean intra-observer agreement of 0.84 (95% PI: 0.66–1.02) and a mean agreement with consensus gold standard diagnosis of 90% (95% PI: 82–98). The scenario with the largest clinical consequences, i.e. a consensus diagnosis of HGD but misdiagnosed as NDBE, was a rare event. When we focused on those cases relevant to our future panel, namely the cases with a baseline diagnosis of LGD or HGD (n = 39), results were similar. For clinical decision making, the distinction between LGD and HGD has limited consequences. After all, confirmed LGD has the same management as HGD. The distinction between NDBE-LGD-IND is the one that pathologists find most difficult and also the one that has the biggest impact on further patient management.
Values for benchmark quality criteria based on 95% prediction interval of five core pathologists.
Indefinite for dysplasia.
95% prediction interval.
High-grade dysplasia.
Non-dysplastic BE.
This study has a number of unique features. First, the pathologists participating in this study are the top BE pathologists of the Netherlands, all with an international reputation in this field. This is the second study in the line of the national digital BE review panel that they are performing as a group. Second, the case set consists of whole-endoscopy cases (all slides from all biopsy levels of one endoscopy), was fully digitalized and only contains review cases from clinical practice. There were two assessment rounds with an adequate wash-out time and the pathologists held group discussions afterwards to discuss all discrepant cases and create a consensus gold standard diagnosis for every case. This digital case set of dysplastic BE cases will be made available in a teaching and testing environment to allow pathologists in- or outside the Netherlands to evaluate whether or not they meet the aforementioned benchmark quality criteria.
A limitation of our study is that the benchmark values for the chosen quality criteria generated in this study are only applicable to this particular case set, since they depend on this particular distribution of diagnoses. In addition, although we feel that our choice of criteria (how often indefinite, how confident, i.e. intra-observer agreement, and how accurate compared with a consensus diagnosis) is logical, some may argue that this choice is subjective. We feel that these benchmark quality criteria are currently the best to quantify expertise in diagnosing BE dysplasia in biopsy samples. In conclusion, our study shows that expert BE pathologists reach high levels of agreement when assessing a dysplastic, whole-endoscopy case set of BE cases. Their agreement scores have generated benchmark values for four quality criteria, namely: (1) the percentage of IND diagnoses; (2) the intra-observer agreement; (3) the percentage agreement compared with a consensus gold standard diagnosis; and (4) the percentage of cases of HGD misdiagnosed as NDBE. The values for these benchmark quality criteria set by our five core pathologists and digital dysplastic BE case set will be used to assess if other pathologists can join our national digital review panel.
Supplemental Material
Supplementary Table 1 -Supplemental material for Development of benchmark quality criteria for assessing whole-endoscopy Barrett's esophagus biopsy cases
Supplemental material, Supplementary Table 1 for Development of benchmark quality criteria for assessing whole-endoscopy Barrett's esophagus biopsy cases by MJ van der Wel, LC Duits, E Klaver, RE Pouw, CA Seldenrijk, GJA Offerhaus, M Visser, FJW ten Kate, JG Tijssen, JJGHM Bergman and SL Meijer in United European Gastroenterology Journal
Supplemental Material
Supplementary Table 2 -Supplemental material for Development of benchmark quality criteria for assessing whole-endoscopy Barrett's esophagus biopsy cases
Supplemental material, Supplementary Table 2 for Development of benchmark quality criteria for assessing whole-endoscopy Barrett's esophagus biopsy cases by MJ van der Wel, LC Duits, E Klaver, RE Pouw, CA Seldenrijk, GJA Offerhaus, M Visser, FJW ten Kate, JG Tijssen, JJGHM Bergman and SL Meijer in United European Gastroenterology Journal
Footnotes
Declaration of conflicting interests
None declared.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Ethics approval
Since the materials used in this study were anonymized, the medical ethical committee of the AMC waived the need for approval.
Informed consent
Since the materials used in this study were anonymized, no informed consent was obtained.
Supplementary materials
The research materials supporting this publication can be accessed through the supplementary materials and/or by contacting Myrtle J van der Wel at
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
