Abstract
Gold standard for immunohistochemical analyses is the manual assessment by two specialist pathologists. This process is time-consuming, highly dependent on the respective evaluator and often difficult to reproduce. The use of image analysis software, such as ImageJ, QuPath, or CellProfiler, which employ machine learning and/or deep learning mechanisms to perform biomarker analyses, offers a potential solution to these problems. The objective of our study is to evaluate whether digital assessment using the open-source software QuPath is comparable to manual evaluation and to examine the inter-evaluator variability between the two manual evaluators and two software-based evaluations. Six tissue microarrays (TMAs) were constructed for a cohort of 309 patients with primary oral squamous cell carcinoma (OSCC). The tumor tissue and corresponding non-lesional squamous epithelial mucosa specimen were immunohistochemically stained for the biomarkers Ki67, as a nuclear marker; the epidermal growth factor receptor (EGF-R), as a membranous marker; and the major histocompatibility complex class I (MHC-I) heavy chain (HC) expressed on the membrane and in the cytoplasm. The staining pattern was analyzed by two experienced, independent manual evaluators and by QuPath. The percentage of positive cells, for Ki67, and the histoscore (H-score) based on the percentage of positive cells and their staining intensity, for EGF-R and MHC-I, were determined as final values. The results yielded high to excellent spearman correlation coefficients for all three biomarkers (p<0.001) in lesional and non-lesional tissues. The Bland–Altman plots demonstrated a high degree of agreement between manual and software-based analysis, as well as inter-evaluator variability demonstrating a high comparability of the evaluation methods. However, a prerequisite for a proper software-based analysis is an accurate, time-consuming annotation of the single specimen, which requires users with a comprehensive understanding of histology and extensive training in QuPath. Once these requirements are met, the software-based analysis offers advantages for large-scale biomarker studies due to objective and reproducible comparability of the stainings leading to a greater accuracy as well as the reuse of established conditions across similar analyses without requiring further operator input.
Introduction
Biomarkers are becoming an increasingly important tool in the field of tumor diagnostics, prognosis, prediction of therapy responses, and resistances and often represent the rational for the development of personalized cancer therapies.1,2 The implementation of tissue microarrays (TMAs) for biomarker analyses allowed investigating the molecular and immunological characteristics in large population cohorts of tumor patients. Immunohistochemical stainings of TMAs enable to study risk factors, molecular tumor subtypes, biomarkers, and therapeutic targets and represent a well-established method for the cost- and time-efficient evaluation of protein expression on numerous tissue samples. However, TMAs also have some disadvantages when compared with whole tissue analysis, as the selected tissue areas only represent a small part of the tumor and give no insights into the tumor heterogeneity.3–5 The conventional approach to histopathological evaluation of tissue sections is the manually assessment by pathologists, which is susceptible to a number of potential biases, including context bias, number preference, subjectivity, inter-evaluator variability, inattentional blindness, recognition errors, search satisfaction, and diagnostic drift.6–10 Furthermore, the manual assessment is time-consuming and limited on qualitative or semiquantitative methods, which restricts the accuracies and reproducibility of data. However, despite these issues, the manual, predominantly qualitative or semiquantitative evaluation of biomarker expression in tissue samples by an experienced pathologist is currently still regarded as the gold standard. 6
Therefore, there is a need to develop computational methods. The advent of artificial intelligence (AI) and machine learning in the field of pathology has created new opportunities for the development of digital evaluation methods. 11 It enables more standardized diagnostic protocols, the establishment of novel biomarker approaches, and can increase the efficiency and accuracy of the evaluation process.12,13 QuPath is an open-source software for digital image analysis that can objectively and reproducibly analyze whole-mount sections or TMA-derived immunohistochemical images. 14 The software enables to distinguish cells of various origins and functions and to create quantitative scores from categorial and continuous variables, such as the percentage of positive cells or cell density. In recent years, technical innovations and updates of the software facilitate the development of advanced methodologies for cell and tissue segmentation, thereby yielding the basis of the generation of accurate results.
In the present study, QuPath version 0.4.3 was employed to analyze the expression level of three different biomarkers with distinct cellular localization—the proliferation marker Ki67, epidermal growth factor receptor (EGF-R), and major histocompatibility complex class I (MHC-I) heavy chain (HC)—in 309 oral squamous cell carcinomas (OSCCs) with different tumor staging and grading and non-lesional squamous epithelial mucosa specimen as control. The biomarkers were selected due to well-established immunohistochemistry (IHC) protocols, their variety of staining patterns (nuclear, membranous, and cytoplasmic), and different expression levels in OSCCs and non-lesional squamous epithelial mucosa specimen. In accordance with the gold standard, two independent evaluators conducted the QuPath analyses as well as the manual assessment of all samples. The objective of this study was to evaluate whether the results of the software application by QuPath are comparable to the manual evaluation of two experienced independent investigators as well as between the two manual evaluators regarding the interrater variability. Furthermore, the interrater variability of two separate QuPath evaluations was analyzed. It is noteworthy that the automatic scoring requires users with histology knowledge and extensive training in QuPath. However, after setting up the requirements, QuPath exhibits a greater accuracy, in particular, by analyzing large patients’ cohorts than the manual scoring.
Materials and Methods
Study Population
The study cohort consists of 309 patients with primary OSCC who underwent immunohistochemical evaluation using the three biomarkers—Ki67, EGF-R, and MHC-I HC. All patients were diagnosed between 1995 and 2015 and either had an initial biopsy or received surgical resection at the Department of Oral and Maxillofacial Surgery, University Hospital Halle (Saale), Germany. Non-lesional squamous epithelial mucosa specimen were available in 266/309 patient cases. A summary of demographic and clinicopathologic data is presented in Table 1. The study was approved by the ethics committee of the Medical Faculty of Martin-Luther-University Halle-Wittenberg (2017-81 and 2020-103) and carried out in compliance with Helsinki Declaration.
Demographic and Clinicopathologic Data of the 309 Patients With Oral Squamous Cell Carcinoma (OSCC).
Abbreviations: NA, not available.
UICC, Union International Contra Cancer (2021).
TNM, tumor, node, metastasis (2010).
Generation of TMAs
Formalin-fixed, paraffin-embedded (FFPE) biopsies from OSCC patients were collected for the generation of TMAs. Briefly, sections of the FFPE blocks were stained with hematoxylin and eosin (H&E) and assessed by an experienced head and neck squamous cell carcinoma pathologist (D.B.) to mark representative regions. For each patient, two representative 0.6-mm tissue cores were obtained from tumor regions and their corresponding normal squamous epithelia. Using these annotations, the TMA Grand Master (3DHISTECH, Budapest, Hungary) automatically created TMA blocks of 1-mm diameter tissue cores. In the majority of cases, further cores were excised for the generation of parallel TMAs. Each block contains between 165 and 220 tissue cores. In total, respectively, three TMAs with tumor tissue and normal squamous epithelium were generated.
Immunohistochemical Analysis
FFPE tissue samples were subjected to conventional IHC on Bond-III automated immunostainer (Leica Biosystems Nussloch GmbH, Wetzlar, Germany) following the manufacturer’s protocol with the Bond Polymer Refine Detection Kit (DS9800CN). Following primary antibodies incubated for 15 min: Ki67 (MM1; Leica Biosystems; Nussloch, Germany; Ready to use), EGF-R (31G7; Diagnostic BioSystems, Pleasanton, CA; dilution: 1:100), and MHC-I HC (HC-10; Thermo Scientific, Waltham, MA; dilution: 1:2500). Antigen retrieval was performed at 98°C buffer solution, pH 9.0, for Ki67, buffer solution, pH 6.0, for MHC-I HC (20 min), and as proteolytic-induced epitope retrieval with proteinase K for EGF-R (10 min).
Manual Scoring of IHC Data
An independent evaluation was performed by two experienced evaluators (C.W. and H.H.). Samples that presented technical challenges were discussed after scoring to ensure the correctness and reproducibility of the results. For EGF-R and MHC-I, a semiquantitative histoscore (H-score) was calculated for cytoplasmic and/or membranous staining by multiplying the intensity of DAB pigmentation (0 = no staining; 1 = weak; 2 = moderate; 3 = strong) and the percentage of positive cells, ranging in the final H-score from 0 to 300. The H-score was calculated as follows: H-score = (3 × % of strong staining cells) + (2 × % of moderate staining cells) + (1 × % of weak staining cells). 15 Ki67 was evaluated based on the percentage of positive-staining nuclei, ranging from 0% to 100% regardless of color intensity. The mean values of the two evaluators were used as the final manual score. Patient cases were defined as one or two tissue cores, depending on whether during the generation of the TMA or the staining process single cores were missing or were not analyzable due to tissue folds, deficient tissue, or wrong type of tissue. Insufficient tissue cores then were excluded from further investigations.
Digital Image Analysis
The digital evaluation was performed by two evaluators (M.B. and H.H.). All TMA slides were digitized utilizing the NanoZoomer-SQ whole slide scanner (C13140-21; HAMAMASTU, Hamamatsu, Japan) at 40× scanning mode. Semiautomatic digital image analyses were conducted using the open-source software QuPath version 0.4.3. 14 The scanned images were imported as Brightfield H-DAB, estimating stain vectors for hematoxylin and DAB for each biomarker separately based on a representative region (Appendix Fig. A1). A grid comprising all tissue cores was created with the assistance of the TMA dearrayer and unsuitable (explained at “Manual scoring of IHC data”) cores were manually excised. The command “cell detection” and “cell + membrane detection” identified all cells on the slides using a custom algorithm. In the case of cytoplasmatic membranous staining biomarkers (EGF-R and MHC-I HC), the software referential value referred to hematoxylin-stained cell nuclei. In the case of nuclei staining (Ki67), another cell detection option based on optical density sum was required. The software employed multiple settings for nuclei separation and size, as well as cell morphology, to achieve more detailed cell detection. The extension of cells was calculated by defining a maximum distance around each nucleus, limited by neighboring nuclei or membrane staining. The software was then trained to differentiate between tumor/normal squamous epithelium and stroma. A two-way random classifier was trained using measurements of 20 to 25 manually annotated regions per slide regarding cell morphology and staining characteristics. The command “Set cell intense classification” was employed to adjust the immunostaining threshold for Ki67 negative and positive cells and in the case of EGF-R and MHC-I HC to adjust for negative, weak, median, and strong positive staining of cells. Thresholds were determined by color deconvolution 16 for mean nuclear DAB intensity (Ki67), maximum membranous DAB intensity (EGF-R), or maximum cytoplasmic DAB intensity (MHC-I HC). Figure 1 shows the different steps of cell segmentation. Finally, all segmentation and staining artifacts, such as DAB flecks, were manually removed from the analysis by deleting the affected cells as illustrated in Figure 2. The digital scoring was found to be equal to the manual scoring systems and was exported via CSV files.

QuPath: Cell segmentation steps. (A) Exemplary unprocessed TMA core with Ki67 staining. (B) Detection of all cells on the core with the command “cell detection.” (C) Segmentation of individual cells as tumor/squamous epithelium and stroma; tumoral/epithelial cells are red, stromal cells are green. (D) Classifying positive (red and dark green) and negative (blue and light green) stained cells with the command “Set cell intense classification.” Magnification 1:40 and 1:100; scale bars: 100 µm.

Correction of artifacts. The image shows a staining artifact (A) before removing QuPath detected cells within the artifact and (B) after manually removing cells. Magnification 1:100; scale bar: 100 µm.
Statistical Analysis
Microsoft Excel (Version 2108, Microsoft, Redmond, WA), SPSS (Version 27, IBM, Armonk, NY), and GraphPad Prism (Version 10.2.1, GraphPad Software, Boston, MA) were used for statistical analysis. The percentage of positive cells (Ki67) and the H-score (EGF-R, MHC-I) were evaluated for each tissue core by two evaluators manually and semi-automatically using QuPath. The majority of cases consisted of two tissue cores, which were combined into a case using the mean value.
With the assumption that manual evaluation is the reference standard, distinct analyses were conducted to assess and compare the evaluation methods. The mean values of the manual scoring were compared with the mean results from QuPath. In addition, the two manually and the two semi-automatically derived scores were compared with each other to validate the reliability of the interrater similarity. The comparisons were conducted at the core and case level in accordance with the clinical approach. In the absence of values in either method, the core or case was excluded from the analysis. Analyses were performed for each biomarker with samples of tumor tissue and normal squamous epithelium, which were separately evaluated.
The Spearman correlation coefficient was used to facilitate a comparative analysis between QuPath results and manually determined values as well as between the two manual scores and the two QuPath scores for each biomarker. Bland–Altman plots were employed for a more precise comparison of the two distinct evaluation methods. Furthermore, the median and mean expression scores and the interquartile range for each biomarker were compared at the core level.
Results
For the analysis of the three selected biomarkers, six TMAs of OSCC tissues consisting of 618 tissue cores from 309 patients and non-lesional squamous epithelial mucosa specimen from 266/309 patients were stained. On average, 17.3% of the tumor cores and 24.3% of the normal tissue cores were excluded from further investigations due to a lack of evaluable tissue or the inability to qualitatively evaluate the tissue. In total, 1540 tumor and 1228 squamous epithelium tissue cores across all biomarkers were evaluated. Due to the availability of two samples per patient, the tumor tissue from 91.1% patients and normal squamous epithelium from 87.1% patients could be analyzed (Appendix Table A1). The analysis of every biomarker required approximately 6 to 8 hr for each manual evaluation and 19 to 22 hr for the evaluation conducted with QuPath. Thus, in total, it took around 42 hr for both manual evaluations and almost 60 hr for the software-supported evaluation process.
Core-Level Comparison
As summarized in Table 2, the median and mean expression values of Ki67, EGF-R, and MHC-I in the tumors were highly comparable between the two manual evaluators and the QuPath analyses. In contrast, in the QuPath evaluation of the squamous epithelium, there was a difference in the EGF-R and MHC-I evaluation of the two investigators. Analyses further revealed an approximately 3-fold or 6- to 8-fold increased expression of Ki67 and EGF-R, respectively, in tumor tissues compared with adjacent non-lesional squamous epithelial mucosa specimen. Unexpectedly, the MHC-I expression was 1.5-fold higher in the tumor samples than in the control group.
Median and Mean Expression Values of Three Selected Biomarkers at the Core Level.
Abbreviations: SD, standard deviation; IQR, interquartile range Q3–Q1; EGF-R, epidermal growth factor receptor; MHC-I HC, major histocompatibility complex class I heavy chain; Ki67: percentage of positive cells (0–100%); EGF-R and MHC-I HC: H-score (0–300).
The correlations between manual scores and QuPath results from all tissue cores as well as the interrater correlations between the two manual evaluators and the two QuPath evaluations demonstrated high to excellent positive correlation coefficients with values ranging from ρ = 0.68 (interrater reliability manual, EGF-R, squamous epithelium) to ρ = 0.98 (interrater reliability QuPath, MHC-I HC, tumor, and squamous epithelium), p<0.001 for all biomarkers (Table 3). The manual interrater coefficients (ρ ranging 0.68–0.96) were found to be slightly lower than the coefficients between manual and software-supported evaluations (0.78–0.95), while the QuPath interrater coefficients (ρ ranging 0.89–0.98) were found to be slightly higher. No significant differences in the analyses of tumor and non-lesional squamous epithelial mucosa specimen were detected.
Correlation of the Immunohistochemical Manual and Software-Derived Evaluation at the Core Level.
Correlation of immunohistochemical evaluation between manual and software-derived (QuPath) scores per core. The data are presented in the form of Spearman’s ρ correlation coefficient with a corresponding 95% confidence interval in brackets.
Abbreviations: EGF-R, epidermal growth factor receptor; MHC-I HC, major histocompatibility complex class I heavy chain.
Comparison of the scoring results using Bland–Altman plots demonstrated a good agreement not only between the two methods, but also between the interrater comparisons for all biomarkers (Fig. 3). The mean differences between the manual gold standard and QuPath were −3.8% (squamous epithelium: −1.9%) for Ki67, −5.2 H-score units for EGF-R (−12.0), and −4.1 H-score units for MHC-I HC (−9.5). The analyses of the scores of normal squamous epithelia showed a tendency to increasing differences at higher values due to higher results in QuPath. Equal analyses for manual interrater agreement yielded mean differences of −3.7% (squamous epithelium: −0.1%) for Ki67, −5.8 H-score units (−4.1) for EGF-R, and –7.7 H-score units (−13.7) for MHC-I HC. The mean differences for QuPath interrater agreement were found to be −0.7% (squamous epithelium: −0.13%) for Ki67, −1.9 H-score units (−11.9) for EGF-R, and 11.0 H-score units (−29.3) for MHC-I HC. Limits of agreement between the gold standard and QuPath were closer for the evaluation of Ki67 and MHC-I HC (tumor) than between the manual evaluators and further apart for EGF-R and MHC-I HC (squamous epithelium).

Core level: Bland–Altman plots. Bland–Altman plots of agreement between manual and software-based scores (A1–A3, B1–B3), interrater agreement (A4–A6, B4–B6) in manual evaluation and interrater agreement in QuPath evaluation (A7–A9, B7–B9) at the core level for tumor tissue (A) and normal squamous epithelium (B). X-axis shows the average evaluation score between the assessment methods or the two manuals as well as QuPath evaluators. Mean differences and limits of agreement (1.96 × standard deviation ≙ 95% confidence interval) are presented on y-axis.
Case-Level Comparison
The majority of the OSCC patients were represented by two cores. For case-level comparison, the values of both cores were combined into one case using the means applying the same statistical evaluation methods as at the core level. Case analyses demonstrated a high level of agreement with the results of the core level (Appendix Tables A2, A3 and Fig. A2). The correlation coefficients at case level were even slightly higher ranging from ρ = 0.71 (interrater reliability manual, EGF-R, squamous epithelium) to ρ = 0.99 (interrater reliability QuPath, MHC-I HC, tumor) (Appendix Table A3), p<0.001 for all markers. The Bland–Altmann plots showed smaller mean differences and closer limits of agreement for all constellations (Appendix Fig. A2).
Discussion
The identification of reliable biomarkers for the prognosis of patients’ risks, stratification, and prediction of therapy response has been suggested to be of critical importance for the individualization of tumor therapies. Over decades, one conventional approach of biomarker analysis was the manual scoring of IHC results. To get insights whether the implementation of digital pathology algorithms could be used as standard procedure for quantification of protein expression levels in tumor tissue specimen, this study compared the IHC scoring of three biomarkers in OSCC and non-lesional squamous epithelial mucosa specimen between the manual evaluation and the software QuPath as well as the interrater variability in the manual and QuPath-driven results. For this, the analyses of Ki67, EGF-R, and MHC-I HC were chosen as biomarkers.
For biomarker analyses in large patients’ cohorts, it is definitively advantageous to generate TMAs, as this allows the simultaneous staining of numerous tissue samples under equal conditions. In addition, expression of the different biomarkers could be evaluated on almost identical tissue areas through serial tissue sections. 17 Staining of multiple tissue cores per patient demonstrated its value in enabling the evaluation of a greater number of patients. Despite the loss of ~17% tumor cores (non-lesional squamous epithelial mucosa specimen: 24%) during the staining process, at least ~91% (87%) of the cases could be evaluated (Appendix Table A1). Nevertheless, the use of TMAs also has some disadvantages, including the limited representation of the intratumoral heterogeneity, and as a result a false interpretation of biomarker expression, for example, PD-L1. 18
Concerning the direct comparison between manual and QuPath software–based scoring, it is noteworthy that QuPath analysis with approximately 2 to 4 hr evaluation per TMA slide did not save time. This is due to the intensive processing of TMA slides in QuPath, such as cell counting, segmentation settings (identification of erythrocytes, glandular tissue, and tumor/squamous epithelium), and the removal of artifacts, which is a time-consuming and demanding process. These challenges were further aggravated by other factors. For example, the membrane detection of the algorithm might be not optimal, because in some cases, the correct cell boundaries during the cell segmentation process might not be identified (Fig. 4). During the selection of the segmentation settings for the semi-automated analysis, challenges were due to the inherent diversity of the tumor size and shape as well as the staining intensity. This might lead either to an undersegmentation or oversegmentation of cells. Nevertheless, recognition errors were also experienced by pathologists, thereby contributing to inter-evaluator variability of (membrane) staining. Both software-based and manual scoring can benefit from extensive training with large datasets to increase accuracy.

Different quality of membrane detection in QuPath. Representative images of EGF-R-stained OSCC with (A, B) a case with well-detected cell membranes and (C, D) a case with poorly detected cell membranes. In (A, C), the unannotated scans are shown; in (B, D), QuPath annotations are shown. Colors of detected cells correspond to the staining intensity (blue = negative, yellow = weak, orange = moderate, red = strong, green = stroma). Magnification 1:200; scale bar: 50 µm.
The technical challenges require future improvements of the software to enhance its reliability in segmentation processes to accurately assess nuclei and accelerate the procedure. In general, the evaluation of IHC by QuPath requires an operator, who is well trained in histology (not necessarily a specialist pathologist) and who has received intensive training in the use of the program. However, once the conditions are established, they can be reused for similar analyses without an additional input of the operator, which not only saves time, but also increases the accuracy.
Evaluating the manual scoring, the use of preferred numbers ending in 0 and 5 is a well-known phenomenon, and especially for high values, rounded numbers are favored to estimate the percentage of positively stained cells. This phenomenon of number preference has been discussed in previous studies6,9 and results in pseudo-continuous values. Multiplying the percentages by the staining intensity further increases this error in the H-score. In contrast, QuPath works with continuous values from 0% to 100% to score positively stained cells. Furthermore, when manual analysis is chosen, the staining intensity is often generalized to the entire score, while the software calculates in each core the percentage of low-, moderate-, and high-positive cells, respectively. Manual evaluation, but not the QuPath analysis, might also be influenced by the staining intensity of surrounding cells. In detail, we found that weakly stained cells in the immediate neighborhood of strongly positive stained cells were more often (falsely) scored as negative by the manual evaluation resulting in lower scores. Furthermore, the manual evaluators tended to score all cells as positive when the tissue was generally stained over a large area, whereas QuPath also considered isolated weakly stained or negative cells for scoring. The maximum H-score of 300 was assigned 56 times (evaluator 1) or 28 times (evaluator 2), respectively, by the manual scorers and never by QuPath [34 values (evaluator 1) or 16 values (evaluator 2) between 270 and 300]. The same principle also applies to cores, which are almost completely negative. These samples were often ranked as negative by the manual evaluators and as weakly positive by QuPath due to few weak positive cells. Especially, in non-lesional squamous epithelial mucosa specimen, QuPath detected slightly higher values for Ki67 and EGF-R than the manual evaluation. This might additionally be explained by a personal bias due to the peoples’ tendency to seek and interpret information based on their knowledge and/or the hypothesis. 19 For example, it is expected that the number of Ki67-positive cells in normal tissue will be very low and that the staining of Ki67 and EGF-R is mainly limited to the basal cell layer.20–23 Therefore, the immunohistochemical scores for both biomarkers were assumed to be relatively low, which corresponded to the evaluators’ results. In contrast, QuPath equally evaluated all cells regardless of the tissue type.
Currently, clinical biomarker evaluation based on IHC stainings is made by manual counting partially also with the aid of digital support systems leading to more semiquantitative than fully quantitative evaluations. However, for some biomarkers, there exist cutoffs that require a fully quantitative evaluation (e.g., Ki67-positive cells in the context of neuroendocrine and gastrointestinal stromal tumors, PD-L1 in the context of several approvals of immune oncological therapies). Thus, a reevaluation of these cutoffs in case of autonomous implementation of software-based IHC scoring is urgently required.
In conclusion, for manual scoring of biomarker expression, the expertise of a specialist pathologist is not required, and semiquantitative evaluation of biomarkers, including the assessment of the staining intensity and percentage of positive cells, is only possible by a person with intensive histological training.
It is noteworthy regarding the analysis of the median and mean values between the manual evaluators and QuPath that the software mostly detects slightly higher values as already reported in another study, 24 which has been attributed to highly sensitive calibrating thresholds in the setting process. Errors in software-based evaluations tend to be systemic. For example, misconfigured thresholds can lead to biased higher or lower results. In manual evaluation, recognition errors often occur sporadically in both directions, and average out, making them hard to spot. Systemic deviations in software-based evaluations can be identified at a control level and (manually) readjusted, for example, by modifying threshold values. The implementation of an independent review by a second person, or the collaborative determination of the segmentation settings and threshold values, might probably obviate the necessity for the double evaluation, as it is customary in the manual gold standard. In context of further research projects and clinical application, this information should be considered in the adjustment of the thresholds.
Previous studies have compared different software programs (QuPath, Definers, Cell Signaling, ImageJ) with each other, as well as with manual evaluation or manual evaluation with each other. These studies have demonstrated moderate to excellent correlation coefficients.25–28 Generally, evaluator variability can be divided into three categories: intra-evaluator-variability, inter-evaluator variability, and inter-laboratory variability. 29 In our study, the comparative analyses of the two manual evaluators and two separate QuPath analyses were equal to the inter-evaluator variability. A strength of our study is the parallel analysis of manual/software-supported evaluation and the interrater considerations, which enables a direct comparison of the detected variabilities. Furthermore, the software was tested on different tissue types by analyzing tumor tissue as well as non-lesional squamous epithelial mucosa specimen. With regard to the correlation coefficients, there were only minimal differences between the different approaches, which cannot indicate superiority of one experimental design. However, interpreting the correlation coefficients alone has only very limited meaningfulness. The main problem is the lack of interconnection between a high correlation coefficient and a high degree of agreement. 30 In addition, the coefficients are dependent on the range of measurement values and their distribution. 31 Pattern recognition and imitation were specified as reasons for the continued common use. 32
To address the limitations of correlation coefficients for the agreement of two evaluation methods, we employed the Bland–Altman method, which has been demonstrated to be the most suitable statistical tool to get insight in the comparison of the manual evaluation with QuPath and the consideration of the interrater variability of the manual evaluation.30,32,33 The analyses of these Bland–Altman plots show that the software-supported evaluation by QuPath is highly comparable with the manual gold standard. In particular, for the nuclear staining of Ki67, QuPath presents a very good agreement with the gold standard. The limits of agreement are even closer than the values within the manual evaluation by different scorers. With regard to EGF-R, QuPath was somewhat less consisting in relation to the limits of agreements when comparing software-based and manual evaluation. In the case of MHC-I, the results were ambiguous, with tumor tissue evaluation exhibiting a better performance and non-lesional squamous epithelial mucosa specimen evaluation exhibiting a lower performance in comparison with the manual interrater confrontation. A potential explanation for this discrepancy is the less precise membrane detection of the software. In non-lesional squamous epithelial mucosa specimen, MHC-I expression was more often observed on the membrane instead of in the cytoplasm than in the tumor tissue. Furthermore, a comparative analysis of the two QuPath analyses was conducted and presented in Bland–Altman plots. These results were in accordance with the interrater results of the manual evaluation. In particular, the mean deviation of the differences in normal tissues of MHC-I with 29.8 H-score units was noteworthy. It can be hypothesized that the thresholds used slightly differed resulting in marginally lower or higher overall H-score determined by a QuPath operator demonstrating that a precise setting of the threshold is crucial for ensuring consistent and comparable evaluations.
As a fundamental technical remark, it is noteworthy to point out that slide preparation can influence the quality of IHC, such as slide thickness, tissue age, and staining protocol. 34 This can affect both manual and software-assisted evaluation results regarding inter-evaluator and inter-laboratory variability. The implementation of standard operation procedures for the slide preparation and IHC will serve to minimize this potential source of error. However, a critical examination of all tissues by experienced manual evaluators and QuPath operators should be carried out during the evaluation process.
A comparison of biomarker expression in tumor tissues is a central element of translational research and requires an assessment as objective as possible. According to our results, the combination of representative tumor tissue specimen in unique TMAs enabled simultaneous immunostaining of hundreds of tissue specimen, and a digital evaluation of the staining results turned out to be an optimal setting for such large cohorts of samples. However, this procedure still requires evaluators with in-depth experience in histopathology as the settings for cell detection and staining intensity, tissue segmentation, and artifact correction are demanding. In particular, for membranous staining, the available digital image analysis tools are not yet optimized leading to a time-consuming annotation process. An automation of these correction processes would require large training and validation sets to accurately differentiate between different cell types and to detect artificial defects. It is also noteworthy that completely automated methods, such as piNET, might avoid the biases introduced by the semi-automated QuPath method. This deep learning algorithm might have an advantage as it is automated and requires no user input allowing the analysis of large cohorts with minimal bias introduction.
Despite the above-discussed disadvantages, digital image analysis will provide in the future an important tool for biomarker analysis in tissue specimen for diagnostic and research purposes.
Footnotes
Appendix
Correlation of Marker Expression Between Manual and Software-Derived (QuPath) Scores at the Case Level.
| Tumor | Squamous Epithelium | |||||
|---|---|---|---|---|---|---|
| Ki67 | EGF-R | MHC-I HC | Ki67 | EGF-R | MHC-I | |
| Manual/QuPath | 0.92 (0.90–0.93) |
0.92 (0.90–0.94) |
0.95 (0.94–0.96) |
0.93 (0.91–0.93) |
0.80 (0.75–0.85) |
0.95 (0.94–0.96) |
| Interrater reliability manual | 0.90 (0.88–0.92) |
0.92 (0.90–0.93) |
0.94 (0.92–9.95) |
0.91 (0.89–0.93) |
0.71 (0.63–0.77) |
0.96 (0.94–0.97) |
| Interrater reliability QuPath | 0.89 (0.87–0.91) |
0.94 (0.93–0.95) |
0.99 (0.99–0.99) |
0.92 (0.89–0.94) |
0.92 (0.89–0.93) |
0.98 (0.09–0.99) |
The data are presented in the form of Spearman’s rho correlation coefficient with a corresponding 95% confidence interval in brackets.
Abbreviations: EGF-R, epidermal growth factor receptor; MHC-I, major histocompatibility complex class I; HC, heavy chain.
Acknowledgements
We would like to thank Maria Heise for excellent secretarial help.
Competing Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
CW, BS, and MB participated the conception and design of the work. AE, DB, HH, and AW were involved in the compilation of a patient’s cohort, generation of the tissue micro arrays, and implementation of immunohistochemical staining. HH, CW, MB, and AW participated in data analysis and interpretation. The manuscript was drafting by HH, BS, CW, and AW. All authors contributed to the critical revision of the manuscript and had approved its final version.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
