Comparison of Manual Versus QuPath Software–based Immunohistochemical Scoring Using Oral Squamous Cell Carcinoma as a Model

Abstract

Gold standard for immunohistochemical analyses is the manual assessment by two specialist pathologists. This process is time-consuming, highly dependent on the respective evaluator and often difficult to reproduce. The use of image analysis software, such as ImageJ, QuPath, or CellProfiler, which employ machine learning and/or deep learning mechanisms to perform biomarker analyses, offers a potential solution to these problems. The objective of our study is to evaluate whether digital assessment using the open-source software QuPath is comparable to manual evaluation and to examine the inter-evaluator variability between the two manual evaluators and two software-based evaluations. Six tissue microarrays (TMAs) were constructed for a cohort of 309 patients with primary oral squamous cell carcinoma (OSCC). The tumor tissue and corresponding non-lesional squamous epithelial mucosa specimen were immunohistochemically stained for the biomarkers Ki67, as a nuclear marker; the epidermal growth factor receptor (EGF-R), as a membranous marker; and the major histocompatibility complex class I (MHC-I) heavy chain (HC) expressed on the membrane and in the cytoplasm. The staining pattern was analyzed by two experienced, independent manual evaluators and by QuPath. The percentage of positive cells, for Ki67, and the histoscore (H-score) based on the percentage of positive cells and their staining intensity, for EGF-R and MHC-I, were determined as final values. The results yielded high to excellent spearman correlation coefficients for all three biomarkers (p<0.001) in lesional and non-lesional tissues. The Bland–Altman plots demonstrated a high degree of agreement between manual and software-based analysis, as well as inter-evaluator variability demonstrating a high comparability of the evaluation methods. However, a prerequisite for a proper software-based analysis is an accurate, time-consuming annotation of the single specimen, which requires users with a comprehensive understanding of histology and extensive training in QuPath. Once these requirements are met, the software-based analysis offers advantages for large-scale biomarker studies due to objective and reproducible comparability of the stainings leading to a greater accuracy as well as the reuse of established conditions across similar analyses without requiring further operator input.

Keywords

biomarker digital pathology head and neck squamous cell carcinoma immunohistochemistry

Introduction

Biomarkers are becoming an increasingly important tool in the field of tumor diagnostics, prognosis, prediction of therapy responses, and resistances and often represent the rational for the development of personalized cancer therapies.^1,2 The implementation of tissue microarrays (TMAs) for biomarker analyses allowed investigating the molecular and immunological characteristics in large population cohorts of tumor patients. Immunohistochemical stainings of TMAs enable to study risk factors, molecular tumor subtypes, biomarkers, and therapeutic targets and represent a well-established method for the cost- and time-efficient evaluation of protein expression on numerous tissue samples. However, TMAs also have some disadvantages when compared with whole tissue analysis, as the selected tissue areas only represent a small part of the tumor and give no insights into the tumor heterogeneity.^3–5 The conventional approach to histopathological evaluation of tissue sections is the manually assessment by pathologists, which is susceptible to a number of potential biases, including context bias, number preference, subjectivity, inter-evaluator variability, inattentional blindness, recognition errors, search satisfaction, and diagnostic drift.^6–10 Furthermore, the manual assessment is time-consuming and limited on qualitative or semiquantitative methods, which restricts the accuracies and reproducibility of data. However, despite these issues, the manual, predominantly qualitative or semiquantitative evaluation of biomarker expression in tissue samples by an experienced pathologist is currently still regarded as the gold standard.⁶

Therefore, there is a need to develop computational methods. The advent of artificial intelligence (AI) and machine learning in the field of pathology has created new opportunities for the development of digital evaluation methods.¹¹ It enables more standardized diagnostic protocols, the establishment of novel biomarker approaches, and can increase the efficiency and accuracy of the evaluation process.^12,13 QuPath is an open-source software for digital image analysis that can objectively and reproducibly analyze whole-mount sections or TMA-derived immunohistochemical images.¹⁴ The software enables to distinguish cells of various origins and functions and to create quantitative scores from categorial and continuous variables, such as the percentage of positive cells or cell density. In recent years, technical innovations and updates of the software facilitate the development of advanced methodologies for cell and tissue segmentation, thereby yielding the basis of the generation of accurate results.

In the present study, QuPath version 0.4.3 was employed to analyze the expression level of three different biomarkers with distinct cellular localization—the proliferation marker Ki67, epidermal growth factor receptor (EGF-R), and major histocompatibility complex class I (MHC-I) heavy chain (HC)—in 309 oral squamous cell carcinomas (OSCCs) with different tumor staging and grading and non-lesional squamous epithelial mucosa specimen as control. The biomarkers were selected due to well-established immunohistochemistry (IHC) protocols, their variety of staining patterns (nuclear, membranous, and cytoplasmic), and different expression levels in OSCCs and non-lesional squamous epithelial mucosa specimen. In accordance with the gold standard, two independent evaluators conducted the QuPath analyses as well as the manual assessment of all samples. The objective of this study was to evaluate whether the results of the software application by QuPath are comparable to the manual evaluation of two experienced independent investigators as well as between the two manual evaluators regarding the interrater variability. Furthermore, the interrater variability of two separate QuPath evaluations was analyzed. It is noteworthy that the automatic scoring requires users with histology knowledge and extensive training in QuPath. However, after setting up the requirements, QuPath exhibits a greater accuracy, in particular, by analyzing large patients’ cohorts than the manual scoring.

Materials and Methods

Study Population

The study cohort consists of 309 patients with primary OSCC who underwent immunohistochemical evaluation using the three biomarkers—Ki67, EGF-R, and MHC-I HC. All patients were diagnosed between 1995 and 2015 and either had an initial biopsy or received surgical resection at the Department of Oral and Maxillofacial Surgery, University Hospital Halle (Saale), Germany. Non-lesional squamous epithelial mucosa specimen were available in 266/309 patient cases. A summary of demographic and clinicopathologic data is presented in Table 1. The study was approved by the ethics committee of the Medical Faculty of Martin-Luther-University Halle-Wittenberg (2017-81 and 2020-103) and carried out in compliance with Helsinki Declaration.

Table 1.

Demographic and Clinicopathologic Data of the 309 Patients With Oral Squamous Cell Carcinoma (OSCC).

Parameter	Number of Cases	%
Total	309
Gender
Female	75	24.3
Male	234	75.7
Age (years)
<60	168	54.4
60	141	45.6
T-Stage
1	81	26.2
2	97	31.4
3	41	13.3
4	88	28.5
NA	2	0.7
N-Stage
0	147	47.6
1	57	18.5
2	100	32.4
3	4	1.3
NA	1	0.3
M-Stage
0	296	95.8
1	13	4.2
Histological grading
1	36	11.7
2	191	61.8
3	82	26.5
UICC
I	62	20.1
II	43	13.9
III	61	19.7
IV	142	46.0
NA	1	0.3

Abbreviations: NA, not available.

UICC, Union International Contra Cancer (2021).

TNM, tumor, node, metastasis (2010).

Generation of TMAs

Formalin-fixed, paraffin-embedded (FFPE) biopsies from OSCC patients were collected for the generation of TMAs. Briefly, sections of the FFPE blocks were stained with hematoxylin and eosin (H&E) and assessed by an experienced head and neck squamous cell carcinoma pathologist (D.B.) to mark representative regions. For each patient, two representative 0.6-mm tissue cores were obtained from tumor regions and their corresponding normal squamous epithelia. Using these annotations, the TMA Grand Master (3DHISTECH, Budapest, Hungary) automatically created TMA blocks of 1-mm diameter tissue cores. In the majority of cases, further cores were excised for the generation of parallel TMAs. Each block contains between 165 and 220 tissue cores. In total, respectively, three TMAs with tumor tissue and normal squamous epithelium were generated.

Immunohistochemical Analysis

FFPE tissue samples were subjected to conventional IHC on Bond-III automated immunostainer (Leica Biosystems Nussloch GmbH, Wetzlar, Germany) following the manufacturer’s protocol with the Bond Polymer Refine Detection Kit (DS9800CN). Following primary antibodies incubated for 15 min: Ki67 (MM1; Leica Biosystems; Nussloch, Germany; Ready to use), EGF-R (31G7; Diagnostic BioSystems, Pleasanton, CA; dilution: 1:100), and MHC-I HC (HC-10; Thermo Scientific, Waltham, MA; dilution: 1:2500). Antigen retrieval was performed at 98°C buffer solution, pH 9.0, for Ki67, buffer solution, pH 6.0, for MHC-I HC (20 min), and as proteolytic-induced epitope retrieval with proteinase K for EGF-R (10 min).

Manual Scoring of IHC Data

An independent evaluation was performed by two experienced evaluators (C.W. and H.H.). Samples that presented technical challenges were discussed after scoring to ensure the correctness and reproducibility of the results. For EGF-R and MHC-I, a semiquantitative histoscore (H-score) was calculated for cytoplasmic and/or membranous staining by multiplying the intensity of DAB pigmentation (0 = no staining; 1 = weak; 2 = moderate; 3 = strong) and the percentage of positive cells, ranging in the final H-score from 0 to 300. The H-score was calculated as follows: H-score = (3 × % of strong staining cells) + (2 × % of moderate staining cells) + (1 × % of weak staining cells).¹⁵ Ki67 was evaluated based on the percentage of positive-staining nuclei, ranging from 0% to 100% regardless of color intensity. The mean values of the two evaluators were used as the final manual score. Patient cases were defined as one or two tissue cores, depending on whether during the generation of the TMA or the staining process single cores were missing or were not analyzable due to tissue folds, deficient tissue, or wrong type of tissue. Insufficient tissue cores then were excluded from further investigations.

Digital Image Analysis

The digital evaluation was performed by two evaluators (M.B. and H.H.). All TMA slides were digitized utilizing the NanoZoomer-SQ whole slide scanner (C13140-21; HAMAMASTU, Hamamatsu, Japan) at 40× scanning mode. Semiautomatic digital image analyses were conducted using the open-source software QuPath version 0.4.3.¹⁴ The scanned images were imported as Brightfield H-DAB, estimating stain vectors for hematoxylin and DAB for each biomarker separately based on a representative region (Appendix Fig. A1). A grid comprising all tissue cores was created with the assistance of the TMA dearrayer and unsuitable (explained at “Manual scoring of IHC data”) cores were manually excised. The command “cell detection” and “cell + membrane detection” identified all cells on the slides using a custom algorithm. In the case of cytoplasmatic membranous staining biomarkers (EGF-R and MHC-I HC), the software referential value referred to hematoxylin-stained cell nuclei. In the case of nuclei staining (Ki67), another cell detection option based on optical density sum was required. The software employed multiple settings for nuclei separation and size, as well as cell morphology, to achieve more detailed cell detection. The extension of cells was calculated by defining a maximum distance around each nucleus, limited by neighboring nuclei or membrane staining. The software was then trained to differentiate between tumor/normal squamous epithelium and stroma. A two-way random classifier was trained using measurements of 20 to 25 manually annotated regions per slide regarding cell morphology and staining characteristics. The command “Set cell intense classification” was employed to adjust the immunostaining threshold for Ki67 negative and positive cells and in the case of EGF-R and MHC-I HC to adjust for negative, weak, median, and strong positive staining of cells. Thresholds were determined by color deconvolution¹⁶ for mean nuclear DAB intensity (Ki67), maximum membranous DAB intensity (EGF-R), or maximum cytoplasmic DAB intensity (MHC-I HC). Figure 1 shows the different steps of cell segmentation. Finally, all segmentation and staining artifacts, such as DAB flecks, were manually removed from the analysis by deleting the affected cells as illustrated in Figure 2. The digital scoring was found to be equal to the manual scoring systems and was exported via CSV files.

Figure 1.

QuPath: Cell segmentation steps. (A) Exemplary unprocessed TMA core with Ki67 staining. (B) Detection of all cells on the core with the command “cell detection.” (C) Segmentation of individual cells as tumor/squamous epithelium and stroma; tumoral/epithelial cells are red, stromal cells are green. (D) Classifying positive (red and dark green) and negative (blue and light green) stained cells with the command “Set cell intense classification.” Magnification 1:40 and 1:100; scale bars: 100 µm.

Figure 2.

Correction of artifacts. The image shows a staining artifact (A) before removing QuPath detected cells within the artifact and (B) after manually removing cells. Magnification 1:100; scale bar: 100 µm.

Statistical Analysis

Microsoft Excel (Version 2108, Microsoft, Redmond, WA), SPSS (Version 27, IBM, Armonk, NY), and GraphPad Prism (Version 10.2.1, GraphPad Software, Boston, MA) were used for statistical analysis. The percentage of positive cells (Ki67) and the H-score (EGF-R, MHC-I) were evaluated for each tissue core by two evaluators manually and semi-automatically using QuPath. The majority of cases consisted of two tissue cores, which were combined into a case using the mean value.

With the assumption that manual evaluation is the reference standard, distinct analyses were conducted to assess and compare the evaluation methods. The mean values of the manual scoring were compared with the mean results from QuPath. In addition, the two manually and the two semi-automatically derived scores were compared with each other to validate the reliability of the interrater similarity. The comparisons were conducted at the core and case level in accordance with the clinical approach. In the absence of values in either method, the core or case was excluded from the analysis. Analyses were performed for each biomarker with samples of tumor tissue and normal squamous epithelium, which were separately evaluated.

The Spearman correlation coefficient was used to facilitate a comparative analysis between QuPath results and manually determined values as well as between the two manual scores and the two QuPath scores for each biomarker. Bland–Altman plots were employed for a more precise comparison of the two distinct evaluation methods. Furthermore, the median and mean expression scores and the interquartile range for each biomarker were compared at the core level.

Results

For the analysis of the three selected biomarkers, six TMAs of OSCC tissues consisting of 618 tissue cores from 309 patients and non-lesional squamous epithelial mucosa specimen from 266/309 patients were stained. On average, 17.3% of the tumor cores and 24.3% of the normal tissue cores were excluded from further investigations due to a lack of evaluable tissue or the inability to qualitatively evaluate the tissue. In total, 1540 tumor and 1228 squamous epithelium tissue cores across all biomarkers were evaluated. Due to the availability of two samples per patient, the tumor tissue from 91.1% patients and normal squamous epithelium from 87.1% patients could be analyzed (Appendix Table A1). The analysis of every biomarker required approximately 6 to 8 hr for each manual evaluation and 19 to 22 hr for the evaluation conducted with QuPath. Thus, in total, it took around 42 hr for both manual evaluations and almost 60 hr for the software-supported evaluation process.

Core-Level Comparison

As summarized in Table 2, the median and mean expression values of Ki67, EGF-R, and MHC-I in the tumors were highly comparable between the two manual evaluators and the QuPath analyses. In contrast, in the QuPath evaluation of the squamous epithelium, there was a difference in the EGF-R and MHC-I evaluation of the two investigators. Analyses further revealed an approximately 3-fold or 6- to 8-fold increased expression of Ki67 and EGF-R, respectively, in tumor tissues compared with adjacent non-lesional squamous epithelial mucosa specimen. Unexpectedly, the MHC-I expression was 1.5-fold higher in the tumor samples than in the control group.

Table 2.

Median and Mean Expression Values of Three Selected Biomarkers at the Core Level.

	Manual 1	Manual 2	Manual Mean	QuPath 1	QuPath 2	QuPath Mean
Tumor
Ki67
Median	20	20	20	24.4	25.9	25.3
Mean	22.8	26.5	24.6	28.1	29.7	28.4
SD	16.7	20.1	17.9	19.5	18.0	18.6
IQR	10–30	10–35	11.3–32.5	13.3–37.8	15.4–37.7	14.7–27.5
EGF-R
Median	10	20	15	19.3	26.1	23.1
Mean	56.3	62.3	58.4	63.5	66.0	64.4
SD	92.4	88.9	89.3	85.7	79.8	81.9
IQR	0–60	0–80	0–65	2.5–92.4	9.2–92.7	6.1–90.7
MHC-I
Median	90	112.5	97.5	121.3	109.9	115.8
Mean	110.2	117.2	111.4	122.1	110.9	116.3
SD	95.5	82.8	86.9	76.8	73.2	74.6
IQR	20–180	40–180	32.5–180	60.6–185.7	45.35–168.8	52.5–179.1
Squamous epithelium
Ki67
Median	7	5	6	6.6	8.1	7.4
Mean	7.5	7.5	7.5	9.2	9.3	9.3
SD	6.1	7.2	6.4	8.9	7.4	7.9
IQR	2–12	2–10	2–11	2.3–13.9	4.0–12.81	3.1–13.4
EGF-R
Median	0	5	2.5	4.4	15.7	9.3
Mean	6.8	11.1	8.9	14.9	26.9	20.56
SD	18.4	19.7	19.1	28.1	32.2	29.1
IQR	0–0	0–12.5	0–10	0.5–17.0	4.8–38.5	2.8–27.7
MHC-I HC
Median	50	80	68.8	86.2	48.9	67.0
Mean	78.0	92.3	84.8	89.9	60.1	74.7
SD	77.5	72.8	74.1	61.1	52.5	56.5
IQR	10–125	25–150	20–135	36.9–133.5	14.9–93.2	25.9–114.2

Abbreviations: SD, standard deviation; IQR, interquartile range Q3–Q1; EGF-R, epidermal growth factor receptor; MHC-I HC, major histocompatibility complex class I heavy chain; Ki67: percentage of positive cells (0–100%); EGF-R and MHC-I HC: H-score (0–300).

The correlations between manual scores and QuPath results from all tissue cores as well as the interrater correlations between the two manual evaluators and the two QuPath evaluations demonstrated high to excellent positive correlation coefficients with values ranging from ρ = 0.68 (interrater reliability manual, EGF-R, squamous epithelium) to ρ = 0.98 (interrater reliability QuPath, MHC-I HC, tumor, and squamous epithelium), p<0.001 for all biomarkers (Table 3). The manual interrater coefficients (ρ ranging 0.68–0.96) were found to be slightly lower than the coefficients between manual and software-supported evaluations (0.78–0.95), while the QuPath interrater coefficients (ρ ranging 0.89–0.98) were found to be slightly higher. No significant differences in the analyses of tumor and non-lesional squamous epithelial mucosa specimen were detected.

Table 3.

Correlation of the Immunohistochemical Manual and Software-Derived Evaluation at the Core Level.

	Tumor			Squamous Epithelium
	Ki67	EGF-R	MHC-I HC	Ki67	EGF-R	MHC-I HC
Manual/QuPath	0.91(0.89–0.92)	0.91(0.89–0.92)	0.95(0.93–0.95)	0.92(0.90–0.93)	0.78(0.74–0.82)	0.95(0.94–0.98)
Interrater reliability manual	0.89(0.87–0.91)	0.90(0.89–0.92)	0.92(0.90–0.93)	0.90(0.88–0.92)	0.68(0.62–0.73)	0.96(0.95–0.96)
Interrater reliability QuPath	0.96(0.95–0.97)	0.94(0.93–0.95)	0.98(0.98–0.98)	0.89(0.87–0.91)	0.90(0.88–0.92)	0.98(0.98–0.98)

Correlation of immunohistochemical evaluation between manual and software-derived (QuPath) scores per core. The data are presented in the form of Spearman’s ρ correlation coefficient with a corresponding 95% confidence interval in brackets.

Abbreviations: EGF-R, epidermal growth factor receptor; MHC-I HC, major histocompatibility complex class I heavy chain.

Comparison of the scoring results using Bland–Altman plots demonstrated a good agreement not only between the two methods, but also between the interrater comparisons for all biomarkers (Fig. 3). The mean differences between the manual gold standard and QuPath were −3.8% (squamous epithelium: −1.9%) for Ki67, −5.2 H-score units for EGF-R (−12.0), and −4.1 H-score units for MHC-I HC (−9.5). The analyses of the scores of normal squamous epithelia showed a tendency to increasing differences at higher values due to higher results in QuPath. Equal analyses for manual interrater agreement yielded mean differences of −3.7% (squamous epithelium: −0.1%) for Ki67, −5.8 H-score units (−4.1) for EGF-R, and –7.7 H-score units (−13.7) for MHC-I HC. The mean differences for QuPath interrater agreement were found to be −0.7% (squamous epithelium: −0.13%) for Ki67, −1.9 H-score units (−11.9) for EGF-R, and 11.0 H-score units (−29.3) for MHC-I HC. Limits of agreement between the gold standard and QuPath were closer for the evaluation of Ki67 and MHC-I HC (tumor) than between the manual evaluators and further apart for EGF-R and MHC-I HC (squamous epithelium).

Figure 3.

Core level: Bland–Altman plots. Bland–Altman plots of agreement between manual and software-based scores (A1–A3, B1–B3), interrater agreement (A4–A6, B4–B6) in manual evaluation and interrater agreement in QuPath evaluation (A7–A9, B7–B9) at the core level for tumor tissue (A) and normal squamous epithelium (B). X-axis shows the average evaluation score between the assessment methods or the two manuals as well as QuPath evaluators. Mean differences and limits of agreement (1.96 × standard deviation ≙ 95% confidence interval) are presented on y-axis.

Case-Level Comparison

The majority of the OSCC patients were represented by two cores. For case-level comparison, the values of both cores were combined into one case using the means applying the same statistical evaluation methods as at the core level. Case analyses demonstrated a high level of agreement with the results of the core level (Appendix Tables A2, A3 and Fig. A2). The correlation coefficients at case level were even slightly higher ranging from ρ = 0.71 (interrater reliability manual, EGF-R, squamous epithelium) to ρ = 0.99 (interrater reliability QuPath, MHC-I HC, tumor) (Appendix Table A3), p<0.001 for all markers. The Bland–Altmann plots showed smaller mean differences and closer limits of agreement for all constellations (Appendix Fig. A2).

Discussion

The identification of reliable biomarkers for the prognosis of patients’ risks, stratification, and prediction of therapy response has been suggested to be of critical importance for the individualization of tumor therapies. Over decades, one conventional approach of biomarker analysis was the manual scoring of IHC results. To get insights whether the implementation of digital pathology algorithms could be used as standard procedure for quantification of protein expression levels in tumor tissue specimen, this study compared the IHC scoring of three biomarkers in OSCC and non-lesional squamous epithelial mucosa specimen between the manual evaluation and the software QuPath as well as the interrater variability in the manual and QuPath-driven results. For this, the analyses of Ki67, EGF-R, and MHC-I HC were chosen as biomarkers.

For biomarker analyses in large patients’ cohorts, it is definitively advantageous to generate TMAs, as this allows the simultaneous staining of numerous tissue samples under equal conditions. In addition, expression of the different biomarkers could be evaluated on almost identical tissue areas through serial tissue sections.¹⁷ Staining of multiple tissue cores per patient demonstrated its value in enabling the evaluation of a greater number of patients. Despite the loss of ~17% tumor cores (non-lesional squamous epithelial mucosa specimen: 24%) during the staining process, at least ~91% (87%) of the cases could be evaluated (Appendix Table A1). Nevertheless, the use of TMAs also has some disadvantages, including the limited representation of the intratumoral heterogeneity, and as a result a false interpretation of biomarker expression, for example, PD-L1.¹⁸

Concerning the direct comparison between manual and QuPath software–based scoring, it is noteworthy that QuPath analysis with approximately 2 to 4 hr evaluation per TMA slide did not save time. This is due to the intensive processing of TMA slides in QuPath, such as cell counting, segmentation settings (identification of erythrocytes, glandular tissue, and tumor/squamous epithelium), and the removal of artifacts, which is a time-consuming and demanding process. These challenges were further aggravated by other factors. For example, the membrane detection of the algorithm might be not optimal, because in some cases, the correct cell boundaries during the cell segmentation process might not be identified (Fig. 4). During the selection of the segmentation settings for the semi-automated analysis, challenges were due to the inherent diversity of the tumor size and shape as well as the staining intensity. This might lead either to an undersegmentation or oversegmentation of cells. Nevertheless, recognition errors were also experienced by pathologists, thereby contributing to inter-evaluator variability of (membrane) staining. Both software-based and manual scoring can benefit from extensive training with large datasets to increase accuracy.

Figure 4.

Different quality of membrane detection in QuPath. Representative images of EGF-R-stained OSCC with (A, B) a case with well-detected cell membranes and (C, D) a case with poorly detected cell membranes. In (A, C), the unannotated scans are shown; in (B, D), QuPath annotations are shown. Colors of detected cells correspond to the staining intensity (blue = negative, yellow = weak, orange = moderate, red = strong, green = stroma). Magnification 1:200; scale bar: 50 µm.

The technical challenges require future improvements of the software to enhance its reliability in segmentation processes to accurately assess nuclei and accelerate the procedure. In general, the evaluation of IHC by QuPath requires an operator, who is well trained in histology (not necessarily a specialist pathologist) and who has received intensive training in the use of the program. However, once the conditions are established, they can be reused for similar analyses without an additional input of the operator, which not only saves time, but also increases the accuracy.

Evaluating the manual scoring, the use of preferred numbers ending in 0 and 5 is a well-known phenomenon, and especially for high values, rounded numbers are favored to estimate the percentage of positively stained cells. This phenomenon of number preference has been discussed in previous studies^6,9 and results in pseudo-continuous values. Multiplying the percentages by the staining intensity further increases this error in the H-score. In contrast, QuPath works with continuous values from 0% to 100% to score positively stained cells. Furthermore, when manual analysis is chosen, the staining intensity is often generalized to the entire score, while the software calculates in each core the percentage of low-, moderate-, and high-positive cells, respectively. Manual evaluation, but not the QuPath analysis, might also be influenced by the staining intensity of surrounding cells. In detail, we found that weakly stained cells in the immediate neighborhood of strongly positive stained cells were more often (falsely) scored as negative by the manual evaluation resulting in lower scores. Furthermore, the manual evaluators tended to score all cells as positive when the tissue was generally stained over a large area, whereas QuPath also considered isolated weakly stained or negative cells for scoring. The maximum H-score of 300 was assigned 56 times (evaluator 1) or 28 times (evaluator 2), respectively, by the manual scorers and never by QuPath [34 values (evaluator 1) or 16 values (evaluator 2) between 270 and 300]. The same principle also applies to cores, which are almost completely negative. These samples were often ranked as negative by the manual evaluators and as weakly positive by QuPath due to few weak positive cells. Especially, in non-lesional squamous epithelial mucosa specimen, QuPath detected slightly higher values for Ki67 and EGF-R than the manual evaluation. This might additionally be explained by a personal bias due to the peoples’ tendency to seek and interpret information based on their knowledge and/or the hypothesis.¹⁹ For example, it is expected that the number of Ki67-positive cells in normal tissue will be very low and that the staining of Ki67 and EGF-R is mainly limited to the basal cell layer.^20–23 Therefore, the immunohistochemical scores for both biomarkers were assumed to be relatively low, which corresponded to the evaluators’ results. In contrast, QuPath equally evaluated all cells regardless of the tissue type.

Currently, clinical biomarker evaluation based on IHC stainings is made by manual counting partially also with the aid of digital support systems leading to more semiquantitative than fully quantitative evaluations. However, for some biomarkers, there exist cutoffs that require a fully quantitative evaluation (e.g., Ki67-positive cells in the context of neuroendocrine and gastrointestinal stromal tumors, PD-L1 in the context of several approvals of immune oncological therapies). Thus, a reevaluation of these cutoffs in case of autonomous implementation of software-based IHC scoring is urgently required.

In conclusion, for manual scoring of biomarker expression, the expertise of a specialist pathologist is not required, and semiquantitative evaluation of biomarkers, including the assessment of the staining intensity and percentage of positive cells, is only possible by a person with intensive histological training.

It is noteworthy regarding the analysis of the median and mean values between the manual evaluators and QuPath that the software mostly detects slightly higher values as already reported in another study,²⁴ which has been attributed to highly sensitive calibrating thresholds in the setting process. Errors in software-based evaluations tend to be systemic. For example, misconfigured thresholds can lead to biased higher or lower results. In manual evaluation, recognition errors often occur sporadically in both directions, and average out, making them hard to spot. Systemic deviations in software-based evaluations can be identified at a control level and (manually) readjusted, for example, by modifying threshold values. The implementation of an independent review by a second person, or the collaborative determination of the segmentation settings and threshold values, might probably obviate the necessity for the double evaluation, as it is customary in the manual gold standard. In context of further research projects and clinical application, this information should be considered in the adjustment of the thresholds.

Previous studies have compared different software programs (QuPath, Definers, Cell Signaling, ImageJ) with each other, as well as with manual evaluation or manual evaluation with each other. These studies have demonstrated moderate to excellent correlation coefficients.^25–28 Generally, evaluator variability can be divided into three categories: intra-evaluator-variability, inter-evaluator variability, and inter-laboratory variability.²⁹ In our study, the comparative analyses of the two manual evaluators and two separate QuPath analyses were equal to the inter-evaluator variability. A strength of our study is the parallel analysis of manual/software-supported evaluation and the interrater considerations, which enables a direct comparison of the detected variabilities. Furthermore, the software was tested on different tissue types by analyzing tumor tissue as well as non-lesional squamous epithelial mucosa specimen. With regard to the correlation coefficients, there were only minimal differences between the different approaches, which cannot indicate superiority of one experimental design. However, interpreting the correlation coefficients alone has only very limited meaningfulness. The main problem is the lack of interconnection between a high correlation coefficient and a high degree of agreement.³⁰ In addition, the coefficients are dependent on the range of measurement values and their distribution.³¹ Pattern recognition and imitation were specified as reasons for the continued common use.³²

To address the limitations of correlation coefficients for the agreement of two evaluation methods, we employed the Bland–Altman method, which has been demonstrated to be the most suitable statistical tool to get insight in the comparison of the manual evaluation with QuPath and the consideration of the interrater variability of the manual evaluation.^30,32,33 The analyses of these Bland–Altman plots show that the software-supported evaluation by QuPath is highly comparable with the manual gold standard. In particular, for the nuclear staining of Ki67, QuPath presents a very good agreement with the gold standard. The limits of agreement are even closer than the values within the manual evaluation by different scorers. With regard to EGF-R, QuPath was somewhat less consisting in relation to the limits of agreements when comparing software-based and manual evaluation. In the case of MHC-I, the results were ambiguous, with tumor tissue evaluation exhibiting a better performance and non-lesional squamous epithelial mucosa specimen evaluation exhibiting a lower performance in comparison with the manual interrater confrontation. A potential explanation for this discrepancy is the less precise membrane detection of the software. In non-lesional squamous epithelial mucosa specimen, MHC-I expression was more often observed on the membrane instead of in the cytoplasm than in the tumor tissue. Furthermore, a comparative analysis of the two QuPath analyses was conducted and presented in Bland–Altman plots. These results were in accordance with the interrater results of the manual evaluation. In particular, the mean deviation of the differences in normal tissues of MHC-I with 29.8 H-score units was noteworthy. It can be hypothesized that the thresholds used slightly differed resulting in marginally lower or higher overall H-score determined by a QuPath operator demonstrating that a precise setting of the threshold is crucial for ensuring consistent and comparable evaluations.

As a fundamental technical remark, it is noteworthy to point out that slide preparation can influence the quality of IHC, such as slide thickness, tissue age, and staining protocol.³⁴ This can affect both manual and software-assisted evaluation results regarding inter-evaluator and inter-laboratory variability. The implementation of standard operation procedures for the slide preparation and IHC will serve to minimize this potential source of error. However, a critical examination of all tissues by experienced manual evaluators and QuPath operators should be carried out during the evaluation process.

A comparison of biomarker expression in tumor tissues is a central element of translational research and requires an assessment as objective as possible. According to our results, the combination of representative tumor tissue specimen in unique TMAs enabled simultaneous immunostaining of hundreds of tissue specimen, and a digital evaluation of the staining results turned out to be an optimal setting for such large cohorts of samples. However, this procedure still requires evaluators with in-depth experience in histopathology as the settings for cell detection and staining intensity, tissue segmentation, and artifact correction are demanding. In particular, for membranous staining, the available digital image analysis tools are not yet optimized leading to a time-consuming annotation process. An automation of these correction processes would require large training and validation sets to accurately differentiate between different cell types and to detect artificial defects. It is also noteworthy that completely automated methods, such as piNET, might avoid the biases introduced by the semi-automated QuPath method. This deep learning algorithm might have an advantage as it is automated and requires no user input allowing the analysis of large cohorts with minimal bias introduction.

Despite the above-discussed disadvantages, digital image analysis will provide in the future an important tool for biomarker analysis in tissue specimen for diagnostic and research purposes.

Footnotes

Appendix

Appendix Table A3.

Correlation of Marker Expression Between Manual and Software-Derived (QuPath) Scores at the Case Level.

	Tumor			Squamous Epithelium
	Ki67	EGF-R	MHC-I HC	Ki67	EGF-R	MHC-I
Manual/QuPath	0.92 (0.90–0.93)	0.92 (0.90–0.94)	0.95 (0.94–0.96)	0.93 (0.91–0.93)	0.80 (0.75–0.85)	0.95 (0.94–0.96)
Interrater reliability manual	0.90 (0.88–0.92)	0.92 (0.90–0.93)	0.94 (0.92–9.95)	0.91 (0.89–0.93)	0.71 (0.63–0.77)	0.96 (0.94–0.97)
Interrater reliability QuPath	0.89 (0.87–0.91)	0.94 (0.93–0.95)	0.99 (0.99–0.99)	0.92 (0.89–0.94)	0.92 (0.89–0.93)	0.98 (0.09–0.99)

The data are presented in the form of Spearman’s rho correlation coefficient with a corresponding 95% confidence interval in brackets.

Abbreviations: EGF-R, epidermal growth factor receptor; MHC-I, major histocompatibility complex class I; HC, heavy chain.

Acknowledgements

We would like to thank Maria Heise for excellent secretarial help.

Competing Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

CW, BS, and MB participated the conception and design of the work. AE, DB, HH, and AW were involved in the compilation of a patient’s cohort, generation of the tissue micro arrays, and implementation of immunohistochemical staining. HH, CW, MB, and AW participated in data analysis and interpretation. The manuscript was drafting by HH, BS, CW, and AW. All authors contributed to the critical revision of the manuscript and had approved its final version.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Hannah Horbas

Marcus Bauer

Barbara Seliger

References

Cancer biomarker detection: recent achievements and challenges. Chem Soc Rev. 2015;44(10):2963–97.

Hristova

Chan

DW.

Cancer biomarker discovery and translation: proteomics and beyond. Expert Rev Proteomics. 2019;16(2):93–103.

Freier

Joos

Flechtenmacher

Devens

Benner

Bosch

Lichter

Hofele

Tissue microarray analysis reveals site-specific prevalence of oncogene amplifications in head and neck squamous cell carcinoma. Cancer Res. 2003;63(6):1179–82.

Kononen

Bubendorf

Kallioniemi

Bärlund

Schraml

Leighton

Torhorst

Mihatsch

Sauter

Kallioniemi

OP.

Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med. 1998;4(7):844–7.

Taylor

CR.

Immunomicroscopy : a diagnostic tool for the surgical pathologist. 2024. Available from: https://cir.nii.ac.jp/crid/1130000797961714560

Aeffner

Wilson

Martin

Black

Hendriks

CLL

Bolon

Rudmann

Gianani

Koegler

Krueger

Young

GD.

The gold standard paradox in digital image analysis: manual versus automated scoring as ground truth. Arch Pathol Lab Med. 2017;141(9):1267–75.

Raffone

Srinivasan

van Leeuwen

The interplay of attention and consciousness in visual search, attentional blink and working memory consolidation. Philos Trans R Soc Lond B Biol Sci. 2014;369(1641):20130215.

Raab

SS.

Improving patient safety by examining pathology errors. Clin Lab Med. 2004;24(4):849–63.

Wen

Kramer

Hoey

Hanley

Usher

RH.

Terminal digit preference, random error, and bias in routine clinical measurement of blood pressure. J Clin Epidemiol. 1993;46(10):1187–93.

10.

Silcocks

PB.

Measuring repeatability and validity of histological diagnosis—a brief review with some practical examples. J Clin Pathol. 1983;36(11):1269–75.

11.

Shafi

Parwani

AV.

Artificial intelligence in diagnostic pathology. Diagn Pathol. 2023;18(1):109.

12.

Bera

Schalper

Rimm

Velcheti

Madabhushi

Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology. Nat Rev Clin Oncol. 2019;16(11):703–15.

13.

Steiner

MacDonald

Liu

Truszkowski

Hipp

Gammage

Thng

Peng

Stumpe

MC.

Impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer. Am J Surg Pathol. 2018;42(12):1636–46.

14.

Bankhead

Loughrey

Fernández

Dombrowski

McArt

Dunne

McQuaid

Gray

Murray

Coleman

James

Salto-Tellez

Hamilton

PW.

QuPath: open source software for digital pathology image analysis. Sci Rep. 2017;7(1):16878.

15.

McCarty

Jr Miller

Cox

Konrath

McCarty

Sr . Estrogen receptor analyses. Correlation of biochemical and immunohistochemical methods using monoclonal antireceptor antibodies. Arch Pathol Lab Med. 1985;109(8):716–21.

16.

Ruifrok

Johnston

DA.

Quantification of histochemical staining by color deconvolution. Anal Quant Cytol Histol. 2001;23(4):291–9.

17.

Merseburger

Hennenlotter

Horstmann

Kuczyk

Stenzl

Die Tissue Microarray-Technik als neues “high throughput-tool” für den Nachweis differentieller Proteinexpression. J Für Urol Urogynäkologie. 2003;10(3):5–8.

18.

Rasmussen

Lelkaitis

Håkansson

Vogelius

Johannesen

Fischer

Bentzen

Specht

Kristensen

von Buchwald

Wessel

Friborg

Intratumor heterogeneity of PD-L1 expression in head and neck squamous cell carcinoma. Br J Cancer. 2019;120(10):1003–6.

19.

Koriat

Lichtenstein

Fischhoff

Reasons for confidence. J Exp Psychol [Hum Learn]. 1980;6:107–8.

20.

Shirasuna

Hayashido

Sugiyama

Yoshioka

Matsuya

Immunohistochemical localization of epidermal growth factor (EGF) and EGF receptor in human oral mucosa and its malignancy. Virchows Arch A Pathol Anat Histopathol. 1991;418(4):349–53.

21.

Iamaroon

Khemaleelakul

Pongsiriwet

Pintong

Co-expression of p53 and Ki67 and lack of EBV expression in oral squamous cell carcinoma. J Oral Pathol Med. 2004;33(1):30–6.

22.

Rajeswari

Saraswathi

TR.

Expression of epithelial growth factor receptor in oral epithelial dysplastic lesions. J Oral Maxillofac Pathol. 2012;16(2):183–8.

23.

Swain

Nishat

Ramachandran

Raghuvanshi

Behura

Kumar

Comparative evaluation of immunohistochemical expression of MCM2 and Ki67 in oral epithelial dysplasia and oral squamous cell carcinoma. J Cancer Res Ther. 2022;18(4):997–1002.

24.

Zhang

Gao

Xiang

Zhang

Liu

WP.

Prognostic value and computer image analysis of p53 in mantle cell lymphoma. Ann Hematol. 2022;101(10):2271–9.

25.

Baker

Bret-Mounet

Wang

Veta

Zheng

Collins

Eliassen

Tamimi

Heng

YJ.

Immunohistochemistry scoring of breast tumor tissue microarrays: a comparison study across three software applications. J Pathol Inform. 2022;13:100118.

26.

Ram

Vizcarra

Whalen

Deng

Painter

Jackson-Fisher

Pirie-Shepherd

Xia

Powell

EL.

Pixelwise H-score: a novel digital image analysis-based metric to quantify membrane biomarker expression from immunohistochemistry images. PLoS ONE. 2021;16(9):e0245638. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8475990/

27.

Bankhead

Fernández

McArt

Boyle

Loughrey

Irwin

Harkin

James

McQuaid

Salto-Tellez

Hamilton

PW.

Integrated tumor identification and automated scoring minimizes pathologist involvement and provides new insights to key biomarkers in breast cancer. Lab Invest. 2018;98(1):15–26.

28.

Acs

Pelekanou

Bai

Martinez-Morilla

Toki

Leung

SCY

Nielsen

Rimm

DL.

Ki67 reproducibility using digital image analysis: an inter-platform and inter-operator study. Lab Invest. 2019;99(1):107–17.

29.

Conway

Dobson

O’Grady

Kay

Costello

O’Shea

Virtual microscopy as an enabler of automated/quantitative assessment of protein expression in TMAs. Histochem Cell Biol. 2008;130(3):447–63.

30.

Zaki

Bulgiba

Ismail

NA.

Statistical methods used to test for agreement of medical instruments measuring continuous variables in method comparison studies: a systematic review. PLoS ONE. 2012;7(5):e37908.

31.

Janse

Hoekstra

Jager

Zoccali

Tripepi

Dekker

van Diepen

Conducting correlation analysis: important limitations and pitfalls. Clin Kidney J. 2021;14(11):2332–7.

32.

Bland

Altman

DG.

Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327(8476):307–10.

33.

Doğan

NÖ.

Bland-Altman analysis: a paradigm to understand correlation and agreement. Turk J Emerg Med. 2018;18(4):139–41.

34.

Chlipala

Butters

Brous

Fortin

Archuletta

Copeland

Bolon

Impact of preanalytical factors during histology processing on section suitability for digital image analysis. Toxicol Pathol. 2021;49(4):755–72.