Abstract
Objectives
The Gastrointestinal (GI) Pathology subspecialty at our medical center is the highest-volume service in our Anatomic Pathology division, with colon polyps constituting 40% of specimens. 2021 revisions to colorectal cancer screening guidelines lowered the screening age to 45, which may increase endoscopy procedures and specimen volume. We therefore sought to develop an artificial intelligence (AI) classifier to triage colorectal polyp specimens.
Methods
We retrospectively searched our pathology database from 2021 to 2023 for colon polyps with following 12 final diagnoses: normal, lymphoid aggregate, inflammatory polyp, hyperplastic polyp, sessile serrated adenoma (with and without dysplasia), traditional serrated adenoma, tubular adenoma, tubulovillous adenoma, villous adenoma, high grade dysplasia (in any adenoma type), and invasive carcinoma. 1191 (759 neoplastic and 432 nonneoplastic) representative slides were scanned using a Leica Aperio LV1 Scanner at 40x magnification. Images were used to train a multi-scale cross attention multiple instance learning (MsCAMIL) network, a weakly-supervised transformer-based model to perform binary/triage (neoplastic versus nonneoplastic) and 12-way/final diagnosis classifications. Slides were randomly assigned for training (N=715, 60%), validation (N=119, 10%), and testing (N=357, 30%). An additional 40 slides collected from two subsequent clinical service days were included to represent routine clinical cases. In addition, 40 external slides from multiple outside academic and private practice institutions were scanned on both Leica Aperio GT450 LV1 and Leica Aperio ScanScope AT2 to compare performance.
Results
We assessed the following diagnostic performance metrics: the macro-averaged F1-Score (F1 = 2 x (precision x recall)/(precision + recall)), micro-averaged Accuracy (mACC), and macroaveraged specificity. In binary classification, the model achieved highest accuracy and F1-score at 95.18% and 97.42% in archived and routine clinical cases, respectively, with best discriminatory performance when overlapping patches of microscopic field from 10x and 20x magnifications. In 12-way classification, the F1 score was reduced to 74% and 57% in archived and routine cases, respectively. In the external set, the binary classification performance decreased to 86%, with misclassifications occurring primarily in slides from a single outside institution.
Conclusion
In daily clinical workflow, MsCAMIL networks can function as an efficient triage system by screening colon polypectomy specimens for neoplasm. In binary classification, the classifier showed >95% specificity and accuracy in both archived and routine clinical cases. Multi-institutional collaborations will be necessary to expand patient population and further validate this tool for clinical use.
Keywords
Introduction
Because colorectal cancer (CRC) deaths can be prevented with early endoscopic identification and removal of pre-cancerous polyps, screening colonoscopy plays a very important role in cancer surveillance in both male and female populations.1,2 CRC is the third most common cancer in both males and females in the United States, ranking second in cancer deaths. 3 Concerningly, in recent years CRC incidence rates doubled in 20- to 49-year-olds, leading to the creation of a new category of early onset colorectal cancers that present in advanced stage of disease. 4 In 2021, the American Cancer Society revised their 2009 CRC screening guidelines, recommending that individuals at average risk for colorectal cancer begin screening at the age of 45 instead of 50.2,5,6 Since this recommendation took effect, screening colonoscopy encounters drastically increased from 10,367 to 11,566 between 2023 and 2024 across our health system.
Gastroenterologists have leveraged artificial intelligence in launching endoscopic platform for real-time polyp histology and computer-aided detection system that integrates with most endoscopy systems, to improve endoscopic visualization of dysplasia/neoplasia.7,8 While these new endoscopic technologies are slightly more sensitive, they increase the detection of both neoplastic and non-neoplastic lesions and thereby increase both the inspection and/or polypectomy workload of the endoscopists, as well as increase the number of excised specimens the pathologist must examine. 9
The local Gastrointestinal (GI) Pathology service has the highest volume of any anatomic pathology subspecialty in our institution, processing approximately 125 FFPE (formalin fixed paraffin embedded) tissue blocks per day (range of 90-248 blocks per day) and colon polyps account for approximately 40% of the total service volume. To address this mounting challenge and improve the efficiency of our clinical workflow, we attempted to develop an AI (Artificial Intelligence) classifier specifically designed for the triage of colorectal polyps, with a particular focus on dysplastic/adenomatous lesions. By leveraging this advanced AI technology, we aimed to make the clinical workflow much more efficient and to prioritize neoplastic specimens over other non-neoplastic specimens so that we could promptly activate molecular testing that would inform treatment decisions, such as microsatellite instability and HER2 testing.
Methods
Model Development
We developed a transformer-based Multi-Scale Cross-Attention Multiple Instance Learning (MsCAMIL) framework for weakly supervised whole-slide image (WSI) classification, in which each slide was represented as a bag of image patches and only slide-level labels were used for training. The model was designed to integrate information across magnifications, reflecting routine pathologist review of tissue architecture at low power and cytologic detail at higher power.
WSIs were tiled into non-overlapping 512 × 512 pixel patches at selected magnifications. The best-performing binary model used 10× and 20× magnifications, although 5×, 10×, and 20× were also evaluated. Background patches were excluded using an HSV saturation threshold. Each retained patch was resized to 224 × 224 pixels and encoded with a frozen QuiltNet backbone, yielding a 768-dimensional embedding that was projected to a shared 256-dimensional feature space.
The architecture comprised three components: an instance-level encoder, a Multi-Scale Cross-Attention (MsCA) fusion module, and a transformer-based MIL aggregator with a classification head. MsCA modeled interactions between magnification-specific patch embeddings before slide-level aggregation using 4-head cross-attention in a 256-dimensional space. Both high-to-low and low-to-high information flow were evaluated, and the best-performing binary model used the high-to-low configuration with 10× and 20× scales. The MsCAMIL architecture and training hyperparameters are summarized in Supplemental Table 1.
The fused embeddings were then processed by a MIL transformer with 2 self-attention encoder blocks, 4 attention heads, and a feed-forward hidden dimension of 512. A learnable CLS token was used to derive the slide-level representation, and Pyramid Positional Encoding Generator (PPEG) was applied to preserve spatial relationships among patches. The final slide representation was passed through layer normalization and a linear classification head to predict either 2 classes for binary classification or 12 classes for multiclass diagnosis.
The model was trained in PyTorch using the Adam optimizer with a learning rate of 1 × 10-5 and weight decay of 1 × 10-5 for 150 epochs. Mixed-precision training was used, and an effective batch size of 4 was achieved through gradient accumulation. Data were split into training (60%), validation (10%), and held-out test (30%) sets stratified by diagnosis, and the best model was selected using validation macro F1-score. For classification, the network was optimized using cross-entropy loss applied to the slide-level output in both the binary and multiclass settings. For a training sample with ground-truth class label y and predicted probability vector p̂ = softmax(z), the loss was defined as ℒCE = −∑_{c=1}^{C} y_c log(p̂_c), where C denotes the number of classes. No auxiliary losses, label smoothing, or class-weighting schemes were applied.
Data Acquisition and Processing
With institutional IRB approval, retrospective search of the institutional electronic health record’s database (Epic SlicerDicer, WI, USA) was performed to include cases from November 1, 2023 to December 18, 2023, looking for colon polyps with following final diagnoses: normal (N, NPD), lymphoid aggregate (LNagg), inflammatory polyp (IP), hyperplastic polyp (HP), sessile serrated adenoma/lesion (SSA), SSA with conventional dysplasia (SSA-dys), traditional serrated adenoma (TSA), tubular adenoma (TA), tubulovillous adenoma (TVA), villous adenoma (VA), high grade dysplasia (HGD), and invasive carcinoma. A total of 1191 (759 neoplastic and 432 nonneoplastic) representative hematoxylin and eosin (H&E) slides were scanned using a Leica Aperio GT450 LV1 Scanner at 40x magnification into svs format, without patient identifiers.
To test the classifier’s performance in real-world conditions and its robustness to variations in tissue processing, an additional 40 slide test set consisting of colorectal biopsy/polypectomy cases were collected from two subsequent clinical service days in January 2024 to represent routine clinical cases at our institution. Furthermore, to specifically evaluate the impact of different fixation and staining protocols, a third set comprised of 40 review set slides from multiple outside academic and private practice institutions were scanned on both Leica Aperio GT450 LV1 and Leica Aperio ScanScope AT2 Scanners. This multi-institutional set allowed assessment of classifier performance across variable histologic preparation methods.
Preprocessing
Each WSI in svs format is tiled into a sequence of 512×512 pixel non-overlapping patches at 5x, 10x and 20x magnifications (Figure 1), with the background patches (i.e., saturation < 15) discarded. With the automated nature of the patch extraction and filtering process, the need for manual annotation of region of interest (ROI) was not required. Whole slide imaging in pyramid structure incorporating data from different magnification objectives, 5x, 10x, and 20x
Deep Learning Algorithm Development and Validation
To model data for the weakly supervised learning approach Multiple Instance Learning (MIL), the training data is organized into “bags” of tiles in WSI of instances (patches), and each bag is assigned a single label. We developed a novel transformer-based model architecture called the MsCAMIL, which jointly learns to distill relevant latent information between representative magnifications and to aggregate features in the correlated MIL setting toward training a classifier (Figure 2).
10
The first module for learning the interactions between bags of patches at varying scales is modeled using a MsCA, and the MIL aggregator is modeled using a set of self-attention blocks.
11
We utilize QuiltNet a contrastive vision-language model trained on histopathology image and text at the patch level to encode tiled patches into latent 768-dimensional embeddings for training.
12
Workflow of the multi-scale cross attention network (MsCA) multiple instance learning (MIL) framework
MsCAMIL essentially distills information from one magnification to the magnification of choice for modeling (Supplemental Figure 1), hence mimicking the behavior of pathologists when they zoom in or out during case reviews. MsCAMIL projects all inputs to a 256-dimensional feature and uses 4 attention heads for both cross attention and self-attention layers. We trained using Cross-entropy loss for 150 epochs with Adam optimization —a popular method that adaptively adjusts learning rates for each parameter to improve training efficiency and convergence. The learning rate was set to 0.00001, with a batch size of 1 and 4 gradient accumulation steps. 13
Modeling Tasks
We trained the model to accomplish two tasks: A) Binary classification that functions as triage system: Neoplastic or Non-Neoplastic. B) 12-way classification tasks across the following distinct neoplastic and non-neoplastic classes to specify a diagnosis: • Neoplastic: “SSA”: 0, “TA”: 1, “TSA”: 2, “TSA+TVA”: 2, “TSA+TA”: 2, “TVA”: 3, “VA”: 4, “SSA-Dys”: 5, “SSA+TA”: 6, “HGD”: 7, “cancer”: 7 • Normal: “LNagg”: 8, “NPD”: 9, “N”: 9, “normal”: 9, “HP”: 10, “IP”: 11
Training and Validation Data Distribution
We evaluated our models on several test sets to assess their performance and generalizability. First, we tested on internal whole-slide images (WSIs) that were consistent with the data used for training, including similar collection and processing workflows. Next, we tested on two separate external test sets to examine how well the models performed on different data sources. The first external set included 40, representing an in-distribution test within the same environment. The second external set consisted of slides obtained from outside institutions, scanned using two different Leica slide scanners (Aperio GT450 LV1 and Aperio ScanScope AT2), to assess the model’s robustness to variations in slide preparation and scanning equipment.
Result
MsCAMIL Results With Internal Evaluation
Metrics in Binary (Neoplastic Versus Nonneoplastic) and 12-Way (Final Pathology Diagnosis) Classifications With Internal Retrospective UWMC Cases

(A) Low power view of the 3mm sessile sigmoid colon polyp that was misclassified by MsCAMIL binary classification (4x magnification). (B) In 20x magnification, there is one mildly dilated gland at the base (white arrow) and there is increased apoptotic debris in the surface epithelium (black arrow). (C) High power view of the surface epithelium with increased apoptotic debris mimicking pseudostratification of an adenoma/dysplasia (40x magnification)
Multi-Institutional External Evaluation
Distribution of 40 External Slides From 12 Outside Institutions
Performance of Aperio GT450 LV1 and Aperio ScanScope AT2 in Binary (Neoplastic Versus Nonneoplastic) and 12-Way (Final Pathology Diagnosis) Classifications With Internal UWMC 40 Cases From Routine Clinical Service and External 40 Cases From Outside Institutions
AT2 scanner performance on the 40 outside slides was slightly decreased compared to GT 450, with AUC of 85.16% in binary classification. There were 8 slides that were misclassified, and only one patient was impacted by the error in clinical management with colonoscopy surveillance. From the misclassified slides, 5 outside slides (62.5%) originated from the same specialty clinic/practice in Washington that were misclassified by GT450 scanner, further supporting the suggestion that this institution’s H&E staining protocol yields poor scanner image quality and subsequently hinders MsCAMIL performance.
Statistical Analysis
We used a two-tailed Z-test for proportions with a significance level of two-sided P =< 0.05 to compare the performance of local pathologists and the model on both internal and external slide sets. The model achieved an accuracy of 97.5% for binary classification in 40 routine internal cases when compared to human pathologist interpretations. The two-tailed Z-test showed no statistically significant difference in accuracy between the model and pathologists in these cases (P = 0.31), indicating the model’s near-perfect performance in identifying neoplastic cases.
For binary classification of 40 external slides scanned on the GT450, the performance difference between the model and pathologists was borderline significant (P = 0.02), suggesting high model performance on this dataset. However, for 40 external cases scanned on the AT2, the difference was highly significant (P = 0.003), highlighting a potential impact of scanner variability on model performance.
In contrast, for multi-class classification tasks (12-way and 8-way) focusing on the specific diagnosis, the model’s performance was significantly different from that of the pathologists across all datasets (P < 0.001). This suggests limitations in the model’s ability to recognize subtle histologic features necessary to differentiate between various types of polyps.
Correlation of Confusion Matrixes/Error Analysis With Clinicopathologic Findings
In the 40 routine internal clinical set with AUC of 99.23%, only 1 case was misclassified in the binary classification, a misclassification that would have impacted the recommended future frequency of colonoscopy surveillance for the patient. In this case, the classifier predicted SSA, with histology showing focally dilated glands and increased apoptotic bodies (Figure 3B and C). The diagnosis of SSA would have decreased the recommended surveillance interval to 5 years for repeat colonoscopy, yet the pathologist diagnosis of hyperplastic polyp would have led to a recommendation for repeat colonoscopy at the normal 10-year interval.
The 12-way classifier applied to 40 routine internal clinical set led to a decreased AUC of 55.83%, with 14 cases that were misdiagnosed, yet recommended colonoscopy surveillance intervals would have changed in only three patients, because other parts of the patients’ colonoscopy samples were appropriately classified with binary classification. In 40 routine internal slides, the comparison of the prediction and ground truth, there is increased confusion between SSA and other entities, which is a challenging diagnosis for general and subspecialized pathologists. 14
In the external slide set comprised of community samples and scanned with GT 450 scanner, 12-way classification AUC slightly increased to 67.04%, but in addition to the confusion with SSA cases, there was misdiagnosis of 4 cancer cases as lymphoid aggregate and SSA-dys. Since error in cancer detection has a dramatic impact on patient care and clinical outcome, we trained with 8-way neoplastic classification task to focus only on neoplastic entities, and the diagnostic error increased to missing two additional cancer cases from external slide set.
The performance of AT2 scanner in 12-way classification further decreased to 61.21%, compared to the GT 450 scanner, and along with the misdiagnosis of cancer cases as adenomatous/dysplastic lesions, cancer cases were sometimes called normal colonic mucosa with lymphoid aggregates. With a limited 8-way neoplastic classification scheme, the cancer cases that were misdiagnosed as normal lymphoid aggregate were corrected, albeit with an increase in cancer cases being diagnosed as adenomatous/dysplastic lesions by the classifier.
Discussion
In Anatomic Pathology, digital pathology has recently transformed traditional evaluation of histology from glass slides to whole-slide imaging and digital workflow. This leap has enabled us to start unleashing the true power which can be gained with the integration of computational pathology for task-specific artificial intelligence and machine learning and convolutional neural networks during historical signout workflows.15,16 There are models that can provide primary diagnoses or predict prognosis across diverse populations and slide preparation methods, which can be useful across institutions. 17 With limited studies on colorectal polyps, we sought out to create an efficient triage system that is trained in all the entities encountered in colorectal mucosal lesions, including the challenging diagnosis of sessile serrated adenoma/lesion.17,18 In daily clinical routine workflow, MsCAMIL network can function as an efficient triage system by screening colon polypectomy and biopsy specimens with 97.50% macroaveraged accuracy, misclassifying one serrated polyp, which is a common diagnostic challenge among subspecialized gastrointestinal pathologists. MsCAMIL was able to sort the cancer and high-grade dysplasia cases correctly, proving its efficacy to triage neoplastic cases that need subsequent orders and ancillary testing.
Our institution is a tertiary care organization that serves a 5-state region comprised of Washington, Wyoming, Alaska, Montana, and Idaho, known locally as the WWAMI region. Within our external dataset focusing on the Pacific Northwest region, the binary classification performance of MsCAMIL was consistent with 87.5% and 80% macroaveraged accuracy from Leica Aperio GT450 LV1 and Leica Aperio ScanScope AT2, respectively. In our limited external dataset, we encountered slides from one particular outside institution where their H&E protocol may have hindered the classifier’s performance in both Leica Aperio slide scanners. A nationwide multi-institutional collaboration with different geographic regions of the United States will be necessary to expand on various tissue preparations as well as to further diversify the patient pool to fully integrate this tool for clinical use.
In addition to multicenter nationwide collaboration, we plan to incorporate MsCAMIL triage system into our clinical workflow to triage colorectal polyps in pathologists work queue. Before deployment of MsCAMIL, it needs to meet the recommendations for clinical decision support. 18 Therefore, in the predeployment phase, the classifier model would need to be evaluated to ensure there is no bias in the population studied. In our internal dataset, majority of the patients identified themselves as White Americans or Northern European descent/European descent and were non-Hispanic or Latino/a or Latinx, with male to female ratio of 1:1. Although the future is digital and computational pathology, there are many factors to consider when integrating machine learning to pathology clinical workflow that strives for patient-centered and personalized medicine.19-21
In the past few years, MIL has been utilized for classifications of WSI in digital pathology. 22 With multiheaded self-attention-based models specialized in parallelization, the training has become faster with little as 12 hours, therefore larger datasets from nationwide collaborations can be trained in MsCAMIL. 23 MsCAMIL is a multi-scale vision transformer that effectively combines image patches at different magnifications, functioning as a real-time pathologist and can be easily integrated into existing clinical workflow.22,24 There are small scale studies that have shown deep learning techniques that use histological features to detect gene mutations and predict patient survival from colon biopsy samples.25,26 With increased incidence of early-onset colon cancers, integration of both histology and molecular findings is necessary to stratify patient’s risk of progression to colon cancer and to predict patient outcomes. 27 Studies that integrate molecular studies and histologic findings from MIL, can explore potential novel biomarkers and therapeutic targets to improve colorectal cancer survival rates.
Conclusion
In conclusion, MsCAMIL has proven near-perfect performance as a pathologist in triaging colonoscopy samples in routine clinical cases from our institution’s patient samples. Extending MsCAMIL to nationwide datasets will help triage cases in pathologists’ work queue for faster turn-around times and increase efficiency in our large volume clinical practices. With the increase in early-onset colon cancers and increased endoscopy procedures, it is essential to incorporate a triage classifier in pathology clinical workflow.
Supplemental Material
Supplemental material - Multi-Scale Cross-Attention Multiple Instance Learning Network for Automated Classification of Colorectal Polyps
Supplemental material for Multi-Scale Cross-Attention Multiple Instance Learning Network for Automated Classification of Colorectal Polyps by Wisdom Ikezogwo, Yongjun Liu, Kareem Hosny, Jonathan Henricksen, Robert Ricciotti, Meenal Rawlani , Paul Swanson, Patrick C. Mathias, Noah G. Hoffman, Geoffrey Baird, Luis F. Gonzalez-Cuyar, Linda Shapiro, Deepti Reddi in Cancer Informatics
Footnotes
Acknowledgement
An earlier version of this work was presented as platform presentation at the United States & Canadian Association of Pathologists 2025 Annual Meeting (USCAP2025).
Ethical Considerations
Human subjects research was approved by the University of Washington Institutional Review Board, Federal wide assurance no. FWA00006878. The IRB protocol approval number. is no. 15111. The IRB deemed the screening of records and images as Minimal Risk, category 5, and determined that consent and HIPAA Authorization for this activity may be waived. Human subject activities were performed in accordance with the regulatory requirements laid down in U.S. Code of Federal Regulations, Title 45 Department of Health and Human Services Part 46, Protection of Human Subjects.
Authors contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Wisdom Ikezogwo, Yongjun Liu, Jonathan Henricksen, and Deepti Reddi. The first draft of the manuscript was written by Wisdom Ikezogqo, Yongjun Liu and Deepti Reddi, all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. W.I., Y.J., and D.R. performed study concept and design; all authors were involved in discussion, writing, review, and revision of the paper. All authors read and approved the final paper.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Appendix
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
