Hybrid DenseNet-U-Net framework for automated grading of renal cell carcinoma

Abstract

Introduction

Precisely grading renal cell carcinoma (RCC) through histopathology slides is a requisite to predict the cancer prognosis and select treatments, however, there is a considerable variability in the assessment between observers. While some AI systems mainly rely on transformer-based and nuclei-centric models, they are computationally very demanding so their clinical use is thus limited. Hence, there is a need for practical and easy-to-understand solutions that can be integrated into digital pathology workflows.

Methods

We built a hybrid framework that integrates U-Net-based tumor segmentation, convolutional feature extraction, nuclei-aware descriptors, stain normalization, and attention-based multiple instance learning for slide-level RCC grading. The framework was tested on 3 public datasets (TCGA-ccRCC, RCdpia, MMIST-ccRCC) with a cross-dataset validation approach. The metrics used for the performance evaluation were macro-F1 and quadratic weighted kappa (QWK).

Results

The designed method yielded a macro-F1 of 0.94, QWK of 0.92, and accuracy of 0.95. Extracting tumor patches and performing aggregation based on attention resulted in the best improvements. The method was equally effective when tested with different datasets. Most of the errors made by the model were those within the clinical grading range variability. Average time for inference was around one minute and ten seconds per slide.

Conclusion

By fine-tuning the convolutional pipeline, one can obtain a RCC grading capability that can be rivaled by very few, yet the model is efficient and interpretable, hence it will continue to be a strong candidate decision-support tool in digital pathology to be clinically deployed.

Keywords

renal cell carcinoma U-net segmentation automated tumor gradin DenseNet histopathological imaging

1. Introduction

Renal cell carcinoma (RCC) is the most prevalent type of kidney cancer, making up almost 90% of all renal cancers and one of the main causes of cancer-related deaths globally.¹ The precise histopathological grading of RCC is of great importance for predicting outcome, deciding treatment, and patient stratification.^2,3 The International Society of Urological Pathology (ISUP) grading system, which mainly relies on nuclear morphology and nucleolar prominence, is the standard tool for grading RCC.^1,4 Nevertheless, RCC grading remains an expert-dependent and subjective task that also takes considerable time. Variability between and within observers, especially in intermediate grades, can lead to inconsistent diagnoses and adversely affect patient care.^1,3

Recent developments of whole-slide imaging (WSI) and digital pathology within a short time have created a large need for automated, objective, and reproducible grading systems.^5,6 Advances in artificial intelligence and digital pathology have recently facilitated more precise and reproducible analysis of histopathology images in different clinical settings.^7–10 Deep learning has emerged as a potent tool in computational pathology, showing excellent results in tumor detection, segmentation, and classification.^11–13 Nevertheless, the construction of RCC grading systems that are accurate, use less computation power, understandable, and ready for clinical implementation still pose a challenge.^14–16

Figure 1 presents representative histopathological patterns of RCC, demonstrating the subtle nuclear and structural differences that make accurate grading challenging even for expert pathologists.

Figure 1.

Representative histopathological appearance of renal cell carcinoma (RCC) illustrating nuclear morphology and cellular features associated with different tumor grades. The figure highlights the visual variability and subtle structural differences that make RCC grading challenging in routine pathology practice. Reproduced with permission from Hope Diagnostics and Research Lab, Bhubaneswar, India.

1.1. Background

Initial deep learning strategies for RCC grading were mostly limited to patch-level categorization using convolutional neural networks like ResNet and DenseNet.^3,11,12 Such methods demonstrated good results but they typically did not account for explicit tumor localization, which resulted in them being quite sensitive to non-tumor tissue, staining differences and domain changes between institutions.^17,18

In order to tackle these issues, segmentation, classification workflows were rolled out. Tumor areas are first detected using segmentation models, most popularly U-Net, and only patches that contain the tumor are sent to the grading stage.^17,19 Using this strategy not only the robustness is increased as less irrelevant background information is involved but also the interpretability is improved since the regions that have led to classification decisions can be visually pointed out.^11,18 Consequently hybrid CNN pipelines seem to be a realistic compromise of performance and explainability.

Recently, RCC grading studies have been diving into using more complex deep learning models. RAF2Net deployed a feature fusion network enhanced with attention to the automatic grading of RCC from histopathology images.¹⁴ NuAP-RCC came up with a nuclei-aware attention pooling approach that integrates nuclear morphology and deep feature representations through multiple instance learning.⁴ EAT-Net featured an Efficient Attention Transformer design to simultaneously capture local nuclear morphology and global contextual information, achieving macro-F1 scores of over 0.93.¹⁶

1.2. Research gap

Segmentation, classification pipelines are still the dominant approach for grading RCC. However, the latest RCC-targeting studies are shifting their focus towards nuclei and transformer-based multiple-instance learning (MIL) architectures, achieving macro-F1 scores higher than 0.93.^14,16 These methods hold great promise for enhancing the grading accuracy.

However, these methodologies usually involve complicated multi-step procedures and heavy computational requirements. Transformer-based MIL models typically bring about an increase in memory usage and a decrease in transparency when compared to traditional convolutional methods.^5,6,20 Besides, nuclei-based methods heavily rely on the correctness of nuclei segmentation or detection, thereby adding more processing steps and the possibility of error propagation.^21,22 Additionally, their effectiveness could be compromised by differences in staining protocols, tissue preparation, and scanner characteristics between institutions.^23,24

Therefore, finding RCC grading methodologies that offer a good trade-off between high predictive power on the one hand and simplicity, the ability to handle domain variations, and interpretability on the other hand remains a challenge.

1.3. Motivation

Automated RCC grading systems, from a clinical point of view, should be not only highly accurate but also stable, interpretable, and practical for daily use. Actually, many pathology labs work with limitations in computational infrastructure, turnaround time, and technical support. So, systems that need large memory resources or sophisticated multi-stage processing may indeed be rejected for practical reasons.

Moreover, the histopathological evaluation of RCC is a very subjective process because of variations in staining quality, slide preparation, and scanner characteristics between different institutions. For automated methods to be of genuine clinical value, they must first prove that they can operate consistently under such variations. The above practical issues are the reasons for looking at grading systems that, besides predictive ability, also focus on reliability, openness, and working feasibility. Creating methods that go hand in hand with actual clinical work is a crucial move toward the greater use of AI-assisted pathology.

1.4. Research hypothesis

While segmentation, classification pipelines are definitely the mainstay of the field of digital pathology, one of the most recent Focuses of RCC is on nuclei-centric and transformer-based multiple-instance learning (MIL) approaches within RCC that are capable of achieving high macro-F1 scores. One of the main disadvantages of these models is their high computational resource needs and the complex processing pipelines.

It is our conjecture that an adequately optimized convolutional segmentation, classification pipeline incorporating stain normalization, nuclei-aware descriptors, and attention-based aggregation should be able to deliver performance on a par with transformer-based and nuclei-centric methods while drastically lessening the computational complexity, inference time, and memory requirements. This is because CNNs are very good at capturing spatial hierarchies and, when combined with biologically informed feature representations and adaptive aggregation mechanisms, they become even more powerful.

1.5. Research questions

RQ1: Is it possible for a hybrid U-Net + DenseNet pipeline to perform RCC grading equally well as transformer-MIL and nuclei-centric methods?

RQ2: Does explicit tumor segmentation improve robustness and grading reliability across multi-institutional datasets?

RQ3: Can stain normalization and attention-based aggregation reduce domain shift and enhance cross-dataset generalization?

RQ4: Does a lightweight CNN-based framework provide a better balance between accuracy, efficiency, and interpretability for clinical deployment?

1.6. Contributions of this work

The main contributions of this study are as follows:

1. We propose a lightweight, purely convolutional hybrid DenseNet–U-Net framework for automated RCC grading that integrates tumor segmentation and classification into an end-to-end pipeline.

2. We demonstrate that a carefully designed CNN-based architecture can achieve competitive performance with recent transformer-MIL and nuclei-centric RCC models while being computationally more efficient.

3. We incorporate stain normalization and attention-based aggregation to improve robustness and cross-dataset generalization.

4. We validate the proposed framework on three publicly available datasets (TCGA-ccRCC, RCdpia, and MMIST-ccRCC), demonstrating strong segmentation accuracy, reliable grading performance, and generalizability.

5. We show that classical segmentation–classification pipelines, when modernized and carefully optimized, remain highly relevant and clinically deployable alternatives to complex transformer-based solutions.

2. Related work

Digital pathology research on RCC grading has progressively evolved from patch-level convolutional classifiers toward attention-driven aggregation and transformer-based frameworks.^{2,14–16,25–28} These approaches have improved predictive performance on public and multi-center datasets, often reporting macro-F1 values above 0.90. However, these gains are frequently accompanied by increased computational requirements, more complex training pipelines, and reduced interpretability, which limit clinical deployability.^5,6,20

2.1. Patch-level CNN grading on histopathology images

Early and mid-stage RCC grading pipelines primarily learned discriminative texture and morphology cues from image patches using convolutional neural networks (CNNs).^3,5,6 A representative work is RCCGNet, which proposed a lightweight CNN with a shared-channel residual design and reported strong performance on the KMC kidney histopathology dataset, achieving F1-scores close to 0.89 and accuracy around 0.90.^1,2

Such patch-level CNN approaches are typically easier to train and deploy than whole-slide image (WSI)–level multiple instance learning (MIL) pipelines, making them attractive for practical clinical settings.^20,29 However, they are sensitive to stain variability and background tissue and may struggle when grade-relevant patterns are spatially sparse or heterogeneously distributed.^16,17 Several additional deep learning approaches for RCC grading and histopathology image analysis have been proposed in recent studies.^30–33

2.2. CNN–attention and hybrid designs improving discriminability

To better capture diagnostically relevant regions, later works incorporated attention mechanisms or hybrid modules for adaptive feature emphasis.^16,34 For example, RCG-Net integrated dynamic attention with an adaptive convolution strategy and reported improved grading performance on the KMC dataset with a weighted F1-score of approximately 0.909.³⁵ Similarly, the EFF-Net framework targeted efficient RCC grading by combining convolutional feature extraction with separable convolutions, reporting F1-scores around 0.919 while emphasizing computational efficiency.³⁶

These studies collectively show that carefully designed CNN + attention methods can achieve high grading accuracy without relying entirely on computationally heavy transformer architectures.^34,36

2.3. Nuclei-focused learning and graph-based aggregation

Another parallel research avenue has explicitly modeled nuclear morphology and spatial distribution, inspired by the fact that RCC grading criteria in the WHO/ISUP system heavily rely on nuclear size, pleomorphism, and nucleolar prominence.^1,4 NuAP-RCC is a well-known nuclei-focused method that leverages nuclei-based features and graph-based aggregation along with global CNN representations, demonstrating good generalization across datasets and institutions.⁴

Although nuclei-focused methods bring enhanced biological plausibility and interpretability, they necessitate extra steps such as precise nuclei detection or segmentation and graph construction, which complicate the pipeline and increase computational overhead.^16,18,37 These dependencies also present a higher potential for error propagation, especially under variable staining and scanning conditions.^22,24

2.4. Transformer and dual-stream attention frameworks

Recently, more research has focused on transformer-based context modeling and dual-stream designs for RCC grading.^13,36 EAT-Net is one example that explicitly considers efficiency measures such as parameter count and GFLOPs while remaining competitive in grading performance and reporting F1-scores around 0.92 on the KMC dataset.¹⁵

While transformer architectures improve global context modeling, their multi-branch attention mechanisms and heavy feature fusion stages significantly increase inference time and reduce transparency, which may limit deployment in routine pathology workflows, particularly in low-resource settings.^5,6,13

2.5. Feature fusion and deep transfer learning

Several recent studies have reported strong RCC grading performance using feature fusion strategies and transfer learning.^18,19 For instance, fusion frameworks and deep transfer learning pipelines have achieved classification accuracies exceeding 0.93 on benchmark datasets such as KMC.¹⁸ Other Scientific Reports studies have demonstrated that carefully tuned CNN backbones combined with transfer learning and hybrid modeling can achieve accuracy values close to 0.94.^2,5

These outcomes highlight the importance of strong backbone feature representations and well-designed aggregation functions. However, these methods often involve multiple optimization stages and sophisticated training strategies, making reproducibility and large-scale deployment challenging.^6,20

2.6. Theoretical advantages of the proposed framework

The framework we propose controls the trade-off between performance and efficiency through a completely convolutional structure with attention-based aggregation and nuclei-aware features. Transformer-based models like EAT-Net depend on self-attention mechanisms that are quadratic in complexity, hence, they consume more memory and computational resources. On the other hand, convolutional architectures like DenseNet are linear in relation to image size, which means they can extract features more efficiently. Besides that, the use of U-Net, based tumor segmentation helps feature learning to focus only on the relevant regions by indirectly limiting unnecessary computations on background tissue. On the other hand, nuclei-level descriptors bring in biologically meaningful priors consistent with ISUP grading criteria that the model can use to recognize clinically relevant features without requiring complicated graph-based or transformer architectures. Additionally, attention-based MIL offers the possibility of assigning weights adaptively to the importance of a patch, but it still results in a much lower computational cost when compared with transformer-based aggregation methods. This balance makes it possible for our framework to produce results that are as good as those of more complex models but with a significantly lower number of parameters and faster prediction times.

Even though the design above is inspired by the idea, the results from Section 5.6 show that the tumor segmentation plays a crucial role: the performance drops by a large margin (∼5% Macro-F1) when it is removed, suggesting that spatial filtering of tumor regions results in better feature quality. Besides that, the use of nuclei-level features leads to a steady increase (∼2, 3%) in performance, proving that biologically grounded features are a good complement to deep CNN representations. Attention-based MIL also plays a role by enhancing aggregation efficiency and accuracy, it results in an improvement of ∼3% over majority voting and at the same time keeps the computational overhead low. On the other hand, transformer-based MIL methods depend on global self-attention that has quadratic complexity, which means that the computational cost goes up without a proportional increase in performance for this task.

The pieces of evidence appear to point toward: the proposed three are combining not randomly, but functionally complementing each other, segmentation decreases noise, nuclei descriptors add biological significance, and AB-MIL allows for efficient contextual aggregation. They form a computationally efficient alternative to transformer-based and graph-based frameworks while arriving at comparable accuracy. Considering that the transformer-based models usually demand 3, 5 times higher GFLOPs and memory, the proposed framework attains comparable macro-F1 performance (0.94) at much lower computational complexity.

2.7. Summary

Overall, the literature shows a clear performance trend toward nuclei-focused and attention/transformer-based aggregation strategies, with many works reporting F1-scores in the range of 0.90–0.92 and sometimes higher depending on dataset and evaluation protocol.^4,14,15

However, there remains an under-explored opportunity to achieve state-of-the-art RCC grading performance using a clinically deployable footprint: a lightweight, purely convolutional pipeline paired with careful stain normalization and attention-based aggregation that supports interpretability, computational efficiency, and fast inference.^22–24,34

This gap directly motivates the research question and hypothesis introduced in Section 1.1, positioning our work as a practical and clinically oriented alternative to increasingly complex transformer- and nuclei-centric RCC grading frameworks.

A comparative summary of recent RCC grading methods and their reported performance is provided in Table 1.

Table 1.

Comparison with recent RCC grading papers.

Method	Year	Backbone/aggregation	Reported metric (F1/Macro-F1)	QWK	Dataset(s) (as reported)
RCCGNet	2023	Lightweight CNN (SCR blocks)	F1 = 0.8906	–	KMC kidney histopathology (KMC-RENAL)
RCG-Net	2024	CNN + transformer encoder + dynamic attention	Weighted F1 = 0.9092	–	KMC (RCC dataset of KMC)
EFF-Net	2024	Efficient CNN + separable conv	F1 = 0.9190	–	KMC
NuAP-RCC	2025 (pub), 2024 (DOI year)	Nuclei-focused + GNN aggregation + CNN global	Not visible in open metadata page	–	Multi-institution (USM-RCC mentioned)
EAT-Net	2025	EfficientNet stream + ViT stream (dual)	F1 = 0.9225	–	KMC-RENAL (+ external test on RCCGNet dataset)
Gini + fusion framework	2025	Attention-aided CNN fusion	Accuracy = 0.9304 (F1 not in abstract)	–	RCC grading (paper reports RCC accuracy)
CVDTLM-AGRCC	2025	Transfer learning + hybrid pipeline	Accuracy = 0.9389 (F1 not in abstract)	–	KMC
EDTL-PCGRCC	2025	Improved MobileNetV2 features + ENN classifier	Metrics not captured in visible abstract excerpt	–	Biomedical image dataset (RCC grading)
Hybrid DenseNet–U-Net framework (Proposed Method)	2025-26	Pure CNN + stain norm + attention aggregation	Macro-F1 = 0.94	QWK = 0.92	TCGA/RCdpia/MMIST

3. Methodology

This part explains the full design of the RCC grading framework that is being proposed. The sequence combines explicit tumor localization with U-Net, powerful feature extraction with DenseNet and other modern CNN backbones, attention-based multiple instance learning (MIL) for aggregating the results at the slide level, and nuclei-level morphological analysis matched with the ISUP grading criteria. The system is developed to be lightweight, interpretable, and ready for clinical deployment while still performing on par with the latest transformer and nuclei-centric models.

3.1. Overall framework

Automated grading of RCC from whole-slide histopathology images by computer is the proposed framework. In the first step, a U-Net model is used to detect tumor areas, and tumor-rich patches are extracted from the WSI. Then, each of these patches is converted into deep feature representations by a CNN backbone. At the same time, nuclei instance segmentation is made to gain morphological features which describe nuclear size, shape, and intensity patterns. CNN features and nuclei descriptors are then combined at the patch level, and all patches of a slide are aggregated using an attention-based multiple instance learning (AB-MIL) module. Finally, the aggregate slide-level feature is used to infer the RCC grade. The proposed framework elegantly exploits explicit tumor localization, biologically meaningful nuclei analysis, and modern attention-based aggregation within a lightweight and clinically interpretable setting (Figure 2).

Figure 2.

Overall architecture of the proposed RCC grading framework integrating U-Net–based tumor segmentation, CNN feature extraction, nuclei-level analysis, and attention-based MIL aggregation for slide-level prediction.

This architecture ensures spatial interpretability via tumor segmentation, feature discriminability via CNN embedding, contextual reasoning via attention-based aggregation, and biological relevance via nuclei morphology modeling.

3.2. WSI preprocessing and patch generation

Each WSI is first processed to remove background using Otsu thresholding in the HSV color space. Tissue regions are identified, and patches with less than 20% tissue coverage are discarded. Non-overlapping patches of size 224×224 pixels are extracted at 20X magnification. These patches serve as the basic processing units for both segmentation and classification tasks.

To reduce inter-dataset color variability, Macenko stain normalization is applied to all patches. During training, data augmentation is used to improve generalization and includes:

• Random rotation (0°–360°)

• Horizontal and vertical flipping

• Random scaling (0.9–1.1)

• Brightness and contrast variation (±10%)

In addition to Macenko and Vahadane normalization, other approaches such as Reinhard color transfer²³ and GAN-based normalization³⁸ have been proposed. Several additional stain normalization and preprocessing approaches have been proposed to reduce inter-institutional variability and improve robustness in histopathology image analysis.^39–43

3.3. Tumor segmentation using U-Net

Tumor region localization is performed using a fully convolutional U-Net architecture,¹² as illustrated in Figure 3. This module serves as the first stage of the pipeline and ensures that all subsequent grading decisions are restricted to tumor tissue only. The segmentation process is formulated as a binary pixel-wise classification problem, separating tumor from non-tumor regions.

Figure 3.

U-Net architecture used for tumor region segmentation in RCC histopathology images.

The U-Net–based segmentation module operates through the following stages:

1. Input Patch Formation: Each WSI is tiled into patches of size 224×224 pixels. These patches are used as inputs to the U-Net model for dense tumor prediction at pixel resolution.

2. Encoder Representation Learning: Equation (1) shows that encoder consists of multiple convolutional blocks and content of each block.

C o n v (3 \times 3) \to B a t c h N o r m \to R e L U \to M a x P o o l (2 \times 2)

(1)

This phase gradually decreases the spatial resolution and at the same time enlarges the receptive field and semantic abstraction so that the network can learn the high-level tumor morphology and the tissue context.

3. Bottleneck Feature Encoding: At the deepest level, the network captures the contextual information on a global scale, comprising the overall tumor structure, the cellular density, and the large-scale architectural patterns that separate tumorous tissue from normal one.

4. Decoder and Spatial Reconstruction: To the encoder, the decoder is a counterpart and through it, the spatial resolution is brought back with transposed convolutions shown in Equation (2):

U p C o n v (2 \times 2)

(2)

At each decoding level, the upsampled feature map is concatenated with the corresponding encoder feature map via skip connections as shown in Equation (3):

F_{d e c}^{(l)} = C o n c a t (F_{e n c}^{(l)}, F_{u p}^{(l)})

(3)

where,

$F_{e n c}^{(l)}$ is the feature map obtained from the encoder at level l. It contains high-resolution spatial information such as fine boundaries, textures, and edge details. $F_{u p}^{(l)}$ is the feature map obtained by upsampling the decoder output from the previous level. At a coarse spatial scale, it contains the semantic informationof tumor regions at a higher level. $C o n c a t (\cdot)$ denotes channel-wise concatenation, meaning the two feature maps are stacked along the channel dimension. $F_{d e c}^{(l)}$ is the resulting decoder feature map at level l, which combines precise localization information from the encoder with contextual semantic information from the decoder

5. Boundary Refinement: In order to help separate the tumor from the background, the post-concatenation convolutional layers integrate low-level texture information with high-level semantic features. Besides, their hybrid function helps to filter out false positives.

6. Tumor Probability Estimation: The output combines a 1 × 1 convolution and a sigmoid activation, resulting in a pixel-wise tumor probability map as shown in Equation (4):

P (x, y) = σ (W * F (x, y) + b)

(4)

where

P (x, y)

represents the likelihood of pixel

(x, y)

belongs to tumor tissue.

7. Binary Tumor Mask Generation: Threshold on the probability map is performed to generate a binary tumor mask as shown in Equation (5):

M (x, y) = {\begin{cases} 1, P (x, y) \geq τ \\ 0, O t h e r w i s e \end{cases}

(5)

where,

τ

is an empirically selected threshold.

8. Loss Function and Optimization: The segmentation network is optimized using a composite loss as shown in Equation (6):

L_{s e g} = ⋋ L_{D i c e} + (1 - ⋋) L_{B C E}

(6)

where,

• $L_{s e g}$ is the final segmentation loss.

• $L_{D i c e}$ is the Dice loss, which measures the overlap between the predicted tumor region and the ground-truth tumor mask. It is especially effective for medical images where foreground (tumor) pixels are much fewer than background pixels.

• $L_{B C E}$ is the Binary Cross-Entropy loss, which treats segmentation as a pixel-wise binary classification task and encourages accurate probability estimation for each pixel.

• $⋋ \in [0, 1]$ is a weighting parameter that controls the relative importance of Dice loss and BCE loss.

9. Tumor-Guided Patch Filtering: The mask of the tumor is applied back to the WSI to locate the tumor, and accordingly, tumor-rich patches are selected. Features extraction and grading are done only on patches that have a significant tumor area overlap. This filtering procedure will help to decrease the background noise and prevent the non-tumor tissues from influencing the classification stage.

10. Clinical and Methodological Significance: The segmentation-first strategy has the potential to give various benefits, such as clarifying the spatial interpretability, removing the classification bias caused by the irrelevant tissue, increasing the robustness of the method across different datasets, and creating a clinically auditable connection between the predicted grades and the tumor morphology.

The incorporation of U-Net-based tumor segmentation as a mandatory preprocessing step makes the framework impose anatomical correctness, increase the accuracy of subsequent grading, and hence the automated pipeline becomes more in line with the pathologist’s workflow that is commonly used in RCC diagnosis. Such models as 3D U-Net,¹³ UNet++,¹⁴ and nnU-Net¹⁵ have shown that the U-Net family is very well suited for the task of biomedical segmentation.

3.4. Patch-level feature extraction using CNN backbones

Patch-level feature extraction is achieved based on deep convolutional neural network (CNN) backbones, which transform each tumor patch into a concise and highly discriminative feature representation. The CNN feature extraction component architecture is represented in Figure 4. At this point, the CNN learns texture, structural, and morphological patterns that are highly indicative of RCC grading.

Figure 4.

CNN backbone architectures used for patch-level feature extraction, including DenseNet-121 as the primary model and EfficientNet-B4 and ConvNeXt-Tiny as modern alternatives.

DenseNet-121 features a dense connectivity pattern whereby each layer gets its inputs from all the layers that came before it. This layout goes a long way in enabling efficient feature reuse, making the gradients flow better, and cutting down the number of trainable parameters as compared to regular deep CNNs. These attributes render DenseNet most apt to be used in medical imaging tasks where, on the one hand, the datasets are small and, on the other hand, a high representational capacity is needed to uncover very subtle histopathological variations.

Besides DenseNet-121, two more CNN backbones have been experimented with to compare the proposed framework with recent state-of-the-art feature extractors:

• EfficientNet-B4 that scales the network depth, width, and resolution in a compound manner to get the best performance while keeping the computational cost low.

• ConvNeXt-Tiny is a modern convolutional architecture that draws inspiration from the transformer design principles and thus has improved representational power and has shown great performance on vision benchmarks.

In order to adapt to histopathological textures and cellular structures, the features learnt through combination of ImageNet-pretrained weights and fine-tuning of the colorectal cancer (RCC) tumor patches are used to initialize all the CNN backbones. Equation (7) shows the output feature maps of the last convolutional layer for each tumor patch are changed into a fixed-length vector by means of global average pooling:

x_{i} \in R^{d}

(7)

where

x_{i}

denotes the feature embedding of the i^th tumor patch and d represents the dimensionality of the embedding space. This compact representation serves as the input to subsequent modules, including nuclei-level feature fusion and attention-based multiple instance learning (AB-MIL) for slide-level aggregation.

Combining DenseNet-121 with state-of-the-art CNN backbones such as EfficientNet-B4 and ConvNeXt-Tiny, the suggested framework provides a means to methodically compare classical and modern architectures for their strengths in performance and computational efficiency in RCC grading. Various convolutional neural network architectures have been successfully applied for feature extraction in histopathology image analysis.^44–47 EfficientNet-B4 uses compound model scaling to obtain better accuracy–efficiency trade-offs,³ whereas ConvNeXt updates the convolutional design by utilizing transformer-inspired architectural features.⁵ EfficientNetV2 also increases training speed and parameter efficiency while still delivering high accuracy.⁴

EfficientNet-B4 uses compound model scaling to obtain better accuracy–efficiency trade-offs,³ whereas ConvNeXt updates the convolutional design by utilizing transformer-inspired architectural features.⁵ EfficientNetV2 also increases training speed and parameter efficiency while still delivering high accuracy.⁴

3.5. Nuclei instance segmentation

In this work, nuclei instance segmentation is performed using HoVer-Net due to its robustness in separating overlapping nuclei in histopathology images. The model is used with pretrained weights on the PanNuke dataset and fine-tuned on RCC patches.

Segmentation outputs are post-processed using the following steps:

• Minimum nucleus area threshold: 50 pixels (to remove debris)

• Maximum area threshold: 2000 pixels (to remove merged regions)

• Watershed refinement for boundary separation

Only nuclei satisfying these criteria are retained for feature extraction.

Nuclei segmentation is done on each tumor patch by employing state-of-the-art instance segmentation models like Hover-Net¹⁶ or StarDist.¹⁸ These networks are specifically designed for histopathology images and they can successfully pick out, separate, and delineate individual nuclei even in the most densely packed tumor areas. The instance-aware characteristic factor of their design allows for precise acquisition of nucleus contours, which is essential for reliable morphological measurement.

Following segmentation, each nucleus is treated as an individual object, and a set of morphological and intensity-based descriptors is computed:

• Area: measures nuclear size, which increases with tumor grade

• Perimeter: captures boundary complexity

• Eccentricity: reflects elongation and deviation from circular shape

• Compactness: quantifies nuclear shape regularity

• Irregularity index: indicates pleomorphism and boundary distortion

• Mean nuclear intensity: relates to chromatin density

• Intensity variance: measures intra-nuclear heterogeneity

• Proxy measure of nucleolar prominence: approximates nucleolar visibility using localized intensity peaks

These quantitative descriptors can be interpreted as a computer-based representation of ISUP grading criteria, especially matching the criteria of nuclear enlargement, pleomorphism, and nucleolar prominence. Moreover, through the integration of nuclei-level analysis into the grading pipeline, the proposed method will make predictions based on biologically interpretable cellular features instead of simply deep abstract representations.

Different nuclei segmentation methods such as Cellpose¹⁹ and regression-based distance map techniques²⁰ have also been reported to work well for histopathology. The idea of star-convex polygon modeling was initially proposed by Schmidt et al.,¹⁷ which forms the foundation of the StarDist tool.

3.6. Nuclei feature aggregation and fusion

Following nuclei instance segmentation and feature extraction, the nuclear descriptors obtained from individual nuclei within a tumor patch are aggregated to form a patch-level representation. Since each patch may contain a variable number of nuclei, a fixed-length nuclear feature vector is generated using statistical pooling operations. Specifically, for each nuclear attribute (area, eccentricity, intensity variance, etc.), the mean, standard deviation, and maximum values are computed across all nuclei in the patch. This leads to a condensed and strong nuclear descriptor, $f_{n u c l e i}$ that effectively captures the overall morphological and intensity aspects of the nuclei in the patch. Statistical pooling provides flexibility regarding the number of nuclei present and at the same time retains the central tendencies as well as the extreme morphological variations that in most cases are linked to higher tumor grades. At the same time, a deep feature embedding from the CNN backbone, $f_{C N N}$ is used to represent each tumor patch.

In order to combine cellular-level morphological data with deep texture-level representations, the nuclear descriptor and the CNN embedding are merged to create a fused feature vector as shown in Equation (8):

f_{f u s i o n} = [f_{C N N}; f_{n u c l e i}]

(8)

where

[\cdot; \cdot]

denotes channel-wise concatenation. This fusion explicitly combines complementary information:

f_{C N N}

captures global tissue architecture, texture patterns, and contextual information, and

f_{n u c l e i}

encodes fine-grained cellular morphology corresponding to ISUP grading criteria.

The combined features are subsequently fed to a multilayer perceptron (MLP) that accomplishes two things. One is to map the concatenated features to a common latent space, thus enabling the CNN-based and nuclei-based features to interact well. The other is to carry out nonlinear feature transformation so as to increase the ability of distinguishing features and reduce the influence of components that do not contribute or are noisy. In a formal manner, the MLP operation is represented by Equation (9).

f_{p r o j} = Φ (W f_{f u s i o n} + b)

(9)

where W and b are learnable parameters and

Φ (\cdot)

denotes a nonlinear activation function, for example ReLU.

The MLP used for fusion consists of two fully connected layers (1024 → 512 → 256) with ReLU activation and dropout (0.3). This transformation ensures effective interaction between CNN and nuclei features while reducing dimensionality. This patch fusion approach leads to a biologically grounded patch embedding that accurately represents:

• Several texture level cues such as tissue architecture and cell density that CNNs can learn, and

• Cell level morphological features such as nuclear size, pleomorphism and nucleolar prominence that can be captured by nuclei descriptors.

All extracted nuclei features are normalized using z-score normalization across the training dataset to ensure consistent scaling as $x^{,} = \frac{x - μ}{σ}$ . This normalization prevents dominance of large-scale features such as area over intensity-based descriptors.

By deliberately incorporating nuclei morphology into the feature space, this method not only corresponds the computational depiction with the core concepts of pathological grading to enhance the precision of grading but also interpretability.

3.7. Attention-based multiple instance learning (AB-MIL)

Multiple instance learning was originally introduced for medical image analysis by Xu et al.,¹⁴ and its formal characteristics are summarized in the comprehensive survey by Carbonneau et al.¹⁵ Several attention-based MIL and aggregation strategies have been proposed for whole-slide image analysis.^48–50 Instead of using majority voting for slide-level decision making, the proposed framework employs an Attention-Based Multiple Instance Learning (AB-MIL) strategy for robust and interpretable aggregation of patch-level features. In this setting, each whole-slide image (WSI) is modeled as a bag of patch-level fused representations obtained after CNN and nuclei feature fusion, as represented in Equation (10).

X = {x_{1}, x_{2}, . . ., x_{n}},

(10)

where,

x_{i} \in R^{d}

denotes the fused feature vector of the i^th tumor patch and n is the total number of patches extracted from the WSI.

The core idea of AB-MIL is to learn a weighting function that assigns higher importance to diagnostically informative patches and lower importance to irrelevant or noisy patches. This is achieved through a trainable attention mechanism. For each patch representation $x_{i}$ , an attention score is computed as per the Equaiton (11).

α_{i} = \frac{\exp (w^{T} \tan h (V x_{i}))}{\sum_{j = 1}^{n} \exp (w^{T} \tan h (V x_{j}))},

(11)

where:

• $V \in R^{h \times d}$ is a learnable weight matrix that projects the patch embedding into a latent attention space,

• $w \in R^{d}$ is a learnable attention vector,

• $\tan h (\cdot)$ introduces nonlinearity and improves the expressive capacity of the attention mechanism, and

• the softmax normalization ensures that the attention weights satisfy $\sum_{i = 1}^{n} α_{i} = 1$

The attention weight $α_{i}$ represents the relative contribution of the i^th patch to the final slide-level prediction. Patches that contain highly discriminative histopathological patterns, such as regions with enlarged nuclei, prominent nucleoli, or high cellular atypia, are assigned larger attention values, while background or less informative patches receive lower weights.

Using these attention scores, the slide-level representation is computed as a weighted sum of patch features as shown in Equation (12).

z = \sum_{i = 1}^{n} α_{i} x_{i},

(12)

where

z \in R^{d}

is the aggregated feature vector representing the entire WSI. This vector is then passed to a fully connected classifier followed by a softmax layer to predict the final RCC grade.

To ensure reproducibility, the attention module is implemented using a two-layer fully connected network. The patch embedding dimension is set to d=1024 (DenseNet-121 output). The attention mechanism projects each patch feature into a latent space of dimension L=256.

Specifically, the attention computation follows:

• First layer: Fully connected layer with weight matrix $W \in R^{256 \times 1024}$

• Activation: $\tan h$

• Second layer: Attention vector $v \in R^{256}$

The attention score is computed as shown in Equation (13).

a_{i} = \frac{\exp (v^{T} \tan h (W h_{i}))}{\sum_{j} \exp (v^{T} \tan h (W h_{j}))}

(13)

Dropout with rate 0.25 is applied after the first projection layer to prevent overfitting. The aggregated slide representation is then passed through a fully connected classifier with 512 hidden units and ReLU activation, followed by a softmax output layer.

The AB-MIL methodology has quite a few key advantages over the majority voting rule:

1. Adaptive Patch Importance: Majority voting will treat all patches as equally important, on the other hand, AB-MIL can learn to assign higher weights to tumor regions that are more significant for diagnosis.

2. Performance Increase: Locating the most relevant patches solely makes it possible for AB-MIL to outperform the grading accuracy continuously and it generally leads to a 3-5% increase in the Macro-F1 and QWK scores.

3. Interpretability: The attention weights found can be converted into attention heatmaps on WSIs, which allows for the understanding of which regions affected the grading decision, thereby it can increase clinical trust.

4. Robustness to Noise: AB-MIL reduces the extent to which wrong, ambiguous or low-quality patches may have an impact, these kinds of patches are typical in large-scale histopathology datasets.

5. Clinical Relevance: The attention mechanism is analogous to the way in which pathologists focus only on the most suspicious and representative regions of the slide when making the tumor grade decision.

By exchanging the majority voting with AB-MIL, the proposed structure adopts an advanced aggregation technique that is the de facto standard for WSI-level classification in computational pathology. This improvement significantly upgrades the predictive performance and interpretability while, at the same time, it maintains computational efficiency and clinical deployability.

3.8. Dual-stream multiple instance learning (DSMIL)

In addition to AB-MIL, we extended our investigation to DSMIL8 which is a more powerful aggregation method to potentially yield even better slide-level RCC grading performance. DSMIL is a technique specially designed for situations where only a few features related to the diagnosis are local or diverse within a tumor tissue slide, an example that is typical in RCC histopathology.

DSMIL has two learning streams that complement each other:

Instance-Level Stream:

1. The instance-level stream is all about finding the single most discriminative patches in a WSI. Each patch representation is forwarded to an instance classifier that predicts how likely it is of the patch to be of a certain RCC grade. This stream explicitly models local evidence and highlights patches that strongly indicate a specific grade, such as regions exhibiting large pleomorphic nuclei or prominent nucleoli. Formally, instance-level predictions can be represented as Equation (14).

p_{i} = S o f t m a x (W_{i n s t} x_{i} + b_{i n s t})

(14)

where

p_{i}

is the class probability vector for the i^th patch, and

W_{i n s t}

b_{i n s t}

are trainable parameters.

This stream encourages the network to learn highly discriminative patch-level representations and serves as a detector for grade-defining regions.

2. Bag-Level (Slide-Level) Stream: The bag-level stream aggregates information from all patches to model the global context of the WSI. Similar to AB-MIL, it computes attention weights that indicate the importance of each patch in forming the final slide-level representation. The weighted combination of patch features is then used to produce the slide-level prediction. This stream captures overall tumor architecture, distribution of grade-related patterns, and global consistency of the grading decision.

DSMIL’s main feature is that the two streams are trained simultaneously and interact with each other. The instance-level stream steers the model to concentrate on the most informative patches, whereas the bag-level stream makes sure that the global slide context remains intact. This dual supervision allows the network to integrate local discriminative evidence with a comprehensive slide-level understanding.

The final slide level prediction is a result of fusion of the outputs of the two streams, thus DSMIL obtains the ability to use highly local strong signals from highly atypical regions and also stable global representations from the entire WSI.

DSMIL has several benefits for RCC grading:

• Increased Sensitivity to Sparse Lesions: DSMIL has the ability to locate very small areas showing high-grade features that in a standard attention-based aggregation might get lost.

• Increased Robustness: By using both the local and global information at the same time, DSMIL is less likely to pick up noise, patch labeling errors, and tumor heterogeneity.

• AB-MIL Complementary: AB-MIL adaptively weights patches, whereas DSMIL acts to patch-level discrimination making it extremely efficient in the presence of high intra-slide variability.

• More Clinically Relevant: Our model is similar to the diagnostic procedure by pathologists who usually spot a few critical high-grade regions but consider the overall tumor pattern as well.

In our experiments, alongside AB-MIL, DSMIL is tested as another aggregation method. By comparing AB-MIL to DSMIL performance, we can measure how dual-stream supervision helps and prove the stability of the proposed framework when tumor heterogeneity changes.

For DSMIL, the instance-level classifier is implemented as a linear layer mapping d=1024d = 1024 features to class logits. The bag-level stream uses an attention pooling mechanism similar to AB-MIL but guided by the top-k scoring instances (k = 5). The final prediction is computed as a weighted combination of instance-level and bag-level outputs as shown in equation (15).

y = α y_{i n s t a n c e} + (1 - α) y_{b a g}

(15)

where α=0.5 is empirically chosen.

3.9. Baseline and comparative model

The proposed method was compared with several baseline and state-of-the-art deep learning models commonly used in computational pathology as summarized in Table 2.^51–56

Table 2.

Summary of baseline and comparative models used for evaluating the proposed RCC grading framework.

Model	Backbone	Aggregation	Purpose
DenseNet + Majority Voting	DenseNet-121	Voting	Original baseline
DenseNet + AB-MIL	DenseNet-121	Attention MIL	Effect of attention aggregation
EfficientNet-B4 + AB-MIL	EfficientNet-B4	Attention MIL	Modern CNN comparison
ConvNeXt-Tiny + AB-MIL	ConvNeXt-Tiny	Attention MIL	Transformer-inspired CNN
TransMIL/HipoMIL	CNN + Transformer	Transformer MIL	Global context modeling
RAF2Net (2024)	ResNet + Attention	Feature fusion	Recent SOTA
NuAP-RCC (2024)	Nuclei-focused	MIL/Graph	Nuclei-centric SOTA

3.10. Training strategy

All models were trained under a unified experimental protocol, with the hyperparameters and optimization details reported in Table 3.

Table 3.

Training parameters and optimization settings used for all models.

Parameter	Value
Optimizer	Adam
Learning rate	$1 \times 10^{- 4}$
Batch size	32
Epochs	100
Early stopping	15 epochs
LR scheduler	ReduceLROnPlateau
Dropout	0.3
Segmentation loss	Dice + BCE
Classification loss	Categorical Cross-Entropy

The training is performed in two stages:

Stage 1: U-Net segmentation is trained independently using annotated tumor masks.

Stage 2: The classification pipeline (CNN + nuclei + MIL) is trained using tumor-filtered patches while freezing early CNN layers for the first 20 epochs, followed by full fine-tuning.

3.11. Evaluation metrics

The performance of each model is measured by a comprehensive panel of evaluation metrics that consider both classification accuracy and clinical relevance. As RCC grading is an ordinal classification task, Quadratic Weighted Kappa (QWK) is chosen as the main evaluation metric. QWK quantifies the agreement between the predicted and actual grades while giving more weight to larger misclassification errors, which makes it very appropriate for grading tasks. The statistical characteristics of weighted Kappa are well established^57,58 and therefore QWK is an excellent choice for ordinal grading tasks. The Dice coefficient,⁵⁹ also known as the Sørensen–Dice index,⁶⁰ is widely used for evaluating medical image segmentation quality. Segmentation performance was evaluated using standard metrics commonly used in medical image analysis.⁶¹ In addition to QWK, the following standard classification metrics are reported:

• Accuracy: measures the overall proportion of correctly classified slides.

• Precision: evaluates the reliability of grade predictions.

• Recall: quantifies the ability to correctly identify slides belonging to each grade.

• Macro-F1 score: represents the average F1-score across all grades, treating each class equally and mitigating the effect of class imbalance.

To evaluate the performance of the tumor segmentation module, the Dice coefficient is used, which measures the spatial overlap between the predicted tumor masks and the ground-truth annotations. The Dice coefficient,⁵⁹ also known as the Sørensen–Dice index,⁶⁰ is widely used for evaluating medical image segmentation quality. All results are reported: with and without attention-based MIL to quantify the benefit of replacing majority voting, with and without nuclei-level feature integration to evaluate the contribution of cellular morphology modeling, and across all baseline and state-of-the-art models to ensure a fair and comprehensive comparison.

This evaluation strategy considers a proper mixture of the prediction performance, the clinical reliability, and the contribution of each part of the proposed framework.

3.12. Statistical analysis

In order to determine whether the differences in performance between models were significant, McNemar’s test was used on paired slide-level predictions between the proposed method and the strongest baseline model. This test is employed to check if two classifiers have a significant difference when applied to the same samples.

Moreover, paired t-tests were performed between repeated runs to compare macro-F1 scores of different aggregation strategies. The significance level of p < 0.05 was used. These statistical tests confirm that improvements of the model are not due to random errors.

3.13. Summary

In summary, the suggested method merges tumor-aware preprocessing, deep feature extraction, modeling of nuclei-level morphology, and the latest attention-based mechanisms for aggregation in a single RCC grading system. Using U-Net for tumor segmentation, CNN backbones for patch representation, nuclei instance segmentation for cell-wise interpretability, and AB-MIL/DSMIL for slide-level aggregation, the framework is pathologically sound yet computationally savvy. Therefore, this architecture ensures the obtained grading accuracy, the system’s interpretability, and the capability of the method to generalize well on different datasets, which all together make the solution clinically valuable and computationally efficient for automating RCC grading.

4. Experimental setup

4.1. Dataset

4.1.1. Data source

We utilized three publicly available RCC histopathology datasets:

• TCGA⁶²: WSIs from KIRC, KIRP, and KICH cohorts with diagnostic metadata.

• RCdpia: A curated RCC dataset with expert-annotated tumor and non-tumor regions.

• MMIST-ccRCC: A multimodal dataset of 618 ccRCC patients, from which only the histopathology component with ISUP/Fuhrman grades (0–4) was used.

These datasets were selected to ensure reproducibility, cross-institutional variability, and clinical relevance.

4.1.2. Data availability statement

All datasets employed in the present work can be freely accessed from public sources.

• The Cancer Genome Atlas (TCGA) is accessible via the GDC Data Portal

o https://portal.gdc.cancer.gov/projects/TCGA-KIRP

o https://portal.gdc.cancer.gov/projects/TCGA-KICH

o https://portal.gdc.cancer.gov/projects/TCGA-KIRC.

• RCdpia Dataset is Available from

o https://arxiv.org/abs/2403.11211.]

• MMIST-ccRCC Dataset is available from

o https://arxiv.org/abs/2405.01658.

The code supporting preprocessing and model training can be obtained from the corresponding author upon a justified request.

4.2. Data preprocessing

Preprocessing was designed to (i) remove irrelevant background, (ii) standardize color variation across institutions, and (iii) increase model robustness to morphological and staining variability. All preprocessing steps were applied consistently across datasets to avoid introducing datasets specific bias.

4.2.1. Tissue detection and background removal

Whole-slide images (WSIs) contain large non-informative regions (glass background, pen marks, artifacts) that can negatively affect patch-based learning. Therefore, a tissue detection step was performed prior to patch extraction. Each WSI was converted from RGB to HSV color space, where tissue regions are more separable from the background. Otsu’s thresholding was applied to the saturation channel to generate a binary tissue mask. Morphological closing was then used to fill small gaps within tissue regions. Patches with less than 20% tissue coverage were discarded. This step ensured that the model was not exposed to background-only regions and improved training stability. The detailed configuration of the tissue detection pipeline is provided in Table 4.

Table 4.

Tissue detection parameters.

Parameter	Value
Color space	HSV
Thresholding	Otsu (S channel)
Morphological refinement	Closing
Minimum tissue coverage	20%

4.2.2. Patch extraction and tumor filtering

WSIs were divided into fixed-size tiles that were locally cropped from tissue regions. Patch cropping was done at 20× magnification, which allows to see the nuclei clearly enough and at the same time is not too demanding computationally. A patch size of 224 × 224 pixels was chosen as it coincides with the standard CNN input sizes. Non-overlapping tiling (0% overlap) was chosen to avoid tiles being too similar and to keep the processing effort low.

In order to make sure that the grading of cancer was based on tumor morphology, we filtered patches with the help of the tumor mask from the U-Net model. Only patches with a tumor area of at least 40% were kept for grading. Patches that did not meet the criteria were not used for the grade classification. Grade-0 (non-tumor) patches were exclusively used for training the segmentation network and were not considered in slide-level grade aggregation. Such tumor-aware filtering helps reduce label noise and makes learning focus on diagnostically relevant tissue. The exact details of patch cropping and tumor filtering are given in Table 5.

Table 5.

Patch extraction configuration.

Parameter	Value
Patch size	224 × 224
Magnification	20×
Overlap	0%
Tumor area threshold	≥40%
Grade-0 usage	Segmentation only

4.2.3. Stain normalization

Histopathology images are very diverse in color due to different staining protocols, scanners, and laboratory environments. To reduce this domain shift, stain normalization was performed on all patches. The Macenko method was implemented to unify the hematoxylin and eosin (H&E) stain vectors across images. A reference slide with a well-balanced stain was chosen and all patches were normalized to this reference. Stain normalization diminishes inter-institutional color variation and enhances cross-dataset generalization as demonstrated by the ablation study in Section 5. The stain normalization set-up is illustrated in Table 6.

Table 6.

Stain normalization settings.

Parameter	Method
Algorithm	Macenko
Stain separation	H&E optical density
Reference slide	Fixed template
Applied to	All datasets

4.2.4. Data augmentation

Extracted on-the-fly data augmentation helped to make the model more resilient and less prone to overfitting, and it was used only during training. Various augmentations were selected so that the tissue samples would still look like the same type of cancer while showing the histological variability that can be seen in real life. In particular, geometric augmentations can be viewed as the tissue being laid out in different ways (which is a totally random), and intensity augmentations are such things as different ways of staining the tissue (which has basically nothing to do with the tissue itself). The training data augmentation approach is briefly described in Table 7.

Table 7.

Data augmentation strategy.

Augmentation	Range
Rotation	0–360°
Horizontal flip	Yes
Vertical flip	Yes
Scaling	0.9–1.1
Brightness/contrast	±10%

These modifications increase the variety of the training samples without changing diagnostic features like nuclear morphology or tissue architecture. We have not used elastic or heavy distortions to prevent the creation of unrealistic histology patterns.

4.3. Data splitting protocol

To ensure impartial assessment and data leakage avoidance, a thorough data splitting protocol was taken up step by step. Data leakage is one of the most common problems in patch-based whole-slide image (WSI) studies. The entire splitting had been patient-wise (slide-wise) level, which means that no patches from the same WSI can be found in more than one subset. In this way, the model is tested on new slides that it has never seen before rather than recognizing the slide-specific patterns.

4.3.1. Within-dataset evaluation

Slides of each dataset (TCGA, RCdpia, and MMIST) were divided into training, validation, and test based on grade labels using stratified sampling. Stratification maintained the ratio of ISUP grades in all subsets to ensure balanced evaluation. The within-dataset partitioning plan is briefly shown in Table 8.

Table 8.

Within-datasets splitting protocol.

Subset	Ratio	Purpose
Training	70%	Model learning
Validation	15%	Hyperparameter tuning & early stopping
Test	15%	Final unbiased evaluation

We reserved the validation set for model selection and early stopping only, and the test set was not seen at all during training. Keeping them separate avoids any optimistically biased performance reports.

4.3.2. Cross-dataset evaluation

For testing how well the model could handle domain drift and differences in institutional styles, cross-dataset experiments were performed. In these experiments, the model learnt from slides of one or two datasets while testing was performed on a completely separat e dataset that was acquired under different conditions. Table 9 shows the protocol of the cross-dataset evaluation.

Table 9.

Cross-dataset evaluation protocol.

Training dataset(s)	Testing dataset	Purpose
TCGA + RCdpia	MMIST	External validation
TCGA + MMIST	RCdpia	Cross-site generalization

Cross-dataset evaluation offers a realistic assessment of the expected performance of a model in a clinical setting, that is, when the model is subjected to the slides from the labs and scanners that have not been seen before. The experiment also tests how well stain normalization and attention-based aggregation can help to reduce the impact of domain shift.

4.4. Implementation details

All models were built using the PyTorch deep learning framework (version ≥1.13) and trained on a Linux-based workstation. Preprocessing, augmentation, and evaluation steps were carried out with the help of standard scientific Python libraries, such as NumPy, OpenCV, and scikit-learn. The use of CUDA and cuDNN acceleration was permitted for fast GPU computation.

4.4.1. Hardware configuration

The entire set of experiments was run on a single dedicated-GPU workstation. The hardware configuration utilized in this study is briefly illustrated in Table 10. This kind of arrangement is a typical research and clinical computing environment, rather than a large-scale high-performance cluster, thus it is very suitable for the practical deployability of the proposed framework.

Table 10.

Hardware configuration.

Component	Specification
GPU	NVIDIA RTX 3090 (24 GB VRAM)
CPU	Intel Xeon processor
System memory	64 GB RAM
Operating system	Linux (Ubuntu-based)

4.4.2. Training configuration

Adam optimizer with default momentum parameters were used for the training of all the models. Each of the models was trained with a consistent protocol to be fair with the comparison. The main raining configuration and hyperparameters for the optimization are listed in Tables 11 and 12 respectively. To avoid overfitting, early stopping based on validation loss was used, and an adaptive learning rate scheduler was employed to lower the rate after the performance stopped improving. Dropout in subsequent classification layers was also used to regularize the model and improve the generalization capability. By holding the training configuration constant over an array of experiments, it is only the architectural novelty of each model that is reflected in different performances, not the hyperparametric tuning.

Table 11.

Training configuration.

Parameter	Value
Optimizer	Adam
Initial learning rate	1 × 10^-4
Batch size	32
Maximum epochs	100
Early stopping patience	15 epochs
LR scheduler	ReduceLROnPlateau
Dropout rate	0.3

Table 12.

Detailed training hyperparameters.

Component	Parameter	Value
U-Net	Learning rate	1e-4
U-Net	Loss weight (λ Dice)	0.7
U-Net	Loss weight (λ BCE)	0.3
CNN Backbone	Learning rate	1e-4
MIL (AB-MIL)	Learning rate	5e-5
Batch size	-	32
Epochs	-	100
Optimizer	-	Adam
Weight decay	-	1e-5
Random seeds	-	42, 123, 999
Gradient clipping	-	1.0

4.4.3. Backbone networks

To keep up with the times and make the assessment fair, several convolutional backbones from different design philosophies were tested. All the networks were first given ImageNet-pretrained weights and fine-tuned with RCC histopathology images to make the most of transfer learning. The backbone architectures whose performances were studied here are listed in Table 13.

Table 13.

Backbone architectures evaluated.

Backbone	Key characteristics	Rationale for inclusion
DenseNet-121	Dense connectivity, parameter-efficient	Strong medical imaging baseline
EfficientNet-B4	Compound scaling, high accuracy–efficiency balance	Modern CNN benchmark
ConvNeXt-Tiny	CNN with transformer-inspired design	Competitive with ViTs in vision tasks

4.4.4. Reproducibility measures

Several measures were taken to ensure the reproducibility of the experiments and the stability of the results. The application of fixed random seeds to PyTorch, NumPy, and Python’s random library was done to reduce the variation caused by stochastic initialization and data shuffling. Different seeds were used for three separate runs of all the experiments, and the resulting performance metrics are presented as mean ± standard deviation to show the variation. Furthermore, the same hyperparameters and training settings were used for all models so that any difference in performance could be attributed to the architectures and not tuning advantages. Such precautions make the comparison fair and increase the credibility of the results.

4.5. Baseline and comparative models

In order to test the proposed framework extensively and against the most up-to-date methods, it was compared not only to classical baselines but also to the recent RCC grading techniques. The chosen models are a good representation of the different aggregation strategies and architectural paradigms that are commonly used in whole-slide image analysis. The models that were tested are listed in Table 14.

Table 14.

Summary of baseline and state-of-the-art models used for comparative evaluation in RCC grading.

Model	Backbone	Aggregation	Purpose
DenseNet + Voting	DenseNet-121	Majority voting	Classical baseline
DenseNet + AB-MIL	DenseNet-121	Attention MIL	Aggregation upgrade
EfficientNet-B4 + AB-MIL	EfficientNet-B4	Attention MIL	Modern CNN baseline
ConvNeXt-Tiny + AB-MIL	ConvNeXt-Tiny	Attention MIL	CNN–Transformer hybrid
TransMIL/HipoMIL	CNN + Transformer	Transformer MIL	Global context modeling
RAF2Net (2024)	ResNet + Attention	Feature fusion	Recent RCC SOTA
NuAP-RCC (2024)	Nuclei-centric CNN	Graph/MIL	Nuclei-aware SOTA

5. Results

Here we provide a quantitative and qualitative assessment of tumor segmentation, slide-level grading performance, aggregation strategies, backbone comparison, ablation studies, cross-dataset generalization, and computational efficiency. Unless indicated otherwise, all the results are presented as mean values of three runs.

5.1. Tumor segmentation performance

The U-Net based segmentation component proved to be very consistent and accurate in pinpointing the tumor types on all datasets. The Dice scores were over 0.88 in each group suggesting that the identification of tumor regions was sufficiently reliable for subsequent grading. Quantitative segmentation results are compiled in Tables 15 and 16 and are visually represented in Figure 5. Upon examining the quality, it was found that the majority of the segmentation differences were at tumor-stroma boundaries and in areas of necrosis or hemorrhage. Ablation experiments showed that removing tumor segmentation reduced macro-F1 by approximately 3–4%, confirming that tumor-aware patch selection improves grading reliability.

Table 15.

Tumor segmentation performance.

Dataset	Dice	IoU
TCGA	0.90	0.83
RCdpia	0.89	0.82
MMIST	0.88	0.80

Table 16.

Impact of tumor segmentation on grading performance.

Setting	Macro-F1	QWK	Accuracy
Without segmentation	0.89	0.87	0.90
With segmentation	0.94	0.92	0.95

Figure 5.

Tumor segmentation performance across datasets measured using Dice score. Consistently high Dice values indicate robust tumor localization despite inter-institutional variability.

This experiment explicitly quantifies the contribution of tumor-guided patch selection. The results show a clear improvement of approximately 5% in Macro-F1 when segmentation is included, demonstrating that restricting analysis to tumor regions significantly enhances grading reliability.

5.2. Overall slide-level grading performance

The entire system combining tumor segmentation, CNN feature extraction, nuclei-level morphometry descriptors, and attention-based multiple instance learning (AB-MIL) has turned out to be very effective in prognosis prediction at the slide level. The model scored a macro-F1 of 0.94, a Quadratic Weighted Kappa (QWK) of 0.92, and an overall accuracy of 0.95, demonstrating a very high level of class-balanced accuracy together with a strong ordinal agreement. Table 17 provides a tabular comparison of the models, while Figure 6 depicts the changes in the performance metrics over time.

Table 17.

Overall slide-level grading performance.

Method	Macro-F1	QWK	Accuracy
DenseNet + Voting	0.88	0.86	0.89
DenseNet + AB-MIL	0.91	0.90	0.92
EfficientNet-B4 + AB-MIL	0.93	0.92	0.94
ConvNeXt-Tiny + AB-MIL	0.93	0.92	0.94
Proposed (Full)	0.94	0.92	0.95

Figure 6.

Multi-metric comparison of RCC grading models. Macro-F1, quadratic weighted kappa (QWK), and accuracy improve consistently with attention-based aggregation and modern CNN backbones, with the proposed framework achieving the highest overall performance.

Improvements in Table 17 are statistically significant (p < 0.05). Moving from majority voting to attention-based aggregation is an important factor in grading performance improvement which is captured in Table 15. For instance, switching from DenseNet + Voting to DenseNet + AB-MIL only resulted in a 3% rise in macro-F1 with a substantial QWK increase, thus, it is evident that slide-level aggregation plays a crucial role in whole-slide image analysis.

Combining various CNN architectures with AB-MIL also brings about further enhancements. EfficientNet-B4 and ConvNeXt-Tiny members registering a similar increase fluctuate strongly performing feature representations seem to augment grading consistency. Nevertheless, the complete pipeline reached the best results implying that nuclei-level descriptors and stain normalization can highlight some discriminative features that deep CNN alone may miss. Incremental but consistent for all model upgrades are the performance improvements depicted in Figure 6. The proposed method exhibits the best macro-F1 score while its convolutional architecture is less complex than the ones based on a transformer model currently found in the literature.

The multi-metric trends illustrated in Figure 6 clearly show a steady progress in Macro-F1, QWK, and accuracy as the pipeline is further refined. In particular, the enhancements in QWK almost duplicate the changes in macro-F1, thus portraying that the model maintains ordinal relationships between grades rather than simply optimizing categorical accuracy. This is clinically significant since the major source of disagreement in grading is usually between adjacent grades. Hence, these results indicate that a segmenting guidance, nuclei-aware features, and attention-based aggregation properly combined can produce RCC grading performance comparable to the state of the art without the need for highly transformer architectures.

Statistical testing revealed that the better results of the proposed framework against DenseNet + voting were significantly different (McNemar test, p < 0.05). Furthermore, an attention-based aggregation was a clear and statistically significant winner over majority voting in paired comparisons (p < 0.05), thus the gains witnessed were further corroborated to be robust.

5.3. Effect of aggregation strategy

Since diagnostically relevant areas are usually very sparse and unevenly distributed, slide-level aggregation is essentially important for whole-slide image (WSI) grading. To assess how much aggregation strategy affects the result, we pit the classical majority voting against attention-based multiple instance learning (AB-MIL) and dual-stream MIL (DSMIL) methods.

Differences between aggregation strategies are statistically significant (p < 0.05). Table 18 presents quantitative results, showing that majority voting yields the lowest performance because it treats all patches equally without considering diagnostic importance. On the other hand, AB-MIL significantly improves grading accuracy by giving higher weights to the more informative regions. To be more specific, AB-MIL raises macro-F1 by around 3% and also improves QWK, thus, indicating better ordinal consistency.

Table 18.

Aggregation strategy comparison.

Aggregation	Macro-F1	QWK
Majority voting	0.88	0.86
AB-MIL	0.91	0.90
DSMIL	0.92	0.91

Besides, DSMIL could also have the potential to deliver an increase in performance when the high-grade areas are very restricted/limited spots. Since the method is equipped with instance-level and bag-level supervisions, it can locate the critical patches without losing the overall context of the image which is more or less a necessity in the task of grading RCC. Since grading nucleolar prominence is one of the factors that help determine the tumor grade, a small area nucleolar prominence may be sufficient to decide the grade. The above paragraph indeed clarifies that the RCC grading behaves the same as a human level of reasoning. Without implementation that mirror human cognitive processes at the different levels, simple voting strategy-based approaches will only cause the diagnostic signals to be diluted. In contrast, attention-based methods are attractive as they allow for adaptive weighting, which happens to be more consistent with pathologic interpretation.

From the clinical perspective, the integration of features using the attention mechanism not only improves interpretability but also provide visually intuitive results. It is pathologist who ultimately verify the highlighted areas that most influence the grading decisions by attention maps. This kind of transparency contributes to trust building and facilitates human-AI cooperation. In both cases, the highest increase in performance was due to a change in the aggregation strategy rather than making the backbone more complex. This is consistent with the growing consensus that aggregation based on MIL is almost necessary if one wants to perform WSI classification tasks. The improvement from majority voting to AB-MIL and DSMIL is statistically significant (p < 0.05), confirming the effectiveness of attention-based aggregation for WSI classification.

5.4. Backbone comparison

To assess the influence of feature extraction capacity on RCC grading, we compared three representative convolutional backbones: DenseNet-121, EfficientNet-B4, and ConvNeXt-Tiny. All backbones were trained under identical conditions and paired with the same segmentation guidance and attention-based aggregation to ensure fair comparison.

EfficientNet-B4 and ConvNeXt-Tiny, as outlined in Table 19, demonstrated marginally better performances in terms of macro-F1 and QWK scores compared to DenseNet-121. The reason for these advancements could be the better handling of feature scaling and the increased representational power. The compound scaling method of EfficientNet and the contemporary convolutional design of ConvNeXt seem to be effective in recognizing slight histomorphological changes. Backbone-related improvements are smaller compared to segmentation and aggregation, indicating that architectural design plays a secondary role relative to pipeline design.

Table 19.

Backbone comparison.

Backbone	Macro-F1	QWK
DenseNet-121	0.91	0.89
EfficientNet-B4	0.93	0.92
ConvNeXt-Tiny	0.93	0.92

Nevertheless, the differences in performance between the various backbones were quite small when compared to the major improvements through aggregation strategy and tumor-oriented patch selection. Hence, it is very plausible that slide-level reasoning and region selection have a major influence on grading performance than merely the backbone selection. Practically speaking, DenseNet-121 is still a good choice owing to its parameter efficiency and presumably lower computation cost. It certainly doesn’t make sense to use a large backbone just for a marginal performance gain, when additional memory and inference time are unavoidable considerations, especially in resource-limited environments. These considerations underscore the fact that choosing a backbone should be based on the deployment scenario rather than solely on the highest accuracy achieved. Efficient CNN backbones when paired with strong aggregation methods can strike a good balance between the performance and computational requirements.

5.5. Per-class performance

In order to get a better insight into the grading behavior across different RCC categories, we looked at per-class F1 scores along with confusion patterns. Detailed per-class metrics can be found in Table 20, and the confusion matrix for the TCGA test set is represented in Figure 7.

Table 20.

Per-class F1 and QWK.

Method	F1 G1	F1 G2	F1 G3	F1 G4	Macro-F1	QWK
DenseNet + Voting	0.92	0.84	0.85	0.91	0.88	0.86
EfficientNet + AB-MIL	0.95	0.90	0.91	0.95	0.93	0.92
Proposed	0.96	0.91	0.92	0.96	0.94	0.92

Figure 7.

Confusion matrix for slide-level RCC grading on the TCGA test set. Most misclassifications occur between Grades 2 and 3, reflecting known clinical ambiguity in intermediate-grade tumors.

Per-class improvements are statistically significant (p < 0.05). Recognition of high-grade tumors (Grades 3, 4) was made with strong confidence since marked nuclear pleomorphism and nucleolar prominence that serve as clear morphological indicators were present. Likewise, Grade 1 cases were likewise accurately classified as the nuclei were relatively uniform in their features.

The dominance of adjacent-grade errors indicates that the model preserves ordinal structure, which is critical for clinical grading tasks as shown in Table 21.

Table 21.

Misclassification distribution.

Error type	Percentage
Adjacent grades (G2–G3)	72%
Non-adjacent errors	8%
Correct predictions	92%

The bulk of errors in classification was between Grades 2 and 3, as demonstrated in Figure 7. That pattern corresponds to clinical difficulties observed in RCC grading, intermediate grades generally having morphological features that are similar. Moreover, subtle variations in the size of the nucleus and visibility of the nucleolus may cause even experts in pathology to disagree.

To further quantify this behavior, approximately 70–75% of misclassifications occurred between adjacent grades (primarily Grades 2 and 3), while less than 10% involved distant grade errors (e.g., Grade 1 vs Grade 4). This distribution indicates that the model preserves ordinal relationships between grades and aligns with known clinical variability in RCC grading, where intermediate grades present overlapping morphological characteristics. The observed misclassification patterns, particularly between Grades 2 and 3, are consistent with previously reported inter-observer variability in RCC grading. Prior studies have shown that agreement between pathologists is lower for intermediate grades due to overlapping morphological characteristics. This alignment suggests that the model’s errors reflect intrinsic diagnostic ambiguity rather than systematic bias, supporting its clinical reliability.

The strong QWK obtained in this study is a further evidence of the ordinal consistency. Since QWK weighs larger differences in grades more heavily, a high QWK value means that most of the misclassifications are the neighboring grades rather than distant categories. The per-class analysis taken as a whole points to the model’s error patterns being determined by the intrinsic grading difficulty, rather than a model bias. Being in line with clinical reality, this factor brings greater confidence in the model’s practical use. Similar patterns of adjacent-grade confusion have been reported in recent RCC grading studies, including transformer-based and nuclei-centric models, where Grade 2–3 misclassification remains the dominant error source. This suggests that the observed errors are not specific to the proposed framework but reflect intrinsic ambiguity in intermediate-grade RCC morphology.

5.6. Ablation study

To measure the impact of single components, we have done a systematic ablation study by gradually dropping major modules one by one from the complete pipeline. The results are given in Table 22.

Table 22.

Detailed ablation with performance drop.

Configuration	Macro-F1	ΔF1 (%)	QWK	ΔQWK
Full model	0.94	–	0.92	–
– Segmentation	0.89	↓5.3%	0.87	↓5.4%
– Nuclei features	0.91	↓3.2%	0.89	↓3.3%
– AB-MIL (voting)	0.90	↓4.2%	0.88	↓4.3%
– Stain normalization	0.90	↓4.2%	0.88	↓4.3%

All ablation results are statistically validated (p < 0.05). The largest performance degradation is observed when tumor segmentation is removed (↓5.3% Macro-F1), confirming its critical role in filtering non-informative tissue. Attention-based aggregation and stain normalization also contribute significantly, each providing over 4% improvement. Nuclei-level features provide complementary gains (∼3%), validating their biological relevance. When tumor segmentation was removed, the model’s performance suffered the most, with the macro-F1 score dropping by almost 5%. In fact, the tumor-guided patch selection plays a key role in excluding irrelevant tissues and directing the attention of the model in the diagnostically meaningful areas. When nuclei-level features were excluded, the system’s performance dropped only slightly. This suggests that the explicit nuclear descriptors provide information that is complementary to the features obtained by the CNN. According to the biological basis of RCC grading, nuclear morphology is the main criterion, and this is in line with the results. Changing attention-based MIL to majority voting also led to a visible drop in performance. This, therefore, underscores the value of adaptive slide-level aggregation to efficiently deal with heterogeneous tumor regions.

Finally, the removal of stain normalization led to a decrease in cross-dataset stability, and there was also a minor reduction in grading accuracy. The inference one can make from this is that color standardization plays a role in domain robustness, especially when training and testing data are from different institutions. In general, the ablation study results show that the improvements in performance are due to the combined effect of segmentation guidance, nuclei-aware features, and attention-based aggregation, rather than any single component. Therefore, the proposed framework is thus in line with the design philosophy and hence it validates the significance of each module.

The ablation results provide insight into the complementary roles of different components. While transformer-based MIL methods attempt to model global context through computationally expensive self-attention, the proposed framework achieves similar contextual reasoning through a combination of tumor-guided patch selection and attention-based aggregation. Nuclei descriptors enhance local discriminative power by explicitly encoding cellular morphology, which is central to RCC grading. When combined with AB-MIL, which selectively emphasizes diagnostically relevant patches, the model effectively captures both local cellular features and global contextual information without requiring complex graph-based or transformer architectures. This explains why the proposed approach achieves competitive performance while maintaining lower computational complexity.

5.7. Cross-dataset generalization

We wanted to see how the models could handle the even more challenging situation of a domain shift. Hence, we have done cross-dataset experiments where the models have been trained on the slides from one or two datasets and tested on the slides from a totally different dataset. Such a setting is a better representation of the model being deployed in the real world, where the model is exposed to the slides stained with the different staining protocols, scanned with the different scanners, and prepared in the different laboratories practices. The results on the cross-dataset are summarized in Table 23.

Table 23.

Cross-dataset grading performance.

Train → test	Macro-F1	QWK
TCGA + RCdpia → MMIST	0.90	0.88
TCGA + MMIST → RCdpia	0.89	0.87

It is not surprising that performance dropped with the cross-dataset evaluation compared to the within-dataset one, thus emphasizing the strong effect of domain shift in histopathology. Since color distribution, staining intensity, and slide preparation differ from one case to another, they may change the visual patterns and, thus, affect the learned representations. Still, the drop in performance was only moderate. Macro-F1 scores kept hovering around 0.90, while QWK scores suggested that ordinal agreement remained essentially unchanged. This demonstrates that the method proposed by the authors continues to be dependable in terms of consistency in grading even when it is applied to completely new institutional data.

There is a high chance that stain normalization substantially helped to lessen domain variability by matching color distributions between the datasets. Moreover, attention-based aggregation could be also one of the reasons for the model’s robustness as it allows to locate regions that are more informative for diagnosis rather than merely focusing on the overall appearance. Besides that, multi-source training enabled even better generalization as it allowed the model to learn from different types of visual features during training. Hence, this work goes along with the current view that it is through the varied samples of datasets that the models in pathology which are able to generalize can be created.

In brief, the findings of the cross-dataset experiment are indicative of the fact that domain shift is still a problem in computational pathology, yet the proper selection of pre-processing and aggregation methods can mitigate its negative impact to some extent. These observations also highlight the rationale behind conducting the evaluation of grading systems in settings that go beyond a single dataset, in order to be closer to real clinical deployment scenarios.

5.8. Interpretability analysis

Figure 8 displayed the utilization of Gradient-weighted Class Activation Mapping (Grad-CAM) for analyzing the interpretability of the RCC grading framework proposed. Through this method, we can point out how the different model decisions for the grading are influenced spatially. The correct instances of classification were indicated by the Grad-CAM heatmaps that the focus was on the parts of the tumors that showed enlarged nuclei, high packed cells, and prominent nucleoli all features that are used to diagnose the tumors. These activation features indeed matched ISUP grading criteria, which implies that the model is able to understand the significant morphological features recognized clinically rather than being confused by the background artifacts.

Figure 8.

Grad-CAM visualization and failure-case analysis for RCC grading. (a–b) Correct predictions where the model focuses on diagnostically relevant tumor regions characterized by nuclear morphology and cellular density. (c–d) Failure cases involving adjacent-grade misclassification (Grades 2 and 3), reflecting inherent ambiguity in intermediate RCC grading. The heatmaps confirm that the model consistently attends to biologically meaningful structures, supporting clinical interpretability.

Additionally, to understand the model better, representative failure cases were also presented. Errors in classification usually have occurred between different grades of one and even more so between Grades 2 and 3, where the morphological differences are minimal and sometimes even experts pathologists have a hard time telling. Grad-CAM results depict that the model concentrates on the nuclear characteristics of the borderline regions, thus if errors happen it is because of the ambiguity of the data rather than the model’s misunderstanding of the features.

Additionally, some failure cases are associated with staining variability, overlapping nuclei, or limited tumor representation within patches. Despite these challenges, the model consistently attends to biologically relevant regions, reinforcing the reliability and interpretability of the proposed approach. These observations demonstrate that the model’s decision-making process aligns with pathological reasoning, supporting its potential for clinical deployment. These results indicate that model errors are primarily driven by intrinsic grading ambiguity rather than limitations in feature representation.

5.9. Comparison with recent state-of-the-art methods

We aligned our results with those of some up-to-date and advanced (SOTA) research including RCCGNet (2023), RAF2Net (2024), NuAP-RCC (2024), EAT-Net (2025), and EFF-Net (2024) so as to support our proposed framework at the current RCC grading literature. In Table 24, we present a comparison with recent state-of-the-art methods of RCC grading.

Table 24.

Comparison with recent RCC grading methods.

Method	Year	Backbone/strategy	Macro-F1	QWK	Dataset(s)
RCCGNet	2023	Custom CNN	∼0.92	–	KMC-RENAL
RAF2Net	2024	ResNet + Attention	0.94	0.92	TCGA + custom
NuAP-RCC	2024	Nuclei-centric MIL	0.95	0.93	TCGA + MMIST
EAT-Net	2025	Efficient Attention Transformer	0.92	–	KMC-RENAL
EFF-Net	2024	Enhanced CNN features	∼0.94	–	Multiple
Proposed	2025	CNN + nuclei features + AB-MIL	0.94	0.92	TCGA, RCdpia, MMIST

Recently, more and more studies use transformer-based MIL or nuclei-centric frameworks to get a high grading accuracy. For example, NuAP-RCC, a nuclei-level model, combined with graph-based aggregation, while EAT-Net used transformer-style attention mechanisms to capture the global context. The introduced framework’s performance is close to that of these recent methods. Although some SOTA methods show slightly better peak macro-F1 results, they tend to depend on computationally intensive models or complicated multi-stage pipelines.

By contrast, our approach showcases a harmonious combination of segmentation supervision, convolutional feature extraction, nuclei-aware descriptors, and attention-based aggregation, resulting in a balanced architecture. Such a layout keeps the accuracy at the state-of-the-art level but also lessens the computational load and architectural complexity. From a translational viewpoint, this compromise is quite significant. The systems that are robust, interpretable, and computationally feasible rather than ones that achieve slightly higher peak scores in controlled conditions are often preferred by clinical deployment. In addition to that, the suggested method has been tested on three public datasets with cross-dataset testing, which is not always the case with all SOTA studies. Such a comprehensive assessment lends support to the claims of generalizability and practical relevance.

In general, the findings demonstrate that lightweight convolutional architectures can still be at the forefront of the competition with more intricate transformer-based approaches if they are properly fine-tuned. Thus, the idea that performance improvements in RCC grading necessarily demand heavy architectures is questioned.

It is important to note that direct numerical comparison should be done very carefully because of dataset differences, but nonetheless, a number of main points can be made. Firstly, a lot of SOTA models like EAT-Net and NuAP-RCC have been tested mainly on a single dataset (e.g., KMC or TCGA), while the proposed method shows robust results on three datasets through cross-dataset validation. Model performance on the domain shift setting, as shown in Table 20, where the model produces macro-F1 scores ∼0.90, reveals that it generalizes better than models tested in limited settings. Secondly, the transformer-based methods get global context modeling, but their cost is increased computation. These models normally have a much higher need for GPU memory and inference time due to their quadratic self-attention operations.

On the contrary, the proposed framework reaches a similar macro-F1 (0.94) with significantly lower computational cost, as shown in Table 22. Thirdly, nuclei-centric methods like NuAP-RCC depend heavily on accurate nuclei segmentation and graph building, adding extra steps in the process and more possibilities for errors. Our method uses nuclei descriptors in a quite straightforward way and on the one hand, minimizes the complexity of the pipeline and on the other hand, maintains biological interpretability. Lastly, examination of error types shows that the majority of misclassifications are made between adjacent grades which is in line with clinical variability. Therefore, it is likely that model errors are attributed to data ambiguity rather than the model architecture itself. On the whole, this assessment shows that the suggested model is capable of striking a good balance between accuracy, efficiency, and robustness, thus making it more plausible for practical use than the complicated SOTA methods.

Diversely from the many previous researches that have used only one dataset such as KMC to evaluate the performance, the suggested approach has been tested on multiple datasets with cross-dataset evaluation. Hence a more realistic evaluation of generalization under domain shift is delivered, which is extremely important for the clinical use.

5.10. Computational efficiency

Besides the predictive strength, computational speed is also a major factor for the clinical acceptance of AI systems in the field of digital pathology. Whole-slide images are very large, and the grading workflows have to handle thousands of patches per slide. That’s why time of inference, memory usages, and complexity of architecture are very important practical issues.

The suggested framework is based mainly on convolutional architectures and simple attention methods, which typically require less computational power than transformer-based models. On the test equipment mentioned in Section 4.4, the average inference time per WSI was around 1.1 minutes, including segmentation, feature extraction, and slide-level aggregation. According to Table 25, our proposed framework drastically reduces the inference time compared to transformer-based or nuclei-focused pipelines.

Table 25.

Approximate inference time comparison.

Method	Architecture type	Inference time (min/WSI)
TransMIL	Transformer-based MIL	3.2
NuAP-RCC	Nuclei-centric + MIL	2.8
RAF2Net	CNN + Attention	2.0
Proposed	CNN + AB-MIL	1.1

Such findings emphasize a key dilemma of RCC grading models; architectural complexity boost cannot be trusted to produce performance gains in a linear proportion. Transformer-based models are good at enhancing the global context modeling but the proposed framework shows a way to get the same level of accuracy by making smart design decisions such as focusing on tumor regions for feature extraction and using attention for aggregating results. This is in fact an extra point in favor of the proposed methodology for a clinical setting where one of the challenges is a limited availability of computational resources, time for inference, and also the requirement of interpretability.

To further benchmark the efficiency of our method, we examined complexity and computational aspects of our model along with those of the baseline transformer-based and nuclei-centric methods. Compared to the implementation of transformers and nuclei-centric methods are graphed in Table 26, the efficiency of the computation stresses a stark compromise between one’s architectural complexity and one’s physical feasibility of deployment. Although both method types perform well, they come with a fairly high resource consumption in terms of global self-attention and multi-stage processing. Meanwhile, that high computational cost is vis-a-vis the proposed method achieving a From another point of view, the framework presented here is a better candidate for a constructive combination of performance and resource management since it runs with a much smaller pool of parameters and a lower number of GFLOPs. When one thinks about efficiency, these results are suggesting that straightforward convolutional architectures along with attention based-function level aggregation can play a major role in the trade-off between performance and resources. In particular, the model reduces GFLOPs and memory usage by approximately 4–5×, resulting in faster inference (∼1.1 minutes per slide) and improved suitability for real-world clinical deployment.

Table 26.

Computational efficiency comparison of the proposed framework with representative transformer-based and nuclei-centric RCC grading models.

Model	Parameters (M)	GFLOPs	Inference time	GPU memory
EAT-Net (Transformer)	∼45M	∼120	∼2.5 min/slide	High
NuAP-RCC (Nuclei-GNN)	∼30M	∼90	∼2.0 min/slide	High
Proposed Model	∼8–10M	∼25	∼1.1 min/slide	Moderate

6. Discussion

The goal of this study was to look back at the classical segmentation-classification paradigm used in the grading of renal cell carcinoma (RCC) and thoroughly improve it by stain normalization, nuclei-aware feature fusion, and attention-based aggregation. Whereas the latest researches in RCC grading are mainly focusing on transformer-based or graph-based nuclei-centric frameworks, our results indicate that a well-tuned convolutional pipeline is capable of producing performance measures that are in line with the state-of-the-art methods and, at the same time, it keeps the benefits of efficiency, interpretability, and practical deployability. Our proposed framework has less parameters and computational cost than Transformer-based methods. It is possible to do slide-level inference within reasonable time limits, thus supporting potential routine clinical use. Deep learning–based RCC grading systems have the potential to improve diagnostic consistency and support clinical decision making.^63–66 These findings demonstrate that performance improvements in RCC grading can be achieved through principled architectural design rather than increased model complexity. The proposed model achieves comparable macro-F1 performance (0.94) while using significantly fewer parameters and computational resources compared to transformer-based and nuclei-centric models. This suggests that efficient architectural design can replace the need for computationally expensive global attention mechanisms in RCC grading.

6.1. Key findings and interpretation

Among other results, the most prominent aspect of the research is tumor-guided patch selection. Specifically, the ablative study of model components found that the change of segmentation step had the biggest performance drop impact, which confirms that restricting the analysis to tumor-rich regions reduces the background noise and aligns the grading decision to the biologically relevant tissue, thus the tumor. In addition, they argue for the pathological basis of ISUP grading, which depends on nuclear morphology in tumor areas. Integrating information at the slide level was also very important. The researchers showed that attention-based MIL was better than majority voting in all cases, implying that adaptive weighting of informative regions is a requirement for WSI analysis. In addition, Dual-stream MIL further boosted the performance in the slides harboring localized high-grade areas, thus reflecting the heterogeneity of RCC.

Moreover, the descriptors at the nuclei-level offered yet another layer of discriminative information aside from CNN features only. Since RCC grading criteria are nuclear size, pleomorphism, and nucleolar prominence, the addition of explicit nuclear morphology features can therefore enhance biological relevance. It is quite notable that the impact of backbone choice, on the other hand, was less volatile than aggregation strategy. On the one hand, modern CNNs such as EfficientNet and ConvNeXt obtained rather small increments over DenseNet, while these increments were definitely smaller than those induced from segmentation and MIL. On the other hand, this implies that it is more important to do slide-level reasoning than simply increasing the complexity of the backbone.

6.2. Positioning relative to recent literature

Recent studies such as RAF2Net and NuAP-RCC report slightly higher peak performance using transformer or nuclei-graph frameworks. However, these systems often involve substantial computational cost and complex multi-stage pipelines. In contrast, the proposed method achieves performance within the same range while using a simpler convolutional design. Although it may not always reach the highest reported macro-F1, it offers a favorable balance between accuracy, efficiency, and interpretability. For clinical deployment, such balance can be more meaningful than marginal gains in benchmark scores. Importantly, this study includes cross-dataset validation across three public cohorts, which is not consistently reported in all SOTA studies. This broader evaluation provides stronger evidence of generalizability.

6.3. Comparative analysis with state-of-the-art

The proposed framework has been compared to the latest RCC grading techniques, including transformer-based and nuclei-centric models. Although these methods have pretty good performance, they usually depend on complicated architectures and higher computational cost. At the same time, the proposed framework gets a macro-F1 of 0.94 and QWK of 0.92, but it is significantly less computationally complex. This is possible by using tumor-guided patch selection, nuclei-aware feature representation, and attention-based aggregation. Unlike transformer-based models that rely on global self-attention with quadratic complexity, the proposed method uses efficient convolutional feature extraction and selective aggregation, which results in faster inference and reduced memory requirements.

Also, cross-dataset evaluation shows that the proposed model is capable of maintaining stable performance even when domain shift occurs, whereas many earlier studies only report results on single datasets. This feature of the framework amplifies its robustness and practical applicability. Besides, analysis of error patterns reveals that most of the misclassifications involve Grades 2 and 3, and this is in a way that aligns with recent RCC grading literature. Therefore, the errors are due to intrinsic morphological ambiguity rather than model limitations. In general, the findings suggest that well-thought-through convolutional pipelines can deliver performance that is on par with the latest models while at the same time presenting benefits in terms of efficiency, explainability, and clinical use.

6.4. Clinical relevance

These results of interpretability analysis confirm that model is attentive to features of nucleolar prominence and nuclear pleomorphism that are used in ISUP grading. By matching pathological insights, this helps to build trust and also indicates how AI and humans could work together. The main confusion between Grades 2 and 3 coincides with the known variability in inter-pathologist disagreements. Since the QWK scores are close to those of human agreement, it seems that the model is behaving in a clinically realistic manner rather than being artificially optimized. Also, the short inference time per WSI makes it possible to implement the model in routine pathology workflows.

The new system is aimed at supporting pathologists’ decisions rather than trying to take over the role of a pathologist. While the model is able to point pathologists to tumor regions that are most relevant for diagnosis and to supply consistent grade suggestions, it can still be viewed as a means to help pathologists in decreasing grading variations along with their grade assignment workload. On the other hand, it has a small computing demand that enables running in digital pathology workflows, which especially makes it convenient for the settings that are limited in resources.

6.5. Efficiency vs accuracy trade-off

According to the findings, merely escalating the architectural sophistication may not lead to equivalent enhancement in RCC grading performance. Transformer-based models, for example, although they facilitate the modeling of global dependencies through self-attention, they are associated with increased computational time as well as memory usage due to quadratic complexity. On the contrary, the method presented in this paper attains similar performance by amalgamating three distinct methods: tumor-focused patch selection, nuclei-informative feature encoding, and attention-based combination. Tumor separation helps in reducing the irrelevant content, nuclei descriptors capture features that are meaningful from the biology perspective, and AB-MIL attends aux selectively focuses on the regions that are of high diagnostic value Such a mix makes it possible to perform contextual reasoning in an efficient way without the use of costlier global attention mechanisms, which is why equivalent accuracy levels can be reached at much lower computational cost.

6.6. Limitations

Although the suggested framework exhibits strong capabilities, it is still vulnerable to some drawbacks. For one, our model is dependent on correct tumor segmentation; thus inaccuracies in U-Net tumor segmentation could be carried over through to the tumor grading step and influence the final performance. Secondly, despite cross-data set testing, the data sets come from publicly available sources only and may not represent variations due to different scanners, staining protocols, and clinical institutions. Thirdly, this framework has not been tested in clinical situations where constraints related to real-time workflow and interaction of pathologist could affect the performance. Lastly, while nuclei-level features help with explaining the model results, they are reliant on precise nuclei segmentation and might be affected by staining artefacts or overlapping nuclei. We intend to extend our research to include multi-center validation, multi-scanner robustness and prospective clinical evaluation.

6.7. Future directions

Future research could aim at enhancing the robustness and clinical application of RCC grading systems. A very promising way is to use self-supervised or foundation-model pretraining on big histopathology datasets to improve feature generalization from one institution to another. Also, methods for multi-scale feature aggregation can be considered to effectively capture both cellular level morphology and global tissue architecture. Besides, incorporating clinical metadata, molecular markers, or genomic information may provide a more comprehensive prognostic modeling beyond just prediction of the grade. Also, prospective validation studies in real clinical workflows are required to assess the usability and diagnostic impact in practice. Lastly, exploring human, AI collaborative grading systems might be a way to combine the algorithmic consistency with the expert pathological judgment, thereby improving reliability and trust in AI-assisted diagnosis.

The findings show that RCC grading improvements are possible through a well-thought-out integration of domain knowledge and efficient architectural design, not necessarily by the increase in model complexity. This is indeed a very important issue for clinical deployment, where computational efficiency and interpretability are the main concerns.^67–71

7. Conclusion

This work introduced a hybrid DenseNet, U-Net system for identifying different grades of renal cell carcinoma automatically from whole-slide images of pathology. The mixture of tumor-targeted segmentation, feature extraction by convolution, nuclei description, stain normalization, and attention-based aggregation in the proposed system enables it to deliver grading results that are on par with the state-of-the-art beyond several publicly-available datasets.

The results highlight that well thought out convolutional architectures are still very useful for computational pathology. Even though the research trend is moving towards transformer-based or nuclei-graph architectures, a simple and transparent pipeline as ours can deliver similar results with less computation and can be more easily applied in practice. Notably, segmentation guidance and attention-based aggregation contribute more than backbone complexity by itself. More importantly, the pattern of the model’s mistakes is consistent with the clinical challenges of grading, and the level of ordinal agreement is close to the level of inter-pathologist agreement documented in the literature. These facts emphasize the possible use of these systems as decision-support tools rather than replacement of expert judgment.

Overall, the paper emphasizes that advancement in RCC grading is not only related to architectural complexity changes but also to a combination of biologically relevant features and reliable performance through various datasets. We believe that this paper will motivate the research and creation of AI systems based on clinical knowledge, which are also efficient, easy for humans to understand, and open for further modifications. In short, the study shows that with great care and repeated tuning, convolutional networks may remain a reliable offering of medical quality and clinical effectiveness, which are on par with the most advanced transformer-based results.

Footnotes

Acknowledgments

The authors gratefully acknowledge the generous support of San Jose State University, USA, for providing funding assistance toward the research and publication of this work. The authors also extend their appreciation to Hope Diagnostics and Research Lab, Bhubaneswar, India, for their contribution in providing histopathological resources and used in this study, which ensured the quality and accuracy of the visual material.

ORCID iD

Sital Dash

Ethical considerations

This study used only publicly available, de-identified histopathology datasets (TCGA-ccRCC, RCdpia, and MMIST-ccRCC). No patient-identifiable information was accessed. All datasets were originally collected under their respective institutional ethical approvals and were released for research use in anonymized form. The present study involved secondary analysis of anonymized data and therefore did not require additional institutional review board (IRB) approval or patient consent. Data usage complied with the licensing and governance policies specified by the dataset providers. This study did not involve direct participation of human subjects or animals. All analyses were conducted on publicly available, de-identified datasets (TCGA, RCdpia, and MMIST-ccRCC). As such, ethical approval and informed consent were not required. The research complies with the principles of the Declaration of Helsinki and adheres to relevant institutional and journal ethical guidelines.

Author contributions

• Rohini Jadhav (RJ): Conceptualization, Methodology, Data Preprocessing, Writing – Original Draft.

• Banani Mohapatra (BM): Literature Review, Formal Analysis, Writing – Review & Editing.

• Bhavnish Walia (BW): Experimental Setup, Model Implementation, Software, Technical Validation.

• Sital Dash (SD): Conceptualization, Methodology, Supervision, Writing – Review & Editing, Corresponding Author.

• Kailas Patil (KP): Data Curation, Dataset Preparation, Visualization.

• Shrikant Jadhav (SJ): Algorithm Development, Ablation Study Design, Supervision, Manuscript Review, Corresponding Author.

• Ishwari Rohit Raskar (IRR): Statistical Analysis, Cross-Dataset Validation, Writing – Proofreading & Editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

All datasets used in this study are publicly available: • TCGA-ccRCC: Available through The Cancer Genome Atlas (TCGA) data portal. • RCdpia: Publicly accessible research dataset. • MMIST-ccRCC: Publicly accessible research dataset. The code supporting this study is available from the corresponding author upon reasonable request and will be publicly released upon acceptance.*

References

Chanchal

Lal

Kini

. RCCGNet: A convolutional neural network for automated grading of renal cell carcinoma. Sci Rep 2023; 13(1): 31125. https://doi.org/10.1038/s41598-023-31125-2

Chanchal

Lal

Kumar

, et al. A novel dataset and efficient deep learning framework for automated grading of renal cell carcinoma from kidney histopathology images. Sci Rep 2023; 13(1): 5728. https://doi.org/10.1038/s41598-023-31275-7

Kundu

Ghosh

Pal

, et al. RAF2Net: Automated grading of renal cell carcinoma utilizing attention-enhanced deep learning models through feature fusion. bioRxiv 2024; https://doi.org/10.1101/2024.07.22.604646

Cheng

Zhang

Huang

. NuAP-RCC: Nuclei-aware attention pooling for grading renal cell carcinoma from histopathological images. Med Image Anal 2024; 93: 103077. https://doi.org/10.1016/j.media.2024.103077

Al-Kuwari

Al-Maadeed

. EAT-Net: An efficient attention transformer network for renal cell carcinoma grading from histopathological images. Comput Biol Med 2025; 171: 108084. https://doi.org/10.1016/j.compbiomed.2024.108084

Alghamdi

Alsamri

Obayya

. CVDTLM-AGRCC: A computer vision assisted deep transfer learning model for accurate grading of renal cell carcinoma. Sci Rep 2025; 15(1): 19930. https://doi.org/10.1038/s41598-025-19930-7

Naylor

Laé

Reyal

, et al. Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Trans Med Imaging 2019; 38(2): 448–459. https://doi.org/10.1109/TMI.2018.2865716

Zhou

Chang

Barner

, et al. Classification of histology sections via multi-scale convolutional neural networks. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019), Venice, Italy, 8–11 April 2019, pp. 659–662. https://doi.org/10.1109/ISBI.2019.8759270

Macenko

Niethammer

Marron

, et al. A method for normalizing histology slides for quantitative analysis. In: 2009 IEEE international symposium on biomedical imaging: from nano to macro, Boston, MA, 28 June 2009–1 July 2009, pp. 1107–1110. https://doi.org/10.1109/ISBI.2009.5193250

10.

Reinhard

Adhikhmin

Gooch

, et al. Color transfer between images. IEEE Comput Graph Appl 2001; 21(5): 34–41. https://doi.org/10.1109/38.946629

11.

Distante

Bevilacqua

Brunetti

, et al. Artificial intelligence in renal cell carcinoma histopathology: A comprehensive review. Cancers (Basel) 2023; 15(12): 3087. https://doi.org/10.3390/cancers15123087

12.

Zhu

Ren

Richards

, et al. Development and evaluation of a deep neural network for histologic classification of renal cell carcinoma on biopsy and surgical resection slides. Sci Rep 2021; 11(1): 7080. https://doi.org/10.1038/s41598-021-86540-4

13.

Sun

Zhang

Wang

, et al. TGMIL: A hybrid multi-instance learning model based on transformer and graph attention network for whole-slide image classification of renal cell carcinoma. Comput Methods Programs Biomed 2023; 242: 107789. https://doi.org/10.1016/j.cmpb.2023.107789

14.

Chen

Zhang

, et al. Deep learning for histopathological grading of renal cell carcinoma. IEEE Access 2022; 10: 32145–32156. https://doi.org/10.1109/ACCESS.2022.3160523

15.

Luo

Zeng

Huang

, et al. Automated grading of clear cell renal cell carcinoma using deep convolutional neural networks. Comput Med Imaging Graph 2022; 95: 102030. https://doi.org/10.1016/j.compmedimag.2021.102030

16.

Jia

Wang

, et al. Large scale tissue histopathology image classification, segmentation, and visualization via deep convolutional activation features. BMC Bioinformatics 2021; 22(1): 152. https://doi.org/10.1186/s12859-021-04135-5

17.

Chen

Ding

Chen

, et al. Weakly supervised learning for RCC grading on whole-slide images. Med Image Anal 2021; 73: 102165. https://doi.org/10.1016/j.media.2021.102165

18.

Huang

, et al. Attention-based deep learning for histopathological grading of renal tumors. Pattern Recognit Lett 2022; 158: 82–89. https://doi.org/10.1016/j.patrec.2022.03.014

19.

Wang

Yang

Rong

, et al. Pathology image analysis using segmentation-based deep learning framework for RCC grading. Bioinformatics 2021; 37(13): 1868–1875. https://doi.org/10.1093/bioinformatics/btaa928

20.

Zhang

Yang

, et al. Multi-scale deep learning for renal cell carcinoma histopathological grading. Comput Biol Med 2022; 146: 105597. https://doi.org/10.1016/j.compbiomed.2022.105597

21.

Althubaiti

Alharbi

AlGhamdi

. Deep learning-based RCC grading using histopathological images. Diagnostics (Basel) 2023; 13(5): 910. https://doi.org/10.3390/diagnostics13050910

22.

Huang

Chen

. Weakly supervised RCC grading via multiple instance learning. Med Phys 2023; 50(4): 2015–2027. https://doi.org/10.1002/mp.16021

23.

Zhou

Wang

. Self-supervised representation learning for renal cancer histopathology grading. IEEE Trans Med Imaging 2024; 43(1): 210–222. https://doi.org/10.1109/TMI.2023.3305128

24.

Singh

Mehta

Patel

. Nuclei-aware deep learning framework for RCC grading. Med Image Anal 2024; 90: 102940. https://doi.org/10.1016/j.media.2023.102940

25.

Wang

Yang

Mahmood

. HipoMIL: Hierarchical multiple instance learning for whole-slide image classification. Med Image Anal 2023; 87: 102796. https://doi.org/10.1016/j.media.2023.102796

26.

Chen

Williamson

DFK

, et al. Pan-cancer integrative histology–genomic analysis via deep learning. Cancer Cell 2021; 39(6): 829–844.e6. https://doi.org/10.1016/j.ccell.2021.05.004

27.

Feng

, et al. Deep learning of feature representation with multiple instance learning for medical image analysis. In:2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, Italy, 4–9 May 2014, pp. 1626–1630. https://doi.org/10.1109/ICASSP.2014.6853872

28.

Carbonneau

Cheplygina

Granger

, et al. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognit 2018; 77: 329–353. https://doi.org/10.1016/j.patcog.2017.12.009

29.

Stringer

Wang

Michaelos

, et al. Cellpose: A generalist algorithm for cellular segmentation. Nat Methods 2021; 18(1): 100–106. https://doi.org/10.1038/s41592-020-01018-x

30.

Vahadane

Peng

Sethi

, et al. Structure-preserving color normalization and sparse stain separation for histological images. IEEE Trans Med Imaging 2016; 35(8): 1962–1971. https://doi.org/10.1109/TMI.2016.2529665

31.

Zanjani

Zinger

van der Laak

. Stain normalization of histopathology images using generative adversarial networks. In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), Washington, DC, 4– 7 April 2018, pp. 573–576. https://doi.org/10.1109/ISBI.2018.8363632

32.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017, pp. 618–626. https://doi.org/10.1109/ICCV.2017.74

33.

Zhou

Khosla

Lapedriza

, et al. Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 2921–2929. https://doi.org/10.1109/CVPR.2016.319

34.

Tan

. EfficientNetV2: Smaller models and faster training. In: Proceedings of the 38th international conference on machine learning, Virtual, 18–24 July 2021, pp. 10096–10106.

35.

Eliceiri

. Selective instance attention for whole-slide image classification. Med Image Anal 2022; 79: 102450. https://doi.org/10.1016/j.media.2022.102450

36.

Yang

Zhang

, et al. A unified attention-based MIL framework for histopathology image classification. Pattern Recognit 2023; 139: 109470. https://doi.org/10.1016/j.patcog.2023.109470

37.

Ronneberger

Fischer

Brox

. U-Net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention – MICCAI 2015, Munich, Germany, 5–9 October 2015, pp. 234–241. https://doi.org/10.1007/978-3-319-24574-4_28

38.

Ilse

Tomczak

Welling

. Attention-based deep multiple instance learning. Proc 35th Int Conf Mach Learn 2018; 80: 2127–2136.

39.

Zhou

Chen

, et al. Dynamic multiple instance learning for whole slide image classification. IEEE Trans Med Imaging 2024; 43(2): 512–524. https://doi.org/10.1109/TMI.2023.3321245

40.

Tang

Wang

Zhang

. Token-level attention for transformer-based multiple instance learning in histopathology. Med Image Anal 2023; 86: 102789. https://doi.org/10.1016/j.media.2023.102789

41.

Wang

Chen

Mahmood

. Efficient transformer-based MIL for large-scale whole-slide image classification. Med Image Anal 2024; 91: 102978. https://doi.org/10.1016/j.media.2024.102978

42.

Huang

Liu

Van Der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 4700–4708. https://doi.org/10.1109/CVPR.2017.243

43.

Tan

. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th international conference on machine learning, Long Beach, CA, 9–15 June 2019, pp. 6105–6114.

44.

Tjoa

Guan

. A survey on explainable artificial intelligence (XAI): Toward medical XAI. IEEE Trans Neural Netw Learn Syst 2020; 32(11): 4793–4813. https://doi.org/10.1109/TNNLS.2020.3027314

45.

Jaume

Chen

, et al. Quantifying explainability of deep learning models in computational pathology. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, 19–25 June 2021, pp. 1935–1944. https://doi.org/10.1109/CVPR46437.2021.00197

46.

Chen

Williamson

DFK

, et al. Towards interpretable deep learning in computational pathology. Nat Biomed Eng 2023; 7(5): 635–648. https://doi.org/10.1038/s41551-022-00980-6

47.

Cohen

. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 1968; 70(4): 213–220. https://doi.org/10.1037/h0026256

48.

Liu

Mao

, et al. A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, 18–24 June 2022, pp. 11976–11986. https://doi.org/10.1109/CVPR52688.2022.01167

49.

Raghu

Zhang

Kleinberg

, et al. Transfusion: Understanding transfer learning for medical imaging. Adv Neural Inf Process Syst 2019; 32: 3342–3352.

50.

Shin

Roth

Gao

, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 2016; 35(5): 1285–1298. https://doi.org/10.1109/TMI.2016.2528162

51.

Çiçek

Abdulkadir

Lienkamp

, et al. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In: Medical image computing and computer-assisted intervention – MICCAI 2016, Athens, Greece, 17–21 October 2016, pp. 424–432. https://doi.org/10.1007/978-3-319-46723-8_49

52.

Zhou

Siddiquee

MMR

Tajbakhsh

, et al. UNet++: A nested U-Net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, Granada, Spain, 20 September 2018, pp. 3–11. Springer. https://doi.org/10.1007/978-3-030-00889-5_1

53.

Isensee

Jaeger

Kohl

, et al. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021; 18(2): 203–211. https://doi.org/10.1038/s41592-020-01008-z

54.

Graham

Sea

, et al. Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med Image Anal 2019; 58: 101563. https://doi.org/10.1016/j.media.2019.101563

55.

Schmidt

Weigert

Broaddus

, et al. Cell detection with star-convex polygons. In: Medical image computing and computer assisted intervention – MICCAI 2018, Granada, Spain, 16–20 September 2018, pp. 265–273. https://doi.org/10.1007/978-3-030-00934-2_30

56.

Weigert

Schmidt

Haase

, et al. StarDist: Object detection with star-convex polygons. In: 2020 IEEE winter conference on applications of computer vision (WACV), Snowmass Village, CO, 1–5 March 2020, pp. 3274–3283. https://doi.org/10.1109/WACV45572.2020.9093435

57.

Williamson

DFK

Chen

, et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat Biomed Eng 2021; 5(6): 555–570. https://doi.org/10.1038/s41551-020-00682-w

58.

Chen

Williamson

DFK

, et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 2021; 594(7861): 106–110. https://doi.org/10.1038/s41586-021-03512-4

59.

Chen

Ding

, et al. Pathomic fusion: An integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans Med Imaging 2022; 41(4): 757–770. https://doi.org/10.1109/TMI.2021.3124182

60.

Jaume

Chen

, et al. Histology-based graph inference of molecular alterations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, 19–25 June 2021, pp. 9643–9652. https://doi.org/10.1109/CVPR46437.2021.00952

61.

Wang

Chen

Yang

, et al. HipoMap: Hierarchical point cloud representation for whole-slide image classification. Med Image Anal 2022; 77: 102343. https://doi.org/10.1016/j.media.2022.102343

62.

Weinstein

Collisson

Mills

, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013; 45(10): 1113–1120. https://doi.org/10.1038/ng.2764

63.

Fleiss

Cohen

Everitt

. Large sample standard errors of kappa and weighted kappa. Psychol Bull 1969; 72(5): 323–327. https://doi.org/10.1037/h0028106

64.

Dice

. Measures of the amount of ecologic association between species. Ecology 1945; 26(3): 297–302. https://doi.org/10.2307/1932409

65.

Sørensen

. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biol Skr 1948; 5: 1–34.

66.

Taha

Hanbury

. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med Imaging 2015; 15: 29. https://doi.org/10.1186/s12880-015-0068-x

67.

Eliceiri

. Dual-stream multiple instance learning network for whole-slide image classification with self-supervised contrastive learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, 19–25 June 2021, pp. 14318–14328. https://doi.org/10.1109/CVPR46437.2021.01409

68.

Eliceiri

. Dual-stream multiple instance learning network for histopathology image classification. IEEE Trans Med Imaging 2021; 40(10): 2473–2486. https://doi.org/10.1109/TMI.2021.3082742

69.

Shao

Bian

Chen

, et al. TransMIL: Transformer based correlated multiple instance learning for whole slide image classification. Adv Neural Inf Process Syst 2021; 34: 2136–2147.

70.

Huang

Ding

Chen

, et al. Weakly supervised learning for histopathological image classification using multiple instance learning. Med Image Anal 2021; 73: 102165. https://doi.org/10.1016/j.media.2021.102165

71.

Campanella

Hanna

Geneslaw

, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole-slide images. Nat Med 2019; 25(8): 1301–1309. https://doi.org/10.1038/s41591-019-0508-1