Abstract
Introduction
Precisely grading renal cell carcinoma (RCC) through histopathology slides is a requisite to predict the cancer prognosis and select treatments, however, there is a considerable variability in the assessment between observers. While some AI systems mainly rely on transformer-based and nuclei-centric models, they are computationally very demanding so their clinical use is thus limited. Hence, there is a need for practical and easy-to-understand solutions that can be integrated into digital pathology workflows.
Methods
We built a hybrid framework that integrates U-Net-based tumor segmentation, convolutional feature extraction, nuclei-aware descriptors, stain normalization, and attention-based multiple instance learning for slide-level RCC grading. The framework was tested on 3 public datasets (TCGA-ccRCC, RCdpia, MMIST-ccRCC) with a cross-dataset validation approach. The metrics used for the performance evaluation were macro-F1 and quadratic weighted kappa (QWK).
Results
The designed method yielded a macro-F1 of 0.94, QWK of 0.92, and accuracy of 0.95. Extracting tumor patches and performing aggregation based on attention resulted in the best improvements. The method was equally effective when tested with different datasets. Most of the errors made by the model were those within the clinical grading range variability. Average time for inference was around one minute and ten seconds per slide.
Conclusion
By fine-tuning the convolutional pipeline, one can obtain a RCC grading capability that can be rivaled by very few, yet the model is efficient and interpretable, hence it will continue to be a strong candidate decision-support tool in digital pathology to be clinically deployed.
Keywords
1. Introduction
Renal cell carcinoma (RCC) is the most prevalent type of kidney cancer, making up almost 90% of all renal cancers and one of the main causes of cancer-related deaths globally. 1 The precise histopathological grading of RCC is of great importance for predicting outcome, deciding treatment, and patient stratification.2,3 The International Society of Urological Pathology (ISUP) grading system, which mainly relies on nuclear morphology and nucleolar prominence, is the standard tool for grading RCC.1,4 Nevertheless, RCC grading remains an expert-dependent and subjective task that also takes considerable time. Variability between and within observers, especially in intermediate grades, can lead to inconsistent diagnoses and adversely affect patient care.1,3
Recent developments of whole-slide imaging (WSI) and digital pathology within a short time have created a large need for automated, objective, and reproducible grading systems.5,6 Advances in artificial intelligence and digital pathology have recently facilitated more precise and reproducible analysis of histopathology images in different clinical settings.7–10 Deep learning has emerged as a potent tool in computational pathology, showing excellent results in tumor detection, segmentation, and classification.11–13 Nevertheless, the construction of RCC grading systems that are accurate, use less computation power, understandable, and ready for clinical implementation still pose a challenge.14–16
Figure 1 presents representative histopathological patterns of RCC, demonstrating the subtle nuclear and structural differences that make accurate grading challenging even for expert pathologists. Representative histopathological appearance of renal cell carcinoma (RCC) illustrating nuclear morphology and cellular features associated with different tumor grades. The figure highlights the visual variability and subtle structural differences that make RCC grading challenging in routine pathology practice. Reproduced with permission from Hope Diagnostics and Research Lab, Bhubaneswar, India.
1.1. Background
Initial deep learning strategies for RCC grading were mostly limited to patch-level categorization using convolutional neural networks like ResNet and DenseNet.3,11,12 Such methods demonstrated good results but they typically did not account for explicit tumor localization, which resulted in them being quite sensitive to non-tumor tissue, staining differences and domain changes between institutions.17,18
In order to tackle these issues, segmentation, classification workflows were rolled out. Tumor areas are first detected using segmentation models, most popularly U-Net, and only patches that contain the tumor are sent to the grading stage.17,19 Using this strategy not only the robustness is increased as less irrelevant background information is involved but also the interpretability is improved since the regions that have led to classification decisions can be visually pointed out.11,18 Consequently hybrid CNN pipelines seem to be a realistic compromise of performance and explainability.
Recently, RCC grading studies have been diving into using more complex deep learning models. RAF2Net deployed a feature fusion network enhanced with attention to the automatic grading of RCC from histopathology images. 14 NuAP-RCC came up with a nuclei-aware attention pooling approach that integrates nuclear morphology and deep feature representations through multiple instance learning. 4 EAT-Net featured an Efficient Attention Transformer design to simultaneously capture local nuclear morphology and global contextual information, achieving macro-F1 scores of over 0.93. 16
1.2. Research gap
Segmentation, classification pipelines are still the dominant approach for grading RCC. However, the latest RCC-targeting studies are shifting their focus towards nuclei and transformer-based multiple-instance learning (MIL) architectures, achieving macro-F1 scores higher than 0.93.14,16 These methods hold great promise for enhancing the grading accuracy.
However, these methodologies usually involve complicated multi-step procedures and heavy computational requirements. Transformer-based MIL models typically bring about an increase in memory usage and a decrease in transparency when compared to traditional convolutional methods.5,6,20 Besides, nuclei-based methods heavily rely on the correctness of nuclei segmentation or detection, thereby adding more processing steps and the possibility of error propagation.21,22 Additionally, their effectiveness could be compromised by differences in staining protocols, tissue preparation, and scanner characteristics between institutions.23,24
Therefore, finding RCC grading methodologies that offer a good trade-off between high predictive power on the one hand and simplicity, the ability to handle domain variations, and interpretability on the other hand remains a challenge.
1.3. Motivation
Automated RCC grading systems, from a clinical point of view, should be not only highly accurate but also stable, interpretable, and practical for daily use. Actually, many pathology labs work with limitations in computational infrastructure, turnaround time, and technical support. So, systems that need large memory resources or sophisticated multi-stage processing may indeed be rejected for practical reasons.
Moreover, the histopathological evaluation of RCC is a very subjective process because of variations in staining quality, slide preparation, and scanner characteristics between different institutions. For automated methods to be of genuine clinical value, they must first prove that they can operate consistently under such variations. The above practical issues are the reasons for looking at grading systems that, besides predictive ability, also focus on reliability, openness, and working feasibility. Creating methods that go hand in hand with actual clinical work is a crucial move toward the greater use of AI-assisted pathology.
1.4. Research hypothesis
While segmentation, classification pipelines are definitely the mainstay of the field of digital pathology, one of the most recent Focuses of RCC is on nuclei-centric and transformer-based multiple-instance learning (MIL) approaches within RCC that are capable of achieving high macro-F1 scores. One of the main disadvantages of these models is their high computational resource needs and the complex processing pipelines.
It is our conjecture that an adequately optimized convolutional segmentation, classification pipeline incorporating stain normalization, nuclei-aware descriptors, and attention-based aggregation should be able to deliver performance on a par with transformer-based and nuclei-centric methods while drastically lessening the computational complexity, inference time, and memory requirements. This is because CNNs are very good at capturing spatial hierarchies and, when combined with biologically informed feature representations and adaptive aggregation mechanisms, they become even more powerful.
1.5. Research questions
RQ1: Is it possible for a hybrid U-Net + DenseNet pipeline to perform RCC grading equally well as transformer-MIL and nuclei-centric methods? RQ2: Does explicit tumor segmentation improve robustness and grading reliability across multi-institutional datasets? RQ3: Can stain normalization and attention-based aggregation reduce domain shift and enhance cross-dataset generalization? RQ4: Does a lightweight CNN-based framework provide a better balance between accuracy, efficiency, and interpretability for clinical deployment?
1.6. Contributions of this work
The main contributions of this study are as follows: 1. We propose a lightweight, purely convolutional hybrid DenseNet–U-Net framework for automated RCC grading that integrates tumor segmentation and classification into an end-to-end pipeline. 2. We demonstrate that a carefully designed CNN-based architecture can achieve competitive performance with recent transformer-MIL and nuclei-centric RCC models while being computationally more efficient. 3. We incorporate stain normalization and attention-based aggregation to improve robustness and cross-dataset generalization. 4. We validate the proposed framework on three publicly available datasets (TCGA-ccRCC, RCdpia, and MMIST-ccRCC), demonstrating strong segmentation accuracy, reliable grading performance, and generalizability. 5. We show that classical segmentation–classification pipelines, when modernized and carefully optimized, remain highly relevant and clinically deployable alternatives to complex transformer-based solutions.
2. Related work
Digital pathology research on RCC grading has progressively evolved from patch-level convolutional classifiers toward attention-driven aggregation and transformer-based frameworks.2,14–16,25–28 These approaches have improved predictive performance on public and multi-center datasets, often reporting macro-F1 values above 0.90. However, these gains are frequently accompanied by increased computational requirements, more complex training pipelines, and reduced interpretability, which limit clinical deployability.5,6,20
2.1. Patch-level CNN grading on histopathology images
Early and mid-stage RCC grading pipelines primarily learned discriminative texture and morphology cues from image patches using convolutional neural networks (CNNs).3,5,6 A representative work is RCCGNet, which proposed a lightweight CNN with a shared-channel residual design and reported strong performance on the KMC kidney histopathology dataset, achieving F1-scores close to 0.89 and accuracy around 0.90.1,2
Such patch-level CNN approaches are typically easier to train and deploy than whole-slide image (WSI)–level multiple instance learning (MIL) pipelines, making them attractive for practical clinical settings.20,29 However, they are sensitive to stain variability and background tissue and may struggle when grade-relevant patterns are spatially sparse or heterogeneously distributed.16,17 Several additional deep learning approaches for RCC grading and histopathology image analysis have been proposed in recent studies.30–33
2.2. CNN–attention and hybrid designs improving discriminability
To better capture diagnostically relevant regions, later works incorporated attention mechanisms or hybrid modules for adaptive feature emphasis.16,34 For example, RCG-Net integrated dynamic attention with an adaptive convolution strategy and reported improved grading performance on the KMC dataset with a weighted F1-score of approximately 0.909. 35 Similarly, the EFF-Net framework targeted efficient RCC grading by combining convolutional feature extraction with separable convolutions, reporting F1-scores around 0.919 while emphasizing computational efficiency. 36
These studies collectively show that carefully designed CNN + attention methods can achieve high grading accuracy without relying entirely on computationally heavy transformer architectures.34,36
2.3. Nuclei-focused learning and graph-based aggregation
Another parallel research avenue has explicitly modeled nuclear morphology and spatial distribution, inspired by the fact that RCC grading criteria in the WHO/ISUP system heavily rely on nuclear size, pleomorphism, and nucleolar prominence.1,4 NuAP-RCC is a well-known nuclei-focused method that leverages nuclei-based features and graph-based aggregation along with global CNN representations, demonstrating good generalization across datasets and institutions. 4
Although nuclei-focused methods bring enhanced biological plausibility and interpretability, they necessitate extra steps such as precise nuclei detection or segmentation and graph construction, which complicate the pipeline and increase computational overhead.16,18,37 These dependencies also present a higher potential for error propagation, especially under variable staining and scanning conditions.22,24
2.4. Transformer and dual-stream attention frameworks
Recently, more research has focused on transformer-based context modeling and dual-stream designs for RCC grading.13,36 EAT-Net is one example that explicitly considers efficiency measures such as parameter count and GFLOPs while remaining competitive in grading performance and reporting F1-scores around 0.92 on the KMC dataset. 15
While transformer architectures improve global context modeling, their multi-branch attention mechanisms and heavy feature fusion stages significantly increase inference time and reduce transparency, which may limit deployment in routine pathology workflows, particularly in low-resource settings.5,6,13
2.5. Feature fusion and deep transfer learning
Several recent studies have reported strong RCC grading performance using feature fusion strategies and transfer learning.18,19 For instance, fusion frameworks and deep transfer learning pipelines have achieved classification accuracies exceeding 0.93 on benchmark datasets such as KMC. 18 Other Scientific Reports studies have demonstrated that carefully tuned CNN backbones combined with transfer learning and hybrid modeling can achieve accuracy values close to 0.94.2,5
These outcomes highlight the importance of strong backbone feature representations and well-designed aggregation functions. However, these methods often involve multiple optimization stages and sophisticated training strategies, making reproducibility and large-scale deployment challenging.6,20
2.6. Theoretical advantages of the proposed framework
The framework we propose controls the trade-off between performance and efficiency through a completely convolutional structure with attention-based aggregation and nuclei-aware features. Transformer-based models like EAT-Net depend on self-attention mechanisms that are quadratic in complexity, hence, they consume more memory and computational resources. On the other hand, convolutional architectures like DenseNet are linear in relation to image size, which means they can extract features more efficiently. Besides that, the use of U-Net, based tumor segmentation helps feature learning to focus only on the relevant regions by indirectly limiting unnecessary computations on background tissue. On the other hand, nuclei-level descriptors bring in biologically meaningful priors consistent with ISUP grading criteria that the model can use to recognize clinically relevant features without requiring complicated graph-based or transformer architectures. Additionally, attention-based MIL offers the possibility of assigning weights adaptively to the importance of a patch, but it still results in a much lower computational cost when compared with transformer-based aggregation methods. This balance makes it possible for our framework to produce results that are as good as those of more complex models but with a significantly lower number of parameters and faster prediction times.
Even though the design above is inspired by the idea, the results from Section 5.6 show that the tumor segmentation plays a crucial role: the performance drops by a large margin (∼5% Macro-F1) when it is removed, suggesting that spatial filtering of tumor regions results in better feature quality. Besides that, the use of nuclei-level features leads to a steady increase (∼2, 3%) in performance, proving that biologically grounded features are a good complement to deep CNN representations. Attention-based MIL also plays a role by enhancing aggregation efficiency and accuracy, it results in an improvement of ∼3% over majority voting and at the same time keeps the computational overhead low. On the other hand, transformer-based MIL methods depend on global self-attention that has quadratic complexity, which means that the computational cost goes up without a proportional increase in performance for this task.
The pieces of evidence appear to point toward: the proposed three are combining not randomly, but functionally complementing each other, segmentation decreases noise, nuclei descriptors add biological significance, and AB-MIL allows for efficient contextual aggregation. They form a computationally efficient alternative to transformer-based and graph-based frameworks while arriving at comparable accuracy. Considering that the transformer-based models usually demand 3, 5 times higher GFLOPs and memory, the proposed framework attains comparable macro-F1 performance (0.94) at much lower computational complexity.
2.7. Summary
Overall, the literature shows a clear performance trend toward nuclei-focused and attention/transformer-based aggregation strategies, with many works reporting F1-scores in the range of 0.90–0.92 and sometimes higher depending on dataset and evaluation protocol.4,14,15
However, there remains an under-explored opportunity to achieve state-of-the-art RCC grading performance using a clinically deployable footprint: a lightweight, purely convolutional pipeline paired with careful stain normalization and attention-based aggregation that supports interpretability, computational efficiency, and fast inference.22–24,34
This gap directly motivates the research question and hypothesis introduced in Section 1.1, positioning our work as a practical and clinically oriented alternative to increasingly complex transformer- and nuclei-centric RCC grading frameworks.
Comparison with recent RCC grading papers.
3. Methodology
This part explains the full design of the RCC grading framework that is being proposed. The sequence combines explicit tumor localization with U-Net, powerful feature extraction with DenseNet and other modern CNN backbones, attention-based multiple instance learning (MIL) for aggregating the results at the slide level, and nuclei-level morphological analysis matched with the ISUP grading criteria. The system is developed to be lightweight, interpretable, and ready for clinical deployment while still performing on par with the latest transformer and nuclei-centric models.
3.1. Overall framework
Automated grading of RCC from whole-slide histopathology images by computer is the proposed framework. In the first step, a U-Net model is used to detect tumor areas, and tumor-rich patches are extracted from the WSI. Then, each of these patches is converted into deep feature representations by a CNN backbone. At the same time, nuclei instance segmentation is made to gain morphological features which describe nuclear size, shape, and intensity patterns. CNN features and nuclei descriptors are then combined at the patch level, and all patches of a slide are aggregated using an attention-based multiple instance learning (AB-MIL) module. Finally, the aggregate slide-level feature is used to infer the RCC grade. The proposed framework elegantly exploits explicit tumor localization, biologically meaningful nuclei analysis, and modern attention-based aggregation within a lightweight and clinically interpretable setting (Figure 2). Overall architecture of the proposed RCC grading framework integrating U-Net–based tumor segmentation, CNN feature extraction, nuclei-level analysis, and attention-based MIL aggregation for slide-level prediction.
This architecture ensures spatial interpretability via tumor segmentation, feature discriminability via CNN embedding, contextual reasoning via attention-based aggregation, and biological relevance via nuclei morphology modeling.
3.2. WSI preprocessing and patch generation
Each WSI is first processed to remove background using Otsu thresholding in the HSV color space. Tissue regions are identified, and patches with less than 20% tissue coverage are discarded. Non-overlapping patches of size 224×224 pixels are extracted at 20X magnification. These patches serve as the basic processing units for both segmentation and classification tasks.
To reduce inter-dataset color variability, Macenko stain normalization is applied to all patches. During training, data augmentation is used to improve generalization and includes: • Random rotation (0°–360°) • Horizontal and vertical flipping • Random scaling (0.9–1.1) • Brightness and contrast variation (±10%)
In addition to Macenko and Vahadane normalization, other approaches such as Reinhard color transfer 23 and GAN-based normalization 38 have been proposed. Several additional stain normalization and preprocessing approaches have been proposed to reduce inter-institutional variability and improve robustness in histopathology image analysis.39–43
3.3. Tumor segmentation using U-Net
Tumor region localization is performed using a fully convolutional U-Net architecture,
12
as illustrated in Figure 3. This module serves as the first stage of the pipeline and ensures that all subsequent grading decisions are restricted to tumor tissue only. The segmentation process is formulated as a binary pixel-wise classification problem, separating tumor from non-tumor regions. U-Net architecture used for tumor region segmentation in RCC histopathology images.
The U-Net–based segmentation module operates through the following stages:
1. Input Patch Formation: Each WSI is tiled into patches of size 224×224 pixels. These patches are used as inputs to the U-Net model for dense tumor prediction at pixel resolution.
2. Encoder Representation Learning: Equation (1) shows that encoder consists of multiple convolutional blocks and content of each block.
This phase gradually decreases the spatial resolution and at the same time enlarges the receptive field and semantic abstraction so that the network can learn the high-level tumor morphology and the tissue context.
3. Bottleneck Feature Encoding: At the deepest level, the network captures the contextual information on a global scale, comprising the overall tumor structure, the cellular density, and the large-scale architectural patterns that separate tumorous tissue from normal one.
4. Decoder and Spatial Reconstruction: To the encoder, the decoder is a counterpart and through it, the spatial resolution is brought back with transposed convolutions shown in Equation (2):
At each decoding level, the upsampled feature map is concatenated with the corresponding encoder feature map via skip connections as shown in Equation (3):
5. Boundary Refinement: In order to help separate the tumor from the background, the post-concatenation convolutional layers integrate low-level texture information with high-level semantic features. Besides, their hybrid function helps to filter out false positives.
6. Tumor Probability Estimation: The output combines a 1 × 1 convolution and a sigmoid activation, resulting in a pixel-wise tumor probability map as shown in Equation (4):
7. Binary Tumor Mask Generation: Threshold on the probability map is performed to generate a binary tumor mask as shown in Equation (5):
8. Loss Function and Optimization: The segmentation network is optimized using a composite loss as shown in Equation (6):
•
•
•
•
9. Tumor-Guided Patch Filtering: The mask of the tumor is applied back to the WSI to locate the tumor, and accordingly, tumor-rich patches are selected. Features extraction and grading are done only on patches that have a significant tumor area overlap. This filtering procedure will help to decrease the background noise and prevent the non-tumor tissues from influencing the classification stage.
10. Clinical and Methodological Significance: The segmentation-first strategy has the potential to give various benefits, such as clarifying the spatial interpretability, removing the classification bias caused by the irrelevant tissue, increasing the robustness of the method across different datasets, and creating a clinically auditable connection between the predicted grades and the tumor morphology.
The incorporation of U-Net-based tumor segmentation as a mandatory preprocessing step makes the framework impose anatomical correctness, increase the accuracy of subsequent grading, and hence the automated pipeline becomes more in line with the pathologist’s workflow that is commonly used in RCC diagnosis. Such models as 3D U-Net, 13 UNet++, 14 and nnU-Net 15 have shown that the U-Net family is very well suited for the task of biomedical segmentation.
3.4. Patch-level feature extraction using CNN backbones
Patch-level feature extraction is achieved based on deep convolutional neural network (CNN) backbones, which transform each tumor patch into a concise and highly discriminative feature representation. The CNN feature extraction component architecture is represented in Figure 4. At this point, the CNN learns texture, structural, and morphological patterns that are highly indicative of RCC grading. CNN backbone architectures used for patch-level feature extraction, including DenseNet-121 as the primary model and EfficientNet-B4 and ConvNeXt-Tiny as modern alternatives.
DenseNet-121 features a dense connectivity pattern whereby each layer gets its inputs from all the layers that came before it. This layout goes a long way in enabling efficient feature reuse, making the gradients flow better, and cutting down the number of trainable parameters as compared to regular deep CNNs. These attributes render DenseNet most apt to be used in medical imaging tasks where, on the one hand, the datasets are small and, on the other hand, a high representational capacity is needed to uncover very subtle histopathological variations.
Besides DenseNet-121, two more CNN backbones have been experimented with to compare the proposed framework with recent state-of-the-art feature extractors: • EfficientNet-B4 that scales the network depth, width, and resolution in a compound manner to get the best performance while keeping the computational cost low. • ConvNeXt-Tiny is a modern convolutional architecture that draws inspiration from the transformer design principles and thus has improved representational power and has shown great performance on vision benchmarks.
In order to adapt to histopathological textures and cellular structures, the features learnt through combination of ImageNet-pretrained weights and fine-tuning of the colorectal cancer (RCC) tumor patches are used to initialize all the CNN backbones. Equation (7) shows the output feature maps of the last convolutional layer for each tumor patch are changed into a fixed-length vector by means of global average pooling:
Combining DenseNet-121 with state-of-the-art CNN backbones such as EfficientNet-B4 and ConvNeXt-Tiny, the suggested framework provides a means to methodically compare classical and modern architectures for their strengths in performance and computational efficiency in RCC grading. Various convolutional neural network architectures have been successfully applied for feature extraction in histopathology image analysis.44–47 EfficientNet-B4 uses compound model scaling to obtain better accuracy–efficiency trade-offs, 3 whereas ConvNeXt updates the convolutional design by utilizing transformer-inspired architectural features. 5 EfficientNetV2 also increases training speed and parameter efficiency while still delivering high accuracy. 4
EfficientNet-B4 uses compound model scaling to obtain better accuracy–efficiency trade-offs, 3 whereas ConvNeXt updates the convolutional design by utilizing transformer-inspired architectural features. 5 EfficientNetV2 also increases training speed and parameter efficiency while still delivering high accuracy. 4
3.5. Nuclei instance segmentation
In this work, nuclei instance segmentation is performed using HoVer-Net due to its robustness in separating overlapping nuclei in histopathology images. The model is used with pretrained weights on the PanNuke dataset and fine-tuned on RCC patches.
Segmentation outputs are post-processed using the following steps: • Minimum nucleus area threshold: 50 pixels (to remove debris) • Maximum area threshold: 2000 pixels (to remove merged regions) • Watershed refinement for boundary separation
Only nuclei satisfying these criteria are retained for feature extraction.
Nuclei segmentation is done on each tumor patch by employing state-of-the-art instance segmentation models like Hover-Net 16 or StarDist. 18 These networks are specifically designed for histopathology images and they can successfully pick out, separate, and delineate individual nuclei even in the most densely packed tumor areas. The instance-aware characteristic factor of their design allows for precise acquisition of nucleus contours, which is essential for reliable morphological measurement.
Following segmentation, each nucleus is treated as an individual object, and a set of morphological and intensity-based descriptors is computed: • Area: measures nuclear size, which increases with tumor grade • Perimeter: captures boundary complexity • Eccentricity: reflects elongation and deviation from circular shape • Compactness: quantifies nuclear shape regularity • Irregularity index: indicates pleomorphism and boundary distortion • Mean nuclear intensity: relates to chromatin density • Intensity variance: measures intra-nuclear heterogeneity • Proxy measure of nucleolar prominence: approximates nucleolar visibility using localized intensity peaks
These quantitative descriptors can be interpreted as a computer-based representation of ISUP grading criteria, especially matching the criteria of nuclear enlargement, pleomorphism, and nucleolar prominence. Moreover, through the integration of nuclei-level analysis into the grading pipeline, the proposed method will make predictions based on biologically interpretable cellular features instead of simply deep abstract representations.
Different nuclei segmentation methods such as Cellpose 19 and regression-based distance map techniques 20 have also been reported to work well for histopathology. The idea of star-convex polygon modeling was initially proposed by Schmidt et al., 17 which forms the foundation of the StarDist tool.
3.6. Nuclei feature aggregation and fusion
Following nuclei instance segmentation and feature extraction, the nuclear descriptors obtained from individual nuclei within a tumor patch are aggregated to form a patch-level representation. Since each patch may contain a variable number of nuclei, a fixed-length nuclear feature vector is generated using statistical pooling operations. Specifically, for each nuclear attribute (area, eccentricity, intensity variance, etc.), the mean, standard deviation, and maximum values are computed across all nuclei in the patch. This leads to a condensed and strong nuclear descriptor,
In order to combine cellular-level morphological data with deep texture-level representations, the nuclear descriptor and the CNN embedding are merged to create a fused feature vector as shown in Equation (8):
The combined features are subsequently fed to a multilayer perceptron (MLP) that accomplishes two things. One is to map the concatenated features to a common latent space, thus enabling the CNN-based and nuclei-based features to interact well. The other is to carry out nonlinear feature transformation so as to increase the ability of distinguishing features and reduce the influence of components that do not contribute or are noisy. In a formal manner, the MLP operation is represented by Equation (9).
The MLP used for fusion consists of two fully connected layers (1024 → 512 → 256) with ReLU activation and dropout (0.3). This transformation ensures effective interaction between CNN and nuclei features while reducing dimensionality. This patch fusion approach leads to a biologically grounded patch embedding that accurately represents: • Several texture level cues such as tissue architecture and cell density that CNNs can learn, and • Cell level morphological features such as nuclear size, pleomorphism and nucleolar prominence that can be captured by nuclei descriptors.
All extracted nuclei features are normalized using z-score normalization across the training dataset to ensure consistent scaling as
By deliberately incorporating nuclei morphology into the feature space, this method not only corresponds the computational depiction with the core concepts of pathological grading to enhance the precision of grading but also interpretability.
3.7. Attention-based multiple instance learning (AB-MIL)
Multiple instance learning was originally introduced for medical image analysis by Xu et al.,
14
and its formal characteristics are summarized in the comprehensive survey by Carbonneau et al.
15
Several attention-based MIL and aggregation strategies have been proposed for whole-slide image analysis.48–50 Instead of using majority voting for slide-level decision making, the proposed framework employs an Attention-Based Multiple Instance Learning (AB-MIL) strategy for robust and interpretable aggregation of patch-level features. In this setting, each whole-slide image (WSI) is modeled as a bag of patch-level fused representations obtained after CNN and nuclei feature fusion, as represented in Equation (10).
The core idea of AB-MIL is to learn a weighting function that assigns higher importance to diagnostically informative patches and lower importance to irrelevant or noisy patches. This is achieved through a trainable attention mechanism. For each patch representation • • • • the softmax normalization ensures that the attention weights satisfy
The attention weight
Using these attention scores, the slide-level representation is computed as a weighted sum of patch features as shown in Equation (12).
To ensure reproducibility, the attention module is implemented using a two-layer fully connected network. The patch embedding dimension is set to d=1024 (DenseNet-121 output). The attention mechanism projects each patch feature into a latent space of dimension L=256.
Specifically, the attention computation follows: • First layer: Fully connected layer with weight matrix • Activation: • Second layer: Attention vector
The attention score is computed as shown in Equation (13).
Dropout with rate 0.25 is applied after the first projection layer to prevent overfitting. The aggregated slide representation is then passed through a fully connected classifier with 512 hidden units and ReLU activation, followed by a softmax output layer.
The AB-MIL methodology has quite a few key advantages over the majority voting rule: 1. Adaptive Patch Importance: Majority voting will treat all patches as equally important, on the other hand, AB-MIL can learn to assign higher weights to tumor regions that are more significant for diagnosis. 2. Performance Increase: Locating the most relevant patches solely makes it possible for AB-MIL to outperform the grading accuracy continuously and it generally leads to a 3-5% increase in the Macro-F1 and QWK scores. 3. Interpretability: The attention weights found can be converted into attention heatmaps on WSIs, which allows for the understanding of which regions affected the grading decision, thereby it can increase clinical trust. 4. Robustness to Noise: AB-MIL reduces the extent to which wrong, ambiguous or low-quality patches may have an impact, these kinds of patches are typical in large-scale histopathology datasets. 5. Clinical Relevance: The attention mechanism is analogous to the way in which pathologists focus only on the most suspicious and representative regions of the slide when making the tumor grade decision.
By exchanging the majority voting with AB-MIL, the proposed structure adopts an advanced aggregation technique that is the de facto standard for WSI-level classification in computational pathology. This improvement significantly upgrades the predictive performance and interpretability while, at the same time, it maintains computational efficiency and clinical deployability.
3.8. Dual-stream multiple instance learning (DSMIL)
In addition to AB-MIL, we extended our investigation to DSMIL8 which is a more powerful aggregation method to potentially yield even better slide-level RCC grading performance. DSMIL is a technique specially designed for situations where only a few features related to the diagnosis are local or diverse within a tumor tissue slide, an example that is typical in RCC histopathology.
DSMIL has two learning streams that complement each other:
Instance-Level Stream: 1. The instance-level stream is all about finding the single most discriminative patches in a WSI. Each patch representation is forwarded to an instance classifier that predicts how likely it is of the patch to be of a certain RCC grade. This stream explicitly models local evidence and highlights patches that strongly indicate a specific grade, such as regions exhibiting large pleomorphic nuclei or prominent nucleoli. Formally, instance-level predictions can be represented as Equation (14). This stream encourages the network to learn highly discriminative patch-level representations and serves as a detector for grade-defining regions. 2. Bag-Level (Slide-Level) Stream: The bag-level stream aggregates information from all patches to model the global context of the WSI. Similar to AB-MIL, it computes attention weights that indicate the importance of each patch in forming the final slide-level representation. The weighted combination of patch features is then used to produce the slide-level prediction. This stream captures overall tumor architecture, distribution of grade-related patterns, and global consistency of the grading decision.
DSMIL’s main feature is that the two streams are trained simultaneously and interact with each other. The instance-level stream steers the model to concentrate on the most informative patches, whereas the bag-level stream makes sure that the global slide context remains intact. This dual supervision allows the network to integrate local discriminative evidence with a comprehensive slide-level understanding.
The final slide level prediction is a result of fusion of the outputs of the two streams, thus DSMIL obtains the ability to use highly local strong signals from highly atypical regions and also stable global representations from the entire WSI.
DSMIL has several benefits for RCC grading: • Increased Sensitivity to Sparse Lesions: DSMIL has the ability to locate very small areas showing high-grade features that in a standard attention-based aggregation might get lost. • Increased Robustness: By using both the local and global information at the same time, DSMIL is less likely to pick up noise, patch labeling errors, and tumor heterogeneity. • AB-MIL Complementary: AB-MIL adaptively weights patches, whereas DSMIL acts to patch-level discrimination making it extremely efficient in the presence of high intra-slide variability. • More Clinically Relevant: Our model is similar to the diagnostic procedure by pathologists who usually spot a few critical high-grade regions but consider the overall tumor pattern as well.
In our experiments, alongside AB-MIL, DSMIL is tested as another aggregation method. By comparing AB-MIL to DSMIL performance, we can measure how dual-stream supervision helps and prove the stability of the proposed framework when tumor heterogeneity changes.
For DSMIL, the instance-level classifier is implemented as a linear layer mapping d=1024d = 1024 features to class logits. The bag-level stream uses an attention pooling mechanism similar to AB-MIL but guided by the top-k scoring instances (k = 5). The final prediction is computed as a weighted combination of instance-level and bag-level outputs as shown in equation (15).
3.9. Baseline and comparative model
3.10. Training strategy
Training parameters and optimization settings used for all models.
The training is performed in two stages: Stage 1: U-Net segmentation is trained independently using annotated tumor masks. Stage 2: The classification pipeline (CNN + nuclei + MIL) is trained using tumor-filtered patches while freezing early CNN layers for the first 20 epochs, followed by full fine-tuning.
3.11. Evaluation metrics
The performance of each model is measured by a comprehensive panel of evaluation metrics that consider both classification accuracy and clinical relevance. As RCC grading is an ordinal classification task, Quadratic Weighted Kappa (QWK) is chosen as the main evaluation metric. QWK quantifies the agreement between the predicted and actual grades while giving more weight to larger misclassification errors, which makes it very appropriate for grading tasks. The statistical characteristics of weighted Kappa are well established57,58 and therefore QWK is an excellent choice for ordinal grading tasks. The Dice coefficient,
59
also known as the Sørensen–Dice index,
60
is widely used for evaluating medical image segmentation quality. Segmentation performance was evaluated using standard metrics commonly used in medical image analysis.
61
In addition to QWK, the following standard classification metrics are reported: • Accuracy: measures the overall proportion of correctly classified slides. • Precision: evaluates the reliability of grade predictions. • Recall: quantifies the ability to correctly identify slides belonging to each grade. • Macro-F1 score: represents the average F1-score across all grades, treating each class equally and mitigating the effect of class imbalance.
To evaluate the performance of the tumor segmentation module, the Dice coefficient is used, which measures the spatial overlap between the predicted tumor masks and the ground-truth annotations. The Dice coefficient, 59 also known as the Sørensen–Dice index, 60 is widely used for evaluating medical image segmentation quality. All results are reported: with and without attention-based MIL to quantify the benefit of replacing majority voting, with and without nuclei-level feature integration to evaluate the contribution of cellular morphology modeling, and across all baseline and state-of-the-art models to ensure a fair and comprehensive comparison.
This evaluation strategy considers a proper mixture of the prediction performance, the clinical reliability, and the contribution of each part of the proposed framework.
3.12. Statistical analysis
In order to determine whether the differences in performance between models were significant, McNemar’s test was used on paired slide-level predictions between the proposed method and the strongest baseline model. This test is employed to check if two classifiers have a significant difference when applied to the same samples.
Moreover, paired t-tests were performed between repeated runs to compare macro-F1 scores of different aggregation strategies. The significance level of p < 0.05 was used. These statistical tests confirm that improvements of the model are not due to random errors.
3.13. Summary
In summary, the suggested method merges tumor-aware preprocessing, deep feature extraction, modeling of nuclei-level morphology, and the latest attention-based mechanisms for aggregation in a single RCC grading system. Using U-Net for tumor segmentation, CNN backbones for patch representation, nuclei instance segmentation for cell-wise interpretability, and AB-MIL/DSMIL for slide-level aggregation, the framework is pathologically sound yet computationally savvy. Therefore, this architecture ensures the obtained grading accuracy, the system’s interpretability, and the capability of the method to generalize well on different datasets, which all together make the solution clinically valuable and computationally efficient for automating RCC grading.
4. Experimental setup
4.1. Dataset
4.1.1. Data source
We utilized three publicly available RCC histopathology datasets: • TCGA
62
: WSIs from KIRC, KIRP, and KICH cohorts with diagnostic metadata. • RCdpia: A curated RCC dataset with expert-annotated tumor and non-tumor regions. • MMIST-ccRCC: A multimodal dataset of 618 ccRCC patients, from which only the histopathology component with ISUP/Fuhrman grades (0–4) was used.
These datasets were selected to ensure reproducibility, cross-institutional variability, and clinical relevance.
4.1.2. Data availability statement
All datasets employed in the present work can be freely accessed from public sources. • The Cancer Genome Atlas (TCGA) is accessible via the GDC Data Portal • RCdpia Dataset is Available from • MMIST-ccRCC Dataset is available from
The code supporting preprocessing and model training can be obtained from the corresponding author upon a justified request.
4.2. Data preprocessing
Preprocessing was designed to (i) remove irrelevant background, (ii) standardize color variation across institutions, and (iii) increase model robustness to morphological and staining variability. All preprocessing steps were applied consistently across datasets to avoid introducing datasets specific bias.
4.2.1. Tissue detection and background removal
Tissue detection parameters.
4.2.2. Patch extraction and tumor filtering
WSIs were divided into fixed-size tiles that were locally cropped from tissue regions. Patch cropping was done at 20× magnification, which allows to see the nuclei clearly enough and at the same time is not too demanding computationally. A patch size of 224 × 224 pixels was chosen as it coincides with the standard CNN input sizes. Non-overlapping tiling (0% overlap) was chosen to avoid tiles being too similar and to keep the processing effort low.
Patch extraction configuration.
4.2.3. Stain normalization
Stain normalization settings.
4.2.4. Data augmentation
Data augmentation strategy.
These modifications increase the variety of the training samples without changing diagnostic features like nuclear morphology or tissue architecture. We have not used elastic or heavy distortions to prevent the creation of unrealistic histology patterns.
4.3. Data splitting protocol
To ensure impartial assessment and data leakage avoidance, a thorough data splitting protocol was taken up step by step. Data leakage is one of the most common problems in patch-based whole-slide image (WSI) studies. The entire splitting had been patient-wise (slide-wise) level, which means that no patches from the same WSI can be found in more than one subset. In this way, the model is tested on new slides that it has never seen before rather than recognizing the slide-specific patterns.
4.3.1. Within-dataset evaluation
Within-datasets splitting protocol.
We reserved the validation set for model selection and early stopping only, and the test set was not seen at all during training. Keeping them separate avoids any optimistically biased performance reports.
4.3.2. Cross-dataset evaluation
Cross-dataset evaluation protocol.
Cross-dataset evaluation offers a realistic assessment of the expected performance of a model in a clinical setting, that is, when the model is subjected to the slides from the labs and scanners that have not been seen before. The experiment also tests how well stain normalization and attention-based aggregation can help to reduce the impact of domain shift.
4.4. Implementation details
All models were built using the PyTorch deep learning framework (version ≥1.13) and trained on a Linux-based workstation. Preprocessing, augmentation, and evaluation steps were carried out with the help of standard scientific Python libraries, such as NumPy, OpenCV, and scikit-learn. The use of CUDA and cuDNN acceleration was permitted for fast GPU computation.
4.4.1. Hardware configuration
Hardware configuration.
4.4.2. Training configuration
Training configuration.
Detailed training hyperparameters.
4.4.3. Backbone networks
Backbone architectures evaluated.
4.4.4. Reproducibility measures
Several measures were taken to ensure the reproducibility of the experiments and the stability of the results. The application of fixed random seeds to PyTorch, NumPy, and Python’s random library was done to reduce the variation caused by stochastic initialization and data shuffling. Different seeds were used for three separate runs of all the experiments, and the resulting performance metrics are presented as mean ± standard deviation to show the variation. Furthermore, the same hyperparameters and training settings were used for all models so that any difference in performance could be attributed to the architectures and not tuning advantages. Such precautions make the comparison fair and increase the credibility of the results.
4.5. Baseline and comparative models
Summary of baseline and state-of-the-art models used for comparative evaluation in RCC grading.
5. Results
Here we provide a quantitative and qualitative assessment of tumor segmentation, slide-level grading performance, aggregation strategies, backbone comparison, ablation studies, cross-dataset generalization, and computational efficiency. Unless indicated otherwise, all the results are presented as mean values of three runs.
5.1. Tumor segmentation performance
Tumor segmentation performance.
Impact of tumor segmentation on grading performance.

Tumor segmentation performance across datasets measured using Dice score. Consistently high Dice values indicate robust tumor localization despite inter-institutional variability.
This experiment explicitly quantifies the contribution of tumor-guided patch selection. The results show a clear improvement of approximately 5% in Macro-F1 when segmentation is included, demonstrating that restricting analysis to tumor regions significantly enhances grading reliability.
5.2. Overall slide-level grading performance
Overall slide-level grading performance.

Multi-metric comparison of RCC grading models. Macro-F1, quadratic weighted kappa (QWK), and accuracy improve consistently with attention-based aggregation and modern CNN backbones, with the proposed framework achieving the highest overall performance.
Improvements in Table 17 are statistically significant (p < 0.05). Moving from majority voting to attention-based aggregation is an important factor in grading performance improvement which is captured in Table 15. For instance, switching from DenseNet + Voting to DenseNet + AB-MIL only resulted in a 3% rise in macro-F1 with a substantial QWK increase, thus, it is evident that slide-level aggregation plays a crucial role in whole-slide image analysis.
Combining various CNN architectures with AB-MIL also brings about further enhancements. EfficientNet-B4 and ConvNeXt-Tiny members registering a similar increase fluctuate strongly performing feature representations seem to augment grading consistency. Nevertheless, the complete pipeline reached the best results implying that nuclei-level descriptors and stain normalization can highlight some discriminative features that deep CNN alone may miss. Incremental but consistent for all model upgrades are the performance improvements depicted in Figure 6. The proposed method exhibits the best macro-F1 score while its convolutional architecture is less complex than the ones based on a transformer model currently found in the literature.
The multi-metric trends illustrated in Figure 6 clearly show a steady progress in Macro-F1, QWK, and accuracy as the pipeline is further refined. In particular, the enhancements in QWK almost duplicate the changes in macro-F1, thus portraying that the model maintains ordinal relationships between grades rather than simply optimizing categorical accuracy. This is clinically significant since the major source of disagreement in grading is usually between adjacent grades. Hence, these results indicate that a segmenting guidance, nuclei-aware features, and attention-based aggregation properly combined can produce RCC grading performance comparable to the state of the art without the need for highly transformer architectures.
Statistical testing revealed that the better results of the proposed framework against DenseNet + voting were significantly different (McNemar test, p < 0.05). Furthermore, an attention-based aggregation was a clear and statistically significant winner over majority voting in paired comparisons (p < 0.05), thus the gains witnessed were further corroborated to be robust.
5.3. Effect of aggregation strategy
Since diagnostically relevant areas are usually very sparse and unevenly distributed, slide-level aggregation is essentially important for whole-slide image (WSI) grading. To assess how much aggregation strategy affects the result, we pit the classical majority voting against attention-based multiple instance learning (AB-MIL) and dual-stream MIL (DSMIL) methods.
Aggregation strategy comparison.
Besides, DSMIL could also have the potential to deliver an increase in performance when the high-grade areas are very restricted/limited spots. Since the method is equipped with instance-level and bag-level supervisions, it can locate the critical patches without losing the overall context of the image which is more or less a necessity in the task of grading RCC. Since grading nucleolar prominence is one of the factors that help determine the tumor grade, a small area nucleolar prominence may be sufficient to decide the grade. The above paragraph indeed clarifies that the RCC grading behaves the same as a human level of reasoning. Without implementation that mirror human cognitive processes at the different levels, simple voting strategy-based approaches will only cause the diagnostic signals to be diluted. In contrast, attention-based methods are attractive as they allow for adaptive weighting, which happens to be more consistent with pathologic interpretation.
From the clinical perspective, the integration of features using the attention mechanism not only improves interpretability but also provide visually intuitive results. It is pathologist who ultimately verify the highlighted areas that most influence the grading decisions by attention maps. This kind of transparency contributes to trust building and facilitates human-AI cooperation. In both cases, the highest increase in performance was due to a change in the aggregation strategy rather than making the backbone more complex. This is consistent with the growing consensus that aggregation based on MIL is almost necessary if one wants to perform WSI classification tasks. The improvement from majority voting to AB-MIL and DSMIL is statistically significant (p < 0.05), confirming the effectiveness of attention-based aggregation for WSI classification.
5.4. Backbone comparison
To assess the influence of feature extraction capacity on RCC grading, we compared three representative convolutional backbones: DenseNet-121, EfficientNet-B4, and ConvNeXt-Tiny. All backbones were trained under identical conditions and paired with the same segmentation guidance and attention-based aggregation to ensure fair comparison.
Backbone comparison.
Nevertheless, the differences in performance between the various backbones were quite small when compared to the major improvements through aggregation strategy and tumor-oriented patch selection. Hence, it is very plausible that slide-level reasoning and region selection have a major influence on grading performance than merely the backbone selection. Practically speaking, DenseNet-121 is still a good choice owing to its parameter efficiency and presumably lower computation cost. It certainly doesn’t make sense to use a large backbone just for a marginal performance gain, when additional memory and inference time are unavoidable considerations, especially in resource-limited environments. These considerations underscore the fact that choosing a backbone should be based on the deployment scenario rather than solely on the highest accuracy achieved. Efficient CNN backbones when paired with strong aggregation methods can strike a good balance between the performance and computational requirements.
5.5. Per-class performance
Per-class F1 and QWK.

Confusion matrix for slide-level RCC grading on the TCGA test set. Most misclassifications occur between Grades 2 and 3, reflecting known clinical ambiguity in intermediate-grade tumors.
Per-class improvements are statistically significant (p < 0.05). Recognition of high-grade tumors (Grades 3, 4) was made with strong confidence since marked nuclear pleomorphism and nucleolar prominence that serve as clear morphological indicators were present. Likewise, Grade 1 cases were likewise accurately classified as the nuclei were relatively uniform in their features.
Misclassification distribution.
The bulk of errors in classification was between Grades 2 and 3, as demonstrated in Figure 7. That pattern corresponds to clinical difficulties observed in RCC grading, intermediate grades generally having morphological features that are similar. Moreover, subtle variations in the size of the nucleus and visibility of the nucleolus may cause even experts in pathology to disagree.
To further quantify this behavior, approximately 70–75% of misclassifications occurred between adjacent grades (primarily Grades 2 and 3), while less than 10% involved distant grade errors (e.g., Grade 1 vs Grade 4). This distribution indicates that the model preserves ordinal relationships between grades and aligns with known clinical variability in RCC grading, where intermediate grades present overlapping morphological characteristics. The observed misclassification patterns, particularly between Grades 2 and 3, are consistent with previously reported inter-observer variability in RCC grading. Prior studies have shown that agreement between pathologists is lower for intermediate grades due to overlapping morphological characteristics. This alignment suggests that the model’s errors reflect intrinsic diagnostic ambiguity rather than systematic bias, supporting its clinical reliability.
The strong QWK obtained in this study is a further evidence of the ordinal consistency. Since QWK weighs larger differences in grades more heavily, a high QWK value means that most of the misclassifications are the neighboring grades rather than distant categories. The per-class analysis taken as a whole points to the model’s error patterns being determined by the intrinsic grading difficulty, rather than a model bias. Being in line with clinical reality, this factor brings greater confidence in the model’s practical use. Similar patterns of adjacent-grade confusion have been reported in recent RCC grading studies, including transformer-based and nuclei-centric models, where Grade 2–3 misclassification remains the dominant error source. This suggests that the observed errors are not specific to the proposed framework but reflect intrinsic ambiguity in intermediate-grade RCC morphology.
5.6. Ablation study
Detailed ablation with performance drop.
All ablation results are statistically validated (p < 0.05). The largest performance degradation is observed when tumor segmentation is removed (↓5.3% Macro-F1), confirming its critical role in filtering non-informative tissue. Attention-based aggregation and stain normalization also contribute significantly, each providing over 4% improvement. Nuclei-level features provide complementary gains (∼3%), validating their biological relevance. When tumor segmentation was removed, the model’s performance suffered the most, with the macro-F1 score dropping by almost 5%. In fact, the tumor-guided patch selection plays a key role in excluding irrelevant tissues and directing the attention of the model in the diagnostically meaningful areas. When nuclei-level features were excluded, the system’s performance dropped only slightly. This suggests that the explicit nuclear descriptors provide information that is complementary to the features obtained by the CNN. According to the biological basis of RCC grading, nuclear morphology is the main criterion, and this is in line with the results. Changing attention-based MIL to majority voting also led to a visible drop in performance. This, therefore, underscores the value of adaptive slide-level aggregation to efficiently deal with heterogeneous tumor regions.
Finally, the removal of stain normalization led to a decrease in cross-dataset stability, and there was also a minor reduction in grading accuracy. The inference one can make from this is that color standardization plays a role in domain robustness, especially when training and testing data are from different institutions. In general, the ablation study results show that the improvements in performance are due to the combined effect of segmentation guidance, nuclei-aware features, and attention-based aggregation, rather than any single component. Therefore, the proposed framework is thus in line with the design philosophy and hence it validates the significance of each module.
The ablation results provide insight into the complementary roles of different components. While transformer-based MIL methods attempt to model global context through computationally expensive self-attention, the proposed framework achieves similar contextual reasoning through a combination of tumor-guided patch selection and attention-based aggregation. Nuclei descriptors enhance local discriminative power by explicitly encoding cellular morphology, which is central to RCC grading. When combined with AB-MIL, which selectively emphasizes diagnostically relevant patches, the model effectively captures both local cellular features and global contextual information without requiring complex graph-based or transformer architectures. This explains why the proposed approach achieves competitive performance while maintaining lower computational complexity.
5.7. Cross-dataset generalization
Cross-dataset grading performance.
It is not surprising that performance dropped with the cross-dataset evaluation compared to the within-dataset one, thus emphasizing the strong effect of domain shift in histopathology. Since color distribution, staining intensity, and slide preparation differ from one case to another, they may change the visual patterns and, thus, affect the learned representations. Still, the drop in performance was only moderate. Macro-F1 scores kept hovering around 0.90, while QWK scores suggested that ordinal agreement remained essentially unchanged. This demonstrates that the method proposed by the authors continues to be dependable in terms of consistency in grading even when it is applied to completely new institutional data.
There is a high chance that stain normalization substantially helped to lessen domain variability by matching color distributions between the datasets. Moreover, attention-based aggregation could be also one of the reasons for the model’s robustness as it allows to locate regions that are more informative for diagnosis rather than merely focusing on the overall appearance. Besides that, multi-source training enabled even better generalization as it allowed the model to learn from different types of visual features during training. Hence, this work goes along with the current view that it is through the varied samples of datasets that the models in pathology which are able to generalize can be created.
In brief, the findings of the cross-dataset experiment are indicative of the fact that domain shift is still a problem in computational pathology, yet the proper selection of pre-processing and aggregation methods can mitigate its negative impact to some extent. These observations also highlight the rationale behind conducting the evaluation of grading systems in settings that go beyond a single dataset, in order to be closer to real clinical deployment scenarios.
5.8. Interpretability analysis
Figure 8 displayed the utilization of Gradient-weighted Class Activation Mapping (Grad-CAM) for analyzing the interpretability of the RCC grading framework proposed. Through this method, we can point out how the different model decisions for the grading are influenced spatially. The correct instances of classification were indicated by the Grad-CAM heatmaps that the focus was on the parts of the tumors that showed enlarged nuclei, high packed cells, and prominent nucleoli all features that are used to diagnose the tumors. These activation features indeed matched ISUP grading criteria, which implies that the model is able to understand the significant morphological features recognized clinically rather than being confused by the background artifacts. Grad-CAM visualization and failure-case analysis for RCC grading. (a–b) Correct predictions where the model focuses on diagnostically relevant tumor regions characterized by nuclear morphology and cellular density. (c–d) Failure cases involving adjacent-grade misclassification (Grades 2 and 3), reflecting inherent ambiguity in intermediate RCC grading. The heatmaps confirm that the model consistently attends to biologically meaningful structures, supporting clinical interpretability.
Additionally, to understand the model better, representative failure cases were also presented. Errors in classification usually have occurred between different grades of one and even more so between Grades 2 and 3, where the morphological differences are minimal and sometimes even experts pathologists have a hard time telling. Grad-CAM results depict that the model concentrates on the nuclear characteristics of the borderline regions, thus if errors happen it is because of the ambiguity of the data rather than the model’s misunderstanding of the features.
Additionally, some failure cases are associated with staining variability, overlapping nuclei, or limited tumor representation within patches. Despite these challenges, the model consistently attends to biologically relevant regions, reinforcing the reliability and interpretability of the proposed approach. These observations demonstrate that the model’s decision-making process aligns with pathological reasoning, supporting its potential for clinical deployment. These results indicate that model errors are primarily driven by intrinsic grading ambiguity rather than limitations in feature representation.
5.9. Comparison with recent state-of-the-art methods
Comparison with recent RCC grading methods.
Recently, more and more studies use transformer-based MIL or nuclei-centric frameworks to get a high grading accuracy. For example, NuAP-RCC, a nuclei-level model, combined with graph-based aggregation, while EAT-Net used transformer-style attention mechanisms to capture the global context. The introduced framework’s performance is close to that of these recent methods. Although some SOTA methods show slightly better peak macro-F1 results, they tend to depend on computationally intensive models or complicated multi-stage pipelines.
By contrast, our approach showcases a harmonious combination of segmentation supervision, convolutional feature extraction, nuclei-aware descriptors, and attention-based aggregation, resulting in a balanced architecture. Such a layout keeps the accuracy at the state-of-the-art level but also lessens the computational load and architectural complexity. From a translational viewpoint, this compromise is quite significant. The systems that are robust, interpretable, and computationally feasible rather than ones that achieve slightly higher peak scores in controlled conditions are often preferred by clinical deployment. In addition to that, the suggested method has been tested on three public datasets with cross-dataset testing, which is not always the case with all SOTA studies. Such a comprehensive assessment lends support to the claims of generalizability and practical relevance.
In general, the findings demonstrate that lightweight convolutional architectures can still be at the forefront of the competition with more intricate transformer-based approaches if they are properly fine-tuned. Thus, the idea that performance improvements in RCC grading necessarily demand heavy architectures is questioned.
It is important to note that direct numerical comparison should be done very carefully because of dataset differences, but nonetheless, a number of main points can be made. Firstly, a lot of SOTA models like EAT-Net and NuAP-RCC have been tested mainly on a single dataset (e.g., KMC or TCGA), while the proposed method shows robust results on three datasets through cross-dataset validation. Model performance on the domain shift setting, as shown in Table 20, where the model produces macro-F1 scores ∼0.90, reveals that it generalizes better than models tested in limited settings. Secondly, the transformer-based methods get global context modeling, but their cost is increased computation. These models normally have a much higher need for GPU memory and inference time due to their quadratic self-attention operations.
On the contrary, the proposed framework reaches a similar macro-F1 (0.94) with significantly lower computational cost, as shown in Table 22. Thirdly, nuclei-centric methods like NuAP-RCC depend heavily on accurate nuclei segmentation and graph building, adding extra steps in the process and more possibilities for errors. Our method uses nuclei descriptors in a quite straightforward way and on the one hand, minimizes the complexity of the pipeline and on the other hand, maintains biological interpretability. Lastly, examination of error types shows that the majority of misclassifications are made between adjacent grades which is in line with clinical variability. Therefore, it is likely that model errors are attributed to data ambiguity rather than the model architecture itself. On the whole, this assessment shows that the suggested model is capable of striking a good balance between accuracy, efficiency, and robustness, thus making it more plausible for practical use than the complicated SOTA methods.
Diversely from the many previous researches that have used only one dataset such as KMC to evaluate the performance, the suggested approach has been tested on multiple datasets with cross-dataset evaluation. Hence a more realistic evaluation of generalization under domain shift is delivered, which is extremely important for the clinical use.
5.10. Computational efficiency
Besides the predictive strength, computational speed is also a major factor for the clinical acceptance of AI systems in the field of digital pathology. Whole-slide images are very large, and the grading workflows have to handle thousands of patches per slide. That’s why time of inference, memory usages, and complexity of architecture are very important practical issues.
Approximate inference time comparison.
Such findings emphasize a key dilemma of RCC grading models; architectural complexity boost cannot be trusted to produce performance gains in a linear proportion. Transformer-based models are good at enhancing the global context modeling but the proposed framework shows a way to get the same level of accuracy by making smart design decisions such as focusing on tumor regions for feature extraction and using attention for aggregating results. This is in fact an extra point in favor of the proposed methodology for a clinical setting where one of the challenges is a limited availability of computational resources, time for inference, and also the requirement of interpretability.
Computational efficiency comparison of the proposed framework with representative transformer-based and nuclei-centric RCC grading models.
6. Discussion
The goal of this study was to look back at the classical segmentation-classification paradigm used in the grading of renal cell carcinoma (RCC) and thoroughly improve it by stain normalization, nuclei-aware feature fusion, and attention-based aggregation. Whereas the latest researches in RCC grading are mainly focusing on transformer-based or graph-based nuclei-centric frameworks, our results indicate that a well-tuned convolutional pipeline is capable of producing performance measures that are in line with the state-of-the-art methods and, at the same time, it keeps the benefits of efficiency, interpretability, and practical deployability. Our proposed framework has less parameters and computational cost than Transformer-based methods. It is possible to do slide-level inference within reasonable time limits, thus supporting potential routine clinical use. Deep learning–based RCC grading systems have the potential to improve diagnostic consistency and support clinical decision making.63–66 These findings demonstrate that performance improvements in RCC grading can be achieved through principled architectural design rather than increased model complexity. The proposed model achieves comparable macro-F1 performance (0.94) while using significantly fewer parameters and computational resources compared to transformer-based and nuclei-centric models. This suggests that efficient architectural design can replace the need for computationally expensive global attention mechanisms in RCC grading.
6.1. Key findings and interpretation
Among other results, the most prominent aspect of the research is tumor-guided patch selection. Specifically, the ablative study of model components found that the change of segmentation step had the biggest performance drop impact, which confirms that restricting the analysis to tumor-rich regions reduces the background noise and aligns the grading decision to the biologically relevant tissue, thus the tumor. In addition, they argue for the pathological basis of ISUP grading, which depends on nuclear morphology in tumor areas. Integrating information at the slide level was also very important. The researchers showed that attention-based MIL was better than majority voting in all cases, implying that adaptive weighting of informative regions is a requirement for WSI analysis. In addition, Dual-stream MIL further boosted the performance in the slides harboring localized high-grade areas, thus reflecting the heterogeneity of RCC.
Moreover, the descriptors at the nuclei-level offered yet another layer of discriminative information aside from CNN features only. Since RCC grading criteria are nuclear size, pleomorphism, and nucleolar prominence, the addition of explicit nuclear morphology features can therefore enhance biological relevance. It is quite notable that the impact of backbone choice, on the other hand, was less volatile than aggregation strategy. On the one hand, modern CNNs such as EfficientNet and ConvNeXt obtained rather small increments over DenseNet, while these increments were definitely smaller than those induced from segmentation and MIL. On the other hand, this implies that it is more important to do slide-level reasoning than simply increasing the complexity of the backbone.
6.2. Positioning relative to recent literature
Recent studies such as RAF2Net and NuAP-RCC report slightly higher peak performance using transformer or nuclei-graph frameworks. However, these systems often involve substantial computational cost and complex multi-stage pipelines. In contrast, the proposed method achieves performance within the same range while using a simpler convolutional design. Although it may not always reach the highest reported macro-F1, it offers a favorable balance between accuracy, efficiency, and interpretability. For clinical deployment, such balance can be more meaningful than marginal gains in benchmark scores. Importantly, this study includes cross-dataset validation across three public cohorts, which is not consistently reported in all SOTA studies. This broader evaluation provides stronger evidence of generalizability.
6.3. Comparative analysis with state-of-the-art
The proposed framework has been compared to the latest RCC grading techniques, including transformer-based and nuclei-centric models. Although these methods have pretty good performance, they usually depend on complicated architectures and higher computational cost. At the same time, the proposed framework gets a macro-F1 of 0.94 and QWK of 0.92, but it is significantly less computationally complex. This is possible by using tumor-guided patch selection, nuclei-aware feature representation, and attention-based aggregation. Unlike transformer-based models that rely on global self-attention with quadratic complexity, the proposed method uses efficient convolutional feature extraction and selective aggregation, which results in faster inference and reduced memory requirements.
Also, cross-dataset evaluation shows that the proposed model is capable of maintaining stable performance even when domain shift occurs, whereas many earlier studies only report results on single datasets. This feature of the framework amplifies its robustness and practical applicability. Besides, analysis of error patterns reveals that most of the misclassifications involve Grades 2 and 3, and this is in a way that aligns with recent RCC grading literature. Therefore, the errors are due to intrinsic morphological ambiguity rather than model limitations. In general, the findings suggest that well-thought-through convolutional pipelines can deliver performance that is on par with the latest models while at the same time presenting benefits in terms of efficiency, explainability, and clinical use.
6.4. Clinical relevance
These results of interpretability analysis confirm that model is attentive to features of nucleolar prominence and nuclear pleomorphism that are used in ISUP grading. By matching pathological insights, this helps to build trust and also indicates how AI and humans could work together. The main confusion between Grades 2 and 3 coincides with the known variability in inter-pathologist disagreements. Since the QWK scores are close to those of human agreement, it seems that the model is behaving in a clinically realistic manner rather than being artificially optimized. Also, the short inference time per WSI makes it possible to implement the model in routine pathology workflows.
The new system is aimed at supporting pathologists’ decisions rather than trying to take over the role of a pathologist. While the model is able to point pathologists to tumor regions that are most relevant for diagnosis and to supply consistent grade suggestions, it can still be viewed as a means to help pathologists in decreasing grading variations along with their grade assignment workload. On the other hand, it has a small computing demand that enables running in digital pathology workflows, which especially makes it convenient for the settings that are limited in resources.
6.5. Efficiency vs accuracy trade-off
According to the findings, merely escalating the architectural sophistication may not lead to equivalent enhancement in RCC grading performance. Transformer-based models, for example, although they facilitate the modeling of global dependencies through self-attention, they are associated with increased computational time as well as memory usage due to quadratic complexity. On the contrary, the method presented in this paper attains similar performance by amalgamating three distinct methods: tumor-focused patch selection, nuclei-informative feature encoding, and attention-based combination. Tumor separation helps in reducing the irrelevant content, nuclei descriptors capture features that are meaningful from the biology perspective, and AB-MIL attends aux selectively focuses on the regions that are of high diagnostic value Such a mix makes it possible to perform contextual reasoning in an efficient way without the use of costlier global attention mechanisms, which is why equivalent accuracy levels can be reached at much lower computational cost.
6.6. Limitations
Although the suggested framework exhibits strong capabilities, it is still vulnerable to some drawbacks. For one, our model is dependent on correct tumor segmentation; thus inaccuracies in U-Net tumor segmentation could be carried over through to the tumor grading step and influence the final performance. Secondly, despite cross-data set testing, the data sets come from publicly available sources only and may not represent variations due to different scanners, staining protocols, and clinical institutions. Thirdly, this framework has not been tested in clinical situations where constraints related to real-time workflow and interaction of pathologist could affect the performance. Lastly, while nuclei-level features help with explaining the model results, they are reliant on precise nuclei segmentation and might be affected by staining artefacts or overlapping nuclei. We intend to extend our research to include multi-center validation, multi-scanner robustness and prospective clinical evaluation.
6.7. Future directions
Future research could aim at enhancing the robustness and clinical application of RCC grading systems. A very promising way is to use self-supervised or foundation-model pretraining on big histopathology datasets to improve feature generalization from one institution to another. Also, methods for multi-scale feature aggregation can be considered to effectively capture both cellular level morphology and global tissue architecture. Besides, incorporating clinical metadata, molecular markers, or genomic information may provide a more comprehensive prognostic modeling beyond just prediction of the grade. Also, prospective validation studies in real clinical workflows are required to assess the usability and diagnostic impact in practice. Lastly, exploring human, AI collaborative grading systems might be a way to combine the algorithmic consistency with the expert pathological judgment, thereby improving reliability and trust in AI-assisted diagnosis.
The findings show that RCC grading improvements are possible through a well-thought-out integration of domain knowledge and efficient architectural design, not necessarily by the increase in model complexity. This is indeed a very important issue for clinical deployment, where computational efficiency and interpretability are the main concerns.67–71
7. Conclusion
This work introduced a hybrid DenseNet, U-Net system for identifying different grades of renal cell carcinoma automatically from whole-slide images of pathology. The mixture of tumor-targeted segmentation, feature extraction by convolution, nuclei description, stain normalization, and attention-based aggregation in the proposed system enables it to deliver grading results that are on par with the state-of-the-art beyond several publicly-available datasets.
The results highlight that well thought out convolutional architectures are still very useful for computational pathology. Even though the research trend is moving towards transformer-based or nuclei-graph architectures, a simple and transparent pipeline as ours can deliver similar results with less computation and can be more easily applied in practice. Notably, segmentation guidance and attention-based aggregation contribute more than backbone complexity by itself. More importantly, the pattern of the model’s mistakes is consistent with the clinical challenges of grading, and the level of ordinal agreement is close to the level of inter-pathologist agreement documented in the literature. These facts emphasize the possible use of these systems as decision-support tools rather than replacement of expert judgment.
Overall, the paper emphasizes that advancement in RCC grading is not only related to architectural complexity changes but also to a combination of biologically relevant features and reliable performance through various datasets. We believe that this paper will motivate the research and creation of AI systems based on clinical knowledge, which are also efficient, easy for humans to understand, and open for further modifications. In short, the study shows that with great care and repeated tuning, convolutional networks may remain a reliable offering of medical quality and clinical effectiveness, which are on par with the most advanced transformer-based results.
Footnotes
Acknowledgments
The authors gratefully acknowledge the generous support of San Jose State University, USA, for providing funding assistance toward the research and publication of this work. The authors also extend their appreciation to Hope Diagnostics and Research Lab, Bhubaneswar, India, for their contribution in providing histopathological resources and
used in this study, which ensured the quality and accuracy of the visual material.
Ethical considerations
This study used only publicly available, de-identified histopathology datasets (TCGA-ccRCC, RCdpia, and MMIST-ccRCC). No patient-identifiable information was accessed. All datasets were originally collected under their respective institutional ethical approvals and were released for research use in anonymized form. The present study involved secondary analysis of anonymized data and therefore did not require additional institutional review board (IRB) approval or patient consent. Data usage complied with the licensing and governance policies specified by the dataset providers. This study did not involve direct participation of human subjects or animals. All analyses were conducted on publicly available, de-identified datasets (TCGA, RCdpia, and MMIST-ccRCC). As such, ethical approval and informed consent were not required. The research complies with the principles of the Declaration of Helsinki and adheres to relevant institutional and journal ethical guidelines.
Author contributions
• Rohini Jadhav (RJ): Conceptualization, Methodology, Data Preprocessing, Writing – Original Draft.
• Banani Mohapatra (BM): Literature Review, Formal Analysis, Writing – Review & Editing.
• Bhavnish Walia (BW): Experimental Setup, Model Implementation, Software, Technical Validation.
• Sital Dash (SD): Conceptualization, Methodology, Supervision, Writing – Review & Editing, Corresponding Author.
• Kailas Patil (KP): Data Curation, Dataset Preparation, Visualization.
• Shrikant Jadhav (SJ): Algorithm Development, Ablation Study Design, Supervision, Manuscript Review, Corresponding Author.
• Ishwari Rohit Raskar (IRR): Statistical Analysis, Cross-Dataset Validation, Writing – Proofreading & Editing.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
All datasets used in this study are publicly available: • TCGA-ccRCC: Available through The Cancer Genome Atlas (TCGA) data portal. • RCdpia: Publicly accessible research dataset. • MMIST-ccRCC: Publicly accessible research dataset. The code supporting this study is available from the corresponding author upon reasonable request and will be publicly released upon acceptance.
