Abstract
Several deep learning approaches have been proposed to address the challenges in computational pathology by learning structural details in an unbiased way. Transfer learning allows starting from a learned representation of a pretrained model to be directly used or fine-tuned for a new domain. However, in histopathology, the problem domain is tissue-specific and putting together a labelled data set is challenging. On the other hand, whole slide-level annotations, such as biomarker levels, are much easier to obtain. We compare two pretrained models, one histology-specific and one from ImageNet on various computational pathology tasks. We show that a domain-specific model (HistoNet) contains richer information for biomarker classification, localization of biomarker-relevant morphology within a slide, and the prediction of expert-graded features. We use a weakly supervised approach to discriminate slides based on biomarker level and simultaneously predict which regions contribute to that prediction. We employ multitask learning to show that learned representations correlate with morphological features graded by expert pathologists. All of these results are demonstrated in the context of renal toxicity in a mechanistic study of compound toxicity in rat models. Our results emphasize the importance of histology-specific models and their knowledge representations for solving a wide range of computational pathology tasks.
Keywords
Introduction
The interpretation of histopathology slides by pathologists remains a mainstay of nonclinical toxicology studies. Histopathology is currently living a revolution with the digitalization of glass slides into whole slide images (WSIs) and the development of digital pathology as a potential support for examination at the microscope. Image analysis has been used mostly in experimental pathology for decades to measure tissue elements identified by a pathologist or revealed by molecular localization. It has however failed to get to the mainstream, possibly because of the high dependence on preanalytical parameters (fixation, staining, image acquisition, etc) and a low reproducibility between observers and studies. The development of deep learning (DL), especially convolutional neural networks (CNNs), opens new opportunities for the quantitation of histological objects and beyond as it has shown great success at image segmentation and classification. 1 –7 Deep learning methods are based on artificial neural networks that were inspired by the architecture of human brain neural circuits. Modern DL architectures are able to automatically extract features relevant to a particular task they are trained to solve, directly from the training data set. However, they often contain millions of trainable parameters, which make their internal logic difficult to grasp and require large amounts of data.
Convolutional neural networks are a subclass of artificial neural networks that are specifically designed for computer vision tasks. The main characteristic of CNNs is their ability to detect patterns irrespective of their location in an image. Often, synthetic changes like rotations, color space variations, and noise may be added to the original images in order to make the CNN model more generally applicable to images outside of the original training data set. As the input image passes through the layers in a CNN, incrementally more complex patterns are captured to aid in the final classification. Applied to histology images, such detectable patterns can be cells and tissue structures, even if they appear in different locations and orientations within the image. Unlike traditional image analysis techniques, which often require manual feature engineering (ie, human translation of visual characteristics into a set of features and constraints for a computer program), CNNs are able to solve image classification tasks by automatically extracting simple features, such as, for example, the outline of an object, the border of cell or a tissue region, as well as higher-level concepts, like group of cellular features within a particular region of a tissue. Generic low-to-medium level features are learned in the early layers of the neural network whereas the more abstract, high-level concepts are learned in the deeper layers and typically form the visual representation of an image. Several CNN architectures have been proposed for classification, pixel segmentation, and image generation tasks. 2 –6,8 –11 Typically, the last set of convolutional layers in a discriminative CNN offers a condensed lower-dimensional representation of the input image, which captures the main features needed for the classification task. This lower-dimensional vector representation is referred to, in different publications, as the learned representation, the projection on to the embedding space, the latent space vector, and also the embedding vector.
A key characteristic of CNNs is that the low-to-mid-level features learned from 1 domain (eg, one type of tissue) are often generic and can be easily transferred to another domain. This technique is known as transfer learning and is widely used in the computer vision community. Collecting large amounts of images and training DL models is not only time-consuming but also extremely computationally expensive. Transfer learning can circumvent this limitation by repurposing previously learned, problem-specific knowledge representations to solve tasks in an unrelated domain, while only retraining a small fraction of all parameters on a few corresponding examples in the new domain. Domain-specific representation learning and transfer learning are active research topics in the biomedical imaging field. 12 –15
Since the collection and annotation of digital tissue slides are time-consuming and cumbersome, one looks for other imaging domains, where the image data and corresponding annotations are abundant and thus one could try to use models trained on such data for transfer learning. One large publicly available data set is ImageNet, which consists of more than 14 million images labelled with over 21,000 classes that are often used for benchmarking visual object recognition and image classification. 16 Deep learning models that have been pretrained on ImageNet have been used as a starting point to solve a number of computer vision-related tasks. 2 –6,8 –11 However, the ImageNet data set does not contain any histopathology specific content. Recently, a novel DL model trained on 46 different tissue types of normal rat histology was reported, which could provide transferable knowledge for many image analysis tasks on histology images. 17
Applying DL methods in digital histopathology has some specific challenges. First, WSIs are much larger (in the gigapixel range) than images found in other domains. Second, slide annotation requires expert-level knowledge and can be very time-consuming. Moreover, tissues are heterogeneous collections of cell types and elementary structures of various size. This requires expert knowledge to identify them unequivocally using standard staining protocols.
Because of their size and its associated high computational costs, WSIs cannot directly be used as an input for artificial neural networks. To overcome this limitation, WSIs are divided into smaller manageable subregions, in a process named tiling. Tiling can be performed at different magnifications resulting in an average of thousands of smaller images called tiles (also referred to as patches) per WSI, at high resolutions.
Annotating the WSI at the level of single tiles is particularly time-consuming and challenging. The lack of annotations at the level of tiles prevents the training of fully supervised classifiers and makes WSIs perfect candidates for weakly supervised approaches. In this context, nonclinical toxicologic pathology represents a domain of opportunity, as lesions are graded and reported, if not always at the slide, at least at the organ level.
Several groups have implemented weakly supervised methods to address lesion localization and classification tasks in histology. These methods include recurrent neural networks, 18,19 attention mechanisms, 19,20 and multiple instance learning (MIL). 18 –23 Multiple instance learning is a weakly supervised approach that aims to classify labeled bags, each containing many instances, but with a label being available only for the bag. In the context of histopathology, the labeled bags refer to WSIs with an associated diagnosis (ie, label) determined by pathologists (ie, reported lesions), whereas the instances refer to individual tiles (regions of the slides actually corresponding to lesions or neighboring normal tissue). For the development of computational pathology models, MIL has the potential to circumvent the scarcity of expert annotations at the pixel level (ie, manual outlines by pathologists) by automatically identifying tiles that contain lesions, given only slide-level annotations as an input. In addition, MIL offers the possibility to extract and localize the tiles containing the lesions associated with the diagnosis from the slides. These have a high probability of association with the global slide label for the WSI and correspondence with the fields containing the lesions associated with the diagnosis. This could contribute to the development of lesion detection models and validation of exploratory biomarkers of tissue damage by pointing to lesions associated with the biomarker, hence enabling phenotypic anchoring.
In this study, we use 2 pretrained CNNs (ResNet-50), the first model was pretrained on the publicly available ImageNet data set 1,16 and the second, named HistoNet, 17 was pretrained on histology slides from several normal rat tissues. These models are employed to extract visual representations from previously unseen histology images and compare their transfer learning performance for lesion classification and localization in the context of mechanistic rat toxicology studies with nephrotoxic compounds. 24,25
We show that learned representations from domain-specific models such as HistoNet could be used directly (ie, without retraining) as input for subsequent models aiming at distinguishing between tissue samples associated with biomarker levels (eg, urinary kidney injury molecule 1 [KIM-1] protein and Kim-1 messenger RNA [mRNA] expression in the kidney), solely on the basis of the average morphological features present in those images. We also show, using a simple logistic regression model, that a domain-specific representation space such as HistoNet performs significantly better than models derived from natural images (eg, ImageNet). Starting from a representation space as opposed to the traditional end-to-end learning models that train directly from the images themselves drastically cuts down on the training and exploration time for such models.
Interpretability is key to building confidence in models applied to computational pathology. We show the capability of our model to identify subregions within the WSI that contribute most to the classification (phenotype anchoring) without needing any additional manual annotation. We propose a weakly supervised DL approach to identify key regions of interest containing lesions. The results can take the shape of predicted heatmaps or virtual staining image overlays. We present the results from a multitask model that can predict graded tubular features from averaged representations per WSI.
Material and Methods
Tissues and Histopathology Data
Paraffin blocks and hematoxylin and eosin (H&E) stained slides and histopathology, urinary biomarker, and mRNA data from 5 previously reported independently processed preclinical toxicology studies using nephrotoxicants (cisplatin, gentamycin, vancomycin, puromycin, doxorubicin) were retrieved from archives. 24,25
In situ Hybridization
Templates for Kim-1 (havcr1) riboprobe synthesis were generated by reverse transcription polymerase chain reaction (RT-PCR) from rat kidney mRNA using self-priming oligonucleotide primers flanked in 5′ by SP6- and T3-promoter recognition sequences (Supplementary Figure 2). The purified polymerase chain reaction (PCR) product was transcribed using T3-RNA polymerase (antisense) and SP6-RNA polymerase (sense) at 37 °C for 2 hours using dNTP containing digoxigenin (DIG)-UTP according to the manufacturer recommendations (Roche Diagnostics Schweiz AG). The quality and quantity of the riboprobe were evaluated using the 2100 BioAnalyzer (Agilent Technologies).
In situ hybridization (ISH) was performed using Ventana Discovery XT (Roche Diagnostics Schweiz AG) with Roche Diagnostics reagents. Briefly, formalin-fixed paraffin-embedded sections were deparaffinized and rehydrated using the EZprep solution. Pretreatment steps were done with the RiboMap kit following the manufacturer instructions. Cell conditioning (demasking) was performed by heat retrieval cycles in RiboCC solution followed by a complementary enzymatic digestion (protease 3 for 16 minutes at 37 °C). Hybridization was performed using 200 µL of RiboHybe solution containing a DIG-riboprobe concentration of 20 ng at 65 °C for 6 hours. After hybridization, sections were washed with saline-sodium citrate buffer. The DIG-label probe was detected using an alkaline phosphatase-conjugated sheep anti-DIG antibody (Roche Diagnostics Schweiz AG) at 1/500 for 32 minutes at 37 °C followed by chromogenic detection using BlueMap kit for 6 hours. Sections were counterstained using ISH nuclear fast red for 4 minutes, then mounted in glycerol-gelatin mounting medium (Sigma-Aldrich Chemie GmbH).
Domain-Specific Learned Representations: HistoNet Pretrained Model
HistoNet 17 is a set of deep neural networks characterizing the diversity of normal tissues that provide rich learned representations that can be extended to wider problems in computational pathology. The networks have been trained on 1,690 slides with rat tissue samples from 6 preclinical toxicology studies where tissue regions were outlined and annotated by pathologists into 46 different tissue classes. From these annotated regions, small tiles of 224 × 224 pixels at 6 different levels of magnification were sampled. For each magnification level, a separate network was then trained on the task of classifying the tissue type of each tile using 4 studies as training set and 2 studies as test set.
Data Set
A data set of 349 slides covering 49 different combinations of treatment, dose level, and time point (described in Supplementary Figure 1) was extracted from the original study slides scanned with a Hamamatsu (NanoZoomer 2.0 HT, scanning software NDP-Scan Version 2.5, Hamamatsu Photonics) at 40×.
The WSI set was split into training and validation set (85% = 296 slides) and test set (15% = 53 slides). The training and validation set was further split into training (80% = 236 slides) and validation (20% = 60 slides) using a 5 cross-validation splitting strategy. To ensure that the training, validation, and test sets were well balanced and that no bias in 1 specific group is introduced during the training of the MIL part (see below), the groups were stratified using the predicted classes, treatment, dose level, and time point features.
For each WSIs, the separation of tiles containing tissue samples versus tiles containing background was achieved using Otsu’s method for automatic thresholding at low magnification (1.25×—8.064 microns per pixel or mpp) in a binary white and black image. To avoid small artifacts in the field-of-view to be accounted as tissue samples, the binary images were further processed. The parameters for the image processing transformation were quality controlled by randomly checking samples across the data set and the number of pixels empirically determined. Briefly, holes between corresponding white pixels of the binary image were filled, then the image was eroded by 80 pixels and finally dilated back by 80 pixels. The position of pixels corresponding to tissue samples was recorded and mapped to the dimensional level for tile extraction (tiling process).
Tiles of size 224 × 224 pixels were extracted at the 10× equivalent magnification (1.008 mpp) for both the WSI and the mapped binary image. Only tiles containing more than 20% of tissue were retained for further analysis.
To extract their corresponding embedding (morphological signature) each 224 × 224 pixels tile was passed in a feed forward fashion (prediction) as input in a CNN. We used either a ResNet-50 pretrained on the ImageNet database 16 or the ResNet-50 presented in the HistoNet work. 17 The HistoNet ResNet-50 is a ResNet-50 which has been pretrained on a set of histological studies containing only normal rat tissues (see the original work for more information 17 ). The penultimate layer of the ResNet-50 (2048 dimensional) was extracted for both pretrained models (ResNet-50 pretrained ImageNet and ResNet-50 pretrained HistoNet). This 2048 dimensional vector corresponds to the embedding vector.
For creating heatmaps representing ISH signal intensity, ISH WSIs scanned at 40× were downsampled to 14.112 mpp. As the maximum absorbance of the BlueMap chromogen is at 620 nm, the red channel of the image was used as an approximation of a monochromatic image for the calculation of the BlueMap optical density. A gaussian mixture model was used to identify the average intensity of the background and unstained tissue. The image was then transformed into specific optical density (SOD) using the calculated intensity of unstained tissue, 26 then binned by (16 × 16) to reach a resolution comparable with the heatmaps generated by the MIL models (1 pixel = 1 tile of 224 × 224 pixels at 1.008 mpp).
Logistic Regression for the Classification of Averaged Embeddings
All embedding vectors corresponding to all extracted tiles containing tissue were averaged for each of the 2048 features per WSI. This results in an averaged embedding vector of length 2048 per WSI. The training and test sets used were split as described in the Data Set section.
Average embedding vectors were normalized (z-score) and used as inputs in dense layer consisting of 1 neuron with a sigmoid activation function. A binary cross-entropy loss function was used to optimize for logistic classification. The model was trained for 500 epochs, and the performance was measured on the test set.
Multiple Instance Learning
Multiple instance learning is an approach that allows learning about tiles (eg, regions from a slide) from labels assigned only at the slide level. Technically, they are called weak labels, assigned to bags of instances (ie, slides) rather than to every instance (ie, tile). 20,27 –32 The present MIL models are trained over multiple examples of slides with different labels (eg, lesion, no lesion) to learn about differences between the tiles and also to combine them together into a slide-level prediction. Multiple instance learning needs a scoring function that assigns a probability of a certain label (eg, contains a lesion or not) to tiles from a WSI. Such a scoring function has to be agnostic to the order in which the model processes the tiles during training (ie, spatially permutation invariant). How to form such a function is well described for MIL. 27,28 Briefly, the scoring function is made of 3 sequential constituents, each acting on the output of the previous part. The first is a model (ImageNet or HistoNet) that generates a representation for a tile and maps it to a single numeric score. The second is a ranking or aggregation function that takes a set of these tile scores in any order and generates a vector of sorted score values. The third function is a classifier providing a global prediction for each WSI. The parameters for these 3 functions are learned and updated through the training process.
Here the approach was applied to identifying regions containing lesions given a binary classification at the slide level between normal and pathological WSIs. The image from each WSI was cropped into tiles (224 × 224 pixels) and representations (later referred to as embeddings) were generated using one of the pretrained models (HistoNet or ImageNet-based) as a feature extractor. These embeddings were then used as inputs to train MIL models to identify pathological slides based on biomarker values and localizing lesions. For this study, the input layer was fed with 2048-dimensional features’ embeddings from 1000 randomly selected tiles for each of the 10 digital tissue slides within a single minibatch (Table 1). The input layer was followed by a 1-dimensional (1D) convolutional layer (implemented as 2-dimensional convolution with 1 × 1 filter, for convenience), which summarized the 2048-dimensional embeddings into a single aggregated value per tile. A reshape layer was added for convenience to convert the 1D convolution values into the right tensor shape, and a sigmoid activation function was used to provide input into the following Lambda layer. This layer performed a maximum (max) ranking of the top n tiles (n = 10 in the example in Table 1), and only the top values were considered for forward and backward propagation.
Multiple Instance Learning—Model Architecture.a,b
Abbreviation: MIL, multiple instance learning.
a Our proposed MIL model expects an input of 2048-dimensional embeddings from 1000 randomly selected tiles for each of 10 digital tissue slides in a single minibatch. For each image, the MIL model outputs the 2 estimated class probabilities, which are used to separate normal from pathological cases.
b In deep learning, learnable parameters only indicate the parameters that are updated during backpropagation learning. In this case, the “Input,” “Reshape,” and “Max-Rank (Lambda)” do not contain learnable parameters thus their count is 0. For the Conv2D layer, the formula to calculate the number of parameters is: (width of current layer filter × height of the current layer filter × number of filters in the previous layer + 1) × number of filters for the current layer. In our case (1 × 1 × 2048 + 1) ×1 = 2049. For the dense layers, the formula to calculate the number of parameters is: number of neurons in current layer × number of neurons in previous layer + 1 × number of neurons in current layer. In our case for the 3 dense layers, the respective calculations are as follow: 200 × 10 + 1 × 200 = 2200; 100 × 200 + 1 × 100 = 20,100; 1 × 100 + 1 × 1 = 101. Assuming merely 1 convolutional filter and just 10 top ranked tiles, our illustrated architecture exhibits 2049 + 2220 + 20,100 + 101 = 24,450 trainable parameters, allowing for fast learning.
The following 2 fully connected layers (dense) operated as a multilayer perceptron taking the ranked tiles as input, combining them in a nonlinear fashion using sigmoid activation functions, to output a 100-dimensional feature vector. Finally, the output layer consisted of 1 neuron aggregating the previously obtained features into a binary classification of the WSI (normal or pathological). The MIL model had 24,450 trainable parameters and was optimized using a cross-entropy loss function.
After training, the MIL model was used to compute a score reflecting the probability that each tile of a WSI contains lesions, even if this WSI was not used for the training. This probability was derived from the aggregated values from the 1D convolution layer corresponding to each tile and was visualized as a heatmap that was overlaid on the original image to highlight the tiles containing lesions.
Multitask Learning
Our multitask model considers the prediction of each individual feature as an independent task. However, all tasks are learned from the same average embeddings, meaning that 1 task can benefit from the learning of another task. Hence, we define our multitask loss function as the average over all individual cross-entropy losses:
where pw i is the positive weight and nw i is the negative weight of class or task i, respectively. For each task i, we aim to minimize the error between actual label yi and predicted label f(x) i , with x being the embedding that is shared across all tasks.
Our multitask model can be seen as a nonlinear function f, which consists of 2 hidden layers with 400 and 200 neurons, respectively, using a rectified linear unit activation. The output layer uses SoftMax activation and the number of neurons matches the number of classes or tasks. We have trained our model with the Adam optimizer and the described multitask loss function for 100 epochs.
Results
Prediction of Lesion-Bearing Slides Using Logistic Regression Based on Representations From CNN Models
The data set used in this study was extracted from previous studies designed for the validation and qualification of biomarkers of toxic renal injury in rats. 24,25 Only a subset of the original studies reported histopathological lesions. Biomarkers were selected in order to focus on lesions possibly detected by the KIM-1 (KIM-1 as known as havcr1) as a specific and sensitive biomarker of tubular damage. 24,25,33 Whole slide images (296 training, 53 tests) were labeled (normal = class 0, pathological = class 1) based on a cutoff threshold either on the urinary concentration of KIM-1 at the time of euthanasia or Kim-1 mRNA CT values obtained by RT-PCR on kidney tissue (Supplementary Figure 1 and Figure 2, respectively). A 2048-element vector representation was generated for each tile from each WSIs, using ImageNet and HistoNet pretrained CNN models (Figure 1A). Tile-level vectors were averaged into an aggregated representation at the WSI level, separately for each model.
Visual inspection of these vector profiles for all the WSIs showed that the information content in averaged representations from ImageNet is much sparser than those from HistoNet, for the same data set (Figure 1B).

Preprocessing workflow and heatmaps of matrices of average embeddings generated using pretrained model with ImageNet or HistoNet weights. Whole slide images (WSIs) are individually selected and a grid is applied to extract the tiles also referred to as patches as described in Methods section. Only tiles containing tissue are selected. Each tile is fed forward through a CNN pretrained on ImageNet (1) or HistoNet (2) to generate an embedding vector of size 2048. The final embedding matrix per WSI is of size m × 2048, where m corresponds to the total number of tiles in the WSI. The average embedding per WSI is generated by computing the average of all m tiles for each WSI. B, The average embeddings matrices generated using the pretrained ImageNet and HistoNet models are visualized as heatmaps. Each row represents a different slide from the 349 available slides, each column a different feature of the 2048 embedding feature vector. Black indicates values equal to 0, whereas colors indicate columns containing values higher than 0. Brighter colors correspond to higher values. CNN indicates convolutional neural network; WSI, whole slide image.
To evaluate the ability of representations learned by CNNs to discriminate between 2 groups of slides, we used a logistic regression model trained against urinary protein concentration and mRNA tissue expression of KIM-1. Classifiers were independently trained for each set of biomarkers (KIM-1 urinary concentration and Kim-1 mRNA). For each such classifier, the experiment was performed separately with average pre-trained ImageNet representations and again with a histology-specific average representation from HistoNet. The evaluation of the model is performed on the test data set, and the results presented in the form of confusion matrices and areas under the receiver operating characteristic curve (ROC-AUC) in Figure 2. Both pre-learned representations had enough information for simple predictive models to be effective, but that the model trained on HistoNet embeddings showed better performance than the one trained on ImageNet embedding vectors, for both urinary KIM-1 and Kim-1 mRNA based classification tasks.

Confusion matrices and receiver operating characteristic (ROC) results from the prediction of the logistic regression of averaged embeddings per slide for ImageNet and HistoNet. A, Confusion matrix between true labels and predicted labels by a logistic regression classifier trained on average ImageNet embeddings for the urinary KIM-1 concentration. B, Same as (A), except that the logistic regression was trained on average HistoNet embeddings. C, Same as (A) but for Kim-1 mRNA expression. D, Same as (B) but for Kim-1 mRNA expression. E, Receiver operating characteristic curve for the urinary KIM-1 classifiers. F, Same as (E) with Kim-1 mRNA expression as biomarker. KIM indicates kidney injury molecule 1; mRNA, messenger RNA.
To remove any model initialization bias, the same process was repeated over 50 to 500 epochs in steps of 50 epochs and 10 sets of randomly initialized model weights for each epoch (Figure 3). Although ImageNet embeddings have good discriminative power, the HistoNet embeddings consistently performed better and had less variability across model weight initializations. Moreover, models trained against KIM-1 urinary protein concentration had a higher accuracy than that based on Kim-1 mRNA expression. To explore this effect, Kim-1 in situ hybridization (ISH) stainings were compared between animals showing moderate Kim-1 CT values in the absence of histological lesions as well as in the absence of KIM-1 urinary concentration (Figure 3C and D), and animals that show lesions as well as high urinary values (Figure 3E and F). This revealed ISH positive foci devoid of perceptible lesion by H&E (Figure 3C and D). This suggests that a model trained to identify lesions on H&E slides based on mRNA values may encounter fewer tiles containing lesions than expected from the CT values whereas a model trained against urinary concentrations will be presented with more perceptible lesions by H&E.

Comparison of logistic regression classifier performance on ImageNet versus HistoNet embeddings. A, Comparison of logistic regression classifier performance on ImageNet versus HistoNet embeddings to separate the slides by KIM-1 urinary concentration. The classifier weights are initialized randomly, 10 repeats per epoch over 10 different epoch settings. B, Same as (A) but for Kim-1 mRNA levels. C and D, Early/limited injury where lesions are imperceptible by conventional histopathology (C: H&E), low urinary concentration of KIM-1 but Kim-1 expression can be detected in some tubular sections (D: Kim-1 ISH). E and F, Overt lesions by conventional histopathology (E: H&E) are associated with clear elevation of urinary concentration of KIM-1 and extensive tissue expression (F: Kim-1 ISH). H&E indicates hematoxylin and eosin; KIM, kidney injury molecule 1; mRNA, messenger RNA.
Overall, the results suggest that aggregated representations have good discriminative power and that domain-specific representations (eg, those based on HistoNet) work better than representations from a general model.
Weakly Supervised Learning Enables Attention and Hence Interpretability at the Tile Level
To identify lesions on WSIs, an MIL model was trained to rank the most relevant tiles from WSIs based on their morphology that provided the best separation of the slides based on the KIM-1 urinary concentration and Kim-1 mRNA levels, respectively (Figure 4).

A 2-step weakly supervised multiple instance learning approach. For training the model, batches of 10 randomly picked WSIs with 1000 randomly sampled tiles are prepared. Whole slide images highlighted in blue and red represent normal and histopathological cases, respectively. Each tile is converted to its corresponding embedding vector containing 2048 features. The inputs tensor of size 10 × 1000 × 2048 are fed sequentially in the MIL architecture. The Lambda (n-max) layer (light blue, third layer from the left) contains the layer with maximum scoring function that extracts n top ranked samples from the vector. The binary cross entropy loss function is used to calculate the difference between the predictions and true global labels. During training, backpropagation computes the gradient of the loss function and updates the weights of the trainable layers of the network. Once trained, the MIL architecture can be used for prediction. For predicting scores, a new WSI which was not part of the training set is taken: all its tiles (t) are sampled and converted to their corresponding embedding vectors of size 2048 each. The trained layers of the MIL architecture (dark blue, first and second layers from the left) are extracted and the input tensor of size t × 2048 is used to generate 1 score per tile. The values of the scores are converted to their respective colors on a colormap and overlaid on the original H&E WSI. H&E indicates hematoxylin and eosin; WSI, whole slide image.
The number of tiles considered (n ranks) was kept constant during training and the effect of the number of top tiles on the accuracy and ROC-AUC was evaluated for each model (Tables 2 and 3). For each number of top tiles, the performance was estimated using a 5-fold cross-validated average on the test set. Under these conditions, the classifier based on KIM-1 urinary concentration performed better than the classifier based on Kim-1 mRNA with ROC-AUC values between 0.9127 and 0.9731 and 0.8646 and 0.8993, respectively. The number of top tiles (n ranks) had an impact on the performance of the KIM-1 urinary concentration classifier: lower performance was observed for fewer number of top tiles, and conversely a higher performance was observed for a higher number of tiles. The number of top tiles did not influence the performance for the Kim-1 mRNA classifier.
Classification Performance on the Test Set of KIM-1 Urinary Concentration Classifier at the Slide Level Assessed by Multiple Metrics for the MIL.a,b
Abbreviations: AUC-ROC, area under the curve of the receiver operating characteristic; KIM-1, kidney injury molecule 1; MIL, multiple instance learning.
a The metrics are reported as an average over 5-fold cross-validation. Accuracy and AUC-ROC are reported ± the standard deviation of the mean. A constant learning rate of 0.00015 was used across all experiments.
bBoldface values: highest values for Accuracy and AUC-ROC.
Classification Performance on the Test Set of Kim-1 mRNA classifier at the slide level assessed by multiple metrics for the MIL.a,b
Abbreviations: AUC-ROC, area under the curve of the receiver operating characteristic; KIM-1, kidney injury molecule 1; MIL, multiple instance learning; mRNA, messenger RNA.
a The metrics are reported as an average over 5-fold cross-validation. Accuracy and AUC-ROC are reported ± the standard deviation of the mean. A constant learning rate of 0.000125 was used across all experiments.
bBoldface values: highest values for Accuracy and AUC-ROC.
Subsequently, the trained layers of the MIL architecture were extracted to provide a score to the individual tiles within a WSI that had never before been seen by the classifier. Tile scores were visualized as a color-coded overlay on the original WSI (Figure 5).

Localization of predicted lesions. A, Overview of an original WSI from the test set. B, Same as (A) but a visual heatmap showing the scores for each tile by the KIM-1 urinary concentration classifier is overlaid on the WSI. The color represents the score of the prediction, that is, the confidence the classifier has that a tile contain a lesion. Dark colors are indicating low probabilities of lesions (normal tissue) and brighter colors indicating higher probability of lesions. C, Examples of extracted tiles at 1 mpp with high prediction scores to contain lesions. KIM-1 indicates kidney injury molecule 1; WSI, whole slide image.
The model predictions were validated using both a pathologist’s review of individual tiles and the comparison of the heatmaps with Kim-1 ISH at the slide level. First, a set of randomly picked tiles with high prediction scores for each of the 2 classifiers were presented to a pathologist. The images presented to the pathologist contained a slightly broader region around the original tile to provide enough context for interpretation (Figure 5C). The accuracies were 97.5% and 94.17% for the classifier based on KIM-1 urinary concentration and Kim-1 mRNA concentration, respectively (n = 120 for each case). In both cases, the classifiers identified the tiles containing lesions with high certainty.
Second, the Kim-1 ISH WSI image was converted into an intensity map corresponding to the SOD of the ISH signal. Then, in order to compare it with the distribution of the ranked tiles overlaid on the original slide, the ISH SOD map was binned to match its pixels to the size of the tile overlay and displayed using the same color scale (Figure 6). In the absence of lesions and low biomarker values, the heatmaps did not highlight any region within the WSI. In cases containing low grade lesions and moderate Kim-1 mRNA expression, the heatmaps from both predictive models and ISH SOD highlighted the inner stripe of the outer medulla. In this situation, the number and the scores of tiles were lower when predictive models were based on urinary concentration than when they were based on mRNA expression of Kim-1. More severe lesions and higher biomarker values showed heatmap signal expanding toward the cortex with similar coverage for models based on urinary concentration and mRNA expression (Figure 6).

Comparison between model predictions and Kim-1 ISH. Examples of sections with no lesions and no expression of Kim-1 (No); low expression, low urinary concentration, and low grade lesions (Low); or high expression, high urinary concentration, and high-grade lesions (High). For each representative slides, heatmaps representing predictions of model based on urinary concentration (UC) or RNA expression are compared with Kim-1 ISH staining and specific optical density (SOD). Note that model predictions were derived from an H&E slide whereas ISH was performed on another section collected deeper in the paraffin block, hence the images show slightly different shapes. H&E indicates hematoxylin and eosin; ISH, in situ hybridization.
Unsurprisingly, the cumulative distribution of positive detection of lesions the highest for Kim-1 mRNA expression followed by KIM-1 urinary concentration and lower for reported histopathology grades. When compared to these data, the performance of the models predicting lesions from H&E images trained against Kim-1 mRNA expression and urinary concentration was higher than that of histopathology (Figure 7A). Also, predictions based on both models showed a strong relationship with histopathology grades (Figure 7B).

Relation between biomarker values, prediction scores, and histopathology. A, Distribution of normalized biomarker values, slide-level ranking scores, and histopathology gradings along a synthetic axis representing the rank of the slide in the 6-dimensional space defined by these parameters. The red vertical line represents the detection threshold of pathology lesions (grade >1). B, Relationship between histopathology tubular lesions (grade) and slide-level ranking score for mRNA and urinary protein concentration (UC)-based models. mRNA indicates messenger RNA.
Multitask Semantic Feature Interpretation From Visual Learned Representations
Finally, to explore the relevance of learned representations to the different morphological types of lesions present in the slide set (Figure 8), a multitask model was trained to classify simultaneously all of the possible lesions in a WSI, directly from the average embeddings obtained through the CNN models.

Frequency of tubular features. The tubular lesions are either present or not. The frequency denotes the ratio between the number of tissue slides that show a particular lesion and the overall number of samples. Normal kidney sections were included under the “no lesion reported” class.
The multitask model has achieved high scores for the multiple classes suggesting that the average HistoNet embeddings correlate well with the manually graded features. Figure 9 illustrates the average performance scores achieved for 5-fold cross-validation. We present both the values for the area under the ROC curve (ROC-AUC) as well as the area under the precision-recall curve (PR-AUC), respectively. The model performed particularly well for the identification of necrosis even though descriptive features of such lesions are likely not learned during training on normal histology. The model also distinguished between tubular epithelial changes such as sloughing, dilation, and vacuolation.

Average performance scores on validation set. We plot the ROC and prescore, in blue and magenta, which account for the “area under the ROC curve” and “area under the precision-recall curve,” respectively. The bars illustrate mean values and the intervals show the standard deviation for 5-fold cross-validation. ROC indicates receiver operating characteristic.
Discussion
Evaluating the Discriminative Power of Domain-specific Embedding Using a Logistic Regression Model
Using classical machine learning models, we show that slide-level average embeddings result in aggregated representations that have discriminative power for certain biomarkers. To the best of our knowledge, we show for the first time that in the context of histopathology images, domain-specific representations work significantly better (Figures 2 and 3) than representations from a model trained on a different domain. This is further substantiated by the observation that HistoNet embeddings contain more information relevant to histology images than those from ImageNet embeddings (Figure 1B). This opens up the possibility of exploring these embeddings for associating morphological changes in WSIs with other biomarkers. We note that the simple logistic regression model reaches over 90% accuracy on the test set using fewer training epochs for the HistoNet embeddings versus the ImageNet embeddings (Figure 3). The variability of the test-set accuracy is also lower for HistoNet than for the ImageNet embeddings.
We also found that the accuracy was higher for a model trained against urinary KIM-1 concentration than for a model trained against Kim-1 mRNA expression in the tissue (Figures 2 and 3). One possible explanation could reside in the biology of Kim-1. Kim-1 is normally absent from renal tubular epithelial cells and its expression is induced by noxious stimuli. With persistent or intense stimulation, the KIM-1 protein is expressed at the apical border of the proximal tubule cells and eventually shed into the lumen, making it detectable in the urine. 33 Kim-1 mRNA induction is a very early phenomenon and therefore it can in some instances be evidenced by ISH and PCR while no lesion can be evidenced on an H&E section (Figures 3C and D). Therefore, during training, a model meant to identify lesions on H&E slides against Kim-1 expression may be exposed to fewer tiles containing actual lesions than a model trained against KIM-1 urinary concentration. While this formally represents a lower accuracy at detecting perceptible lesions, the comparison between model predictions and Kim-1 ISH through heatmaps (Figure 6) suggests that the model trained against Kim-1 mRNA identifies areas where lesions are subtle and under detected by the pathologist. What is shown here for urinary KIM-1 concentration and Kim-1 mRNA expression opens the possibility to explore more biomarkers, including those characterized by a narrower separation between the normal and histopathological cases. From a modeling perspective, HistoNet embeddings could be used to train the same model on multiple tasks simultaneously (as we have demonstrated for the multitask modeling of tubular features), so as to increase its discriminative power on hard-to-separate cases. It could also be expanded to the discrimination between lesions associated with a biomarker from others (eg, distinguish between tubular and glomerular lesions in association with Kim-1). Additionally, this approach could help anchor biomarkers to histopathology when molecular localization is not possible or suboptimal.
Weakly Supervised Learning Enables Localization and Interpretability at the Tile Level
Model interpretability is an open research topic in DL. Several approaches to identifying the pixels in the image that are most relevant to the prediction, exist, for example, Grad-Cam 26 and DeepExplain. 27 In our work, we have limited our interpretability of image features to the ranking of individual tiles based on the relevance of that representation (also called morphological signature) to the model prediction.
We have developed and implemented a 2-step approach for the identification of renal lesions using a weakly supervised MIL framework based on learned representations from a pretrained holistic histopathological DL model. Weakly supervised approaches have been previously reported to yield great results in the context of cancers for the identification of regions containing tumors. 18,21,23,34 In these examples, the authors have chosen to use an ImageNet pretrained CNN model in front and directly connected to the MIL architecture and used end-to-end training to achieve global slide classification from sampled tiles. This means that both the CNN and the MIL architectures are trained simultaneously. Since the most recent CNNs can have more than 100 layers, end-to-end training can be time-consuming. In our approach, a readily pretrained CNN model was used which means that only the few layers in MIL architecture needed to be trained. This not only greatly reduce the training time but also gives the possibility to have a common starting point for any exploratory analyzes without retraining the entire CNN architecture.
The approach we have chosen makes use of a maximum scoring function on a small number of top predictive tiles to predict the global biomarker expression label for a given WSI. Multiple scoring functions exist for the general use of MIL. In the context of histopathology, maximum (max), maximum and minimum (min-max), and attention-based scoring functions have been previously successfully reported. 18,20,21,23,35 In this work, we have used a ranking layer using the maximum function. Unlike previous reports in oncology searching for markers associated with good or poor prognosis where a combination of minimum and maximum functions is meaningful, there is no minimum evidence in the context of lesions detection. That is a lesion is positively differentiated from the normal tissue and there is little sense in using minimum evidence of a lesion to define the normal tissue.
The parameter of number of top tiles considered by the scoring function was shown to have an impact on the classification performance in the case of the KIM-1 urinary concentration classifier but not in the case of the Kim-1 mRNA classifier. These last results are comparable to those reported previously for the identification of tumor regions in WSIs. 18,21,23 In a previously published study, the authors have reported that when using a min-max scoring function, using only 5 top and bottom tiles were sufficient for correctly predicting global WSI labels. 23 In this case, the number of tiles had a low influence on performance of the global WSI label classification. One possible explanation for this difference could lay with the lower sensitivity of the urinary concentration of KIM-1 in comparison to the mRNA expression, as discussed above.
The classifiers were trained on WSI from different studies using multiple compounds with different nephrotoxic profiles across multiple dose levels and time points. They were able to identify tubular lesions and the regions which are highly correlated with Kim-1 mRNA expression independently from the compounds and their respective nephrotoxic profile. The models identified clear tubular lesions with high specificity in samples from animals treated with classical tubular toxicants (cisplatin, gentamicin, and vancomycin). Moreover, only regions corresponding of tubular lesions where picked up by the model in samples from animals treated with glomerular toxicants (puromycin, doxorubicin). It would be interesting to assess if these classifiers could detect lesions caused by other nephrotoxicants such as furosemide or lithium, which are both known tubular and collecting duct toxicants.
One limitation of the current work is that we have concentrated our approach to identifying only renal lesions. We believe that since the HistoNet model was trained on a large collection of 46 normal tissues, the learned representations from such a model should be easily applicable to the exploration of other tissues as well. Also, the methodology presented here could be extended to other biomarkers (eg, cystatin C, β2 microglobulin), a broader spectrum of lesions, other tissues, or even other different contexts, either in toxicology or in general pathology (eg, other tissues, disease markers, genetic abnormalities, etc). 24,25,36
The ability of MIL to identify relatively small regions of the slides (in our case squares of 224 µm × 224 µm) associated with the elevation of a particular biomarker has the potential to facilitate the characterization of precursor lesions. This would apply also to the detection of particular combinations of elementary lesions associated with the elevation of a biomarker, thereby facilitating its phenotypic anchoring. In the context of nonclinical toxicology studies, this would facilitate the identification of treatment-related lesions in low-dose groups, their relationship with biomarker values, and potentially improve candidate compound selection.
The detection and characterization of early elementary lesions at the microscope have a low sensitivity. Additional, sometimes numerous molecular stains are required for certain diagnoses. This requires multiple sections that may not contain the lesion of interest especially if it is small. In this context, MIL-based models could allow the detection of areas presenting features associated with these markers and quantify them from a single section. This could become an important tool for research and the study of underlying pathophysiological mechanisms.
Multitask Semantic Feature Interpretation From Visual Learned Representations
Training a model to, simultaneously, predict multiple graded features allows us to ascertain how well the averaged HistoNet embeddings correlate with expert-graded morphological features. We have shown that the features contained in the averaged embeddings correlate well with pathologist-graded features such as tubular necrosis/apoptosis, glomerular changes, interstitial inflammation, tubular cell vacuolation, and tubular cell sloughing, even considering the possibility for intra- and interobserver variability in human grading. Such a model could become the basis of a tool to assist pathologists in grading, hence improving the reproducibility of the grading systems between studies or even tissues.
Regarding the multitask model performance, it is known that the AUC can be an overoptimistic measure of model performance in cases of highly unbalanced classes. Hence, we base our discussion on the values for the PR-AUC. While a ROC-AUC of 0.5 is associated with a random classifier, the same is not always true for the PR-AUC especially for classes with a high negative class balance, as in our setup. Recall that the binary classification for a task is the comparison of each class or graded feature against all of the other classes. The performance of the model on each task is significantly above the PR-AUC threshold for a random classifier but higher values closer to one are better.
This approach has a dual purpose: (1) given a predictive model based on the current learned representation, informing pathologists about features that they can have a high confidence in and (2) to instruct the machine learning model about which features have either insufficient examples or inherent variability. An application of this effort, of course, could be the observer-independent grading of slides for lesion-types for which the model has high confidence in its prediction. In a future effort, one can envision an active learning approach that would allow for the correction of uncertain labels while at the same time conditioning the model to learn better representations to discriminate among high-confidence labels.
Conclusion
Our study shows that learned representations from a pretrained domain-specific model can be readily used to explore the relevance of the morphological changes in tissue to biomarker level differences and to pathologist-graded tissue features. Moreover, we show that it is possible to provide localized scoring for those discriminative features. These may be easily projected onto the original H&E-stained WSI to facilitate the visual validation of the morphological changes associated with a specific biomarker.
We have shown that a domain-specific model such as HistoNet has a richer set of embedding features than models trained with images from other domains, although for certain classification tasks, for example, discriminating between urinary concentration KIM-1 classes, such nonspecific embeddings may be sufficient.
In summary, while training directly from a large corpus of carefully labelled WSI images inspires confidence for a specific classification task, such end-to-end training is computationally intensive and as shown here, not entirely necessary. The field of computational pathology benefits from domain-specific representations such as HistoNet that allow researchers to rapidly explore deeper questions regarding tissue morphology, without the need to learn such representations from scratch. This is of particular relevance for small data sets. Weakly supervised approaches built upon such prelearned representations or embeddings open up the possibility to make morphological predictions at a local level, with global slide-level information only.
Moreover, with the rapid adoption of spatially resolved molecular profiling technologies such as spatial transcriptomics, the MIL approaches could be extended to models trained to predict gene expression profiles from small histology sections. The aggregated transcriptomics profile would be localized only to those regions that show a difference in morphological structure between groups of slides being studied. Similarly, local morphological representations can serve as end points to explore relationships to compound structure for predictive models of compound toxicity and potentially compound efficacy.
Such a rapid modeling capability brings machine learning and expert human knowledge together, for a richer exploration of biomarker predictions that have an explainable morphological foundation and opens up the possibility for linking morphological change in histology to a host of relevant data modalities.
Supplemental Material
Supplemental Material, sj-docx-1-tpx-10.1177_0192623320987202 - Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology
Supplemental Material, sj-docx-1-tpx-10.1177_0192623320987202 for Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology by Christophe A. C. Freyre, Stephan Spiegel, Caroline Gubser Keller, Marc Vandemeulebroecke, Holger Hoefling, Valerie Dubost, Emre Cörek, Pierre Moulin and Imtiaz Hossain in Toxicologic Pathology
Supplemental Material
Supplemental Material, sj-docx-2-tpx-10.1177_0192623320987202 - Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology
Supplemental Material, sj-docx-2-tpx-10.1177_0192623320987202 for Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology by Christophe A. C. Freyre, Stephan Spiegel, Caroline Gubser Keller, Marc Vandemeulebroecke, Holger Hoefling, Valerie Dubost, Emre Cörek, Pierre Moulin and Imtiaz Hossain in Toxicologic Pathology
Supplemental Material
Supplemental Material, sj-tif-1-tpx-10.1177_0192623320987202 - Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology
Supplemental Material, sj-tif-1-tpx-10.1177_0192623320987202 for Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology by Christophe A. C. Freyre, Stephan Spiegel, Caroline Gubser Keller, Marc Vandemeulebroecke, Holger Hoefling, Valerie Dubost, Emre Cörek, Pierre Moulin and Imtiaz Hossain in Toxicologic Pathology
Supplemental Material
Supplemental Material, sj-tif-2-tpx-10.1177_0192623320987202 - Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology
Supplemental Material, sj-tif-2-tpx-10.1177_0192623320987202 for Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology by Christophe A. C. Freyre, Stephan Spiegel, Caroline Gubser Keller, Marc Vandemeulebroecke, Holger Hoefling, Valerie Dubost, Emre Cörek, Pierre Moulin and Imtiaz Hossain in Toxicologic Pathology
Supplemental Material
Supplemental Material, sj-tif-3-tpx-10.1177_0192623320987202 - Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology
Supplemental Material, sj-tif-3-tpx-10.1177_0192623320987202 for Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology by Christophe A. C. Freyre, Stephan Spiegel, Caroline Gubser Keller, Marc Vandemeulebroecke, Holger Hoefling, Valerie Dubost, Emre Cörek, Pierre Moulin and Imtiaz Hossain in Toxicologic Pathology
Supplemental Material
Supplemental Material, sj-tif-4-tpx-10.1177_0192623320987202 - Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology
Supplemental Material, sj-tif-4-tpx-10.1177_0192623320987202 for Biomarker-Based Classification and Localization of Renal Lesions Using Learned Representations of Histology—A Machine Learning Approach to Histopathology by Christophe A. C. Freyre, Stephan Spiegel, Caroline Gubser Keller, Marc Vandemeulebroecke, Holger Hoefling, Valerie Dubost, Emre Cörek, Pierre Moulin and Imtiaz Hossain in Toxicologic Pathology
Footnotes
Authors’ Note
Christophe A. C. Freyre, Stephan Spiegel, Pierre Moulin and Imtiaz Hossain have contributed equally to this article.
Acknowledgments
The authors would like to thank Carlotta Caroli, Nicholas Kelley, and Shahram Ebadollahi from the Novartis AI for Life Residency Program. The authors would also like to thank Chintan Parmar, Eric Durand, and Xian Zhang from the Novartis Institutes for Biomedical Research (NIBR) for fruitful discussions.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: CF, SS, PM, IH, CGK, MV, HH, and VD are employed by Novartis and hold shares in the company.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
