Abstract
BACKGROUND
Continued improvement in deep learning methodologies has increased the rate at which deep neural networks are being evaluated for medical applications, including diagnosis of lung cancer. However, there has been limited exploration of the underlying radiological characteristics that the network relies on to identify lung cancer in computed tomography (CT) images.
OBJECTIVE
In this study, we used a combination of image masking and saliency activation maps to systematically explore the contributions of both parenchymal and tumor regions in a CT image to the classification of indeterminate lung nodules.
METHODS
We selected individuals from the National Lung Screening Trial (NLST) with solid pulmonary nodules 4–20 mm in diameter. Segmentation masks were used to generate three distinct datasets; 1) an Original Dataset containing the complete low-dose CT scans from the NLST, 2) a Parenchyma-Only Dataset in which the tumor regions were covered by a mask, and 3) a Tumor-Only Dataset in which only the tumor regions were included.
RESULTS
The Original Dataset significantly outperformed the Parenchyma-Only Dataset and the Tumor-Only Dataset with an AUC of 80.80
CONCLUSION
We conclude that network performance is linked to textural features of nodules such as kurtosis, entropy and intensity, as well as morphological features such as sphericity and diameter. Furthermore, textural features are more positively associated with malignancy than morphological features.
Introduction
The ability of deep neural networks (DNNs) to extract high-level features from images has allowed them to garner widespread attention and adoption in various real-world tasks [1,2,3]. In the case of lung cancer, DNNs have achieved comparable and sometimes even better performance than trained radiologists [4]. DNNs evaluate voxel intensity relationships and construct features that are subsequently used to address a classification problem. However, since these features are not predefined, and their attribution to the endpoint is rapidly convoluted within the network layers, it is difficult to know what image characteristics contribute most heavily to the classification [5,6,7,8]. This intrinsic black-box nature of DNNs mitigates against trust in their diagnoses, especially when they do not agree with physician opinion.
Various methodologies have been created to address network interpretability, including saliency activation maps and feature perturbation. The saliency activation map is a visualization technique that highlights the regions or features in an image that a DNN pays most attention to when making its classification decisions [9,10,11]. However, this leaves the interpretation of which features are being identified as important to the human observer, making it open to confirmation bias. Alternatively, perturbation of the individual features identified by a CNN can show the relative contributions that each feature makes to network performance [12,13,14], but it is often difficult to interpret these features in terms of meaningful human notions. It thus remains challenging to determine if a DNN is capturing known biologic relationships such as, for example, the link between parenchymal lung disease and lung cancer [15,16,17,18,19]. The roles of such known relationships have been studied in support vector machines, random forests, and multi-layer perceptrons [20], but in these cases the features were manually extracted. Their roles in CNNs, which extract features automatically, remain uncertain.
Accordingly, in this present study we perturbed images by masking segmented regions, and combined this with saliency activation maps to systematically explore the contribution of parenchymal and tumor regions in CT images to the classification of indeterminate lung nodules. In particular, we investigated the nodule characteristics associated with false-negatives and false-positives in order to gain insight into the failure modes of CNNs.
Methods
Dataset
We selected a subset of images containing indeterminate lung nodules from the National Lung Screening Trial (NLST) dataset (2). The University of Vermont Institutional Review Board determined the use of NLST data to be human subject exempt following the National Cancer Institute Data Agreement (NLST-163). Individuals screened in the NLST had a smoking history of greater than 30 pack-years and had quit smoking less than 15 years prior. Using the low dose computed tomography (LDCT) branch of the NLST, we selected individuals with nodules less than 20 mm in diameter. This reduced the influence of diameter on the likelihood of malignancy, since solitary nodules with diameters between 20 and 30 mm are known to be associated with an approximately > 50% risk of malignancy [21]. Additionally, images with multiple nodules or subsolid nodules were excluded from the dataset. These criteria resulted in a final dataset of 3,533 annotated 3-dimenstional LDCT images from the total of 54,000 images in the NLST dataset (Fig. 1).
Demographic and scanning parameters of study cohorts.
Demographic and scanning parameters of study cohorts.

Flow diagram showing the inclusion and exclusion criteria for final dataset using the National Lung Screening Trial dataset (NLST) [36].
Of the 3,533 patients in the final dataset, 354 were found to have positive diagnoses for lung cancer (Table 1). To balance the dataset for training, 354 patients were randomly selected from those with benign nodules, giving a total of 708 nodule. A 64
Nodules were segmented semi-automatically from regions of interest (ROI) using the Chest Imaging Platform (CIP) [22,23]. Nodule boundaries were automatically detected by the CIP followed by manual adjustments based on secondary visual inspection by a trained radiologist. First-order radiomics, such as energy, entropy, and skewness, along with morphologic radiomics, such as nodule sphericity and maximal diameter, were extracted from the tumor regions in each image. Low attenuation areas below
Training and testing
Normalization was applied to all images prior to being processed by our miniaturized Inception module [24,25]. This architecture was selected to allow for multiscale features to be extracted and concatenated together to minimize information loss. To train the model, a cross-entropy loss function was utilized alongside an ADAM optimizer. Stratified K-fold cross validation was utilized to generate 10 unique training/validation/testing dataset combinations. Training and testing were repeated 10 times on the 10 unique combinations of images. Specificity and sensitivity were extracted from each training-testing instance along with a receiver operating characteristic curve (ROC). The general performance of each approach was evaluated using the area under the curve (AUC) of the ROC.
Lastly, we selected the network with the lowest least-absolute-square error by calculating the average AUC. This network was utilized to evaluated how much attention the CNN placed on each pixel in each image from its gradient-weighted class activation map (Grad-CAM) [9,10]. All Grad-CAMs were separated into classification groups (true-positives, false-positives, true-negatives, and false-negatives) in order to determine those traits that most impacted network performance for each group.
Statistical analysis
A two-sample

Axial slice from a Low Dose Computed Tomography (LDCT) image showing the (a) the original LDCT scan, (b) the segmented tumor map, (c) the parenchyma-only image, (d) the tumor-only image.

Distribution of the area under the curve (AUC) across datasets for 100 iterations.
Figure 3 compares the testing diagnostic performances of the Original Dataset, the Parenchyma-Only Dataset, and the Nodule-Only Dataset. The mean AUC for each dataset was 80.80
Number of individuals in each classification group for a given approach using the same testing dataset (
137).
Number of individuals in each classification group for a given approach using the same testing dataset (
The classification results from the best performing network comprised four distinct groups using the maximum probability of the networks output – true positives, false positives, false negatives, and true negatives. Table 2 shows the number of individuals in each group for the Original Dataset, the Parenchyma-Only Dataset, and the Tumor-Only Dataset using the same testing data. Consistent true positives can be observed across all datasets, with the primary difference between the datasets being false classification.

Grad-CAM images from the original dataset and parenchyma-only dataset showing network attention for malignant and benign nodules based on class label.
Grad-CAM images from the Original Dataset show that the attention of the CNN was focused on the nodule when malignancy was diagnosed and moved to the parenchyma when nodules were considered benign (Fig. 4). Grad-CAM images from the Parenchyma-Only Dataset shows a similar shift in attention from adjacent regions of the parenchyma to the border of the masked tumor in cases of malignancy versus more distant parenchyma in the case of benign nodules.
Nodule diameter, sphericity, intensity, entropy, skewness, kurtosis, gray levels, y-position, and z-position with relation to the carina were significantly different between true positives and true negative (see Supplement A for
Mean and standard error across the demographic and first order radiomics features extracted from the original image for classification groups (true positive, false negatives, false positives, and true negatives).
Deep neural networks and the growing availability of big data have allowed for rapid improvements in the accuracy of computed aided diagnostic tools (CADx) at the cost of interpretability [26,27]. Various methods for model interpretability have been proposed in order to address their black-box nature. Approaches such as concept vectors [5, 8,28,29] and attention based, perturbation based, and expert knowledge methodologies [27,30] have been explored to improve trust in classification results produced by DNNs. From a clinician perspective, confidence in a classification result is bolstered by model interpretability that provides a clear reason for a decision. Model interpretability can also be useful for improving the performance of DNNs. For example, we showed in the present study that a combination of image perturbation via masking together with attention-based methodologies provides insight into the features associated with early signs of malignancy that may not be considered in the Lung-RADS guidelines.
Comparing the results shown in Table 3 to published data such as that of Zhu P. and Ogino M., we found that nodule diameter remains positively correlated with nodule malignancy [27,31,32]. This is best illustrated when comparing the size of true-positive and true-negative nodules. Interestingly, true-positive nodules were found to be significantly larger than false-positive and false-negative nodules in the Original Dataset (Supplement A). However, in the Tumor-Only Dataset, nodule diameter was not significantly different between true-positive and false-positives. This suggest that excluding parenchymal features increases the attention of the network on nodule diameter, allowing for larger benign nodules to be misclassified as malignant nodules.
Comparing the results shown in Table 3, to published literature such as Zhu P. and Ogino M., we found that nodule diameter remains positively correlated with nodule malignancy [31,32]. This is best illustrated when comparing the nodule size of true-positive and true-negative nodules. Interestingly, true positive nodules were found to be significantly larger than false positive and false negative nodules in the original dataset (Supplement A). However, in the case of the tumor-only dataset nodule diameter was not significantly different when comparing true positive and false positives. This suggest that the exclusion of the parenchymal features increased network attention to nodule diameter, allowing for larger benign nodules.
Characteristics of nodule morphology such as shape and spiculation have been shown to provide clues to its likelihood of malignancy [33]. In our analysis, morphological features were significantly different in true-positive nodules compared to false-positives, false-negatives, and true-negatives in both the Original Dataset and the Parenchyma-Only Dataset (Table 3 & Supplement Table A). In these datasets, true-positives were less spherical in nature than other classification groups. This differs from findings by Zhu P. and Ogino M., suggesting an additional CT biomarker of interest [27]. This significant difference disappears when comparing true-negatives to false-positives and false-negatives, suggesting that nodule morphology plays an important role in nodule classification and contributes substantially to nodule misclassification in the Original and Parenchyma-Only datasets (Supplement A). Furthermore, the true-positives in Fig. 4 suggest that attention of the DNN was focused primarily on the tumor-parenchyma border, ignoring distant features of emphysematous or fibrotic tissue.
The presence of chronic inflammatory lung diseases such as emphysema or pulmonary fibrosis have been associated with an increased risk of nodule malignancy [18]. Interestingly, the DNN does not seem to weigh the presence of emphysema as a significant CT biomarker for malignancy. For the Original Dataset, low attenuation areas below
Similarities in the regions of attention in the GradCAM images between the Original Dataset and Paren-chyma-Only Dataset shows that the DNN paid considerable attention to the tumor-parenchyma interface, as seen in Fig. 4, suggesting that it relied not only on diameter but also morphologic image biomarkers such as nodule sphericity. Therefore, the difference in performance between the Tumor-Only Dataset and the Original Dataset (Fig. 3) may be attributable to significant additional information present at the local interface between the nodule and the parenchyma.
Density and textural features such as nodule entropy, skewness, and kurtosis were significantly different between true-positive and true-negative nodules in the Original and Tumor-Only datasets. This supports findings by the GaX model where nodule roughness was positively associated with malignancy [27]. Our findings therefore suggest that textural and density features should be considered as potential image biomarkers in addition to the nodule diameter in screening guidelines such as the Lung-RADS [34].
We found significant differences in performance between the Original Dataset and both the Tumor-Only and Parenchyma-Only datasets. The significant drop in performance of the Parenchyma-Only Dataset can be attributed to the exclusion of tumor textural and density features. These features are important as demonstrated by the Tumor-Only Dataset performance versus that of the Parenchyma-Only Dataset. However, the performance of the Parenchyma-Only Dataset demonstrates that morphologic and parenchymal features contain critical information related to nodule malignancy that are not currently included in the Lung-RADS assessment. Prior studies have explored the relative importances of parenchymal and nodular features for nodule classification achieved by various machine learning approaches, including artificial neural networks [20,35,36]. There has been limited study of the characteristics associated with solid pulmonary nodule classification in DNNs, and how modifications to the training set lead to changes in these characteristics [37,38]. Current research focuses on minimizing false-positives with limited consideration given to which image biomarkers present within a training dataset could be influencing outcomes.
The findings of this study, although confirming existing work, suffer from several limitations. First, the results presented herein are based on the selective population within the NLST dataset, which consists primarily of heavy smokers. A more comprehensive understanding of why features related to emphysema (laa950) were not selected could be achieved by investigating a cohort of subjects with a higher prevalence of emphysema. In particular, this could elucidate whether this behavior is specific to the dataset we used in the present study or if it is due to lower signal intensity from emphysematous regions that fail to capture the attention of the network. At the same time, nodule characteristics should not be ignored, as significant differences between true-positives and false-negatives demonstrate that the network tends to flag larger, higher intensity, and less spherical nodules as malignant. Additionally, the networks were provided with the central slices of the nodules and not the complete 3D region of interest (ROI), potentially missing critical information in nearby slices. It is also important to note that this study exclusively addresses solid nodules and does not address the influence of ground-glass opacities and part-solid nodules on the identified textural CT biomarkers. Inclusion of ground-glass opacities or part-solid nodules could reduce the influence of textural features related to malignancy classification. To combat this, curriculum and transfer learning approaches could be utilized to teach a network to recognize specific pulmonary structures such as local vasculature as well as definable disease states [39,40]. Furthermore, a selection bias could be impacting the performance of the network as the study focuses on solitary pulmonary nodules and does not evaluate instances where multiple nodules appear in close proximity to one another. Lastly, the performance of the parenchyma-only datasets is likely inflated as masking the nodule still preserved characteristics of the nodules shape and size. Therefore, the overall contribution of nodule diameter and shape cannot be properly evaluated. It is therefore unlikely that the networks we investigated would be able to evaluate the likelihood of future malignancy from pre-cancerous parenchymal features arising prior to the development of an actual nodule, in contrast to recent results using SYBIL [41]. An important distinction between our work and SYBIL is that the task of our model is to predict the likelihood of malignancy for an existing nodule and to evaluate the differential effect of the nodule versus the surrounding parenchyma, while SYBIL provides a prediction regarding the likelihood of future cancers and the development of existing nodules in a holistic fashion.
Conclusion
Using a combination of GradCAM, image perturbation via masking, and radiomics, we have demonstrated where in an image the attention of a DNN is focused depending on which regions of an image are removed. Unsurprisingly, nodule maximum diameter remained a highly selected image biomarker for nodule classification across all datasets. Textural and density features were highly selected in the Original and Tumor-Only datasets, while morphologic features were more commonly selected in the Parenchyma-Only Dataset. The results of this investigation thus imply that network performance is tied to textural features such as nodule kurtosis, entropy, and intensity, and morphologic features such as nodule sphericity, and diameter. Our findings imply that current screening guidelines may be improved through incorporation of additional image biomarkers related to malignancy [34]. Our findings also suggest that the majority of the information selected for malignant nodule classification is to be found at the tumor-parenchyma interface. Nevertheless, the features selected by CNNs for nodule classification are likely dependent on the dataset [27], hence mixing data from multiple sources could improve model generalizability[42].
Supplemental Material
sj-docx-1-10.3233_cbm-230444 - Supplemental material for LDCT image biomarkers that matter most for the deep learning classification of indeterminate pulmonary nodules
Supplemental material, sj-docx-1-10.3233_cbm-230444 for LDCT image biomarkers that matter most for the deep learning classification of indeterminate pulmonary nodules by Axel H. Masquelin, Nick Cheney, Raúl San José Estépar, Jason H.T. Bates and C. Matthew Kinsey in Cancer Biomarkers
Footnotes
Acknowledgments
This work was supported by the NIH K23 HL133476, NCI grant F31 CA268908, and NCI grant F99 CA274713. The content is solely the responsibility of the author and does not represent the official view of the National Cancer Institute.
Author contributions
AHM, CMK, NC, RSJE and JHTB conceived the study. AHM interpretated and analyzed the data. AHM prepared the manuscript. All authors reviewed, revised, and approved the manuscript.
Funding
NIH K23 HL133476, NCI F31 CA268908, NCI F99 CA274713.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
Data generated or analyzed during the study are available from the corresponding author by request.
Conflict of interest
AHM is a consultant and equity holder for Predictive Wear LLC. JHTB consults for Johnson & Johnson on approaches to treating lung cancer. CMK is a consultant for Olympus America, Nanology, Johnson and Johnson, and consultant and equity holder for Quantitative Imaging Solutions. He reports grants from the NIH, the DECAMP Consortium (funded by Johnson and Johnson through Boston University), and a patent pending for “Bates JM and Kinsey CM. Methods for Computational Modeling to Guide Intratumoral Therapy.” RJSE is consultant and equity holder for Quantitative Imaging Solutions.
Supplementary data
The supplementary files are available to download from http://dx.doi.org/10.3233/CBM-230444.
