Sage Journals: Discover world-class research

Abstract

Background:

Accurate differentiation between benign and malignant breast tumors is critical for early diagnosis and treatment planning. Traditional approaches often rely on whole-image processing; however, the tumor contour contains rich morphological cues that can independently support malignancy assessment. Leveraging these contour-based features using artificial intelligence (AI) can enhance diagnostic specificity and interpretability.

Objective:

This study aims to evaluate the diagnostic potential of tumor outline features extracted from mammographic images using deep learning models, with a focus on interpreting their variations numerically and biologically. It also investigates whether combining deep features (ensemble approach) can improve classification accuracy.

Methods:

A public dataset of 100 mammography tumor contours was analyzed. Eight deep learning models (ResNet50, Xception65, VGG16, AlexNet, DenseNet, GoogLeNet, Inception-v3, and a feature-level ensemble) were used for feature extraction. These features were then classified using 5 machine learning algorithms: SVM, KNN, DT, Naive Bayes, and a shallow neural network. Performance metrics included accuracy, sensitivity, specificity, and precision.

Results:

Xception65 with Naive Bayes achieved 97.97% accuracy, while the feature ensemble with an ensemble classifier achieved 96.96% accuracy, 95.45% sensitivity, and 98.48% specificity. Naive Bayes consistently outperformed other classifiers in integrating deep contour features.

Conclusion and Clinical Interpretation:

Tumor contour-based analysis provides biologically meaningful indicators of malignancy—such as irregularity, spiculation, and shape complexity—without relying on full pixel intensity. The results demonstrate that outline-driven AI analysis can enhance breast cancer screening by offering a low-complexity, high-performance diagnostic tool. Future integration into clinical workflows may aid radiologists in real-time and reduce false positives in mammographic diagnosis.

Keywords

tumor outline tracking variation indices AI mapping ensemble deep learning medical image analysis

Introduction

Breast cancer remains one of the leading causes of mortality among women globally. Understanding its clinical behavior and developing precise diagnostic tools are crucial for improving patient outcomes. Breast cancer remains the most frequently diagnosed cancer and a leading cause of cancer-related death among women worldwide, with an estimated 2.3 million new cases and approximately 670 000 deaths in 2022, according to the latest global cancer statistics. The World Health Organization (WHO) consistently reports¹ that breast cancer accounts for a substantial proportion of cancer morbidity and mortality in women, highlighting persistent global challenges in early detection and accurate diagnosis despite advances in screening and treatment. Feng et al² further outlined the role of genomic instability and cellular signaling pathways in the initiation and progression of breast cancer. Clinical staging criteria, as first formalized by Devitt,³ provide a systematic way to categorize disease severity, yet imaging and histological accuracy are foundational to this process.

Recently, researchers have shifted their attention to finer aspects of tumor morphology—particularly the tumor outline or margin—as an underexplored yet information-rich feature for classification. While many studies focus on the entire region of interest (ROI), contour-based classification strategies have shown that irregular tumor boundaries often correlate strongly with malignancy. Thus, leveraging the outline through AI mapping provides a novel and computationally efficient way to classify tumors with minimal pre-processing. Despite these advancements, challenges remain. Tumor heterogeneity, imaging noise, inter-observer variability in annotations, and dataset bias all continue to hinder generalizability. However, with the availability of large annotated datasets and increasing computational power, integrating advanced CNNs, ensemble learning, and outline-focused analysis is a promising direction. Although mammography remains the standard imaging modality for population-level screening, limitations in segmentation and interpretation accuracy continue to impede optimal clinical performance. This underscores the need for innovative, interpretable machine learning approaches that not only improve classification performance but also provide clinically meaningful insights into image-derived features. In this study, we investigate a novel method that focuses on tumor outline tracking using AI mapping. Eight pre-trained CNN architectures—ResNet50, Xception65, VGG16, AlexNet, DenseNet, GoogLeNet, Inception-v3, and an ensemble model—are applied to benchmark outline classification accuracy. By analyzing shape-based variation indices derived from tumor contours, we provide a numerical interpretation of classification performance, aiming to improve early diagnosis and treatment planning.

Related Works

Table 1 summarizes representative studies across contour-based analysis, deep learning on ROI, ensemble modeling, and multimodal approaches, explicitly highlighting their limitations and the remaining gaps that motivate the proposed contour-driven, interpretable framework.

Table 1.

Summary of Related Studies on Breast Cancer Analysis, Limitations, and Identified Gaps.

Category	Study (Ref.)	Main approach	Key contribution	Main limitations/Gaps
Contour/Morphological Analysis	Rangayyan and Nguyen⁴	Fractal dimension analysis	Quantified border irregularity as a malignancy indicator	Relies on handcrafted features; limited scalability and no deep feature learning
Contour/Morphological Analysis	Hmida et al⁵	Fuzzy clustering + active contours	Improved segmentation in noisy mammograms	Sensitive to initialization; limited interpretability of final decision
Contour/Morphological Analysis	Liu et al⁶	Differential evolution + slime mold optimization	Precise multilevel segmentation	Computationally expensive; not linked to classification interpretability
Contour/Morphological Analysis	Ranjbarzadeh et al⁷	Review of segmentation techniques	Highlighted importance of dataset quality and preprocessing	No unified framework linking segmentation to interpretable classification
Deep Learning on ROI-based Inputs	Shamai et al⁸	CNN-based prediction	Extended CNNs to prognostic marker prediction	Limited explainability of learned representations
Deep Learning on ROI-based Inputs	Ashwini et al⁹	CNN architecture comparison	Demonstrated strength of ResNet50	Focused on accuracy, not interpretability
Deep Learning on ROI-based Inputs	Ramadan¹⁰	CNN + domain-specific augmentation	Improved robustness	Heavily dependent on preprocessing heuristics
Deep Learning on ROI-based Inputs	Oyelade and Ezugwu¹¹	Wavelet-CNN hybrid	Combined texture and deep features	Increased complexity; limited clinical interpretability
Deep Learning on ROI-based Inputs	Razali et al¹²	CNN optimization via activation functions	Faster convergence and higher accuracy	No insight into feature-level relevance
Ensemble and Hybrid Models	Shovon et al¹³	Interpretable ensemble	Reduced inter-class confusion	Designed for histopathology, not mammography contours
Ensemble and Hybrid Models	Yan et al¹⁴	Weighted ensemble classifiers	High sensitivity and specificity	Feature fusion not explicitly interpretable
Ensemble and Hybrid Models	Nagalakshmi¹⁵	Multistage DNN ensemble	Improved dense tissue segmentation	Complex architecture; limited transparency
Ensemble and Hybrid Models	Rani et al¹⁶	CNN + SVM hybrid	Combined deep features with classical decision boundaries	Focused on performance rather than feature meaning
Classical Machine Learning Approaches	Babu and Jerome¹⁷	Gaussian mixture models	Competitive accuracy with low resources	Limited capacity for complex feature abstraction
Classical Machine Learning Approaches	Al-Fahaidy et al¹⁸	ML classifiers on mammograms	Reliable results on small datasets	Performance sensitive to feature engineering
Classical Machine Learning Approaches	Hamed et al¹⁹	K-means segmentation + SVM	Improved accuracy via handcrafted features	Lacks deep representation learning
Classical Machine Learning Approaches	Sakib et al²⁰	ML classifier comparison	Identified optimal models for small datasets	No linkage to anatomical interpretability
System-Level & Multimodal Studies	Yoo et al²¹	AI-integrated radiological workflows	Improved clinical throughput	Focused on systems, not feature interpretability
System-Level & Multimodal Studies	Santos et al²²	Review of CAD & AI	Highlighted automation trends	Limited discussion on explainable decision-making
System-Level & Multimodal Studies	Bouzarjomehri et al²³	Multimodal mammography fusion	Increased diagnostic sensitivity	Requires multiple imaging modalities
System-Level & Multimodal Studies	Islam et al²⁴	Hybrid deep fusion model	Robust to noise and variability	High model complexity
System-Level & Multimodal Studies	Saadi et al²⁵	Texture & edge analysis	Adaptable to microcalcification detection	Not validated on mammographic contours

Methodology

Dataset Description

This study utilizes a curated collection of 100 digital mammography image folders, each containing high-resolution grayscale images alongside clinical metadata and tumor segmentation masks. The database was made publicly available²⁶ to support tumor contour analysis and breast cancer research initiatives. Each folder represents a distinct clinical case and includes expertly delineated tumor outlines performed by experienced radiologists. These outlines serve as the ground truth for evaluating shape-based features and classification outcomes. The dataset provides a diverse range of tumor morphologies, spanning benign to malignant categories, and includes variations in margin sharpness, spiculation, and asymmetry.

Preprocessing and Outline Extraction

To ensure uniformity in input data, all mammography images were resized to 224 × 224 pixels, preserving the original aspect ratio via zero-padding. Histogram equalization was applied to enhance contrast, followed by median filtering to reduce noise. Tumor outlines were extracted from the segmentation masks using a morphological edge detection algorithm and converted into binary contour maps.

Each outline was encoded as a 2-dimensional coordinate sequence, which was further processed into shape descriptors such as: Perimeter-to-area ratio, Compactness, Fractal dimension, Curvature variation, and Radial distance signatures. These descriptors were used to quantify the morphological variability of the tumors and served as the basis for the AI-based mapping process.

Feature Mapping With Pretrained Deep Networks

The extracted outline images and their derived descriptors were fed into 8 state-of-the-art pretrained convolutional neural networks (CNNs), each configured to extract high-dimensional feature embeddings for classification tasks. The networks used include: ResNet50, Xception65, VGG16, AlexNet, DenseNet121, GoogLeNet (Inception-v1), Inception-v3, and Ensemble Model (combining the top 3 performing networks via soft-voting). Each CNN was pre-trained on ImageNet and fine-tuned using a transfer learning strategy, with the final fully connected layer modified for binary classification (benign vs malignant). The networks were trained using 80% of the dataset for training, 10% for validation, and 10% for testing.

Evaluation Metrics

Performance was evaluated based on the following standard classification metrics: Accuracy, Sensitivity, Specificity, Precision, F1-score, and Area Under the Curve. Each model’s outputs were compared against the ground truth labels derived from histopathology reports. Statistical significance was tested using McNemar’s test for model comparison, and confidence intervals (95%) were calculated for all primary metrics.

Ensemble Feature Integration

To improve classification performance and reduce model variance, an ensemble strategy was employed. Deep features from the top 3 CNNs (based on validation accuracy) were concatenated and fed into a meta-classifier—a gradient-boosted decision tree (GBDT). This approach allowed for leveraging the complementary strengths of different architectures while maintaining interpretability.

Overview of the Proposed Workflow

Figure 1 illustrates the complete pipeline of the proposed framework, providing a clear and sequential overview of the methodological steps. The process begins with the original dataset, consisting of 165 benign and 90 malignant samples. To address class imbalance and enhance generalization capability, class-specific data augmentation is applied. Subsequently, multiple pre-trained convolutional neural networks are utilized as deep feature extractors, enabling robust and discriminative representation learning. The extracted high-dimensional features are then subjected to a feature selection stage to reduce redundancy and improve computational efficiency. Finally, the optimized feature subsets are evaluated using several machine learning classifiers to ensure comprehensive performance assessment and comparative analysis. This structured visualization clarifies the logical progression of the proposed approach and enhances the transparency and reproducibility of the study.

Figure 1.

Class-specific data augmentation, deep feature extraction using pre-trained CNNs, feature selection, and final classification using multiple machine learning models.

Results and Discussion: A Comparative Interpretation Analysis

Interpretation of Tumor Classification Results Using ResNet50 Network

The classification performance of tumor outlines extracted through the ResNet50 convolutional neural network was evaluated using 5 classical machine learning classifiers—Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree (DT), Naive Bayes (NB), and Neural Network (NN)—as well as an ensemble of these methods. The outputs, as summarized in Table 2, reflect a comprehensive comparative assessment in terms of accuracy, sensitivity, specificity, and precision.

Table 2.

Tumor Classification Performance Using ResNet50 Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	79.04	93.93	64.14	91.36
SVM2	91.16	92.92	89.39	92.67
SVM3	83.58	89.89	77.27	88.43
SVM4	86.61	81.81	91.41	83.41
SVM5	95.95	95.95	95.95	95.95
SVM	87.27 ± 6.56	90.90 ± 5.53	83.63 ± 83.63	90.36 ± 4.73
KNN1	78.53	91.91	65.15	88.96
KNN2	90.15	90.40	89.89	90.35
KNN3	82.07	88.88	75.25	87.13
KNN4	83.08	76.26	89.89	79.11
KNN5	92.42	89.89	94.94	90.38
KNN	85.25 ± 5.82	87.47 ± 6.36	83.03 ± 12.41	87.19 ± 4.71
DT1	77.02	83.83	70.20	81.28
DT2	91.66	90.90	92.42	91.04
DT3	76.76	81.31	72.22	79.44
DT4	81.06	80.30	81.81	80.59
DT5	90.15	89.89	90.40	89.94
DT	83.33 ± 7.14	85.25 ± 4.89	81.41 ± 10.15	84.46 ± 5.56
NB1	82.82	85.85	79.79	84.94
NB2	91.41	95.45	87.37	95.05
NB3	81.31	86.86	75.75	85.22
NB4	88.13	84.84	91.41	85.78
NB5	96.21	98.98	93.43	98.93
NB	87.97 ± 6.14	90.40 ± 6.39	85.55 ± 7.57	89.98 ± 6.55
NN1	75.50	95.45	55.55	92.43
NN2	91.16	92.92	89.39	92.67
NN3	80.05	86.86	73.23	84.79
NN4	82.82	76.26	89.39	79.01
NN5	91.41	89.39	93.43	89.80
NN	84.19 ± 6.99	88.18 ± 7.43	80.20 ± 15.81	87.74 ± 5.82
Ensemble1	79.79	94.94	64.64	92.75
Ensemble2	92.17	94.94	89.39	94.65
Ensemble3	84.09	90.90	77.27	89.47
Ensemble4	86.86	82.82	90.90	84.11
Ensemble5	95.95	97.47	94.44	97.39
Ensemble	87.77 ± 6.41	92.22 ± 5.75	83.33 ± 12.28	91.67 ± 5.12

SVM demonstrated robust overall performance, with an average accuracy of 87.27% ± 6.56, and high specificity (90.90%) and precision (90.36%). The best individual result was observed in SVM5, which achieved an outstanding accuracy, sensitivity, specificity, and precision all at 95.95%, indicating excellent consistency in classification. The high precision across all variants implies reliable identification of true positives with minimal false alarms.

The KNN classifier achieved slightly lower results than SVM, with an overall accuracy of 85.25% ± 5.82. Although the specificity (87.47%) and sensitivity (83.03%) remained competitive, the wider standard deviation in sensitivity (±12.41) suggests that KNN’s performance varied more significantly across different test folds. Nevertheless, KNN5 matched SVM5 closely, delivering 92.42% accuracy and 94.94% sensitivity, demonstrating potential for strong performance when hyperparameters are optimized.

The DT classifier had a moderate performance profile with accuracy averaging 83.33% ± 7.14, the lowest among the classifiers. Although DT2 and DT5 reached accuracy values above 90%, the overall variability across the DT configurations impacted the mean performance. The precision of 84.46% ± 5.56 indicates a higher tendency for misclassification of false positives, especially in DT1 and DT3.

NB exhibited competitive results, achieving an average accuracy of 87.97% ± 6.14, sensitivity of 85.55% ± 7.57, and precision of 89.98% ± 6.55. The highest single output was from NB5, with a near-perfect accuracy (96.21%), specificity (98.98%), and precision (98.93%). These values indicate NB’s strong probabilistic modeling capability when paired with deep feature extraction, although the performance depends on data distribution assumptions.

The standalone NN models yielded an overall accuracy of 84.19% ± 6.99. While the specificity (88.18%) and precision (87.74%) were relatively consistent, the sensitivity dropped to 80.20% ± 15.81, reflecting potential limitations in detecting true positives in some configurations, notably NN1. However, NN5 and NN2 delivered excellent performance near the top end, confirming that optimization plays a crucial role in NN success.

The ensemble approach, integrating the outputs of the above classifiers, resulted in superior consistency and robustness. The average accuracy reached 87.77% ± 6.41, while specificity peaked at 92.22%, and precision at 91.67% ± 5.12—the highest among all methods, with statistical significance (P < .05) over DT. Ensemble5 demonstrated near-optimal performance with an accuracy of 95.95%, a specificity of 97.47%, and a precision of 97.39%, closely matching or exceeding NB5.

These results validate the advantage of ensemble learning in aggregating the strengths and minimizing the weaknesses of individual classifiers. Despite the moderate variance in sensitivity, the ensemble model shows a well-balanced and reliable classification behavior when paired with deep ResNet50 features.

Interpretation of Tumor Classification Results Using Xception65 Network

The results presented in Table 3 demonstrate the effectiveness of the Xception65-based feature extraction in conjunction with various classification algorithms for distinguishing between benign and malignant tumors in mammography images. Among all classifiers, SVM and ensemble learning methods consistently outperformed others in terms of specificity and precision, indicating their superior capability in correctly identifying non-malignant cases and reducing false positives.

Table 3.

Tumor Classification Performance Using Xception65 Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	74.49	91.41	57.57	87.02
SVM2	92.67	94.44	90.90	94.24
SVM3	83.58	89.89	77.27	88.43
SVM4	85.10	84.84	85.35	84.92
SVM5	96.71	96.96	96.46	96.95
SVM	86.51 ± 8.62	91.51 ± 4.62	81.51 ± 15.15	90.31 ± 5.07
KNN1	73.98	92.42	55.55	88.00
KNN2	89.39	88.88	89.89	89.00
KNN3	82.07	89.39	74.74	87.57
KNN4	82.07	80.30	83.83	80.97
KNN5	93.68	91.91	95.45	92.19
KNN	84.24 ± 7.59	88.58 ± 4.88	79.89 ± 15.63	87.54 ± 4.1
DT1	70.45	80.30	60.60	75.47
DT2	82.57	83.83	81.31	83.41
DT3	78.28	88.88	67.67	85.89
DT4	78.28	74.74	81.81	76.41
DT5	92.67	89.39	95.95	90.04
DT	80.45 ± 8.11	83.43 ± 6.14	77.47 ± 13.75	82.25 ± 6.23
NB1	76.76	78.28	75.25	77.60
NB2	88.88	95.45	82.32	94.76
NB3	81.06	87.87	74.24	85.96
NB4	83.83	84.34	83.33	84.18
NB5	97.97	99.49	96.46	99.47
NB	85.70 ± 8.15	89.09 ± 8.51	82.32 ± 8.89	88.39 ± 8.71
NN1	73.48	92.42	54.54	87.80
NN2	87.37	84.84	89.89	85.57
NN3	83.08	84.34	81.81	83.93
NN4	80.80	78.28	83.33	79.32
NN5	92.42	90.90	93.93	91.17
NN	83.43 ± 7.11	86.16 ± 5.67	80.70 ± 15.43	85.56 ± 4.42
Ensemble1	76.26	93.93	58.58	90.62
Ensemble2	91.16	93.93	88.38	93.58
Ensemble3	86.61	93.93	79.29	92.89
Ensemble4	83.83	84.84	82.82	84.53
Ensemble5	97.47	97.97	96.96	97.95
Ensemble	87.07 ± 7.94	92.92 ± 4.84	81.21 ± 14.3	91.92 ± 4.91

Specifically, the Ensemble classifier achieved the highest overall mean performance across all evaluated metrics: Accuracy: 87.07% ± 7.94, Specificity: 92.92% ± 4.84, Sensitivity: 81.21% ± 14.3, and Precision: 91.92% ± 4.91. This high specificity and precision suggest that ensemble methods, which combine predictions from multiple models, are particularly effective in handling the complex patterns present in tumor outlines. The performance of the SVM classifier was also notably strong, with an average accuracy of 86.51% and similarly high specificity (91.51%) and precision (90.31%), reinforcing its robustness for binary classification tasks in medical imaging.

In contrast, DT classifiers exhibited the lowest performance across all metrics, particularly in specificity and precision. This indicates a higher likelihood of false positives and less reliable generalization, possibly due to overfitting tendencies or limitations in capturing subtle nonlinear relationships within the extracted features.

Other classifiers, such as KNN, NB, and NNs, showed intermediate performance. While they offered reasonably balanced sensitivity and specificity, they were generally outperformed by SVM and ensemble approaches in both consistency and magnitude of classification metrics.

Overall, these findings highlight the importance of selecting advanced ensemble or kernel-based classifiers when leveraging deep features from Xception65 for mammographic tumor analysis. The statistical differences observed (P < .05), particularly between DT and ensemble/SVM methods, underscore the significance of model selection in improving diagnostic performance.

Interpretation of Tumor Classification Results Using VGG16 Network

The classification results obtained from the VGG16 network, as shown in Table 4, reveal a consistent and stable performance across all classifiers. The performance metrics—accuracy, specificity, sensitivity, and precision—demonstrated no statistically significant differences among the classifiers (P < .05), suggesting that the VGG16-extracted features provided a balanced representation of tumor characteristics suitable for a wide range of classification algorithms.

Table 4.

Tumor Classification Performance Using VGG16 Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	74.74	95.45	54.04	92.24
SVM2	90.40	94.44	86.36	93.95
SVM3	83.83	84.84	82.82	84.53
SVM4	89.39	84.34	94.44	85.77
SVM5	96.96	96.96	96.96	96.96
SVM	87.07 ± 8.32	91.21 ± 6.11	82.92 ± 17.14	90.69 ± 5.35
KNN1	73.48	91.91	55.05	87.20
KNN2	88.38	88.88	87.87	88.77
KNN3	85.85	91.41	80.30	90.34
KNN4	83.33	75.75	90.90	78.94
KNN5	91.66	91.91	91.41	91.87
KNN	84.54 ± 6.91	87.97 ± 6.95	81.11 ± 15.23	87.42 ± 5.05
DT1	73.48	91.41	55.55	86.61
DT2	88.38	89.39	87.37	89.17
DT3	80.05	78.78	81.31	79.31
DT4	84.84	79.29	90.40	81.36
DT5	91.41	87.37	95.45	88.31
DT	83.63 ± 7.08	85.25 ± 5.85	82.02 ± 15.66	84.95 ± 4.37
NB1	74.49	78.78	70.20	76.79
NB2	86.36	91.91	80.80	90.90
NB3	80.30	83.83	76.76	82.60
NB4	86.11	83.83	88.38	84.54
NB5	95.95	95.95	95.95	95.95
NB	84.64 ± 7.98	86.86 ± 6.92	82.42 ± 10.03	86.16 ± 7.44
NN1	70.95	94.44	47.47	89.52
NN2	89.39	92.92	85.85	92.39
NN3	80.80	79.29	82.32	79.90
NN4	85.10	78.78	91.41	81.16
NN5	93.68	91.41	95.95	91.78
NN	83.98 ± 8.73	87.37 ± 7.68	80.60 ± 19.24	86.95 ± 5.97
Ensemble1	74.74	96.46	53.03	93.75
Ensemble2	90.40	95.45	85.35	94.94
Ensemble3	84.09	85.85	82.32	85.34
Ensemble4	88.13	83.83	92.42	85.11
Ensemble5	96.96	97.47	96.46	97.44
Ensemble	86.86 ± 8.23	91.81 ± 6.44	81.91 ± 17.09	91.31 ± 5.72

The SVM classifier achieved the highest average performance with an overall accuracy of 87.07% ± 8.32, specificity of 91.21% ± 6.11, sensitivity of 82.92% ± 17.14, and precision of 90.69% ± 5.35. These results indicate SVM’s strength in distinguishing between benign and malignant tumors using the relatively shallow but finely tuned hierarchical features from VGG16.

The Ensemble method followed closely with comparable performance—accuracy of 86.86% ± 8.23 and specificity of 91.81% ± 6.44—while maintaining high precision (91.31% ± 5.72). This suggests that ensemble learning, which integrates predictions from multiple base learners, benefits from the moderate-depth features of VGG16 by enhancing generalization and minimizing classification variance.

Classifiers like KNN and NB also performed reasonably well, with accuracies of 84.54% ± 6.91 and 84.64% ± 7.98, respectively. These results demonstrate the robustness of VGG16 features in supporting both distance-based and probabilistic classification schemes. Interestingly, DT and NN classifiers yielded slightly lower performance in precision and specificity, potentially due to overfitting or their sensitivity to the limited dataset size and feature redundancy.

Another notable observation is the overall balance in sensitivity across classifiers, which hovered around 80% to 83% in most cases. This balance is crucial in medical diagnosis, where minimizing false negatives (ie, failing to detect malignancies) is of paramount importance. The SVM5 and Ensemble5 configurations exhibited perfect or near-perfect classification across all metrics (96.96%), reflecting the model’s ability to capture and generalize highly discriminative features when properly configured.

In summary, the VGG16 network demonstrated stable and competitive performance across diverse classifiers. The minimal statistical variance suggests its robustness and general compatibility with various classification paradigms, making it a reliable feature extractor in mammographic tumor analysis. These findings reinforce the utility of moderate-depth convolutional networks like VGG16 in medical imaging applications, particularly where model interpretability and stability are prioritized.

Interpretation of Tumor Classification Results Using AlexNet Network

The results of tumor classification using features extracted by the AlexNet architecture are presented in Table 5. Across the various classifiers, the model demonstrated robust performance, with classification metrics showing relatively low variance. Among individual classifiers, SVM and Ensemble models achieved the highest overall performance, highlighting the strength of deep feature representations derived from AlexNet.

Table 5.

Tumor Classification Performance Using AlexNet Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	74.49	87.87	61.11	83.44
SVM2	91.16	95.45	86.86	95.02
SVM3	83.08	88.88	77.27	87.42
SVM4	84.59	79.79	89.39	81.56
SVM5	95.20	93.93	96.46	94.08
SVM	85.70 ± 7.79	89.19 ± 6.16	82.22 ± 13.66	88.31 ± 6.09
KNN1	72.47	86.36	58.58	81.11
KNN2	90.15	91.41	88.88	91.19
KNN3	78.28	86.86	69.69	84.14
KNN4	81.56	76.26	86.86	78.53
KNN5	92.17	90.40	93.93	90.73
KNN	82.92 ± 8.22	86.26 ± 6	79.59 ± 14.87	85.14 ± 5.67
DT1	75.50	82.82	68.18	79.88
DT2	87.87	92.92	82.82	92.13
DT3	77.52	80.80	74.24	79.45
DT4	79.04	72.72	85.35	75.78
DT5	89.14	88.88	89.39	88.94
DT	81.81 ± 6.25	83.63 ± 7.77	80.00 ± 8.63	83.24 ± 6.94
NB1	80.05	82.32	77.77	81.48
NB2	89.89	94.94	84.84	94.38
NB3	77.77	88.38	67.17	85.25
NB4	83.08	78.78	87.37	80.46
NB5	96.46	98.48	94.44	98.42
NB	85.45 ± 7.66	88.58 ± 8.27	82.32 ± 10.36	88.00 ± 8
NN1	76.26	89.39	63.13	85.61
NN2	88.63	91.91	85.35	91.35
NN3	82.07	86.86	77.27	85.47
NN4	79.04	75.75	82.32	77.25
NN5	92.67	89.89	95.45	90.43
NN	83.73 ± 6.79	86.76 ± 6.41	80.70 ± 11.86	86.02 ± 5.6
Ensemble1	78.78	90.40	67.17	87.50
Ensemble2	91.66	95.45	87.87	95.08
Ensemble3	82.07	88.38	75.75	86.70
Ensemble4	83.58	78.28	88.88	80.36
Ensemble5	96.96	96.46	97.47	96.50
Ensemble	86.61 ± 7.48	89.79 ± 7.27	83.43 ± 11.94	89.23 ± 6.62

The SVM classifier yielded an average accuracy of 85.70 ± 7.79%, with specificity of 89.19 ± 6.16%, sensitivity of 82.22 ± 13.66%, and precision of 88.31 ± 6.09%, indicating balanced performance in detecting both benign and malignant tumors. Notably, SVM5 reached the highest individual performance (accuracy of 95.20%, sensitivity 96.46%, precision 94.08%), suggesting its suitability for high-confidence classification tasks.

The KNN classifier exhibited slightly lower performance with an average accuracy of 82.92 ± 8.22%, although it maintained relatively good sensitivity (79.59 ± 14.87%) and precision (85.14 ± 5.67%), which may still make it valuable in clinical settings where interpretability is prioritized.

The DT classifier had a modest average accuracy of 81.81 ± 6.25%, while the NB classifier performed comparably to SVM and KNN, particularly with NB5 achieving near-perfect classification (accuracy 96.46%, specificity 98.48%, precision 98.42%), though this might indicate overfitting or data-specific bias.

The NN model also performed well (accuracy 83.73 ± 6.79%), and NN5 stood out with 92.67% accuracy and 95.45% sensitivity, affirming its capacity to detect tumors with high recall.

The Ensemble approach, combining predictions from multiple classifiers, showed strong and stable results: average accuracy 86.61 ± 7.48%, specificity 89.79 ± 7.27%, sensitivity 83.43 ± 11.94%, and precision 89.23 ± 6.62%. These results further validate the benefit of integrating diverse classifiers for improved generalization and robustness.

Overall, AlexNet features proved effective for tumor classification, with SVM, Ensemble, and NB classifiers leveraging these features most successfully. However, caution is advised regarding the slight variance in sensitivity among classifiers, which could impact the detection of malignant cases if not properly calibrated in clinical deployment.

Interpretation of Tumor Classification Results Using DenseNet Network

The tumor classification results based on DenseNet feature extraction demonstrate notable performance across all machine learning classifiers (Table 6). The Ensemble classifier achieved the highest overall accuracy (87.72 ± 7.46%), specificity (91.91 ± 5.85%), sensitivity (83.53 ± 14.24%), and precision (91.34 ± 5.32%). This consistent performance indicates DenseNet’s ability to extract highly informative and generalizable features from tumor outlines, which contributes to superior classification outcomes when coupled with ensemble-based decision strategies.

Table 6.

Tumor Classification Performance Using DenseNet Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	76.26	95.45	57.07	92.62
SVM2	90.90	92.42	89.39	92.18
SVM3	81.56	85.85	77.27	84.53
SVM4	85.85	79.79	91.91	81.98
SVM5	97.72	97.47	97.97	97.48
SVM	86.46 ± 8.29	90.20 ± 7.29	82.72 ± 16.2	89.76 ± 6.35
KNN1	77.77	92.42	63.13	89.28
KNN2	88.13	87.87	88.38	87.93
KNN3	79.29	89.39	69.19	86.70
KNN4	81.56	77.27	85.85	79.06
KNN5	91.91	89.89	93.93	90.29
KNN	83.73 ± 6.05	87.37 ± 5.88	80.10 ± 13.23	86.65 ± 4.46
DT1	73.48	90.90	56.06	86.04
DT2	87.62	85.85	89.39	86.34
DT3	78.53	79.29	77.77	78.97
DT4	85.35	80.80	89.89	82.40
DT5	95.70	93.43	97.97	93.71
DT	84.14 ± 8.55	86.06 ± 6.15	82.22 ± 16.3	85.49 ± 5.49
NB1	82.57	95.45	69.69	93.87
NB2	89.39	92.92	85.85	92.39
NB3	83.08	90.40	75.75	88.75
NB4	86.11	81.31	90.90	82.94
NB5	96.71	98.48	94.94	98.42
NB	87.57 ± 5.79	91.71 ± 6.54	83.43 ± 10.51	91.28 ± 5.81
NN1	76.26	96.46	56.06	94.06
NN2	89.39	88.88	89.89	89.00
NN3	79.04	84.84	73.23	82.85
NN4	84.59	80.30	88.88	81.86
NN5	94.44	93.93	94.94	94.00
NN	84.74 ± 7.42	88.88 ± 6.57	80.60 ± 15.95	88.35 ± 5.86
Ensemble1	78.53	94.94	62.12	92.48
Ensemble2	91.16	92.42	89.89	92.22
Ensemble3	83.08	89.89	76.26	88.30
Ensemble4	87.87	83.33	92.42	84.72
Ensemble5	97.97	98.98	96.96	98.96
Ensemble	87.72 ± 7.46	91.91 ± 5.85	83.53 ± 14.24	91.34 ± 5.32

Among the individual classifiers, the SVM approach also yielded robust results, with an average accuracy of 86.46 ± 8.29% and precision of 89.76 ± 6.35%, suggesting that the linear and nonlinear margins constructed in the SVM decision space align well with DenseNet-derived features. Furthermore, the highest single-model accuracy (97.72%) was observed with SVM5, further supporting the strength of this pairing.

The NB classifier followed closely, with a mean accuracy of 87.57 ± 5.79% and high precision (91.28 ± 5.81%). The simplicity of NB combined with DenseNet features—potentially approximating Gaussian distributions—likely contributed to this performance. DT, KNN, and NN classifiers also provided competitive results, albeit with slightly lower stability and higher standard deviations in sensitivity, especially for DT (82.22 ± 16.3%) and NN (80.60 ± 15.95%).

Overall, DenseNet feature extraction demonstrates strong classification potential, with ensemble models offering the best performance trade-off between precision and sensitivity. These findings suggest that DenseNet is highly effective in capturing the structural complexity of tumor outlines for benign versus malignant discrimination.

Interpretation of Tumor Classification Results Using GoogLeNet Network

The tumor classification results derived from GoogLeNet feature extraction demonstrate consistent and competitive performance across all classifiers (Table 7). The average accuracy values for SVM, KNN, DT, NB, NN, and Ensemble classifiers ranged between 80.95% and 85.95%, with NB (85.95 ± 7.88%) and Ensemble (85.55 ± 8.58%) achieving the highest mean accuracies. These results indicate the robustness of GoogLeNet-derived features in distinguishing between benign and malignant tumor outlines.

Table 7.

Tumor Classification Performance Using GoogLeNet Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	75.75	93.93	57.57	90.47
SVM2	84.09	80.30	87.87	81.69
SVM3	86.11	92.42	79.79	91.32
SVM4	83.58	76.26	90.90	79.29
SVM5	96.21	95.95	96.46	95.97
SVM	85.15 ± 7.33	87.77 ± 8.87	82.52 ± 15.19	87.75 ± 7
KNN1	67.17	89.39	44.94	80.90
KNN2	85.10	86.36	83.83	86.01
KNN3	80.80	87.37	74.24	85.46
KNN4	81.06	76.26	85.85	78.34
KNN5	90.65	84.84	96.46	86.42
KNN	80.95 ± 8.68	84.84 ± 5.08	77.07 ± 19.62	83.43 ± 3.61
DT1	70.95	85.35	56.56	79.43
DT2	86.11	87.87	84.34	87.43
DT3	82.32	89.39	75.25	87.64
DT4	78.78	69.19	88.38	74.15
DT5	92.67	88.88	96.46	89.67
DT	82.17 ± 8.11	84.14 ± 8.5	80.20 ± 15.26	83.66 ± 6.61
NB1	74.74	93.93	55.55	90.16
NB2	89.89	96.96	82.82	96.47
NB3	86.86	96.46	77.27	95.62
NB4	82.57	73.23	91.91	77.44
NB5	95.70	94.94	96.46	95.02
NB	85.95 ± 7.88	91.11 ± 10.06	80.80 ± 15.99	90.94 ± 7.94
NN1	71.96	91.91	52.02	86.55
NN2	89.39	91.41	87.37	91.05
NN3	83.08	90.40	75.75	88.75
NN4	84.34	81.81	86.86	82.69
NN5	94.94	92.42	97.47	92.78
NN	84.74 ± 8.54	89.59 ± 4.41	79.89 ± 17.37	88.36 ± 3.95
Ensemble1	72.72	92.92	52.52	88.13
Ensemble2	88.13	91.41	84.84	90.81
Ensemble3	86.61	93.43	79.79	92.39
Ensemble4	83.83	77.27	90.40	79.91
Ensemble5	96.46	95.45	97.47	95.54
Ensemble	85.55 ± 8.58	90.10 ± 7.31	81.01 ± 17.23	89.35 ± 5.92

In terms of specificity, NB (91.11 ± 10.06%) and Ensemble (90.10 ± 7.31%) again outperformed others, indicating a strong capability in correctly identifying negative (benign) samples. The sensitivity scores, representing true positive rates, varied more significantly across classifiers, with Ensemble (81.01 ± 17.23%) and NB (80.80 ± 15.99%) maintaining a balance between sensitivity and specificity, critical in clinical applications to minimize false negatives.

Precision values—representing the proportion of correctly identified positive instances among all positive predictions—were consistently high across methods, with NB (90.94 ± 7.94%) and Ensemble (89.35 ± 5.92%) performing best. This demonstrates that GoogLeNet features contribute to reliable prediction with minimal false positives.

Despite the slight variations, no statistically significant differences were observed among classifiers in terms of the 4 performance metrics (P > .05). Overall, these findings suggest that GoogLeNet is an effective deep architecture for extracting relevant tumor morphology features from outline data. Its integration with ensemble and probabilistic classifiers, such as NB, provides an optimal balance between sensitivity, specificity, and precision, making it suitable for real-world diagnostic support systems.

Interpretation of Tumor Classification Results Using Inception-v3 Network

The classification outcomes obtained through features extracted from the Inception-v3 network reveal a consistently high level of diagnostic performance across various machine learning classifiers (Table 8). Among the evaluated models, the Ensemble method demonstrated superior performance, particularly in terms of specificity and precision. Statistically significant improvements were observed in specificity when comparing the Ensemble and NB classifiers with both DT and NN methods (P < .05). Moreover, the SVM classifier also exhibited significantly higher specificity compared to the DT classifier (P < .05), emphasizing its strong discriminatory power in identifying benign cases.

Table 8.

Tumor Classification Performance Using Inception-v3 Deep Features (Mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	80.80	93.93	67.67	91.78
SVM2	89.64	93.43	85.85	92.89
SVM3	81.56	91.41	71.71	89.30
SVM4	87.12	82.82	91.41	84.18
SVM5	96.21	93.43	98.98	93.77
SVM	87.07 ± 6.32	91.01 ± 4.68	83.13 ± 13.2	90.39 ± 3.85
KNN1	75.50	91.41	59.59	87.40
KNN2	84.84	86.86	82.82	86.31
KNN3	79.04	85.85	72.22	83.62
KNN4	84.34	82.32	86.36	83.00
KNN5	90.65	86.36	94.94	87.44
KNN	82.87 ± 5.82	86.56 ± 3.25	79.19 ± 13.65	85.56 ± 2.1
DT1	74.49	87.87	61.11	83.44
DT2	82.07	82.82	81.31	82.56
DT3	77.52	84.34	70.70	81.87
DT4	83.33	79.29	87.37	80.84
DT5	91.16	87.37	94.94	88.26
DT	81.71 ± 6.36	84.34 ± 3.52	79.09 ± 13.4	83.39 ± 2.88
NB1	79.79	91.91	67.67	89.33
NB2	86.86	92.42	81.31	91.47
NB3	76.51	87.87	65.15	84.31
NB4	88.38	88.88	87.87	88.77
NB5	97.22	97.97	96.46	97.94
NB	85.75 ± 8.07	91.81 ± 3.95	79.69 ± 13.29	90.36 ± 4.97
NN1	78.03	91.91	64.14	88.81
NN2	85.35	87.37	83.33	86.84
NN3	77.27	82.82	71.71	80.68
NN4	84.09	79.29	88.88	81.10
NN5	93.93	89.89	97.97	90.65
NN	83.73 ± 6.73	86.26 ± 5.17	81.21 ± 13.47	85.61 ± 4.52
Ensemble1	79.79	94.94	64.64	92.75
Ensemble2	88.88	93.43	84.34	92.77
Ensemble3	80.80	90.40	71.21	88.12
Ensemble4	87.62	84.84	90.40	85.64
Ensemble5	97.22	95.95	98.48	96.05
Ensemble	86.86 ± 7.05	91.91 ± 4.47	81.81 ± 13.83	91.07 ± 4.14

In terms of precision, the Ensemble approach significantly outperformed NN, KNN, and DT models (P < .05), suggesting its robustness in reducing false positives. Furthermore, both SVM and NB classifiers showed significantly higher precision values relative to DT (P < .05), indicating their better reliability in positive prediction accuracy.

Interestingly, no statistically significant differences were observed across the classifiers regarding overall accuracy and sensitivity, suggesting that most models performed comparably well in correctly identifying both tumor classes. The Ensemble classifier achieved the highest average precision (91.07 ± 4.14) and the highest specificity (91.91 ± 4.47), while also maintaining a balanced sensitivity (81.81 ± 13.83), reinforcing its effectiveness as a comprehensive classifier. Overall, the results underscore the potential of the Inception-v3 network as a robust feature extractor in mammographic tumor classification, particularly when coupled with advanced ensemble techniques that enhance classification reliability and minimize misdiagnosis.

Interpretation of Tumor Classification Results Using Feature Ensemble

The analysis of tumor classification performance using the feature Ensemble revealed a consistent and relatively high level of performance across all classification models (Table 9). However, no statistically significant differences were observed among the methods in terms of accuracy, specificity, sensitivity, or precision (P > .05). This finding indicates that, unlike the results obtained from individual feature extractors (eg, Xception65 or Inception-v3), the Ensemble feature representation produced a more balanced feature space, reducing variability among the downstream classifiers.

Table 9.

Tumor Classification Performance Using Feature Ensemble (mean ± SD; P < .05).

Output /Category	Accuracy	Specificity	Sensitivity	Precision
SVM1	72.97	92.42	53.53	87.60
SVM2	91.91	95.95	87.87	95.60
SVM3	80.30	83.83	76.76	82.60
SVM4	85.35	80.80	89.89	82.40
SVM5	95.70	96.46	94.94	96.41
SVM	85.25 ± 9.07	89.89 ± 7.17	80.60 ± 16.53	88.92 ± 6.8
KNN1	71.71	92.42	51.01	87.06
KNN2	90.90	91.41	90.40	91.32
KNN3	81.06	90.90	71.21	88.67
KNN4	82.82	80.80	84.84	81.55
KNN5	92.42	88.88	95.95	89.62
KNN	83.78 ± 8.36	88.88 ± 4.7	78.68 ± 17.99	87.65 ± 3.74
DT1	79.04	90.90	67.17	88.07
DT2	89.39	88.88	89.89	89.00
DT3	79.29	84.34	74.24	82.58
DT4	80.55	70.70	90.40	75.52
DT5	93.18	92.92	93.43	92.96
DT	84.29 ± 6.55	85.55 ± 8.89	83.03 ± 11.6	85.63 ± 6.76
NB1	82.32	92.92	71.71	91.02
NB2	88.88	94.44	83.33	93.75
NB3	85.35	93.43	77.27	92.16
NB4	86.86	82.32	91.41	83.79
NB5	97.22	99.49	94.94	99.47
NB	88.13 ± 5.62	92.52 ± 6.27	83.73 ± 9.63	92.04 ± 5.64
NN1	69.69	91.91	47.47	85.45
NN2	91.16	94.94	87.37	94.53
NN3	75.50	76.76	74.24	76.16
NN4	84.09	77.77	90.40	80.26
NN5	92.17	90.90	93.43	91.13
NN	82.52 ± 9.8	86.46 ± 8.53	78.58 ± 18.87	85.51 ± 7.55
Ensemble1	73.98	93.43	54.54	89.25
Ensemble2	92.17	96.46	87.87	96.13
Ensemble3	83.58	89.89	77.27	88.43
Ensemble4	86.36	80.80	91.91	82.72
Ensemble5	96.96	98.48	95.45	98.43
Ensemble	86.61 ± 8.76	91.81 ± 6.96	81.41 ± 16.49	90.99 ± 6.32

Among all classifiers, the Naïve Bayes (NB) model achieved the highest mean performance in accuracy (88.13% ± 5.62), specificity (92.52% ± 6.27), and precision (92.04% ± 5.64), suggesting its suitability for scenarios requiring a high true negative rate and positive predictive value. However, the highest mean sensitivity (83.73% ± 9.63) was also observed in NB, reinforcing its potential for identifying malignant cases without significantly compromising the false positive rate.

The SVM and Ensemble classifiers followed closely in performance, achieving mean accuracies of 85.25% and 86.61%, respectively. Their specificity and precision were also comparable to NB, supporting their robustness across balanced datasets. KNN and NN models demonstrated slightly lower sensitivity (78.68% and 78.58%, respectively), indicating a potential challenge in detecting certain malignant tumors when using those classifiers.

Notably, DT models maintained a balanced trade-off between metrics but underperformed relative to NB and Ensemble methods. The low standard deviations across classifiers for most metrics reflect the stability of the feature ensemble approach, with less inter-trial variability compared to individual feature extractors.

Overall, the Ensemble feature method appears to neutralize outlier effects and reduce classifier dependency, offering a generalized and reliable representation for tumor classification tasks. Its consistent performance across diverse machine learning models makes it an effective choice for robust, real-world deployment in clinical decision support systems.

Comparative Summary of Tumor Classification Models

In this comprehensive analysis of 8 deep learning architectures and one feature-level ensemble method for breast tumor classification, several performance insights emerge based on key evaluation metrics:

Among all models, Xception65 and the Ensemble approach demonstrated the highest mean accuracy values of 88.13% and 86.61%, respectively. These models effectively classified both benign and malignant cases. ResNet50 and Inception-v3 also showed competitive accuracy, while AlexNet and VGG16 recorded the lowest performance, indicating their relatively limited representational capacity in this task.

In terms of specificity, which reflects the model’s ability to correctly identify benign cases and avoid false positives, the Ensemble model (91.81%) and Xception65 (92.52%) outperformed other networks. High specificity is critical in clinical settings to prevent unnecessary anxiety or interventions for healthy individuals.

Sensitivity measures the model’s ability to correctly identify malignant tumors. Once again, Ensemble, Xception65, and Inception-v3 delivered strong sensitivity values, while AlexNet and VGG16 lagged behind. These results underline the clinical reliability of advanced models in minimizing false negatives.

Precision indicates the proportion of correctly identified malignant cases out of all predicted malignant cases. Xception65 (92.04%) and the Ensemble model (90.99%) led this category, suggesting that their false-positive rates for malignant prediction were low, vital for reducing overtreatment.

The Ensemble method, which aggregates multiple traditional classifiers (SVM, KNN, DT, NB, and NN), offered a balanced and robust performance across all metrics. Its strength lies in combining the strengths of diverse algorithms, reducing variance, and enhancing generalization. Notably, Xception65, with its deep and complex convolutional architecture, showed performance metrics very close to the Ensemble, making it a strong standalone candidate for tumor diagnosis in clinical applications. The heatmap provides a visual benchmark of these models across all key indices, facilitating quick identification of the best-performing classifiers for practical deployment. Figure 2 is intended as a compact, end-to-end summary of the comparative performance of the evaluated pipelines across the 4 key clinical metrics (accuracy, specificity, sensitivity, and precision). For each deep feature extractor (ResNet50, Xception65, VGG16, AlexNet, DenseNet, GoogLeNet, and Inception-v3), the values reported in the heatmap correspond to the final selected configuration for that feature extractor, that is, the single downstream classifier setting retained for cross-model comparison after the full set of candidate classifiers had been evaluated (SVM, KNN, DT, NB, NN, and the ensemble-of-classifiers strategy). Importantly, each heatmap cell reports the metric values obtained from the same selected configuration, ensuring that accuracy, specificity, sensitivity, and precision remain internally consistent and directly comparable across rows. The selection criterion and reporting rule are now stated explicitly in the text and caption, and the percentages displayed in Figure 2 are derived from—and can be cross-checked against—the corresponding results tables for each network (Tables 2 -9), thereby enabling transparent tracking of how the key findings and concluding remarks follow from the numerical evidence. Figure 2 presents the comparative performance of the deep feature extraction models, where all metrics were computed using the support vector machine classifier to ensure methodological consistency and fair comparison across feature sets. By fixing the classifier, the figure isolates the impact of different deep feature representations on classification performance, eliminating variability that could arise from using multiple learning algorithms. This standardized evaluation strategy allows a clearer interpretation of how each convolutional neural network contributes to discriminative power, thereby directly addressing potential ambiguity regarding the underlying classification algorithm used to generate the reported metrics.

Figure 2.

Final remarks: comparative summary of tumor classification models.

Conclusion

This study evaluated the performance of 8 deep learning models—ResNet50, Xception65, VGG16, AlexNet, DenseNet, GoogLeNet, Inception-v3, and a feature-level ensemble method—in the binary classification of breast tumors (benign vs malignant) using mammographic tumor outlines.

Key Findings

The results indicate that the Xception65 network and the feature ensemble approach consistently achieved superior performance across all major metrics, including accuracy, specificity, sensitivity, and precision. Xception65 demonstrated the highest precision (92.04%) and strong balance across metrics, while the ensemble approach showed remarkable stability and generalization, with accuracy reaching 86.61%, specificity 91.81%, sensitivity 81.41%, and precision 90.99%. In contrast, traditional models like VGG16 and AlexNet underperformed, highlighting their limited suitability for this task.

Applications

The findings of this study can directly support the development of automated diagnostic tools in breast cancer screening programs, especially for pre-screening or second-opinion systems in radiology clinics. By relying solely on tumor outlines—a low-dimensional and privacy-preserving representation—these models offer an efficient alternative to full-image processing, suitable for deployment in resource-limited environments.

Limitations

Despite promising results, the study faces several limitations. First, the analysis was based on a dataset of 100 mammographic image folders, which, although publicly available and valuable, may not fully represent the variability in clinical settings. Second, using only tumor outlines may ignore critical texture and contextual information available in the full mammogram. Additionally, the study does not account for real-time clinical constraints such as model interpretability and inference latency.

Future Directions

Future work should explore the integration of hybrid features, combining shape outlines with radiomic and deep texture features, to boost diagnostic robustness. Expanding the dataset and validating models on multi-center clinical data will be essential for generalization. Moreover, incorporating explainable AI (XAI) techniques could increase clinician trust and transparency in decision-making. Finally, deployment-oriented studies that measure runtime efficiency, hardware requirements, and user interfaces for radiologists will be key steps toward real-world clinical adoption.

Beyond achieving competitive classification performance, this study contributes to the growing body of interpretable medical imaging research by explicitly linking model outcomes to contour-driven, numerically defined lesion characteristics. Consistent with the principles of interpretable machine learning, the proposed framework emphasizes that predictive accuracy alone is insufficient unless accompanied by clinically meaningful explanations. By leveraging precisely annotated lesion boundaries from a high-quality mammography dataset, the analysis demonstrates how boundary-based variation indices (contour descriptors) encode diagnostically relevant information related to lesion irregularity and malignancy. This contour-centric perspective enables a transparent interpretation of model behavior, providing insight into why certain classification decisions emerge rather than merely reporting how well they perform. As such, the proposed approach offers a reproducible and explainable pathway for integrating deep learning into breast cancer assessment workflows, bridging the gap between black-box prediction²⁷ and clinically interpretable evidence.^28,29

Footnotes

Acknowledgements

The AI tool ChatGPT is used to check the grammar of specific sections of the manuscript.

ORCID iD

Hamidreza Mortazavy Beni

Ethical Considerations

All procedures performed in studies involving human participants were in accordance with the institutional and/or national research committee’s ethical standards and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Author Contributions

The author solely contributed to the conception and design of the study, data analysis, interpretation of results, and preparation and revision of the manuscript.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data will be available upon reasonable request.*

References

World Health Organization (WHO). Breast cancer. 2023. https://www.who.int/news-room/fact-sheets/detail/breast-cancer

Feng

Spezia

Huang

, et al. Breast cancer development and progression: risk factors, cancer stem cells, signaling pathways, genomics, and molecular pathogenesis. Genes Dis. 2018;5(2):77-106.

Devitt

JE.

The clinical stages of breast cancer–what do they mean?

Can Med Assoc J. 1967;97(21):1257-1262.

Rangayyan

Nguyen

TM.

Fractal analysis of contours of breast masses in mammograms. J Digit Imaging. 2007;20:223-237.

Hmida

Hamrouni

Solaiman

Boussetta

Breast mass segmentation in mammograms combining fuzzy c-means and active contours. In Proceedings of the Tenth International Conference on Machine Vision (ICMV 2017). SPIE; 2018.

Liu

Zhao

, et al. Performance optimization of differential evolution with slime mould algorithm for multilevel breast cancer image segmentation. Comput Biol Med. 2021;138:104910.

Ranjbarzadeh

Dorosti

Jafarzadeh Ghoushchi

, et al. Breast tumor localization and segmentation using machine learning techniques: overview of datasets, findings, and methods. Comput Biol Med. 2023;152:106443.

Shamai

Livne

Polónia

, et al. Deep learning-based image analysis predicts PD-L1 status from H&E-stained histopathology images in breast cancer. Nat Commun. 2022;13(1):6753.

Ashwini

Suguna

Vadivelan

Detection and classification of breast cancer types using VGG16 and ResNet50 deep learning techniques. Int J Electr Comput Eng. 2024;14(5):5481-5488.

10.

Ramadan

SZ.

Using convolutional neural network with cheat sheet and data augmentation to detect breast cancer in mammograms. Comput Math Methods Med. 2020;2020:9523404.

11.

Oyelade

Ezugwu

AE.

A novel wavelet decomposition and transformation convolutional neural network with data augmentation for breast cancer detection using digital mammogram. Sci Rep. 2022;12(1):5913.

12.

Razali

Isa

Sulaiman

Osman

MK.

Optimization of ReLU activation function for deep-learning-based breast cancer classification on mammogram images. In Proceedings of the IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS). IEEE; 2024.

13.

Shovon

MSH

Mridha

Hasib

Alfarhood

Safran

Che

. Addressing uncertainty in imbalanced histopathology image classification of HER2 breast cancer: an interpretable ensemble approach with threshold filtered single instance evaluation (SIE). IEEE Access. 2023;11:122238-122251.

14.

Yan

Huang

Pedrycz

Hirota

Automated breast cancer detection in mammography using ensemble classifier and feature weighting algorithms. Expert Syst Appl. 2023;227:120282.

15.

Nagalakshmi

Breast cancer semantic segmentation for accurate breast cancer detection with an ensemble deep neural network. Neural Process Lett. 2022;54(6):5185-5198.

16.

Rani

Singh

Virmani

Hybrid computer aided diagnostic system designs for screen film mammograms using DL-based feature extraction and ML-based classifiers. Expert Syst. 2023;40(7):e13309.

17.

Babu

Jerome

SA.

Automatic breast cancer detection using HGMMEM algorithm with DELMA classification. Multimed Tools Appl. 2023;82(17):26771-26795.

18.

Al-Fahaidy

FAK

Al-Fuhaidi

AL-Darouby

AL-Abady

AL-Qadry

AL-Gamal

. A diagnostic model of breast cancer based on digital mammogram images using machine learning techniques. Appl Comput Intell Soft Comput. 2022;2022:1-17.

19.

Hamed

El Rahman

Amin

Tolba

Deep learning in breast cancer detection and classification. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020). Springer; 2020.

20.

Sakib

Yasmin

Tanzeem

Shorna

Breast cancer detection and classification: A comparative analysis using machine learning algorithms. In Proceedings of the Third International Conference on Communication, Computing and Electronics Systems (ICCCES 2021). Springer; 2022.

21.

Yoo

Lee

Kim

Hwang

Park

Kang

Integrating deep learning into CAD/CAE system: Generative design and evaluation of 3D conceptual wheel. Struct Multidiscipl Optim. 2021;64(4):2725-2747.

22.

Santos

Ferreira Júnior

Wada

Tenório

APM

Barbosa

MHN

Marques

PMA

. Artificial intelligence, machine learning, computer-aided diagnosis, and radiomics: advances in imaging towards to precision medicine. Radiol Bras. 2019;52(6):387-396.

23.

Bouzarjomehri

Barzegar

Rostami

Keshavarz

Asghari

Azad

ST.

Multi-modal classification of breast cancer lesions in digital mammography and contrast enhanced spectral mammography images. Comput Biol Med. 2024;183:109266.

24.

Islam

Hasib

Mridha

Alfarhood

Safran

Bhuyan

MK.

Fusing global context with multiscale context for enhanced breast cancer classification. Sci Rep. 2024;14(1):27358.

25.

Saadi

Ranjbarzadeh

Kazemi

, et al. Osteolysis: a literature review of basic science and potential computer-based image processing detection methods. Comput Intell Neurosci. 2021;2021:4196241.

26.

Loizidou

Skouroumouni

Savvidou

, et al. Breast Masses Dataset with Precisely Annotated Sequential Mammograms. Zenodo; 2024. doi:10.5281/zenodo.11446259.

27.

Molnar

, ed. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 3rd ed. 2025. https://christophm.github.io/interpretable-ml-book

28.

Mortazavy Beni

Mortazavi

Islam

. Thermal/fluid Characteristics of Inline Stacked Plain-Weave Screen as Solar Powered Stirling Engine Heat Regenerators. IET Renewable Power Generation, 2022. doi:10.1049/rpg2.12405

29.

Beni

Asaei

FY.

Benign vs malignant tumors classification from tumor outlines in mammography scans using artificial intelligence techniques. Comput Biol Med. 2025;197(Pt B):111118.

Numerical Interpretation of Variation Indices of Tumor Outline Tracking Using Artificial Intelligence Mapping

Abstract

Background:

Objective:

Methods:

Results:

Conclusion and Clinical Interpretation:

Keywords

Introduction

Related Works

Methodology

Dataset Description

Preprocessing and Outline Extraction

Feature Mapping With Pretrained Deep Networks

Evaluation Metrics

Ensemble Feature Integration

Overview of the Proposed Workflow

Results and Discussion: A Comparative Interpretation Analysis

Interpretation of Tumor Classification Results Using ResNet50 Network

Interpretation of Tumor Classification Results Using Xception65 Network

Interpretation of Tumor Classification Results Using VGG16 Network

Interpretation of Tumor Classification Results Using AlexNet Network

Interpretation of Tumor Classification Results Using DenseNet Network

Interpretation of Tumor Classification Results Using GoogLeNet Network

Interpretation of Tumor Classification Results Using Inception-v3 Network

Interpretation of Tumor Classification Results Using Feature Ensemble

Comparative Summary of Tumor Classification Models

Conclusion

Key Findings

Applications

Limitations

Future Directions

Footnotes

Acknowledgements

ORCID iD

Ethical Considerations

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References