ENAS-B: Combining ENAS With Bayesian Optimization for Automatic Design of Optimal CNN Architectures for Breast Lesion Classification From Ultrasound Images

Abstract

Efficient Neural Architecture Search (ENAS) is a recent development in searching for optimal cell structures for Convolutional Neural Network (CNN) design. It has been successfully used in various applications including ultrasound image classification for breast lesions. However, the existing ENAS approach only optimizes cell structures rather than the whole CNN architecture nor its trainable hyperparameters. This paper presents a novel framework for automatic design of CNN architectures by combining strengths of ENAS and Bayesian Optimization in two-folds. Firstly, we use ENAS to search for optimal normal and reduction cells. Secondly, with the optimal cells and a suitable hyperparameter search space, we adopt Bayesian Optimization to find the optimal depth of the network and optimal configuration of the trainable hyperparameters. To test the validity of the proposed framework, a dataset of 1522 breast lesion ultrasound images is used for the searching and modeling. We then evaluate the robustness of the proposed approach by testing the optimized CNN model on three external datasets consisting of 727 benign and 506 malignant lesion images. We further compare the CNN model with the default ENAS-based CNN model, and then with CNN models based on the state-of-the-art architectures. The results (error rate of no more than 20.6% on internal tests and 17.3% on average of external tests) show that the proposed framework generates robust and light CNN models.

Keywords

breast lesion classification convolutional neural networks efficient neural architecture search Bayesian optimization ultrasound image classification

Introduction

Breast cancer is one of the most common cancer types.¹ It is the second deadliest cancer for woman.² Previous studies show that early detection of breast cancers followed by appropriate treatment is responsible for 38% reduction in mortality rate from 1989 to 2018.¹ Ultrasound (US) imaging has the benefits of being safe and less costly than other imaging modalities such as Magnetic Resonance Imaging (MRI), and hence widely used in breast cancer diagnosis. The clinical needs as well as technological advances in deep learning have motivated us to develop a new automated recognition approach for classifying breast lesions into benign or malignant types.

In recent years, Computer-Aided Diagnosis (CAD) systems have been applied to medical image analysis including classifying ultrasound images of breast lesions.³ At the same time, deep learning Convolutional Neural Network (CNN) has shown great success in natural image classification. Many existing CNN architectures such as VGGNet⁴ and GoogLeNet⁵ were designed. Because of model complexity and shortage of annotated medical images, most existing research focuses on customizing the existing CNN architectures to the medical images via transfer learning.³ However, such customized CNN models are still inherently large and complex with an increased risk of model overfitting. Attempts have also been made to design CNN architectures specifically for breast lesion classification from US images. An architecture (CNN3) of three convolutional layers followed by Batch normalization, Relu and MaxPooling was proposed.⁶ Another architecture (CNN4) of four convolutional layers with filters of different sizes and numbers was also reported.⁷ More recently, the Fus2Net⁸ architecture consists of three convolutional layers followed by two consecutive modules each of which consists of several convolutional layers using filters of different sizes. Despite all the efforts already made in building and customizing CNN architectures for breast lesion image classification via manual designs of the layers and hyperparameters, the need for accurate, robust, and light CNN models remains constant.

CNN architecture design involves setting many hyperparameters. Manually obtaining the optimal settings for them is challenging and time-consuming.⁹ Therefore, the interest in automatic search for optimal CNN architectures is increasing. Several approaches, such as Generic Algorithms (GA), Reinforcement Learning (RL) and Bayesian Optimization (BO), have been developed.¹⁰ Neural architecture search (NAS) is a RL-based framework,⁹ but it is computationally expensive because the number of architectural options to explore grows exponentially. Efficient Neural Architecture Search (ENAS) overcomes this limitation through weight sharing during the search phase.¹¹ In ENAS, a single CNN network known as Supernet with all operations within a search space is trained, and the generated CNNs share trained weights of the Supernet. Two types of search space can be used by the RNN controller within the ENAS framework: the macro space where the controller searches for an entire network or the micro space where the controller generates cells containing operations and connections between them. Evidence shows that the micro search space is more efficient.¹¹

Automatic search of CNN architectures has been attempted for medical images recently. A hybrid NAS framework for classifying and segmenting thyroid cancer from ultrasound images was proposed in Qian et al.¹² ENAS with micro search space was adopted for breast lesion classification from US images.¹³ The generalization gap of ENAS models was further investigated.¹⁴ Nevertheless, the ENAS approach has its own limitations. First, the number of blocks of cells is still determined manually through trials. Secondly, trainable hyperparameters critical for designing effective and efficient CNN architectures are manually set by trials.

This paper addresses these limitations by adopting Bayesian Optimization for optimizing the number of blocks of ENAS cells and trainable hyperparameters. Bayesian Optimization, as an efficient method for optimizing noisy and expensive functions, provides a better approach than other optimizers to model uncertainty and allow exploration and exploitation to be automatically balanced during the search.¹⁰ The paper therefore proposes a novel automatic “end-to-end” CNN design framework by combining ENAS cells with Bayesian Optimization search. To evaluate this framework, the optimized classification model is tested on images captured by US machines of different makes and from different medical centers in different countries. A further comparison is made between our model and state-of-the-art models based on hand-crafted architectures.

Materials and Methods

Data Collection and Preparation

In this study, five datasets of US images of breast lesions were used. Four were collected by our sponsor from three hospitals in Shanghai China including Pudong New Area People’s Hospital, No.6 Hospital and No.10 Hospital after ethical approvals by the hospitals. The ground-truth for each image (benignity or malignancy) is based on pathology reports. Experienced radiologists from the hospitals manually cropped the region of interest (RoI) for each US image in every dataset. A RoI bounding box image was generated and used as the input image. The fifth is a public domain dataset (BUSI) collected from a hospital in Egypt with associated class labels and cropped lesion areas.¹⁵ All images were captured using US machines of different makes (Siemens, Toshiba, GE, Philips and LOGIQ E9). This research was granted ethics approval by the Research and Ethics Committees of University of Buckingham. The datasets are split into two collections:

(1) Modeling Dataset: Two of the four datasets from two of the three hospitals in China respectively containing 1102 images (726 benign and 376 malignant) and 420 images (278 benign and 142 malignant) were merged into a dataset of 1522 images. This set is used for developing ENAS-B.

(2) External Test Sets: The BUSI dataset (External A) consists of 565 images (355 benign and 210 malignant). The other two datasets (External B and External C) from two of the three hospitals in China respectively consist of 500 images (300 benign and 200 malignant) and 168 images (72 benign and 96 malignant). The three datasets were separately used for testing purposes. Figure 1 shows some examples of US images from the datasets.

Figure 1.

Samples of RoI images for modeling and external test sets.

A new dataset of ultrasound images of breast lesions just became available.¹⁶ It consists of 109 images of benign and 123 images of malignant lesions all of which have been confirmed by histopathologic results including fine needle aspiration, core needle, or open biopsies. After removing the images with artifacts, 207 images (95 benign and 112 malignant) were used as another external test set (External D).

Bayesian Optimization for ENAS-Based Architecture Design

The proposed framework is shown in Figure 2. It consists of three main phases. Phase I is a general preparation of US images including RoI (i.e., the lesion region) cropping, image resizing, and increasing the number of training examples. Phase II is intended to obtain an optimized backbone deep CNN (DCNN) architecture and a set of optimized trainable hyperparameters. Phase III finally uses the optimized architecture and hyperparameters to train a classification model.

Figure 2.

The proposed framework for automatic CNN model designs for breast lesion classification from US images.

Image preparation

The RoI image of lesion was cropped from the whole US image for accurate recognition. A free-hand cropping tool reported in Zhu et al.¹⁷ was used by the radiologists to identify, collect and store the coordinates of the pixel points on the border of a lesion. A rectangular bounding box was then generated for each lesion by fitting the border points into a minimum area rectangle. The tumor microenvironment (TME) is the cellular environment in which a tumor exists, and it includes various components such as immune cells, blood vessels, fibroblasts, and extracellular matrix.¹⁸ The TME plays a crucial role in cancer progression and can have a significant impact on disease management and diagnosis.¹⁹ Therefore, the accurate assessment of the tumor microenvironment TME within breast cancer plays a pivotal role in disease management when utilizing ultrasound images. Selecting RoI is a crucial factor in the development of machine learning models for breast cancer classification from ultrasound images.²⁰ Furthermore, integrating information about the TME into these models by RoI margin enhances model performance.²¹

Based on the work in Zhu et al.,¹⁷ and Hassan et al.,²² margin of 8% of the lesion width and height was then added for the final cropped RoI image. To satisfy the training requirements of our proposed framework, the cropped RoI images were resized to 100 × 100 pixels.

Searching and training a complex DCNN also requires large datasets. One way to meet the requirement is to enlarge the training set through data augmentation. Two augmentation methods reported in Zhu et al.¹⁷ were adopted. The geometric methods use both image mirroring and rotation (90, 180 and 270 degrees), and the singular value decomposition (SVD) method respectively takes 45%, 35%, and 25% ratios of the selected top singular values. The methods generated seven additional images from one RoI image.

Automatic search and optimization

Phase II of our framework consists of three stages as shown in Figure 2. At Stage 1, the ENAS method is used to search for the optimal internal structures of normal and reduction cells. At Stage 2, the optimized cells are stacked in a process controlled by the Bayesian Optimization algorithm, creating a sequential layer structure of the cells for the whole network. At Stage 3, Bayesian Optimization is again employed to optimize trainable hyperparameters within the optimized network structure, creating the final optimized DCNN architecture for modeling.

A. Optimal cells search using ENAS

The ENAS micro approach consists of two stages.¹¹ The first stage searches for an optimal pair of Normal (N) and Reduction (R) cells in a pre-defined architecture (i.e., Supernet) based on validation accuracy. The Supernet consists of a 3 × 3 standard convolution layer named stem conv and seven cells (1N, 1R, 1N, 1R, 3N). The default search operations of ENAS¹¹ were provided to the ENAS controller. We set the mini batch size to 8 and all other hyperparameters as ENAS default.¹⁰ The RNN controller is trained for 150 epochs and each epoch generated 10 pairs of N and R cells. In the searching stage, the Modeling Dataset (see Section on Data Collection and Preparation) was used under a single split policy (see Section on Results). Figure 3 shows the searched optimal cell structures from ENAS based on the modeling dataset.

Figure 3.

Example of optimal cell structure (Normal and Reduction cells) generated from our data set.

B. ENAS-B

The proposed ENAS-B search involves three key elements: a backbone architecture, a search space and a search strategy. First, we define a CNN backbone architecture with optimisable structural hyperparameters and their search spaces. Second, we perform automatic architecture search (the first optimization stage) using Bayesian optimization to identify the optimal number of normal cells (or the depth of the architecture) that results in a new architecture called ENAS-B-1. Finally, we use ENAS-B-1 as a backbone architecture, define optimisable training hyperparameters and their search space, and perform automatic architecture search using Bayesian optimization to optimize training hyperparameters. This second optimization stage results in ENAS-B. It is worth noting that the first and the second stages use the same Bayesian optimization algorithm but with different inputs.

Backbone Architecture. We define the backbone architecture ( $B_{A}$ ) as follows. A stem convolution layer, that is, a convolutional layer with 108 filters of size 3 × 3 stride 1 followed by ReLU and batch normalization, is included immediately after the input layer. The architecture then contains several blocks. Each block consists of one or more normal cells and one reduction cell. The output of the final layer (final normal cell) is followed by Global Average Pooling (GAP) for reducing the feature map dimensionality. The final layer consists of two nodes for the two classes followed by SoftMax for classification. Since the reduction cells are used as a pooling layer to reduce the feature map size by half, to control the output size, the backbone architecture in this study has two reduction cells determined by the input image size and the intention to avoid input vanishing. Figure 4 shows the proposed backbone architecture.

Figure 4.

Backbone architecture (B_A) for Bayesian optimization search.

Structural Hyperparameter Search Space. The structural search by Bayesian Optimization aims at utilizing the optimized normal and reduction cells within the backbone architecture ( $B_{A}$ ). In fact, Bayesian Optimizer searches for the optimal number of normal cells in each of three blocks (Block 1 $(d_{1}$ ), Block 2 $(d_{2}$ ), Block 3 $(d_{3}$ )) in Figure 4. Thus, the structural search space is the number of normal cells per block $d_{i}$ . The search range for $d_{i}$ is therefore defined as Min = 1, Max = 5 and step = 1. Given this setting, the deepest architecture may have 15 normal and 2 reduction cells, while the shallowest architecture 3 normal and 2 reduction cells. The full details of the Bayesian Optimization algorithm are presented in the following Search Strategy section.

Trainable Hyperparameters Search Space. A suitable search space of trainable hyperparameters is needed as the input for Bayesian optimizer to build the optimal CNN architecture. In this paper, the search space $(L_{r}, O p z, L_{f}, W_{i}, D r p, L_{n}, L_{2 r})$ is composed by Learning rate $L_{r}$ , Optimization $O p z$ , Loss function $L_{f}$ , Weight Initialization $W_{i}$ , Dropout Rate $D r p$ , Layer Normalization $L_{n}$ , and regularization $L_{2 r}$ . Based on the literature¹² and our knowledge in deep learning architecture design, the following values and ranges of hyperparameters for the search space are carefully defined: $L_{r}$ : [0.00001, 0.001]; $O p z$ : (Adam, SGD, RMSprop)¹²; $L_{f}$ : (Sparse Categorical Cross-Entropy (SCCE), Binary Cross-Entropy (BCE)); $W_{i}$ : (He normal, Glorat normal); $D r p$ : [0%, 90%]; $L_{n}$ : (Batch Normalization, Group normalization (4)); and $L_{2 r}$ : [0.00001, 0.001].

Search Strategy: The Bayesian Optimization is conducted in two sequential stages. Given $B_{A}$ and our definition of the structural search space, Bayesian Optimizer first searches for the optimal number of N cells in each block. The Bayesian Optimization algorithm consists of six steps. Step 1, a hyperparameter setting $S_{s}$ is defined as one set of possible values of optimisable structural hyperparameters $(d_{1}, d_{2}, d_{3}) .$ Therefore, it is defined as $S_{s} = {S_{s 1}, \dots, S_{s i}, \dots, S_{s j}}$ where $S_{s i}$ is the value of the optimizable parameter $i$ in the hyperparameter setting $S_{s}$ and $j$ is number of hyperparameters that are being optimized ( $j = 3$ in the first search stage). Step 2, we define an objective function $f (S_{s})$ as the validation accuracy (the model accuracy on the test set when modeling the backbone architecture with hyperparameter setting $S_{s}$ ) that is maximized at each iteration. Step 3, Bayesian optimizer randomly selects $t$ number of hyperparameter settings known as the initial seed points that the Bayesian optimizer examines before starting the search process. We set $t = 3$ as illustrated in our experiment. Using three initial hyperparameter settings, Bayesian optimizer models the backbone architecture to calculate the objective function $f (S_{s})$ . In step 4, Bayesian optimizer builds the surrogate model $G (S_{s})$ which is based on Gaussian Process Regression. Given the initialization of the $G (S_{s})$ , Bayesian optimizer uses Expected Improvement as acquisition function to select the next hyperparameter setting in Step 5. Where the next hyperparameter setting $f (S_{s i})$ with the highest expected improvement over the current best observed setting of the objective function is selected using the Expected Improvement as follow:

A_{q} (S_{s}) = E (m a x (G (S_{s}) - f^{'} (S_{s})), 0)

where $f^{'} (S_{s})$ is the best observed point of the objective function and $G (S_{s})$ is the posterior distribution of the surrogate model. At each iteration, the hyperparameter setting $S_{s}$ that maximizes the $A_{q} (S_{s})$ is selected as the next setting for evaluation. The surrogate model is updated with the newly evaluated hyperparameter setting after each iteration. The search process is repeated for 30 iterations. The number of iterations were defined empirically in this study. Finally, in Step 6, the architecture that provides the highest classification accuracy from Bayesian Optimization search was selected and named as ENAS-B-1. Figure 5 shows ENAS-B-1 architecture.

Figure 5.

Fixed backbone architecture (ENAS-B-1) for trainable hyperparameters search stage.

In the second stage of the optimization, given the architecture of ENAS-B-1 as the backbone and our definition of the trainable search space, the Bayesian Optimizer algorithm searches for the optimal trainable hyperparameter setting. In particular, the same steps 1 to 6 are used, but $S_{s}$ = $(L_{r}, O p z, L_{f}, W_{i}, D r p, L_{n}, L_{2 r})$ and $j = 7$ . This stage of the search results in an optimal CNN architecture ENAS-B. The Results Section will provide the details on the optimized trainable hyperparameters.

An interesting alternative of the two-stage search as described earlier is a combined search strategy where optimal combinations of the number of normal cells and the trainable hyperparameters are searched using the Bayesian Optimizer. We further explore this alternative search strategy and compare the searched architecture with ENAS-B. Further details of the architecture obtained will be shown in The Results section.

Experiment Setups

Experiments have been conducted to find the optimal CNN architectures and evaluate the classification performance of ENAS-B models. All experiments were run on a workstation with Intel Xeon(R) W-2102 CPU@2.90GHz with 16.0 GB RAM. The Modeling Dataset (See Data Collection and Preparation) was used for searching for the optimal cells in ENAS, searching for the optimal number of normal cells and trainable hyperparameters using Bayesian Optimization, and finally training the ENAS-B model from scratch. A 5-fold stratified cross validation protocol was followed. At each iteration, the modeling data was split into 20% for testing (Internal test) and 80% for training. The training part was further split into 10% for validation and 90% for training. One split out of the five was used for the optimization (See Automatic Search and Optimization section). To determine the classification error rates, all 5-fold were used. The imbalance ratio between benignity and malignancy (1.92:1) was upheld in the modeling and searching stages based on the findings in Ahmed et al.¹⁴ All images were pre-processed, and the training set enlarged using the pre-processing and data augmentation methods as described in the Section on Data Collection and Preparation.

Optimizing ENAS-B Architecture: After ENAS generates the optimal cells, the Bayesian optimization algorithm searched for 30 networks each of which was trained from scratch on the Modeling dataset with 50 epochs during the search. The primary criterion for selecting the optimal among the generated architectures is the validation accuracy, but the architecture complexity in terms of the number of weight parameters within the model is also considered. As a result, the optimal architecture has the block configuration of (1N, 1R, 1N, 1R, 1N) as depicted in Figure 5. Then, the search for trainable hyperparameters of the architecture is conducted under the following setting. The number of trials (sample model) is 30. Each model is trained on the unbalanced dataset for 50 epochs with batch size 8. The maximum batch size was constrained by the available computational power and the number of epochs was determined experimentally. The final ENAS-B architecture has the following hyperparameters: Learning Rate = 0.0001; Optimization function = SGD; Loss function = SCCE; Weight initialization = He Normal; Dropout rate = 0.3; Normalization Layer = Group Normalization; and L2 Regularization = 0.00036.

For the optimized architecture using the combined search strategy (ENAS-B Combined), the number of trial network models is set to 40. Each model is trained on the imbalanced dataset for 50 epochs with batch size 8. The final optimal architecture has the block configuration of (5N, 1R, 1N, 1R, 4N) with Learning Rate = 0.0001; Optimization function = Adam; Loss function = BCE; Weight initialization = He Normal; Dropout rate = 0; Normalization Layer = Batch Normalization; and L2 Regularization = 0.00042.

Model performance is measured by Sensitivity, Specificity, Accuracy and F1-score. Sensitivity refers to the proportion of known malignant test examples being classified as malignant, whereas Specificity refers to the proportion of known benign test examples being classified as benign. Accuracy refers to the proportion of correctly predicted test examples out of the total, and F1-score is the harmonic mean of Accuracy and Sensitivity.

Results

Breast Lesion Classification

The optimized ENAS-B is then trained from scratch on the unbalance dataset. All the data augmentation methods as mentioned are used to expand the training set. The number of epochs for training the EBAS-B models is set to 50. Figure 6 shows losses of ENAS-B training and validation.

Figure 6.

Training and Validation losses of ENAS-B model.

Comparison With State-of-Art Purposely Built CNNs

We first compared ENAS-B with three existing state-of-art networks manually designed specifically for classifying breast lesions in US images, that is, CNN3,⁶ CNN4,⁷ and Fus2Net.⁸ Each CNN was trained and tested on the Modeling Dataset under the same cross validation protocol with the same folds as used for ENAS-B. As shown in Table 1, ENAS-B model outperforms all three network models by a large margin with higher overall accuracy of 4.5%, 18.8% and 13.3% respectively. ENAS-B also outperforms CNN3, CNN4 and Fus2Net by at least 6.6% when tested on the external sets A, B and C. With the new dataset (External D),¹⁶ the results show that ENAS-B still achieved the highest overall accuracy of 67.4% (specificity 45.3% and sensitivity 89.5%) while CNN3, CNN4 and Fus2Net achieved overall accuracies of 61.5% (specificity 87.4% and sensitivity 35.7%), 53.7.5% (specificity 56.4% and sensitivity 50.9%), and 60.5% (specificity: 40% and sensitivity: 81.1%) respectively.

Table 1.

ENAS-B Performance and Comparison Against Other State-of-the-Art Breast Lesion CNNs.

Models	Test sets	Specificity	Sensitivity	ACC	F1	# Parameters
CNN3 Xiao et al.⁶	Internal	88.7 ± 6	61.1 ± 13	74.9 ± 5	66.0 ± 8	619,202
	External A	78.8 ± 4	68.1 ± 12	73.5 ± 4	66.2 ± 7
	External B	78.9 ± 11	86.4 ± 7	82.6 ± 3	79.4 ± 3
	External C	73.6 ± 10	71.0 ± 12	72.3 ± 1	73.9 ± 5
	External D	87.4 ± 6	35.7 ± 15	61.5 ± 5	46.9 ± 14
CNN4 Zeimarani et al.⁷	Internal	91.3 ± 13	29.9 ± 34	60.6 ± 11	30.8 ± 26	628,418
	External A	88.2 ± 11	39.8 ± 33	64.0 ± 11	40.9 ± 28
	External B	89.3 ± 18	39.4 ± 34	64.4 ± 9	42.8 ± 25
	External C	79.7 ± 29	36.9 ± 31	58.3 ± 6	41.4 ± 25
	External D	56.4 ± 45	50.9 ± 44	53.7 ± 6	41.7 ± 31
Fus2Net Ma et al.⁸	Internal	83.0 ± 15	49.2 ± 38	66.1 ± 12	44.0 ± 29	889,714
	External A	63.1 ± 34	56.2 ± 40	59.7 ± 10	46.5 ± 28
	External B	84.9 ± 14	64.3 ± 42	74.6 ± 15	59.1 ± 33
	External C	68.9 ± 27	52.7 ± 41	60.8 ± 9	48.1 ± 35
	External D	40.0 ± 19	81.1 ± 22	60.5 ± 5	68.3 ± 11
ENAS17 Pham et al.¹¹	Internal	86.4 ± 1.1	81.1 ± 3.4	83.8 ± 1.4	78.2 ± 1.7	3,927,636
	External A	90.0 ± 2.8	63.0 ± 3.5	76.5 ± 0.5	70.0 ± 1
	External B	84.9 ± 4.3	89.3 ± 3.4	87.1 ± 1.4	84.3 ± 1.7
	External C	74.4 ± 4	73.8 ± 4.7	74.1 ± 1.7	76.4 ± 2.4
	External D	55.2 ± 9	81.9 ± 6	68.6 ± 2	74.8 ± 1
ENAS-B	Internal	88.2 ± 2	70.5 ± 4	79.4 ± 1	73.0 ± 2	1,053,398
	External A	88.8 ± 4	72.0 ± 8	80.4 ± 2	75.2 ± 2
	External B	84.3 ± 4	95.0 ± 3	89.7 ± 1	87.0 ± 1
	External C	75.6 ± 8	80.4 ± 3	78.0 ± 3	81.0 ± 1
	External D	45.3 ± 19	89.5 ± 6	67.4 ± 6	76.0 ± 2

The classification performance of the ENAS-B models is presented in Table 1. With a 5-fold cross validation, ENAS-B achieved an average overall accuracy of 79.4% (88.2% specificity and 70.5% sensitivity). Further, we tested all five ENAS-B models on the four external datasets (External A, B, C, and D). ENAS-B generalizes well on the unseen data and achieved average accuracies (average of five models) of 80.4%, 89.7%, 78.0%, 67.4% on External A, B, C, and D respectively.

We then compared ENAS-B with the original ENAS.¹¹ Based on our earlier findings as reported in Ahmed et al.,¹³ we chose ENAS17 for the comparison. Using the optimal cells as shown in Figure 3, ENAS17 architecture consists of 15 Normal cells (N) and two Reduction cells (R) in a configuration of (5N, R, 5N, R, 5N) and trained on the Modeling dataset under the same 5-fold cross validation protocol. Although the ENAS17 models achieved higher accuracy in internal test, ENAS-B generalized better and achieved higher overall accuracy than ENAS17 on the external datasets except External D where ENAS17 has a marginally better overall accuracy. On the other hand, the number of weight parameters of ENAS-B is about 3.73 times fewer than that of ENAS17.

The performance of ENAS-B demonstrates the effectiveness of our approach in optimizing the number of layers and trainable hyperparameters for accurate and robust networks. To confirm whether the differences in the model accuracies on external datasets are statistically significant, a paired sample t-test upon the ENAS-B model and each of CNN3, CNN4 and Fus2net were separately conducted, and the t-statistics and p-values were calculated. The p-values for ENAS-B versus CNN3, ENAS-B versus CNN4 and ENAS-B versus Fus2Net are respectively 0.000487, 0.001484 and 0.016456, all well below the general threshold of p = .05. Therefore, the ENAS-B model significantly outperforms the other manually designed CNN models.

To further explore the predictive power of ENAS-B on external datasets, the Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) were calculated on the External datasets A and B because both datasets have more than 500 images collected from two different sources (see Section on Data Collection and Preparation). For calculating the ROC curve different thresholds were used between 0 and 10. Figure 7 shows the ROC curves and AUC scores of ENAS-B (0.89 on External A and 0.96 on External B). AUC score in general demonstrates how well a classifier can discriminate between classes. The AUC scores have demonstrated that the capability of ENAS-B in distinguishing malignant lesions is better than CNN3, CNN4 and Fus2Net on both External A and B respectively. Moreover, we calculated the Delong test for ENAS-B against CNN3, CNN4 and Fus2Net on both External A and B. The DeLong value of ENAS-B versus CNN3, ENAS-B versus CNN4 and ENAS-B versus Fus2Net are respectively 0.3, 2.4 and 1.2 on External A, and 2.9, 4.9, and 3.4 on External-B. The DeLong test values all greater than zero indicate that ENAS-B different from those purposely designed CNN models²³ The larger the DeLong test statistic is, the stronger the evidence supporting the difference in AUC values between the two models. Since the DeLong test statistics of ENAS-B against the other models are mostly greater than 1, it implies that the ENAS-B model has a significantly higher AUC compared to the other models. Although, the DeLong value of 0.3 for ENAS-B versus CNN3 on External-A is less than 1, the p-value for ENAS-B versus CNN3 is 0.000487. All the results indicate that ENAS-B has a significantly better performance than the other purposely designed CNN models.

Figure 7.

Presents ROC curve and AUC score of ENAS-B, CNN3, CNN4 and Fus2Net on External_A and External_B.

Comparison With State-of-the-Art Generic CNNs

It is also interesting to know how ENAS-B models compare with known CNN architectures originally designed for ImageNet. We selected some well-known architectures (VGG16,⁴ ResNet50,²⁴ InceptionV3,²⁵ MobileNet V2,²⁶ DenseNet,²⁷ EfficientNetB0,²⁸ NasNet Mobile²⁹ and XceptionNet³⁰) and then customized them for breast lesion recognition from US images. Both training the architectures from scratch and training them with transfer learning (TL) have been attempted. The number of epochs was set to 50 for the former, and 25 for the latter. The batch size was set as 16 for all the models in both situations. For fairness of the comparison, all the network models were trained on the Modeling Dataset under the same setting as for the ENAS-B models. Table 2 shows that ENAS-B achieved the highest overall accuracy on the internal tests except XceptionNet TL (with a small margin), and the highest average overall accuracy on the external tests.

Table 2.

Comparison Results of State-of-the-Art and ENAS-B to Classify US Images of Breast Lesions (SP: Specificity; ST: Sensitivity; ACC: Overall Accuracy; F1: F1-Score): Internal Average Versus External Average.

Network Models	Test sets	CNN models from scratch				CNN models with TL
Network Models	Test sets	SP	ST	ACC	F1	SP	ST	ACC	F1
VGG16	Internal	100	0	50	N/A	100	0	50	N/A
VGG16	External	100	0	50	N/A	100	0	50	N/A
Resnet50	Internal	74.5	50.4	62.5	57.4	87.1	48	67.6	54.5
Resnet50	External	65.4	57.6	61.5	60.1	80.2	52.9	66.6	57.6
InceptionV3	Internal	84.2	48.9	66.6	53.6	94.7	32.2	63.4	40.2
InceptionV3	External	76.3	60.7	68.5	63.0	90.2	47.4	68.8	55.6
MobileNet V2	Internal	19.9	80.2	50	41	77.9	77.5	77.7	72.2
MobileNet V2	External	20.0	80.1	50.0	49.3	67.4	87.7	77.6	78.1
DenseNet	Internal	83.5	52.4	68	58	94.2	42.8	68.5	51.1
DenseNet	External	80.6	67.9	74.2	70.7	87.1	50.0	68.5	53.4
EfficientNetB0	Internal	82.9	58.4	70.7	63.1	87.3	65.5	76.4	68.3
EfficientNetB0	External	73.5	68.9	71.2	69.8	80.6	78.9	79.7	77.6
NasNetMobile	Internal	50.2	91.1	70.7	64.9	73.3	36.9	55.1	24.7
NasNetMobile	External	34.1	95.9	65.0	69.9	70.1	37.7	53.9	27.2
XceptionNet	Internal	88.3	57.3	72.8	63.7	87.3	73.8	80.6	74.4
XceptionNet	External	82.3	67.4	74.9	71.2	77.5	84.8	81.2	79.8
ENAS-B	Internal	88.2	70.5	79.4	73	-	-	-	-
ENAS-B	External	82.9	82.5	82.7	81.1	-	-	-	-

Comparison With ENAS-B Combined

We further compare ENAS-B with ENAS-B Combined. Figure 8 summarizes the performance of the 12-layer ENAS-B Combined models on the internal test data and three external test datasets (without External D). Although the ENAS-B Combined models still perform better than all other purposely built CNNs, the performance is worse than that of ENAS-B for both internal and external tests.

Figure 8.

Performance of ENAS-B-Combined search on internal and external test sets.

ENAS-B for Thyroid Cancer Classification

A pilot study was conducted by searching for an optimal ENAS-B architecture using breast lesion US images, and then train an ENAS-B model for thyroid nodule classification in two different scenarios using the same data augmentation methods to enlarge the training sets under a 5-fold cross validation evaluation framework. In the first scenario, a balance dataset of 500 ultrasound images (250 Benign and 250 Malignant) was used with the result showing that the ENAS-B architecture achieved the average overall accuracy of 73.6% (specificity: 54% and sensitivity: 93.2%) in classifying thyroid nodules. In the second scenario, following our approach of using unbalance classes, the ENAS-B models were trained on an unbalanced thyroid dataset (480 Benign and 250 Malignant) with ratio (1.92:1) with the results showing that the ENAS-B models achieved the average overall accuracy of 67.9% (specificity: 67.8% and sensitivity: 68%). In both scenarios, the specificity is close to random guess whereas sensitivity has substantial lifts. However, the potentials of transfer learning aspects of ENAS-B still require further research.

Discussions

The comparisons have revealed several advantages of ENAS-B over the existing approaches. First, the ENAS-B models outperform all exiting handcrafted networks purposely built for breast lesion classification from US images (Table 1). Second, the ENAS-B models in general maintain a smaller difference between sensitivity and specificity with more balanced performance on both classes (Tables 1 and 2). Furthermore, the ENAS-B models have much smaller number of weight parameters in comparison with ENAS17 (Table 1) and other known generic architectures as reported in the literature. Although the purposely build networks tend to be slimer, they underperform on bother internal and external tests (Table 1). ENAS-B also has its own limitations. Like all automatic search methods, ENAS-B requires resources to conduct searching and then training. It is worth noting that our two-stage approach for optimization has already reduced the demand on resources comparing to the combined search. Due to resource constraint, ENAS-B purposely controls the sizes of the search space by defining a backbone architecture framework, which might influence the best optimal outcomes.

The results of the comparison between the two search strategies (two-stage vs. combined) show that the two-stage ENAS-B outperforms ENAS-B Combined by a margin of 3% on overall accuracy in the internal tests and nearly 5.5% better average sensitivity in the external tests while the average specificity remains marginally the same (Table 1 and Figure 8). Although the ENAS-B search principle is consistent with the ENAS’s two-step principle,¹¹ such finding is still surprising because a combined search space offers more hyperparameter combinations and hence should increase the possibility for finding the global optimum. It is possible that reaching an optimal CNN may require more iterations and hence prolong the overall time for searching.

Radiomics refers to the high-throughput analysis of quantitative image features for improving diagnostic accuracy in a clinical decision support system.³¹ For the rigor of studies and clinical relevance, a radiomics quality score (RQS) system was recently introduced in this landmark article. Although our paper is not a direct clinic-based study, it is desirable to evaluate the quality aspect of our study using the RQS score. After discarding six key components irrelevant to our study (key elements 4, 6, 7 11, 14 and 15), our study scored 16 out of 23 points on the remaining 10 key components. Although we have no direct control over the image acquisition and collection due to the data collection protocol agreed with the sponsor, US images from scanners of different makes and models were purposely collected and all lesions were cropped by experienced radiologists from their medical centers (see Section on Data Collection and Preparation). Our deep learning approach follows an end-to-end workflow instead of examining each stage of image processing separately. The embedded convolutional operators optimally placed in a CNN architecture extract features at different levels of data abstraction. The ENAS reduction cells and the GAP layer are used for feature reduction. The performance of ENAS-B has been evaluated through internal and external tests, and various discrimination and calibration statistics have been used (See the Results Section). Although not all of our datasets are publicly accessible, one external test dataset (BUSI) is available from open sources. The ENAS methods for cell search are based on python codes in Pham et al.³² and the Bayesian Optimization adapts the program codes in KerasTuner.³³ The radiomics analysis has also revealed the need for bringing our study closer to clinical practice. We therefore plan to conduct prospective tests in a clinical setting at the next phase of our investigation.

Conclusion and Future Work

This paper presented a novel framework for automatically searching CNN architectures for breast lesion classification from US images. We combined ENAS cell search with Bayesian Optimization of network layers and trainable hyperparameters. The proposed framework yields efficient, shallow and robust CNN models that outperformed the state-of-the-art CNN models developed for the same purpose. The results show that cell structures, network depth and trainable hyperparameters are all important parameters to be optimized. Another finding is the importance of the search strategy. Evidence has shown that the two-stage approach (ENAS-B) allows the Bayesian optimizer to narrow the search and provide a robust CNN model. In the future, we plan to expand the search space by including other hyperparameters such as the number of filters, RoI margin size and the connectivity between cells. In addition, we plan to automatically optimize the depth and trainable hyper-parameters of existing CNNs such as ResNets, GoogleNet and MobileNet by using their blocks as the search spaces. We will further compare the classification accuracies of the ENAS-B model with expert radiologists on datasets from different sources. Finally, we plan to evaluate the performance of ENAS-B with pre-processed images to ensure that they have consistent characteristics.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work is sponsored by TenD AI Medical Technology.

ORCID iD

Alaa AlZoubi

References

Sung

Ferlay

Siegel

Laversanne

Soerjomataram

Jemal

, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209-49.

Siegel

Miller

Wagle

Jemal

Cancer statistics, 2023. CA Cancer J Clin. 2023;73(1):17-48.

Liu

Wang

Yang

Lei

Liu

, et al. Deep learning in medical ultrasound analysis: a review. Engineering. 2019;5:261-75.

Simonyan

Zisserman

Very Deep Convolutional Networks for Large-Scale Image Recognition; 2014, pp. 1-14, http://arxiv.org/abs/1409.1556

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

, et al. Going deeper with convolutions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, pp. 1-9, 2014.

Xiao

Liu

Qin

Comparison of transferred deep neural networks in ultrasonic breast masses discrimination. Biomed Res Int. 2018;2018:4605191-9.

Zeimarani

Costa

MGF

Nurani

Bianco

De Albuquerque Pereira

Filho

CFFC

. Breast lesion classification in ultrasound images using deep convolutional neural network. IEEE Access. 2020;8:133349-59.

Tian

Sun

Liu

, et al. Fus2Net: a novel convolutional neural network for classification of benign and malignant breast tumor in ultrasound images. Biomed Eng Online. 2021;20(1):112-15.

Zoph

QV.

Neural Architecture Search with Reinforcement Learning;

2017, pp. 1-16, http://arxiv.org/abs/1611.01578

10.

Radhakrishnan

Alzoubi

2022. Vehicle Pair Activity Classification using QTC and Long Short Term Memory Neural Network. In VISIGRAPP (5: VISAPP), pp. 236-247.

11.

Pham

Guan

Zoph

Dean

Efficient Neural Architecture Search via Parameter Sharing; 2018, http://arxiv.org/abs/1802.03268

12.

Qian

Yang

Huang

Luo

Lin

, et al. HASA: Hybrid Architecture Search with Aggregation Strategy for Echinococcosis Classification and Ovary Segmentation in Ultrasound Images, vol. 00; 2022, pp. 1-17.

13.

Ahmed

AlZoubi

An ENAS Based Approach for Constructing Deep Learning Models for Breast Cancer Recognition from Ultrasound Images, ArXiv; 2020.

14.

Ahmed

AlZoubi

Improving generalization of enasbased cnn models for breast lesion classification from ultrasound images. In: Annual Conference on Medical Image Understanding and Analysis, 2021.

15.

Al-Dhabyani

Gomaa

Khaled

Fahmy

Dataset of breast ultrasound images. Data Brief. 2020;28, pp. 1–5.

16.

Abbasian Ardakani

Mohammadi

Mirza-Aghazadeh-Attari

Acharya

. An open-access breast lesion ultrasound image database: Applicable in artificial intelligence studies. Comput Biol Med. 2023;152, pp. 1–3.

17.

Zhu

AlZoubi

Jassim

Jiang

Zhang

Wang

, et al. A generic deep learning framework to classify thyroid and breast lesions in ultrasound images. Ultrasonics. 2021;110.

18.

Khalili

Kazerooni

Familiar

Haldar

Kraya

Foster

, et al. Radiomics for characterization of the glioma immune microenvironment. NPJ Precis Oncol. 2023;7(1):59.

19.

Rich

Seshadri

Photoacoustic imaging of vascular hemodynamics: Validation with blood oxygenation level-dependent MR imaging. Radiology. 2015;275(1):110-8.

20.

Han

Cao

Liang

Radiomics assessment of the tumor immune microenvironment to predict outcomes in breast cancer. Front Immunol. 2021;12, pp. 1–9.

21.

Mohammadi

Faeghi

Homayoun

Abolghasemi

Vogl

Bureau

, et al. Tumor microenvironment, radiology, and artificial intelligence: should we consider tumor periphery? J Ultrasound Med. 2022;41(12):3079-90.

22.

Hassan

Alzoubi

Jassim

Towards optimal cropping: breast and liver tumor classification using ultrasound images. In: Agaian

Jassim

DelMarco

Asari

eds. Multimodal Image Exploitation and Learning 2021. UAS-Florida, pp. 111–122, SPIE; 2021.

23.

DeLong

Clarke-Pearson

DL.

Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988;44(3):837-45.

24.

Zhang

Ren

Sun

Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, 2015, pp. 770-778.

25.

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

Rethinking the Inception Architecture for Computer Vision. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, 2016, pp. 2818–2826.

26.

Sandler

Howard

Zhu

Zhmoginov

Chen

LC.

MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510-4520.

27.

Huang

Liu

Van Der Maaten

Weinberger

KQ.

Densely connected convolutional networks. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, 2017, pp. 2261-2269.

28.

Tan

QV.

EfficientNet: Rethinking model scaling for convolutional neural networks. In: 36th International Conference on Machine Learning, ICML 2019, 2019, vol. 2019-June, pp. 10691-10700.

29.

Zoph

Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697-8710.

30.

Chollet

Xception: Deep learning with depthwise separable convolutions. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1800-1807.

31.

Leijenaar

Deist

RT P J TM

De Jong

Van Tim

Even

, et al. Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol. 2017;14(12):749-62.

32.

Pham

Guan

Zoph

Dean

Efficient Neural Architecture Search via Parameter Sharing-github, Github, 2019.

33.

O'Malley

Elie

James

Chollet

Haifeng

Invernizzi

KerasTuner, 2019, https://github.com/keras-team/keras-tuner.