Region search based on hybrid convolutional neural network in optical remote sensing images

Abstract

Currently, big data is a new and hot issue. Particularly, the rapid growth of the Internet of Things causes a sharp growth of data. Enormous amounts of networking sensors are continuously collecting and transmitting data to be stored and processed in the cloud, including remote sensing data, environmental data, and geographical data. And region is regarded as the very important object in remote sensing data, which is mainly researched in this article. Region search is a crucial task in remote sensing process, especially for military area and civilian fields. It is difficult to fast search region accurately and achieve generalizability of the regions’ features due to the complex background information, as well as the smaller size. Especially, when processing region search in large-scale remote sensing image, detailed information as the feature can be extracted in inner region. To overcome the above difficulty region search task, we propose an accurate and fast region search in optical remote sensing images under cloud computing environment, which is based on hybrid convolutional neural network. The proposed region search method partitioned into four processes. First, fully convolutional network is adopted to produce all the candidate regions that contain the possible object regions. This process avoids exhaustive search for input images. Then, the features of all candidate regions are extracted by a fast region-based convolutional neural network structure. Third, we design a new difficult sample mining method for the training process. At the end, in order to improve the region search precision, we use an iterative bounding box regression algorithm to normalize the detected bounding boxes, in which the regions contain candidate objects. The proposed algorithm is evaluated on optical remote sensing images acquired from Google Earth. Finally, we conduct the experiments, and the obtained results show that the proposed region search method constantly achieves better results regardless of the type of images tested. Compared with traditional region search methods, such as region-based convolutional neural network and newest feature extraction frameworks, our proposed methods show better robustness with complex context semantic information and backgrounds.

Keywords

Big data region search hybrid convolutional neural network difficult sample mining improved iterative bounding box regression

Introduction

Recently, with the continuous progress of Internet of Things (IoT) big data, mobile Internet, cloud computing, grid computing, and other new technologies, system integration becomes more complex. IoT is with the purpose of improving the quality of human life by handling the massive amount of data, collected by the sensing devices connected to the Internet.^1–3 When processing information and data, it encounters lots of challenges, for instance, data storage and management, efficient processing of massive data, structured and unstructured data fusion and analysis, and multi-type data visualization. Especially, remote sensing technologies are promoted quickly; the spatial resolution, temporal resolution, spectral resolution, and radiative resolution of remote sensing data are becoming higher and higher, which contain abundant data. Remote sensing data have the obvious big data characteristics such as large capacity, high efficiency, multi-type, and high value, which indicate that remote sensing has entered into big data era. Based on the aerospace science and technologies, an integrated space–air information network has been formed, which provides ultra-high dimensional and frequency earth observation data. Remote sensing images have significant applications in both military and civilian areas. For the civil area, remote sensing images can be used for meteorological forecasting, land planning, environmental monitoring, and other aspects to make a significant contribution to the development of national economy.^4–6 In terms of military, remote sensing images can be used for strategic reconnaissance, military surveying and mapping, marine monitoring, and so on, which can acquire various military targets intelligence information without affected by national boundaries and geographical restrictions.^7–10 Detecting all kinds of important military targets accurately and realizing the rapid transformation of remote sensing images to intelligence information, this can not only save human resources but also improve the efficiency of information acquisition.

Region/object detection based on image technology refers to that it uses relevant technologies such as computer vision to detect the targets from the image; meanwhile, judge the categories of objects, location, size, and confidence degree automatically. At present, the technology has been widely used in marine surveillance, precision guidance, video monitoring, and so on. Among all the military targets, region search has attracted considerable research attention.

However, due to the complex background, the size of region is very dissimilar, which greatly increases the difficulty of the region search task.¹¹ At present, the hot region search is the main research point; thus, we carry out the automatic detection for region search, which has the vital significance for enhancing the intelligence information level of military battlefield.

In recent years, the region search technology has attracted more attention from researchers in deep learning and computer vision. Traditional region search tasks such as histogram of oriented gradients (HOG),¹² local binary pattern (LBP) feature,¹³ and scale invariant feature transformation (SIFT)¹⁴ with classifier (boosting, support vector machine (SVM), etc.) cannot satisfy the requirements for next image process like object detection or localization accurately. The rule of feature extraction by using artificial design method is that it extracts feature information in the original image, and inputs feature into the classifier to learn classification rules. Finally, the trained classifier is used for object detection. This artificial feature modeling methods have achieved good results in face recognition, pedestrian detection, and other fields, which greatly promote the development of image region search technology. However, artificial feature modeling method only contains the original pixel image feature and texture gradient information; it does not have high-level semantic abstract ability, which makes this method have low detection effeteness under the complicated scene. Convolutional neural network (CNN) was proposed for ImageNet image segmentation by Krizhevsky et al.¹⁵ A breakthrough was made in image classification, target detection, image segmentation, and other image processing tasks with CNN. Compared with the feature description of the traditional manual design, the deep convolution feature has a subversive enhancement in the semantic abstraction ability.

Actually, region search and region localization have a delicate difference. The aim of region search is to find out the location of all region objects in one image and they need to be detected. Nevertheless, region localization should locate the region objects in image accurately. Obviously, region localization has a higher precision than many region search algorithms. Thus, a region search framework is proposed for the specific region object, that is, airport and harbor, in this article. Therefore, we lay more emphasis on fast region search to realize the core work of this article. Feature extraction problem for region search is addressed by CNN, because there is a deeper layer structure in CNN, which has better learning abilities than other classifiers. We design a new accurate and fast region search framework based on hybrid convolutional neural network (HCNN) in optical remote sensing images under cloud computing environment.

This article is organized as follows. First, related works on region search and applications of CNN are reviewed in remote sensing images in the “Related work” section. Then, the “Proposed region search method” section shows the detailed proposed region search framework and its main implementation issues are discussed. The “Experimental results” section discusses the region search experiments and gives the analysis. Finally, the “Conclusion” section summarizes this article.

Related work

Significant research has been devoted to region search through multi-scale gradient histograms and CNN-based methods. The classical region object detection network contains two parts: candidate region extraction and object region recognition. Girshick et al.¹⁶ proposed the region-based convolutional neural network (R-CNN) for object detection. Its main idea is that it uses selective search method¹⁷ to extract the proposal regions containing possible objects. Then, the CNN is adopted to extract convolution features for proposal regions. SVM is used to make a discrimination for the proposal regions. Finally, bounding box algorithm is used for making correction. However, the method needs to calculate the convolution feature in each proposal region; the calculation efficiency is very low. In addition, all the proposal regions are normalized to the same scale resulting in the distortion of the image, which affects the final detection result. Li et al.¹⁸ presented a framework that integrated R-CNN and deformable part-based model (DPM) for detecting multiple objects. Liu et al.¹⁹ focused on the last step of the system to aggregate thousands of scored box proposals into final object prediction, which was called proposal decimation. Girshick²⁰ proposed a fast R-CNN for object detection, which addressed the issues of R-CNN. Meanwhile, the discriminant of the proposal area and the bounding box regression were integrated into a framework, which effectively improved the accuracy and efficiency of the object detection. Wang et al.²¹ proposed a multi-scale mask-based fast R-CNN framework, which generated saliency score of each region. Since the regions were segmented using edge-preserved methods, the results were naturally with sharp boundaries. Lou et al.²² proposed an object detection system based on standard fast R-CNN object detection branch and DeepLap semantic segmentation branch. This method was divided into three processes. Ren et al.²³ introduced a region proposal network (RPN) that shared full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. Chen et al.²⁴ improved the cross-domain robustness of object detection. He tackled the domain shift on two levels: (1) the image-level shift, such as image style and illumination, and (2) the instance-level shift, such as object appearance and size. The method built the approach based on the recent state-of-the-art faster R-CNN model, and designed two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components were based on H-divergence theory, and were implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels were further reinforced with a consistency regularization to learn a domain-invariant RPN in the faster R-CNN model. Redmon et al.²⁵ presented You Only Look Once (a real-time object detection), a new approach to object detection. It directly made regression for multiple region categories and bounding box on the convolution feature maps, realizing the end-to-end training and testing for input image. The method greatly improved the object detection speed. Liu et al.²⁶ presented a single deep neural network single shot multibox detector (SSD) for detecting objects in images. SSD discretized the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. In addition, the network combined predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

However, with the increasing of CNN layers, dimensions of feature map are getting smaller. The features become more abstract, and semantic feature is more obvious, namely, the location information is very fuzzy. Therefore, location is not accurate; although the bounding box can remit the problem, it still exists difference for different input images; sometimes, the correction effect is very poor. Multiple convolution and pooling alternately execute the operation, making smaller objects correspond to small unit size. Inadequate feature expression leads to poor performance for small object detection.

In this article, we propose a fast region search based on HCNN in optical remote sensing images. Fully convolutional network (FCN) generates all the candidate regions avoiding the influence of different sizes of input images. Then, the features of all candidate regions are extracted by fast R-CNN structure. A new difficult sample mining method is designed for the training process. To improve the accuracy of region search, we use iterative bounding box regression (IBBR) algorithm to make the bounding boxes of searched regions perfect. Experiments show that the proposed region search method shows better robustness with complex context semantic information and backgrounds than other state-of-the-art object region detection methods.

Proposed region search method

The proposed region search method consists of three stages: region proposal, feature extraction, and output result, as shown in Figure 1.

Figure 1.

Total framework of proposed method: (a) original image, (b) FCN produces the candidate regions, (c) fast R-CNN is used for extracting the features from these candidate regions, (d) Softmax is used for classifying the objects, (e) false alarms are removed, (f) IBBR is applied to optimize the bounding boxes, and (g) final object region results.

First, potential candidate boxes are generated by using FCN method for the input remote sensing images. Second, fast R-CNN is adopted for extracted feature of all the potential candidate boxes. Finally, IBBR is designed to address the problem of box redundancy and optimize the region locations, and then get the accurate searched regions. The following is the detailed illustration for each process.

Proposal region extraction with FCN

FCN was used for segmenting semantic content.²⁷ In this article, we use FCN to generate the proposal regions with possible regions. As we all know, exhaustive search method such as sliding window technique for object detection is computationally expensive. FCN can effectively address this problem. More importantly, only a few training examples are required when using FCN. FCN defines many classes in convolutional network, where the output of the final parameterized layer is a grid rather than a vector, which has parameters that are learned from the data. That is beneficial for semantic segmentation, where each location is the window corresponds to the predicted region class of a pixel.

FCN has two obvious advantages: (1) it can accept any size of input images without requiring the same size of all training images and test images; and (2) after one convolution, it can get the features of multiple regions, which avoids the repeated calculation problems, so it is more efficient.

First, FCN is constructed. Any size of the images is input to this convolution network, and the eigenmatrix representation of image is acquired; each location in feature representation corresponds to an object region in the original image. Given an image, it gets the feature representation by using FCN. To detect different sizes of regions, the images need to be made multi-scale transformation, and it obtains feature matrix. Finally, it makes similarity comparison between detected object feature and feature matrix of image in database. The optimal matching position and similar values are obtained. Framework of FCN is shown in Figure 2.

Figure 2.

Framework of FCN: (a) original image, (b) FCN network, and (c) matrix representation.

After training FCN, it can get the feature matrix representation of input image. For the object image, scaling is executed to keep the same size with that of FCN. Region search will be directly affected by the quality of extracted region proposals. Therefore, FCN improves the quality of region proposals for region search to some extent.

Feature extraction with fast R-CNN

Above all, the fast R-CNN deals with all the detected candidate regions with several convolutional (conv) and maximum pooling layers to produce a conv feature map. Then, it adds the ROI (region of interest) pooling layer, to extract the feature representation of a fixed dimension for each region, and then conducts type identification through normal Softmax.

ROI pooling layer can transform any valid candidate regions into a small feature map with a fixed spatial extent of H × W. Here, H and W are layer hyper-parameters that are independent of any particular ROI. Each ROI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

Feature matrix is obtained by FCN. This process means that it selects fixed sliding window based on certain step length. After a forward convolution, it can get the feature of the area. This fixed window size may result in some areas without covering the objects, which can influence the region search result. In this article, we perform multi-scale transformation to detect the different size objects.

Deep CNN usually has millions of model parameters that need to be trained, and the training samples and hardware conditions are very necessary. At present, the general approach is to use pre-trained network models on the ImageNet. Then, transfer learning method is adopted to modify and fine-tune the parameters for specific images. Generally, feature map of region extraction and detection image should contain more rich information, including semantic information and location information. In addition, feature map should have appropriate size; smaller size will lead to insufficient expression of feature and bigger size will affect the computational efficiency. Deep CNN model contains LeNet, AlexNet, VGGNet, GoogLeNet, and so on.^28–30 There are five convolutional layers and three full connection layers in the final network structure of AlexNet, as shown in Figure 3. It uses dropout and activation function Rectified Linear Units (ReLU) to improve the accuracy and reduce training time. In addition, due to the limitation of experiment condition, in this article, we use AlexNet model for FCN. When training the network, size of input image is 227 × 227.

Figure 3.

Architecture of AlexNet CNN. This network is composed of five successive convolutional layers (C1, C2, C3, C4, and C5) followed by three fully connected layers (FC6, FC7, and FC8).

Fast R-CNN can avoid the repeated overlap computation, although with the bounding box calculation to realize the end-to-end training. Fast R-CNN does not need to run all of the images to get FC7 feature. Therefore, fast R-CNN is used for feature extraction in this work.

Difficult sample mining method

In order to train the extracted candidate regions, we need to label the samples of the original proposal regions corresponding to each sliding window. If the intersection-of-union (IoU) between original proposal region and ground truth bounding box is the biggest, or $IoU > 0.5$ , the proposal region is labeled as positive sample. If $IoU < 0.3$ , the proposal region is labeled as negative sample. The unmarked initial proposal regions do not affect the target function in the training process. Hence, M labeled samples form the training sample set $S = {(X_{i}, Y_{i})}_{i = 1}^{M}$ . Loss function can be denoted as

l (X_{i}, Y_{i} | W) = L_{c} (p (X_{i}), y_{i}) + λ [y_{i} = 1] L_{1} (b_{i}, {\hat{b}}_{i})

(1)

where $L_{c} (p (X_{i}), y_{i})$ and $L_{1} (b_{i}, {\hat{b}}_{i})$ represent the classification loss function and the regression loss function of the sample set, respectively; $λ$ is compensation factor to balance the classification and regression layers; $X_{i}$ denotes any sample image; $Y_{i} (y_{i}, b_{i})$ denotes the corresponding label $y_{i} = {0, 1}$ and bounding box regression vector $b_{i}$ , $b_{i} = (b_{ix}, b_{iy}, b_{iw}, b_{ih})$ . The subscripts (x, y, w, h) indicate the corresponding components of central coordinates, length and width of the bounding box.

Classification layer adopts a cross-entropy classification function based on dichotomies and a loss function, $L_{c} (p (X_{i}), y_{i}) = - \lg p_{yi} (X_{i})$ , to output the probability distribution of samples. $p_{yi} (X_{i})$ denotes the probability of $X_{i}$ belonging to $y_{i}$ .

After sample labeling, the object is limited in the image. Negative samples are usually greater than that of positive samples. The imbalance distribution of samples may lead to the unstable training, so it needs to make an adjustment between positive sample number and negative sample number in the training process. In this article, we set the proportion of positive and negative samples as $| P^{+} | / | P^{-} | = α$ . Due to the large negative samples and uneven distribution, this has a great influence on the final region search accuracy. Therefore, we use bootstrapping sampling strategy³¹ to select the negative training set. All negative samples are sorted according to the confidence value, and samples with the highest scores are added into the training set. Since the feature map of each layer corresponds to a different size of perception region, each of convolution layers in the extracted region is responsible for extracting the proposal regions with different scales. This region extraction method may lead to a shortage of positive samples in a convolution layer, namely, $| P^{+} | > > α | P^{-} |$ . So, we modify the weight of classification loss function

\begin{matrix} L_{c} = \frac{1}{1 + α} \frac{1}{| P^{-} |} \sum_{i \in | P^{-} |} - \lg p_{1} (X_{i}) \\ + \frac{α}{1 + α} \frac{1}{| P^{+} |} \sum_{i \in | P^{+} |} - \lg p_{0} (X_{i}) \end{matrix}

(2)

By adding the corresponding weight coefficient for the loss function of positive and negative samples, it ensures that if the positive samples are less than the setting number, it can increase the positive proportion in the loss function through controlling the weight coefficient, which makes the weight of the positive and negative samples remain balanced in loss function.

Improved bounding box regression IBBR

Bounding box regression³² plays a significant role in detection models. We then derive a novel bounding box regression loss, which better matches the goal of maximizing IoU while maintaining good convergence properties (smoothness, robustness, etc.). We define the regression loss function as

L_{1} (b_{i}, {\hat{b}}_{i}) = \frac{1}{4} \sum_{j \in {x, y, w, h}} s m_{L_{1}} (b_{i, j} - {\hat{b}}_{i, j})

(3)

where sm function is defined as

s m_{L_{1}} (x) = {\begin{matrix} 0.5 x^{2}, | x | < 1 \\ | x | - 0.5, others \end{matrix}

(4)

${\hat{b}}_{i} = ({\hat{b}}_{ix}, {\hat{b}}_{iy}, {\hat{b}}_{iw}, {\hat{b}}_{ih})$ denotes the four-dimensional vector that is parameterized by the ground truth coordinates associating with the predicted sample. Then, we execute the parametrization for bounding box coordinates

{\begin{matrix} b_{ix} = (x - x_{a}) / w_{a} \\ b_{iy} = (y - y_{a}) / h_{a} \\ b_{iw} = \lg (w / w_{a}) \\ b_{ih} = \lg (h / h_{a}) \end{matrix}

(5)

{\begin{matrix} {\hat{b}}_{ix} = (\hat{x} - x_{a}) / w_{a} \\ {\hat{b}}_{iy} = (\hat{y} - y_{a}) / h_{a} \\ {\hat{b}}_{iw} = \lg (\hat{w} / w_{a}) \\ {\hat{b}}_{ih} = \lg (\hat{h} / h_{a}) \end{matrix}

(6)

where x, $x_{a}$ , and $\hat{x}$ denote the predicted bounding box, candidate bounding box, and ground truth bounding box, respectively. Therefore, in the training stage, the algorithm takes the convolution feature of the image in the predicted bounding box as input and optimizes the regression parameters by the gradient descent method. In the test stage, the output is obtained according to the convolution feature of the input image, and the bounding box is fine-tuned after the inverse parameterization. So, the total loss function is obtained as

L (W) = \sum_{n = 1}^{N} \sum_{i \in S_{n}} ϖ_{n} l_{n} (X_{i}, Y_{i} | W)

(7)

where N is the convolutional layer number, $ϖ_{n}$ is the sample weight corresponding to each convolutional layer, and $S_{n}$ is the extracted sample set.

Final results

The whole object region search network carries out end-to-end training by reverse propagation and stochastic gradient descent. Reverse propagation calculates the error between the actual output $O_{act}$ and the corresponding ideal output $O_{ide}$ . According to the principle of error minimization, it adjusts the weight matrix of reverse propagation error. We use base sensitivity (BS) to present reverse propagation error. So the ith layer BS can be calculated as

B S^{l} = {(W^{l + 1})}^{T} B S^{l + 1} \otimes f' (u^{l})

(8)

where ⊗ denotes multiplication of each element. $f (\cdot)$ and W are output activation function and weight matrix, respectively, $u^{l} = W^{l} x^{l - 1} + b^{l}$ and $x^{l} = f (u^{l})$ . b is a bias term. BS of output layer neuron is

B S^{L} = f' (u^{l}) \otimes (y^{n} - t^{n})

(9)

The weight is scaled by BS

\frac{\partial E}{\partial W^{l}} = x^{l - 1} {(B S^{l})}^{T}

(10)

Δ W^{l} = - η \frac{\partial E}{\partial W^{l}}

(11)

The above is the process of reverse propagation. Then, we use IBBR to optimize the detected bounding boxes. The regression pattern setting is applied,³³,³⁴ where patterns are pre-defined as a set of three-dimensional scale changes and offset vectors OE

OE = {[s_{n}, x_{n}, y_{n}]}_{n = 1}^{N}

(12)

Given a detection window $(x, y, w, h)$ , the top-left corner x, y, and its size w, h, the pattern adjusts the window to be

(x - \frac{x_{n} w}{s_{n}}, y - \frac{y_{n} h}{s_{n}}, \frac{w}{s_{n}}, \frac{h}{s_{n}})

(13)

In each pattern, $s_{n} \in {0.83, 0.91, 1.0, 1.10, 1.21}$ and $x_{n}, y_{n} \in {- 0.17, 0.17}$ . Because the patterns are not orthogonal to each other, we must integrate the output of AlexNet to obtain a better adjustment. Thus, a weight for each pattern is considered as

[s, x, y] = \sum_{n = 1}^{N} [s_{n}, x_{n}, y_{n}] I (c_{n} > t)

(14)

I (c_{n} > t) = {\begin{matrix} c_{n}, c_{n} > t \\ 0, otherwise \end{matrix}

(15)

where $s_{n}$ , $x_{n}$ , $y_{n}$ represent the three parameters in the nth airport-displacement pattern; s, x, y represent the integrated airport-displacement pattern; and I represents the weight of the patterns whose confidence exceeds a certain threshold.

More importantly, redundant regression can be generated. To avoid this phenomenon, a function as a terminate condition is defined. It is used to evaluate the performance of detection after every bounding box regression

{\begin{matrix} | s - 1.0 | < 0.0001 \\ | x - 0.0 | < 0.0001 \\ | y - 0.0 | < 0.0001 \end{matrix}

(16)

where s, x, y represent the integrated result of the airport-displacement pattern. If it cannot meet the condition in formula (16), it implies that the detected bounding box at current iteration is not better to locate a candidate region. Thus, the bounding box would be fed into AlexNet again and the operation described above executed. So, we will execute this process several times until a better bounding box generated.

Experimental results

Data set

We collect the regions (mainly airports and harbors), images from Google Earth with 2-m resolution, and label them intended for airport/harbor search in optical remote sensing images. The size of the images is 2048 × 2048 pixels, which is resized from large-scale images before processing. The ground truth of the airport and harbor images is determined by manual annotation. In the test images, the object regions have different sizes and shapes.

When using fast R-CNN for training process, ample training samples are necessary for acquiring better learning abilities. The size and quality of training data set are very critical. Therefore, when selecting the training data set, high-quality training data set is chosen. All the data set contains positive samples including 806 images with each image containing at least one target (in this article, there are 1021 airports) to be detected and negative samples including 100 images that do not contain any targets of the given objects which is at a ratio of 4:1. Figure 4 shows positive and negative training samples.

Figure 4.

Examples from the training data set: (a) positive samples and (b) negative samples.

Evaluation metrics

By following the work,³⁵ when we judge a region detection whether it is correct or incorrect depending on the bounding box, the process overlaps more than 70% with the ground truth bounding box. It the overlap is greater than 70%, it is better search result; otherwise, the detected object region is regarded as a false positive. Nevertheless, if a single ground truth bounding box has several same bounding boxes overlap, there is only one considered as true positive and the others are regarded as false positives in this article. We adopted the standard precision-recall curve to quantitatively evaluate the developed framework. The precision measures the fraction of detections that are true positive and the recall measures the fraction of positive examples that are correctly detected. Let TP, FP, and FN denote the number of correctly detected objects, the number of false alarms, and the number of undetected objects in test images. The precision and recall can be formulated as

Precision = \frac{TP}{TP + FP}

(17)

Recall = \frac{TP}{TP + FN}

(18)

We also use F-score to calculate weighting-harmonic-mean for precision and recall

F - score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(19)

A high F-score indicates that the accuracy of algorithm is better.

Experimental results and comparison

Based on the successful application of AlexNet that has been pre-trained on the ImageNet data set, we train our HCNN model. The experiment is partitioned into three procedures: training process, testing, and the search process. In training process, we use graphics processing unit (GPU) and NVIDIA (NVIDIA: This is a comprehensive software that can implement graphics, audio, video, communications, storage and security functions) to enhance the training speed.^36,37 Meanwhile, we can obtain the trained CNN models. Then, testing images are input to validate the algorithm. The search process is used to detect regions in test images. We use IBBR to improve the search accuracy as described in the preceding section. The learning rate is also an important factor. Figure 5 is the comparison of learning rate with HCNN method (from left to right in x-coordinate, learning rate = 0.01, 0.005, 1e–6 in turn).

Figure 5.

F-score measurement in terms of different learning rates.

So we set learning rate to 1e–6. Minibatch size is constructed as 32. We use small patches cropped from raw image to train our CNN models through the backpropagation algorithm, by employing the GPU and compute unified device architecture (CUDA) for AlexNet model to improve the speed.

In order to evaluate the better performance of the presented region search method, we compare to other two state-of-the-art detection methods in this article, namely, cascade region-based convolutional neural network (CR-CNN)³⁸ and DLU (DLU is the highly summarized abbreviation from deep learning U-net).³⁹

We select four optical remote sensing images to test the performance. To train our HCNN model, if the candidate proposals are greater than 0.3 IoU overlap with a ground truth box, then they are deemed to be positives, and the others are as negatives. And Figures 6 –11 show a portion of the comparison of the search results of the CR-CNN, DLU, and HCNN methods.

Figure 6.

Airport search results with the CR-CNN method.

Figure 7.

Airport search results with the DLU method.

Figure 8.

Airport search results with the HCNN method.

Figure 9.

Harbor search results with the CR-CNN method.

Figure 10.

Harbor search results with the DLU method.

Figure 11.

Harbor search results with the HCNN method.

In Figures 6 and 9, the CR-CNN method suffers from many false airport/harbor regions and many missed detections. The DLU method achieves a better search performance than CR-CNN shown in Figures 7 and 10, but it is worse than the proposed method. Although the object regions have different orientations and different sizes, the proposed method has successfully searched and detected most of them. Due to the low target and the background, there are missing alarms.

Tables 1 –3 show the quantitative comparison results of the three different methods, measured by F-score values, average running time per image, respectively.

Table 1.

Airport performance comparison of CR-CNN, DLU, and HCNN.

Method	Precision	Recall	F-score
CR-CNN	68.32%	92.64%	78.64%
DLU	72.68%	94.37%	82.12%
HCNN	87.98%	96.54%	92.06%

CR-CNN: cascade region-based convolutional neural network; HCNN: hybrid convolutional neural network.

The bold-faced values are better results.

Table 2.

Harbor performance comparison of CR-CNN, DLU, and HCNN.

Method	Precision	Recall	F-score
CR-CNN	65.74%	89.61%	75.84%
DLU	72.43%	91.38%	80.81%
HCNN	89.63%	94.58%	92.04%

CR-CNN: cascade region-based convolutional neural network; HCNN: hybrid convolutional neural network.

The bold-faced values are better results.

Table 3.

Computation time comparison of CR-CNN, DLU, and HCNN.

Method	Average running time per image (s)
Method	Airport	Harbor
CR-CNN	15.2	16.5
DLU	16.8	14.7
HCNN	10.6	9.5

CR-CNN: cascade region-based convolutional neural network; HCNN: hybrid convolutional neural network.

The bold-faced values are better results.

As can be seen from them, we derive the following conclusions. (1) The proposed HCNN model method with IBBR outperforms all other comparison approaches for airports and harbors in terms of F-score. Specifically, our HCNN method obtains 92.06%, compared with the CR-CNN, and DLU for airport. This demonstrates the high superiority of the proposed method compared with the existing state-of-the-art methods. (2) For running time, HCNN takes nearly 10 s lower than other two methods. (3) With the other same computation cost, our HCNN model improves the newly trained CNN-model-based method significantly for airports and other regions. This adequately shows the effectiveness of the proposed HCNN model learning method.

Conclusion

In this work, an effective region search framework based on HCNN has been developed. This proposed region search method contains three main procedures: generating candidate regions, feature extraction, and final search. FCN is used to generate a coarse-object-sensitive map, thereby effectively rejecting many regions that do not contain objects in the process of first stage. Then, we use the newest fast R-CNN to extract region features. Finally, IBBR is used for optimizing bounding box to produce more accuracy region search. At last, the performance of the proposed framework has been evaluated qualitatively and quantitatively using airport and harbor optimal remote sensing images. Comprehensive comparisons with state-of-the-art approaches have demonstrated the effectiveness of the developed region search framework. In terms of precision, recall, and F-score, the results with proposed HCNN method are superior to other methods.

Although our new proposed method has been shown to be effective to obtain promising region search performance, some problems still exist: (1) some geospatial objects, having similar appearances to targets in structure and shape, may result in the false alarms, and (2) cluttered background and low contrast between the object and the background may lead to miss alarm.

Therefore, our future work may include the following issues: (1) integrate contextual background cues with the developed framework to improve region search performance; (2) develop new CNN-based model to address the mis-classification problems; (3) use this method to locate more objects, such as bridge, oil tank, and helicopter, and enhance the scene understanding, and especially has a significant military area; and (4) adopt airplane as the feature to detect airport in large-scale remote sensing image.

Footnotes

Handling Editor: Qingchen Zhang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Defense Industrial Technology Development Program under Grant JCKY2016603C004.

ORCID iD

Shoulin Yin

References

Zhang

Yang

Chen

et al . An adaptive dropout deep computation model for industrial IoT big data learning with crowdsourcing to cloud computing. IEEE Trans Indus Inform 2019; 15: 2330–2337.

Chen

Yang

et al . Deep convolutional computation model for feature learning on big data in Internet of Things. IEEE Trans Indus Inform 2018; 14(2): 790–798.

Chen

Yang

et al . An incremental deep convolutional computation model for feature learning on industrial big data. IEEE Trans Indus Inform 2019; 15: 1341–1349.

Pan

Tai

Zheng

et al . Cascade convolutional neural network based on transfer-learning for aircraft detection on high-resolution remote sensing images. J Sensors 2017; 2017: 1796728.

Rodriguez

Carriello

Fernandes

PJF

et al . Assessment of flooded areas projections and floods potential impacts applying remote sensing imagery and demographic data. ISPRS 2016; XLI-B8: 159–161.

Bao

Zhou

Liu

et al . A MRF-based multi-agent system for remote sensing image segmentation. In: Proceedings of the international conference on computing intelligence and information system, Nanjing, China, 21–23 April 2017, pp.246–249. New York: IEEE.

Chang

Kim

et al . Canopy-cover thematic-map generation for Military Map products using remote sensing data in inaccessible areas. Landscape Ecol Eng 2011; 7(2): 263–274.

Zhang

Yang

Yan

et al . An efficient deep learning model to predict cloud workload for industry informatics. IEEE Trans Indus Inform 2018; 14(7): 3170–3178.

Chen

Yang

et al . An improved stacked auto-encoder for network traffic flow classification. IEEE Netw 2018; 32(6): 22–27.

10.

Yin

Zhang

Karim

Large scale remote sensing image segmentation based on fuzzy region competition and Gaussian mixture model. IEEE Access 2018; 6: 26069–26080.

11.

Long

Gong

Xiao

et al . Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans Geosci Remote Sens 2017; 55(5): 2486–2498.

12.

Kobayashi

Hidaka

Kurita

Feature selection for object recognition by histograms of oriented gradients. IEICE Techn Rep 2007; 106(606): 119–124.

13.

Zhang

et al . SDBD: a hierarchical region-of-interest detection approach in large-scale remote sensing image. IEEE Geosci Remote Sens Lett 2017; 14(5): 699–703.

14.

Maestas

Lumia

Starr

et al . Scale invariant feature transform (SIFT) parametric optimization using Taguchi design of experiments. Intell Robot Appl 2010; 6424(6): 630–641.

15.

Krizhevsky

Sutskever

Hinton

GE.

ImageNet classification with deep convolutional neural networks. In: Proceedings of the international conference on neural information processing systems, Lake Tahoe, NV, December 2012, pp.1097–1105. Advances in Neural Information Processing Systems.

16.

Girshick

Donahue

Darrell

et al . Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, 23–28 June 2014. New York: IEEE.

17.

Uijlings

Sande

Gevers

et al . Selective search for object recognition. Int J Comput Vis 2013; 104(2): 154–171.

18.

Wong

et al . Multiple object detection by deformable part-based model and R-CNN. IEEE Sig Process Lett 2018; 25: 288–292.

19.

Liu

Jia

. Box aggregation for proposal decimation: last mile of object detection. In: Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp.2569–2577. New York: IEEE.

20.

Girshick

. Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp.1440–1448. New York: IEEE.

21.

Wang

Chen

Salient object detection via fast R-CNN and low-level cues. In: Proceedings of the IEEE international conference on image processing, Phoenix, AZ, 25–28 September 2016. New York: IEEE.

22.

Lou

Jiang

et al . Improve object detection via a multi-feature and multi-task CNN model. In: Proceedings of the visual communications and image processing, St. Petersburg, FL, 10–13 December 2017, pp.1–4. New York: IEEE.

23.

Ren

Girshick

et al . Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 2015; 39(6): 1137–1149.

24.

Chen

Sakaridis

et al . Domain adaptive faster R-CNN for object detection in the wild, 2018, https://arxiv.org/abs/1803.03243

25.

Redmon

Divvala

Girshick

et al . You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.779–788, https://arxiv.org/abs/1506.02640

26.

Liu

Anguelov

Erhan

et al . SSD: single shot multibox detector. In: Proceedings of the European conference on computer vision, Amsterdam, 8–16 October 2016, pp.21–37. New York: Springer.

27.

Nie

Wang

Adeli

et al . 3-D fully convolutional networks for multimodal isointense infant brain image segmentation. IEEE Trans Cybernet 2018; 49: 1123–1136.

28.

Zhang

Lin

Yang

et al . A double deep Q-learning model for energy-efficient edge scheduling. IEEE Trans Serv Comput 2018, https://www.computer.org/csdl/journal/sc/5555/01/08449105/13rRUxjQy9m

29.

Cheng

Zhou

Han

Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans Geosci Remote Sens 2016; 54(12): 7405–7415.

30.

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition. Comput Sci 2014, https://arxiv.org/abs/1409.1556

31.

Kopacz

Kryzia

Assessment of sustainable development of hard coal mining industry in Poland with use of bootstrap sampling and copula-based Monte Carlo simulation. J Clean Product 2017; 159: 359–373.

32.

Hajič

Pecina

Detecting noteheads in handwritten scores with ConvNets and bounding box regression, 2017, https://arxiv.org/abs/1708.01806

33.

Luo

Wen

et al . Deep-learning-based face detection using iterative bounding-box regression. Multimed Tools Appl 2018; 77(6): 1–18.

34.

Lin

Shen

et al . A convolutional neural network cascade for face detection. In: Proceedings of the computer vision and pattern recognition, Boston, MA, 7–12 June 2015, pp.5325–5334. New York: IEEE.

35.

Cheng

Han

Guo

et al . Object detection in remote sensing imagery using a discriminatively trained mixture model. ISPRS J Photogramm Remote Sens 2013; 85(9): 32–43.

36.

Jing

Peng

Zhikui

. A canonical polyadic deep convolutional computation model for big data feature learning in Internet of Things. Future Gener Comput Syst 2019; 99: 508–516.

37.

Jing

Jianzhong

Yingshu

. Approximate event detection over multi-modal sensing data. J Comb Optim 2016; 32: 1002–1016.

38.

Cai

Vasconcelos

Cascade R-CNN: delving into high quality object detection, 2017, http://openaccess.thecvf.com/content_cvpr_2018/papers/Cai_Cascade_R-CNN_Delving_CVPR_2018_paper.pdf

39.

Ling

Qiu

Jin

et al . An accurate and real-time self-blast glass insulator location method based on faster R-CNN and U-net with aerial images. Comput Vis Pattern Recognit 2018, https://arxiv.org/abs/1801.05143