Abstract
It is crucial to precisely classify the pixels in brain tumor tissues in the brain tumor image segmentation. However, the traditional segmentation method is somewhat restricted and the segmentation accuracy cannot meet the real requirements because of the randomness of brain tumors’ spatial location in the brain. To solve the said problems, the model of convolutional neural network in the deep learning approach was used in this article to cope with classification and labeling tasks of brain tumor images. The main contents of this article were studied as follows: the principle and operating approach of convolutional neural network on image processing was first introduced, and then 12-layer convolutions were skillfully set up for local pathways based on two-way convolutional neural network architectures; considering the inter-label dependency in pixel areas, the situation of conditional random field was simulated to design the input series connection structure; multi-pooling input series connection model was designed to solve the problem that the input pixel area is limited; finally, the classification accuracy upon experiments reached 83%, which has verified the effectiveness of model to improve.
Introduction
There are about 250,000 patients with brain tumors around the world. A total of 23,000 cases of brain tumors were diagnosed in the United States only in 2015. The use of magnetic resonance (MR) images for brain tumor segmentation has a great impact on improving the diagnosis, predicting the growth rate or developing the treatment plans. However, glioma and glioblastoma tumors have extended tentacle structures, with a poor contrast. Moreover, brain tumors may be in any place in the brain and their shapes and sizes are random. This makes it exceptionally difficult to segment the brain tumors.
Scholars at home and abroad have launched a series of studies in view of the aforesaid difficulties. Prastawa et al. 1 and Schmidt et al. 2 used the image registration approach for image segmentation. However, this abnormal edge detection-based approach was susceptible to the diverse effects of diseased tissues, thus resulting in the incorrect segmentation; Gooya et al. 3 and Parisot et al. 4 tried to solve this problem by way of brain scanning combined with segmentation and registration. Liu et al. 5 proved that the subordinate components of anomalous structure in sparse components were obtained through the registration tasks and low-rank decomposition, and the abnormal detection was also proposed in the image synthesis. There were typical reports: Weiss et al. 6 used dictionary-based learning approach; Ye et al. 7 used patch-based approach, which was used to synthesize the pseudo-healthy image and then the abnormal area could be highlighted by comparing it with scanned images of the patient. In this case, Cardoso et al. 8 proposed a model of generating the images to generate the abnormal segmentation of probability. Another unsupervised technology was proposed by Erihov et al. 9 to solve the problem by taking advantage of the asymmetry of brain tissue institutions when the patient is diseased. A common advantage of the aforesaid approaches is that a corresponding manually annotated training data set is not required in them. Under normal circumstances, these approaches are more suitable for detecting the lesions, rather than for accurate segmentation and classification and labeling. Geremia et al. 10 used intensity features to capture the appearance of multiple sclerosis (MS) lesions around each voxel region. Gooya et al. 3 combined it with the generated Gaussian mixture model (GMM) to acquire the tissue specificity prior probability. Tustison et al. 11 additionally used Markov random field (MRF) to merge the spatial regularization. These approaches have been very successful, but their modeling capabilities still have significant limitations.
The function and principle of convolutional neural networks (CNN)12,13 were introduced in this article based on the deep learning of brain tumor image classification and labeling of CNNs, and the operation of CNNs in dealing with the image problems was discussed based on the structure pattern of layers. Then, different improvements were made on the basis of two-way CNNs, including multi-layer convolutional improvement of local pathways, improvement of input series connections, and multi-pooling improvement; finally, it was proved that the improved model could improve the accuracy of different types of brain tumors upon the experiments.
CNN
CNNs evolved from ordinary neural networks, so they are very similar. They are composed of neurons, which can be used to simulate the neurons of human beings, consisting of weights and deviations, and they have the ability to learn. Each neuron receives input and multiples with the weights and can be output with different activation functions. The whole network structure shows that it can be presented as a segmented differentiable function; we can pre-process the images subject to the CNN technology, which makes the forward operations more efficient and greatly reduces the parameters in the network.
In the CNN architecture, a convolutional layer of neurons can be described by three dimensions: width, height, and depth (which here refers to the number of channels of input images, rather than the cascade number of a complete neural network), as shown in Figure 1. As we will describe later, the neurons in a layer will only be connected to a small region of the layer before it, instead of all the neurons in a fully connected manner. In addition, the form of full connection is used generally in the final output layer. At this moment, a column of vectors generally represent the scores of different types. Figure 1 14 shows a collection of neurons arranged in a three-dimensional (3D; width, height, and depth) structure. The visibility graph shows that each layer of convolutional layers will convert the input into a neuron-activated output map via the 3D neuron.

Visibility graph of 3D neuron in CNN.
Medical image annotation based on series connection multi-pooling CNN
Model training
This study is a model training based on the stochastic gradient descent (SGD) method. The output of the convolutional network is interpreted as a distributed model used on the segmented label. The natural training criterion is to maximize the probability of all labels in training set or equivalently minimize the negative logarithmic probability, that is
For medical images, it is very different from the common images. First of all, there is a very complicated relationship among their internal tissues, accompanying with randomness, fuzzy uncertainty, and imperfection. It is for these reasons that we know little about the medical images. During the classification and annotation of medical images, we will also intuitively realize that the relationship between adjacent pixels is very important.
Pixels generally convey low-level information like color and brightness. Therefore, the information expressed by the pixels themselves is very limited. In the experiments of Schroff et al.,15,16 they segmented the entire image into small patch sizes centered on the pixels. Thus, these areas centered on the pixels can be used as a semantic label. Hoiem et al. 17 processed the resulted “area” (sometimes called superpixel) via the segmentation algorithms of some images, and it was then used as a basic processing unit to acquire the effective semantic segmentation.
To this end, we used SGD algorithm to calculate the average negative logarithm probability of the small-batch patches by repeatedly selecting marker Yi,j on random patch subset in each brain and took the gradient descent steps for CNN parameters
where
Series connection multi-pooling improvement methods
Two-way series connection
One of the shortcomings of the CNN is that they predict each segment label separately from each other. On the other hand, the use of conditional random field for pixel classification is to model the posterior probability, which needs to calculate the unary potential functions and binary potential functions, so there is a large amount of calculations and it is time-consuming. To utilize the efficiency of the CNNs, directly simulate the dependency between adjacent labels in the segmentation, and ensure the final predicted model to be affected by the proximity labels, the output of the first CNN was used as the layer additional input of the second CNN. Likewise, we did this by relying on the connection of convolutional layers. Furthermore, we used the same two-way structure as the subsequent connection of two CNNs to effectively correspond to the series connection of the two CNNs. We hereby called such a model a series connection architecture.
The details are shown in Figure 2. We call this model an input series connection CNN.

Input series connection model.
The design model of the previous section clearly shows that the number of convolutional layers is increased in the local path in the two-way architecture, and the use of smaller convolution kernel helps focus on the texture structure formed by finer pixels resolution. Similarly, we set up 8-layer convolution, 12-layer convolution, and 20-layer convolution for convolution operations, respectively, in the input series connection architecture-local pathway. Finally, we found that the best effect was obtained only from the 12-layer convolution structure.
Series connection multi-pooling
In the previous section, we selected a patch size (33 × 33, 56 × 56) centered on each pixel for training, predicted classification, and segmentation. There were also some practical problems that could be overcome in such an operation. Since the size of brain tumors is different for different patients, the use of the methods has limited the size of input images. However, a certain redundancy will be introduced if the window of the input images is too large, which will thus affect the accuracy of the segmentation and classification that takes the pixels as central semantics. For this reason, the multi-pooling improvement approach was proposed.
Multi-pooling is an operation similar to MaxPooling and it is used to generate the multi-scale features; compared with the traditional MaxPooling, multi-pooling is easy to operate and it has flexibly expanded the range of patch area, which thus not only ensures the extraction of richer feature information but also can avoid the increase in redundant information. The multi-pooling is very helpful to solve the limitations of local series structure in the input images. In the convolutional process, the multi-pooling operation can substitute the original MaxPooling operation according to the applications.
MaxPooling performs a single feature map reduction operation. However, due to lesions of different sizes (Figure 3) for different patients, it is required to expand the range of patch to certain extent.

Lesions of different sizes.
Multi-pooling operation method was designed to capture the surrounding areas’ features centered on the central pixel. As shown in Figure 3, the features centered on the central pixel could be expressed as
where the sizes of two central areas corresponding to

Multi-pooling schematic diagram.
The multi-pooling operation architecture was inspired by the spatial pyramid pooling network (SPPNet). 18 Compared to SPPNet, the multi-pooling operation has the following characteristics:
The number of multi-pooling operation times for MaxPooling operation is associated with the location of selected feature map.
The multi-pooling can be connected behind the convolutional layer, but the output of SPPNet can only be connected to the top of CNN.
In this section, the multi-pooling operation was connected to the top of input series connection architecture, which helps expand the range of input image areas and can simultaneously avoid the introduction of redundant information via the multi-pooling operation. The combination of multi-pooling operation with local series connection architecture is shown in Figure 5.

Multi-pooling input series connection model.
The specific process is as follows:
Select area 128 × 128 × 4 for multi-pooling operation to generate small patch block 32 × 32 × 12; after adding padding = 1, generate the green patch block in 33 × 33 × 12.
Select block 65 × 65 × 4 to generate red block 33 × 33 × 4 after the convolution and form the input of convolutional network after the parallel connection with small patch green block subjected to multi-pooling operations, and then input the flow path of series connection structure to 33 × 33 × 16 as per red+green patch in the flow chart.
Experimental results and analysis
Experimental environment
The data sets were from a subset provided by BRATS2015.19,20 They were acquired at four different centers—Bern University, Debrecen University, Heidelberg University, and Massachusetts General Hospital—over the course of several years, using MR scanners from different vendors and with different field strengths (1.5 and 3 T) and implementations of the imaging sequences (e.g. two-dimensional (2D) or 3D). Since the brains in the BRATS data set lack resolution in the third dimension, we consider performing the segmentation slice by slice from the axial view. The training data set contained 30 patients (20 with advanced tumors and 10 with less advanced tumors), and the testing data set included 25 patients (21 with advanced tumors and 4 with less advanced tumors). In the data sets, there were 200 2D slices for each brain and and our model has approximately 6000 2D images to train on. There were four patterns for each brain, that is, T1, T1C, T2, and fluid attenuated inversion recovery (FLAIR). T1:T1-weighted, native image, sagittal, or axial 2D acquisitions, with 1- to 6-mm slice thickness. T1c:T1-weighted, contrast-enhanced (Gadolinium) image, with 3D acquisition and 1-mm isotropic voxel size for most patients. T2:T2-weighted image, axial 2D acquisition, with 2- to 6-mm slice thickness.FLAIR:T2-weighted FLAIR image, axial, coronal, or sagittal 2D acquisitions, 2- to 6-mm slice thickness. Five classification markers were provided for brains in training data set, that is, non-tumor, necrosis, edema, non-enhanced tumor, and enhanced tumor. Caffe was used for the experiments and was installed on a desktop computer. Our method predicts the class of a pixel by processing the M × M patch centered on that pixel. The input X of our CNN model is thus an M × M 2D patch with several modalities. The configuration of the computer was as follows: CPU8G memory, Ubuntu16.04 system, CUDA8.0,GPU-8GDDR5-NVIDIA-GTX1070B.
Analysis of experimental results
The experiments were performed in Caffe system, with two-way structure, 12-layer convolutional two-way architecture, input series connection two-way 12-layer convolution, and multi-pooling input series connection 12-layer convolution. In the study, the small patch images centered on the pixel were classified and annotated, which can be divided into five categories, that is, non-tumor, necrosis, edema, non-enhanced tumor, and enhanced tumor. They could be represented by markers 0, 1, 2, 3, and 4, respectively, when the procedure was operated. The pixel patch areas of different categories were unbalanced, so the representation of marker 5 for a balanced integrated accuracy is not simply an average of several categories. Brain tumor segmentation is a highly data unbalanced problem where the healthy voxels (i.e. marker 0) comprise 98% of total voxels. From the remaining 2% pathological voxels, 0.18% belongs to necrosis (marker 1), 1.1% to edema (marker 2), 0.12% to non-enhanced (marker 3), and 0.38% to enhanced tumor (marker 4). The integrated accuracy is the weighted average of each part of the voxel proportion. This balance was from the distribution of training samples of different categories, and it is an important issue directly affecting the accuracy of classification:
1. The comparison of experimental results based on two-way structure with 12-layer two-way convolutional structure is detailed in Figure 6, and the main comparison contents are the classification accuracy and integrated accuracy of five categories of the pixels in brain tumor images. Figure 6 shows that the classification accuracy of different types of brain tumors was all enhanced after local paths in two-way structures were treated with smaller and multi-layer convolutional kernels. This is because more accurate minutia features can be learned in local paths using smaller convolutional kernels.

Comparison of experimental results of two-way structure and 12-layer two-way convolutional structure.
W1 represents the size of input image before passing through the convolutional layer, and W2 represents the size of input image before passing through the pooled layer; F1 represents the size of convolution window, and F2 represents the size of pooled window; P represents the size of zero-padding area; and S represents the step length of sliding window. Generally speaking, the size of images will become smaller after MaxPooling is used. However, learning from the calculation process in Table 1, the output and input sizes of images were not changed after each convolution + pooling operation under the condition that 2 × 2 convolutional window was used in combination with 3 × 3 pooled window (padding = 1), which is conducive to the stacking of multi-layer convolutions and learning more minutia features of patch images. Therefore, classification accuracy of pixels was significantly improved relative to the two-way structure; compared with the use of larger convolution kernel, the calculation speed was faster when the smaller convolution kernel was used because it contained a small amount of weights. However, it was difficult to train when the deeper convolutional network was used, which will lead to the explosive growth in forward propagation of values or the disappearance of gradient in the reverse propagation. We always hoped that certain stability of variance could be maintained after the values were calculated using the convolution kernel. Hence, we initialized the weight using Xavier’s approach 21 and processed the data using the batch normalization (BN) technology, 22 usually prior to the excitation and after the whole connection layer. This can make the gradient (calculation) more smooth and meanwhile appropriately raise the learning rate during the training process, and the dependence on the initial value is also reduced.
2. The comparison of experimental results based on 12-layer two-way convolutional structure and input series connection 12-layer two-way convolutional structure is detailed in Figure 7, and the main comparison contents are the classification accuracy and integrated accuracy of five categories of the pixels in brain tumor images.
Setting method of 12-layer convolutional structure.
BN: batch normalization.

Comparison of experimental results of 12-layer two-way structure and input series connection 12-layer two-way structure.
The data in Figure 7 showed the use of input series connection 12-layer two-way convolutional structure. The classification accuracy of enhanced tumors was most significantly improved, which is increased up to 17%. This is sufficient to prove the effectiveness of this model. Because the composition of human diseased tissues is continuous and the categories between adjacent pixels are dependent when it is reflected in the images, it is very important to take this dependence into account in the classification of pixels. In the training course as shown in Figure 2, it is required to consider the balance of different categories when the green input patch block 65 × 65 around the core pixel 33 × 33 is added into the network. The pixel data of brain tumor images were highly unbalanced, of which healthy volume pixels (marker 0) account for 98% and the remaining pathological volume pixels account for 2% (necrosis marker 1, edema marker 2, non-enhanced tumor marker 3, and enhanced tumor marker 4), so the emergence probability of healthy volume pixels was controlled while selecting the pixel patches at random during the training process; even so, the healthy volume pixels were still present in large numbers. The technology of CNN is the most accurate to study the features of healthy volume pixels, so the classification accuracy of healthy volume pixels remained at around 97%. We also selected the green patch 65 × 65 for convolution operation and took it as the input of 12-layer two-way convolutional structure after it was connected in series with the core pixel 33 × 33. We could also intuitively feel that the larger patch 65 × 65 could introduce or input more abundant pixel information, which is helpful for prediction and classification of core pixels (patch block 33 × 33).
However, the consideration of dependence of pixel patch area on the surrounding pixel labels in the input series connection model would affect the classification identification of enhanced tumor pixel labels and non-enhanced tumor pixel labels, and additionally, the boundary of tumor areas was usually diffuse and there was no significant incision in enhanced and non-enhanced tumors, resulting in the similarity of pixel patch areas near the boundary. The pixel patch area of input series connection 12-layer convolutional structure is too dependent on the surrounding pixel patch labels, so the classification data in the figures showed that the classification accuracy of input series connection 12-layer convolutional structure in the non-enhanced tumors (marker 3) was reduced compared with the 12-layer two-way convolutional structure. But the enhancement of enhanced tumor areas increased significantly, up to 83%, and the integrated accuracy increased to 81%. This indicates that the input series connection model is effective in improving the classification accuracy when the dependence of labels is considered.
3. The comparison of experimental results based on input series connection 12-layer two-way structure and multi-pooling input series connection 12-layer two-way structure is detailed in Figure 8, and the main comparison contents are the classification accuracy and integrated accuracy of five categories of the pixels in brain tumor images.

Comparison of experimental results of input series connection 12-layer two-way structure and multi-pooling input series connection 12-layer two-way structure.
The data in Figure 8 showed that the best effect was obtained from the multi-pooling input series connection 12-layer convolution structure among the three groups of experimental data, and the classification accuracy of four categories (markers 1–4) in pathological volume pixels has basically reached 80%. Compared with the input series connection 12-layer two-way convolutional structure, the classification accuracy in non-enhanced tumors (marker 3) was also improved (over 5%). Pixel areas at boundary of enhanced and non-enhanced tumors were highly dependent. However, after the multi-pooling operation was used, more features were learned in the training process, so the pixel points on these diffuse areas of boundary could be classified more accurately, which has balanced to certain extent the dependence of the simulated conditional random field (input series connection model). The direct processing of input patches using the multi-pooling operation enabled the input patch to contain not only the information of core area (size 33 × 33) but also the main information of double areas (64 × 64) and triple areas (128 × 128) around the core area. This method is very necessary to treat the tumors in a large region. We can also imagine that the larger the input area, the richer the information contained and the more the texture information, which will be more conducive to the training of the neural network. Moreover, the neural network model established via the multi-pooling operation is more characteristic and the weight of filter is more accurate for the study of main features. Furthermore, when the neural network model established via the multi-pooling operation was used for tests, we found that it was faster compared with the 12-layer two-way convolutional structure model.
4. Comparison with other methods.
The comparison of integrated accuracy based on the methods of support vector machine (SVM), random forest (RF), two-way structure, 12-layer two-way structure, input series connection 12-layer structure, and multi-pooling input series connection 12-layer structure is detailed in Table 2. The comparison results show that the highest integrated accuracy is acquired by the model of multi-pooling input series connection 12-layer convolutional structure among the data sets of this training. The performance of the commonly used SVM method was, however, not very good in this experiment. This is because that the SVM method is highly dependent on preprocessing of the data, while the data sets we used were not overmuch pre-processed. Compared with the SVM method, one advantage of the RF is that the data shall not be pre-processed. Therefore, the RF method has a very good effect in the integrated accuracy of classification in this experiment, up to 81%, which is superior over the two-way structure and 12-layer two-way convolutional structure. This is because the dependence between pixel labels was not considered by the two structures. When this dependence was considered, for example, input series connection structure, the integrated accuracy of classifications reached around 83%. However, the model of CNN is computationally expensive and requires a large number of data sets to be trained in the network. This makes us also consider the hardware conditions while improving the classification accuracy. In addition, with respect to the generalization ability, the generalization ability of the SVM method is the weakest among the three categories (SVM, RF, and CNN), and the effect of RF and CNN is significantly better than the SVM method and they are more robust. Compared with the SVM method, the CNN method can extract the characteristics of abstract space, while the SVM simply maps its input to a high-dimensional space. When the imbalance problems of data are dealt with, the deep convolutional network has a stronger ability to represent the features, which is beneficial to coping with the imbalance problem of multiple categories.
Comparison of different methods.
SVM: support vector machine; RF: random forest.
Conclusion
In this article, we focus on the studies of classification and labels of brain tumor images and propose the model of series connection multi-pooling CNN for the detailed extraction and limited problems in image areas of brain tumors. The experiment proves that this model not only introduces more important feature information of pixel areas into the CNN but also eliminates the addition of redundant information, and meanwhile, the integrated accuracy of classifications is significantly improved. We will visualize the classification model of brain tumor images proposed in this article for further analysis in the next step and perform a comprehensive study on how different factors affect the performance in combination with the medical knowledge. Furthermore, we will increase the amount of training data sets to give the network a stronger generalization ability.
Footnotes
Handling Editor: James Brusey
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partly supported by the Research Project Supported by Shanxi Scholarship Council of China (grant no. 2017-049).
