Nonlocal spatial attention module for image classification

Abstract

To enhance the capability of neural networks, research on attention mechanism have been deepened. In this area, attention modules make forward inference along channel dimension and spatial dimension sequentially, parallelly, or simultaneously. However, we have found that spatial attention modules mainly apply convolution layers to generate attention maps, which aggregate feature responses only based on local receptive fields. In this article, we take advantage of this finding to create a nonlocal spatial attention module (NL-SAM), which collects context information from all pixels to adaptively recalibrate spatial responses in a convolutional feature map. NL-SAM overcomes the limitations of repeating local operations and exports a 2D spatial attention map to emphasize or suppress responses in different locations. Experiments on three benchmark datasets show at least 0.58% improvements on variant ResNets. Furthermore, this module is simple and can be easily integrated with existing channel attention modules, such as squeeze-and-excitation and gather-excite, to exceed these significant models at a minimal additional computational cost (0.196%).

Keywords

Convolutional neural network nonlocal attention module image classification computer vision object recognition and classification vision systems

Introduction

By interleaving a series of convolutional layers with nonlinear activation functions and downsample operators, convolutional neural networks (CNNs)¹ are able to produce robust representations that capture hierarchical patterns and attain global theoretical receptive field. Thus, CNNs become the paradigm of choice in many computer vision applications, such as image classification,^2

–5 object detection,⁶ semantic segmentation,⁷ and regression.^8,9 In recent years, attention mechanisms have been a new remedy for feature recalibration by capturing contextual long-range interactions. The attention mechanism started from the introduction of an attention module to draw global dependencies of inputs in neural machine translation,¹⁰ and then, the landmark works about self-attention modules¹¹ sets a new standard in this field.

Self-attention mechanism is to measure the compatibility of the query and key content pairwise relations. In this field, one approach is that nonlocal network (NLNet)¹² presents a self-attention map to model the correspondence from all positions to each query position. Meanwhile, simplified attention models use a query-independent attention map for all query positions. In the recently proposed convolutional block attention module (CBAM),¹³ a single spatial attention map is multiplied back to the channel-attention tuned feature maps for adaptive feature refinement. Our nonlocal spatial attention module (NL-SAM) builds on the benefits of NLNet with effective modeling on global contextual information and CBAM with an efficient attention map generation. NL-SAM incorporates three interdependent operations: context collecting, transformation, and distribution. Context collecting performs feature aggregation to obtain highly compressed global information in spatial dimension. Transformation provides a nonlinear way to recalibrate feature responses. Then, attention map is distributed to each location in the convoluted feature maps.

Our NL-SAM is general and efficient in terms of added parameters, it can be integrated into any CNN architectures individually while maintaining end-to-end trainable or existing channel attention modules with negligible overhead to serve as the complementary attention. We incorporate NL-SAM within ResNet⁴ to validate through experiments on CIFAR-10, CIFAR-100, and ImageNet-1K classification datasets. In particular, as shown in Figure 1, deep CNNs embedded with our NL-SAM introduce very few additional computations while bringing notable performance gain. For example, for ResNet50 with 25.53M parameters and 6.995G floating point operations (FLOPs), the additional parameters and computations of NL-SAM are only 0.03M and 0.007G FLOPs, but the accuracy has been improved by 0.58%. And gather-excite network (GENet)¹⁴ combined with NL-SAM reaches the highest accuracy (75.6%) with only 0.085% additional FLOPs (more details are given in Table 7).

Figure 1.

Comparison of various attention modules embedded in ResNet50 in terms of accuracy and FLOPs on ImageNet, and the size of circles indicates parameter numbers. FLOP: floating point operation.

The rest of this article is organized as follows. The second section analyzes related works about attention mechanisms. The third section describes the architecture of the proposed NL-SAM and analyzes the relationship between NL-SAM and other attention modules. The fourth section verifies the effectiveness of NL-SAM throughout extensive experiments with various baseline models on multiple benchmarks. The fifth section concludes the article.

Related works

Convolution network architecture

The convolution layer has been the dominant visual feature extractor in computer vision. Recent advances of convolution networks focus on capturing long-range dependencies.^3,4 ResNet⁴ solves the degradation problem caused by the increased depth of neural networks, thus, it is able to deliver information between distant positions by simply increasing the network depth. After that, another important direction is to modify the spatial scope for aggregation by enlarging the receptive field, such as atrous/dilated convolution.⁷

Attention models

In vision, the key and query refer to visual elements, such as image pixels. Regular convolution can be deemed as a special instantiation of spatial attention mechanism. Given a query element, predetermined positional offsets of key elements are sampled. In the recently proposed attention mechanisms, there are two major patterns as follows.

Query-independent attention models

These models are irrelevant to the query content, and they only capture salient key contents, which should be focused on for the task. By computing a global attention map and sharing this map for all query positions, these models are very efficient. For example, squeeze-and-excitation network (SENet)¹⁵ and GENet¹⁴ perform rescaling to different channels to recalibrate the channel dependency. However, they miss the spatial axis, which is also important for inferring accurate attention maps. Bottleneck attention module (BAM)¹⁶ and CBAM¹³ introduce spatial attention using convolution in a similar way with a channel attention mechanism. However, in these spatial attention models, feature transformation is performed by convolution, which results in suboptimal effect for global context exploitation in CNNs.

Query-specific attention models

Motivated by their success in natural language processing (NLP) tasks, self-attention mechanisms have also been employed in computer vision applications, such as image recognition,^17

–21 relational reasoning among objects,^22,23 image segmentation,^17,24 scene parsing,²⁵ and video recognition.²⁶ Wang et al.²⁷ have proposed residual attention network, which uses an hourglass module to generate 3D attention maps for intermediate features. As attention maps are computed for each query position, the time and space complexity are both quadratic to the number of positions. To decrease the large computational overheads caused by the heavy map generation process, dual attention network²⁰ appends two parallel 2D and 1D attention modules on top of dilated fully convolutional networks (FCN).²⁸ However, they all have an obvious shortcoming—self-attention models contain a large number of matrix multiplication operations, which increases the computational burden.

Proposed method

In this section, we first overview two major attention modules NLNet¹² and CBAM,¹³ and conclude a concise formula that summarizes these attention models. Then, we introduce the NL-SAM and describe its key components. In addition, we analyze the relationship between our NL-SAM and other attention modules.

Overview of attention modules

A representative query-independent attention model is CBAM,¹³ which is formulated as

Z^{'} = M_{c} (X) \otimes X

Z = M_{s} (Z^{'}) \otimes Z^{'}

where $X \in ℝ^{W \times H \times C}$ is the input feature block (W, H, and C are width, height, and channel number, respectively), and ⊗ denotes element-wise multiplication, Z is the final refined output. CBAM first computes the channel attention

M_{c} (X) = σ (W_{1} (W_{0} \sum_{\forall j} \frac{1}{N_{p}} X_{j}) + W_{1} (W_{0} max_{\forall j} (X_{j})))

In this function, W ₁ and W ₀ are multilayer perceptron weights, ∀j enumerates all positions in the j’th feature map, and N_p is the number of positions in this feature map. $max (\cdot)$ calculates the largest value of the inputs, and σ refers to the sigmoid function. Based on the channel attention output $Z^{'}$ , the spatial attention model is calculated as

M_{s} (Z^{'}) = σ (W_{7 \times 7} ([\sum_{\forall j} \frac{1}{C} {Z^{'}}_{j}, max_{\forall j} ({Z^{'}}_{j})])

where $\forall j$ denotes $1 \leq j \leq C$ and $W_{7 \times 7}$ denotes convolution with kernel size 7.

NLNet,¹² a typical query-specific attention module, can be expressed as

Z_{i} = X_{i} + W \sum_{\forall j} \frac{f (X_{i}, X_{j})}{C (X)} (W_{1 \times 1} X_{j})

where i is the index of query positions and $\forall j$ enumerates all possible positions. $f (X_{i}, X_{j})$ denotes the relationship between positions i and j, and $C (X)$ is a normalization factor.

After simplifying these attention models, a general attention modeling framework can be summarized as the following formula

Z = F (X, δ (\underset{\forall j}{θ} (X_{j})))

$\underset{\forall j}{θ} (X_{j})$ represents a context collecting module that aggregates features at all locations to obtain global context features through weighted calculations and $\forall j$ represents the traversal value of the channel dimension or spatial dimension;

$δ (\cdot)$ denotes the nonlinear feature conversion method;

$F (\cdot, \cdot)$ denotes the attention distribution function that broadcasts the global context features to each position.

Nonlocal spatial attention module

The structure of NL-SAM is shown in Figure 2, which can be roughly divided into three parts: context collecting, transformation, and distribution. According to equation (6), our NL-SAM is formulated as

Z_{i j} = F (X_{i j}, δ_{r} (θ_{k} (\sum_{l = 1}^{C} \frac{1}{C} X_{l})))

Figure 2.

The nonlocal spatial attention module.

In detail, our NL-SAM consists of

channel reduction for information aggregation and local average pooling on feature map for context collecting, as $θ_{k} (\cdot)$ defined in equation (9);

bottleneck transformation capture pixelwise dependencies, as $δ_{r} (\cdot)$ defined in equation (10);

matrix multiplication for feature fusion, as $F (\cdot, \cdot)$ defined in equation (11).

The first part is context collecting for global information aggregation. For simplicity, channel average pooling is used here. For a given feature block $X \in ℝ^{W \times H \times C}$ , the output $X_{c p}$ is computed as

X_{c p} (i, j) = \frac{1}{C} \sum_{m = 1}^{C} X (i, j, m)

On the basis of channel compression, we further utilize average pooling to aggregate the channel-pooled feature maps. Experimental results in Table 2 demonstrate that pooling of adjacent pixels in early stages where NL-SAM inserted provides translation invariance needed for classification. The final aggregated feature map $X_{θ}$ is calculated by

X_{θ} = \frac{1}{k \times k} \sum_{m = i^{'} k}^{(i^{'} + 1) k} X_{c p} (m, n)

where k is a hyperparameter of the pooling operation with different values at different stages, $i^{'}$ and $j^{'}$ are the traversal values of the context collecting map, where $0 \leq i^{'} \leq \frac{W}{k}$ and $0 \leq j^{'} \leq \frac{H}{k}$ .

The second part—feature transformation—has the largest number of parameters. Thus, a bottleneck transformation module is applied, which significantly reduces the number of parameters from $W H \cdot W H$ to $2 \cdot W H \cdot \frac{W H}{r}$ , where r is the bottleneck ratio and $\frac{W H}{r}$ denotes the hidden neuron number of the bottleneck. With default reduction ratio set to $r = 8$ , the number of parameters and computational overheads for transformation module can be reduced to $\frac{1}{4}$ . More results on different values of bottleneck ratio r are given in Table 1.

Table 1.

Accuracy (%) on CIFAR-100 test set by ResNet110 with different reduction ratio r.

Ratio r	Top-1 accuracy (%)	Top-5 accuracy (%)	Params (M)	FLOPs (G)
4	74.37	92.32	11.830	0.552
8	74.62	92.36	6.795	0.53
16	74.41	92.68	4.278	0.521
ResNet110	73.86	92.02	1.737	0.511

FLOP: floating point operation.

The bold values indicate the highest accuracy in the comparison experiment.

Then, the 2D pooled feature map $X_{θ} \in ℝ^{\frac{W}{k} \times \frac{H}{k}}$ is transformed into one dimension ${X^{'}}_{θ} \in ℝ^{\frac{W H}{k^{2}}}$ . Each point in the feature map is treated as an independent node, the trainable weight matrix learns the “importance” or saliency of each point corresponds to the rest in the feature map by backpropagation. The weight matrix is query-sensitive and can learn exclusive relationships since every query has its own attention map. This is important because of the pooling operation in the original feature map and the bottleneck reduction operation in this part. Inspired by SENet, full connection acts as an instantiation here

X_{δ} = sigmoid (fc (relu (fc ({X^{'}}_{θ}))))

where “fc” means full connection. Finally, the 1D results are restored to 2D, namely, the transformation results $X_{δ} \in ℝ^{\frac{W}{k} \times \frac{H}{k}}$ .

The main calculation cost of NL-SAM is introduced by this part. The FLOPs are $\frac{W H}{r k^{2}}$ , which are determined by the width W and height H of the input feature map, and pooling kernel size k and bottleneck reduction ratio r.

In the last part, fusion function distributes the global context to the features of each position. While the transformed feature map has a width/height that is 1/k of the original size, experimental results in Table 3 draw a conclusion that nearest neighbor interpolation is the best choice to restore the size. The finally refined feature map is obtained by

{X^{'}}_{δ} = resize (X_{δ})

Z = X \otimes {X^{'}}_{δ}

where ⊗ denotes elementwise multiplication, and the same sized 2D spatial attention feature map ${X^{'}}_{δ} \in ℝ^{W \times H}$ is to emphasize or suppress responses in different locations.

Relationship to other attention modules

We make a theory comparison for deeper analysis on the relationship to other attention modules. To aggregate global context from all positions, SENet¹⁵ utilizes global average pooling, which assumes that every point in a feature map contributes the same to the whole, there is no spatial attention in this module, so it can serve as the channel aggregation block of our NL-SAM. GENet¹⁴ and BAM¹⁶ utilize depthwise global convolution or dilated convolution to obtain a large receptive field, since convolution finally sums up as a single result, it is only one point on our 2D attention map. In the fusion stage, SENet and GENet utilize rescaling to recalibrate a channel as a whole unit by one single parameter; BAM¹⁶ utilizes broadcast elementwise addition; CBAM¹³ is almost the same as us, the most sophisticated way is elementwise multiplication.

Our NL-SAM follows the NLNet¹² by acquiring a query-sensitive attention map, but we need to specify that there are two prominent features in this block. Firstly, we use dimension reduction to decompose channel attention and spatial attention. Secondly, we use the combination of $k \times k$ pooling and upsampling to reduce the calculation in global information collection and to maintain translation invariance. And, moreover, NL-SAM is a simplified version and not referred to self-attention but salience-attention.

Experiment and analysis

In this section, we evaluate the performance of the proposed NL-SAM on a series of benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet.²⁹ Our experiments contain three parts. Firstly, ablation experiments are conducted to determine the optimal value of hyperparameters. Given limited computation resources, CIFAR-10 and CIFAR-100 are selected as experimental datasets. Our module surpasses ResNet on CIFAR datasets both in efficiency and accuracy. Secondly, we analyze the reciprocal relation of NL-SAM with other attention modules. Finally, to be more convincing, we compare classification results with ResNet50 variants on ImageNet. And evaluate original ResNets based on the official code of TensorFlow-Slim image classification model library.³⁰ Some parameters are modified due to limited computing resources, especially the batch size. For fair comparison, all the relevant attention models are reimplemented by TensorFlow-Slim.

CIFAR and analysis

Implementation details

We follow the simple data augmentation strategy in the literature⁴ for training: images are randomly flipped horizontally and zero-padded on each side with four pixels before taking a random $32 \times 32$ crop, mean and standard deviation normalization are also applied. ResNet110 and ResNet164 preactivation residual models⁵ are trained on one RTX2080ti, batch size is set to 128, momentum and weight decay are 0.9 and 0.0003 and the same weight initialization as the reference.⁵ The initial learning rate is set to 0.1 with warming up at the first 400 iterations, divided by 10 at 120, 190 of 250 epochs in total.

Reduction ratio

The bottleneck compression rate r is intended to reduce redundancy in parameters. The value of average pooling kernel size k is fixed to 0, different values are set to determine the optimal value of r. As given in Table 1, r = 8 achieves a good balance between accuracy and complexity. So, this value is used in the rest of the experiments.

Effect on different inserting stages and pooling kernel size

The influence of NL-SAM is investigated on different stages (here the term “stage” is the same as defined in the literature⁴) based on different inserting stages and different values of k, the accuracy, computational cost, and parameter numbers, as given in Table 2.

Table 2.

Effect of inserting NL-SAM at different stages with enumerated values of k on ResNet110 with r = 8.

Network	Kernel k	Top-1 accuracy (%)	Top-5 accuracy (%)	Params (M)	FLOPs (G)
ResNet110	0	73.86	92.02	1.737	0.511
ResNet110	0	74.10	92.11	6.475	0.530
ResNet110 + NL-SAM stage1	2	74.04	92.17	2.036	0.512
ResNet110 + NL-SAM stage1	4	74.19	92.20	1.756	0.511
ResNet110 + NL-SAM stage2	0	74.29	92.10	2.036	0.512
ResNet110 + NL-SAM stage2	2	74.64	92.32	1.756	0.511
ResNet110 + NL-SAM stage3	0	74.42	92.33	1.756	0.511
ResNet110 + NL-SAM stage3	2	74.73	92.60	1.738	0.511
ResNet110 + NL-SAM all	0	74.62	92.36	6.795	0.531
ResNet110 + NL-SAM all	2	74.74	92.71	2.057	0.513

NL-SAM: nonlocal spatial attention module; FLOP: floating point operation.

The bold values indicate the highest accuracy in the comparison experiment.

Based on the results listed in Table 2 and the corresponding Figure 3, we have the following observations: (1) deeper stage which has more semantic information benefits more from NL-SAM; (2) insertion at all stages are not mutually redundant, they can further bolster the accuracy; (3) pooling is not only to reduce the calculation cost but also to keep translation invariant in classification, since our module is position-sensitive, weight sharing within the pooling kernel is necessary to counteract this.

Figure 3.

Accuracy of inserting NL-SAM at different stages with enumerate values of k on ResNet110 (r = 8). NL-SAM: nonlocal spatial attention module.

At the same time, the interpolation experiment is conducted in stage 2 with pooling kernel size of 2. The results of Table 3 verify that nearest-neighbor interpolation is the best choice. In fact, due to the previous average pooling of the feature map, the same weights are naturally shared within the $k \times k$ block in ${X^{'}}_{δ}$ (equation (11)).

Table 3.

Accuracy(%) of interpolation on stage 2 of ResNet110 + NL-SAM with k = 2.

ResNet110 + NL-SAM (stage 2, k = 2)	Top-1 accuracy (%)	Top-5 accuracy (%)	Params (M)	FLOPs (G)
Nearest neighbor	74.64	92.32	1.756	0.511
Bilinear	74.43	92.24	1.756	0.511

NL-SAM: nonlocal spatial attention module; FLOP: floating point operation.

The bold values indicate the highest accuracy in the comparison experiment.

Integration with different architectures

We evaluate our NL-SAM on the popular ResNet backbones, plain ResNet110⁵ and ResNet164⁵ with bottleneck. According to the above experiments, the structure and hyperparameters are determined, as given in Table 4. NL-SAM embedded in stage 1 of the network consumes more resources, but the effect is inferior to the later stages. Thus, it is the best choice to insert NL-SAM into stage 2 and stage 3 with k set to 2, and the results are reported in Table 5. It can be found that NL-SAM consistently improves performance across different depths with a small increase in FLOPs.

Table 4.

Architecture of ResNets and parameters of NL-SAM.

Output size	ResNet110	ResNet164	NL-SAM (r, k)
$32 \times 32$ stage 1	$[\begin{matrix} conv,3 \times 3, 16 \\ conv,3 \times 3, 16 \end{matrix}] \times 18$	$[\begin{matrix} conv,1 \times 1, 16 \\ conv,3 \times 3, 16 \\ conv,1 \times 1, 64 \end{matrix}] \times 18$	—
$16 \times 16$ stage 2	$[\begin{matrix} conv,3 \times 3, 32 \\ conv,3 \times 3, 32 \end{matrix}] \times 18$	$[\begin{matrix} conv,1 \times 1, 32 \\ conv,3 \times 3, 32 \\ conv,1 \times 1, 128 \end{matrix}] \times 18$	cr = 8
k = 2 $8 \times 8$ stage 3	$[\begin{matrix} conv,3 \times 3, 64 \\ conv,3 \times 3, 64 \end{matrix}] \times 18$	$[\begin{matrix} conv,1 \times 1, 64 \\ conv,3 \times 3, 64 \\ conv,1 \times 1, 256 \end{matrix}] \times 18$	cr = 8
k = 2 $1 \times 1$	Global average pooling, 100-d, softmax		—

NL-SAM: nonlocal spatial attention module.

Table 5.

Classification accuracy (%) on CIFAR-10/100 test set with NL-SAM embedded in ResNet110 and ResNet164.

Network	Top-1 accuracy (%)	Top-5 accuracy (%)	Params (M)	FLOPs (G)
ResNet110	93.64/73.86	−/92.02	1.732/1.737	0.510/0.511
ResNet110 + NL-SAM	94.62/74.87	−/92.66	1.751/1.757	0.511/0.511
ResNet164	94.54/75.79	−/93.33	1.703/1.726	0.501/0.501
ResNet164 + NL-SAM	95.38/77.85	−/93.94	1.724/1.747	0.503/0.503

NL-SAM: nonlocal spatial attention module; FLOP: floating point operation.

The bold values indicate the highest accuracy in the comparison experiment.

Combination with other attention modules

Comparative experiments are conducted on CIFAR-100 by integrating NL-SAM with SENet and GENet to serve as complementary spatial attention. Average pooling is used in SENet, global depthwise convolution is used in GENet, and bottleneck reduction ratio is set to 4 with sigmoid activation in both SENet and GENet. In Figure 4, the comparison of parameter numbers and accuracy of different models are explicit. It shows that NL-SAM can further improve the accuracy of SENet and GENet by about 0.2% with less than 0.002M parameters added. Furthermore, we replace spatial attention module in CBAM with NL-SAM. In the third column of the figure, we can see that this replacement variant outperforms the original CBAM by 0.32%. It suggests that the designed query-sensitive spatial attention in NL-SAM is superior to query-independent spatial attention in CBAM.

Figure 4.

(a, b) Experiments on CIFAR-100 for NL-SAM combined with SENet, GENet, and CBAM. NL-SAM: nonlocal spatial attention module; SENet: squeeze-and-excitation network; GENet: gather-excite network; CBAM: convolutional block attention module.

ImageNet and analysis

Implementation details

This dataset contains 1.2 million training images and 50k validation images, each is cropped to $224 \times 224$ pixels. ResNet50 is chosen as a base model and trained from random initialization.⁵ SGD is used with momentum as 0.9 to optimize, batch size is set to 128. The initial learning rate is set to 0.1 and is reduced by a factor of 10 for every 30 epochs, all models are trained for 100 epochs.

Classification results on ImageNet

Based on the ablation experiments on CIFAR, the hyperparameters of ResNet50 for ImageNet are defined in Table 6. Due to the large feature map size and insertion stage position study on CIFAR, we add NL-SAM in stages 3, 4, and 5.

Table 6.

Architecture of ResNet50 and the hyperparameters r and k of NL-SAM.

Output size	ResNet50	NL-SAM (r, k)
$112 \times 112$ stage 1	Conv, $7 \times 7$ , stride 2	—
$56 \times 56$ stage 2	Maxpooling, $3 \times 3$ , stride 2	—
	$[\begin{matrix} conv,1 \times 1, 64 \\ conv,3 \times 3, 64 \\ conv,1 \times 1, 256 \end{matrix}] \times 3$	—
$28 \times 28$ stage 3	$[\begin{matrix} conv,1 \times 1, 128 \\ conv,3 \times 3, 128 \\ conv,1 \times 1, 512 \end{matrix}] \times 4$	cr = 8
k = 2 $14 \times 14$ stage 4	$[\begin{matrix} conv,1 \times 1, 256 \\ conv,3 \times 3, 256 \\ conv,1 \times 1, 1024 \end{matrix}] \times 6$	cr = 8
k = 2 $7 \times 7$ stage 5	$[\begin{matrix} conv,1 \times 1, 512 \\ conv,3 \times 3, 512 \\ conv,1 \times 1, 2048 \end{matrix}] \times 3$	cr = 8
k = 0 $1 \times 1$	cGlobal avgpooling 1000-d, softmax	—

NL-SAM: nonlocal spatial attention module.

We further analyze the effectiveness of NL-SAM by the splitting and combination experiments. The results of the final models are given in Table 7. It can be found that our module is able to effectively improve the accuracy even on large datasets. Compared with SENet and GENet, NL-SAM is slightly inferior but still has a notable improvement on ResNet50 baseline. The best result is produced when NL-SAM is integrated with GENet. CBAM seems to be unsteady, which does not reach the baseline result, but when integrated with NL-SAM, the accuracy is improved by 0.22%. These experiments verify the consistent improvement of NL-SAM when integrated with channel attention modules. NL-SAM exhibits the ability to use without bells and whistles.

Table 7.

Classification accuracy (%) on ImageNet validation set with embedded NL-SAM, CBAM in ResNet50, SENet50, and GENet50.

Network	Top-1 accuracy (%)	Top-5 accuracy (%)	Params (M)	FLOPs (G)
ResNet50	74.50	91.9	25.530	6.995
ResNet50 + NL-SAM	75.08	92.11	25.580	7.002
ResNet50 + SE	75.13	92.38	28.061	7.015
ResNet50 + SE + NL-SAM	75.25	92.38	28.106	7.020
ResNet50 + GE	75.15	92.45	32.500	7.027
ResNet50 + GE + NL-SAM	75.55	92.49	32.545	7.033
ResNet50 + CBAM	74.08	92.04	28.063	7.030
ResNet50 + CBAM-ch + NL-SAM	74.72	92.04	28.106	7.026

NL-SAM: nonlocal spatial attention module; FLOP: floating point operation.

The bold values indicate the highest accuracy in the comparison experiment.

Analysis and discussion

We compare the visualization results of NL-SAM inserted to ResNet50 or combined with other modules. The grad-CAM³¹ visualization is calculated for the last convolution layer outputs, ground-truth label is shown on the top of each input image and P denotes the softmax score of each network for the predicted class. As shown in Figure 5, according to P’s score, ResNet50 misclassifies siamese cat and obtains the lowest score. ResNet50 + GE + NL-SAM always gets the highest score, ResNet50 + NL-SAM and ResNet50+ GENet both have their own advantages. We can clearly see that NL-SAM integrated networks have a good incentive or suppression in some regions. For example, they focus more precisely on the head of the merganser and pay less attention to the water, and attention on the whole body of siamese cat shows the benefit of global visual field.

Figure 5.

Grad-CAM visualization results. Grad-CAM: gradient - class activation map.

Conclusion

We propose the NL-SAM, which is a new method to enhance the representation ability of the network. Through modeling on global contextual information, our module learns where to focus or suppress, and refines intermediate features. To verify its effectiveness, extensive experiments were conducted on three benchmark datasets. The results show that NL-SAM improves classification accuracy by 0.98% on CIFAR-10, 1.01% on CIFAR-100 when embedded in ResNet110, and 0.58% on ImageNet when embedded in ResNet50. What is more, the combination of NL-SAM and GENet improves accuracy on ImageNet by 1.05% with only 0.038G FLOPs added. For future works, we plan to devise a flexible way to switch between NL-SAM and convolution in a layer.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the National Natural Science Foundation of China [Grant Nos. 61701351] and the Fundamental Research Funds for the Central Universities of CCNU [No. CCNU16A05028].

ORCID iDs

Bingling Chen

Qinglin Zhang

References

LeCun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86(11): 2278–2324.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), Lake Tahoe, Nevada, USA, 3–6 December 2012, pp. 1097–1105. Red Hook, NY: Curran Associates Inc.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556 2014.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. Piscataway, NJ: IEEE.

Zhang

Ren

, et al. Identity mappings in deep residual networks. In: European conference on computer vision, Amsterdam, The Netherlands, October 11–14, 2016, pp. 630–645, Berlin: Springer.

Ren

Girshick

, et al. Faster r-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, Montreal, Quebec, Canada, 7–10 December 2015, pp. 91–99. Cambridge, MA: MIT Press.

Chen

Papandreou

Kokkinos

, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 2017; 40(4): 834–848.

Trujillohernandez

Rodriguezquinonez

Ramirezhernandez

, et al. Accuracy improvement by artificial neural networks in technical vision system. In: Annual conference of IEEE industrial electronics society, Lisbon, Portugal, Portugal, 14–17 October 2019, pp. 5572–5577. IEEE.

Rodriguezquinonez

Sergiyenko

Floresfuentes

, et al. Improve a 3D distance measurement accuracy in stereo vision systems using optimization methods’ approach. Opto-electron Rev 2017; 25(1): 24–32.

10.

Luong

Pham

Manning

. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:150804025 2015.

11.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Advances in neural information processing systems, Long Beach, CA, USA, 4–9 December 2017, pp. 5998–6008. Red Hook, NY: Curran Associates Inc.

12.

Wang

Girshick

Gupta

, et al. Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018, pp. 7794–7803. Piscataway, NJ: IEEE.

13.

Woo

Park

Lee

, et al. CBAM: convolutional block attention module. In: Proceedings of the european conference on computer vision (ECCV), Munich, Germany, September 8–14, 2018, pp. 3–19. Cham: Springer.

14.

Shen

Albanie

, et al. Gather-excite: exploiting feature context in convolutional neural networks. In: Advances in neural information processing systems, Montreal, Quebec, Canada, 3–8 December 2018, pp. 9401–9411. Red Hook, NY: Curran Associates Inc.

15.

Shen

Sun

. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Wellington, New Zealand, June 2018, pp. 7132–7141. Piscataway, NJ: IEEE.

16.

Park

Woo

Lee

, et al. BAM: bottleneck attention module. arXiv preprint arXiv:180706514 2018.

17.

Huang

Wang

Huang

, et al. CCNet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, 2019, pp. 603–612. Washington, DC: IEEE.

18.

Wang

, et al. Learning region features for object detection. In: Proceedings of the european conference on computer vision (ECCV), January 2018, pp. 381–395. Cham: Springer.

19.

Zhao

Zhang

Liu

, et al. PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of the european conference on computer vision (ECCV), Munich, Germany, September 8–14, 2018, pp. 267–283. Cham: Springer.

20.

Liu

Tian

, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019, pp. 3146–3154. Piscataway, NJ: IEEE.

21.

Cao

Lin

, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE international conference on computer vision workshops, Seoul, Korea, 27–28 October 2019. Piscataway, NJ: IEEE.

22.

Battaglia

Pascanu

Lai

, et al. Interaction networks for learning about objects, relations and physics. In: Advances in neural information processing systems, Barcelona, Spain, 5–10 December 2016, pp. 4502–4510. Red Hook, NY: Curran Associates Inc.

23.

Santoro

Raposo

Barrett

, et al. A simple neural network module for relational reasoning. In: Advances in neural information processing systems, Long Beach, CA, USA, 4–9 December 2017, pp. 4967–4976. Red Hook, NY: Curran Associates Inc.

24.

Chen

Yang

Wang

, et al. Attention to scale: scale-aware semantic image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 3640–3649. Piscataway, NJ: IEEE.

25.

Yuan

Wang

. OCNet: object context network for scene parsing. arXiv preprint arXiv:180900916 2018.

26.

Yan

, et al. An attention mechanism based convolutional LSTM network for video action recognition. Multimedi Tools Appl 2019; 78(14): 20533–20556.

27.

Wang

Jiang

Qian

, et al. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Honolulu, HI, USA, 21–26 July 2017, pp. 3156–3164. Piscataway, NJ: IEEE.

28.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Boston, MA, USA, 7–12 June 2015, pp. 3431–3440. Piscataway, NJ: IEEE.

29.

Deng

Dong

Socher

, et al. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition(CVPR), Miami, FL, USA, 20–25 June 2009, pp. 248–255. Piscataway, NJ: IEEE.

30.

N Silberman SG. TensorFlow-slim image classification model library. https://github.com/tensorflow/models/tree/master/research/slim, 2016 (accessed April 2018).

31.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017, pp. 618–626. Washington, DC: IEEE.