Sage Journals: Discover world-class research

Abstract

Recently, a DCNet consisting of a dense relation distillation module and a context-aware aggregation module has achieved remarkable performance for the few-shot object detection task. In this article, we aim to improve the DCNet from the following two aspects. First, we design an adaptive attention module, which is equipped in the front of the dense relation distillation module, and can be trained together with the remainder parts of the DCNet. After training, the adaptive attention module helps to enhance foreground features and to suppress the background features. Second, we introduce a large-margin Softmax into the dense relation distillation module. The large-margin Softmax with a hyperparameter can normalize features without reducing the discriminability between different classes. We conduct extensive experiments on the PASCAL visual object classes and the Microsoft common objects in context data sets. The experimental results show that the proposed method can work under the few-shot scenario and achieves the mean average precision of 50.8% on the PASCAL visual object classes data set and 13.1% on the Microsoft common objects in context data set, which both outperform the existing baselines. Moreover, ablation studies and visualizations validate the usefulness of the adaptive attention module and the large-margin Softmax. The proposed method can be applied to recognize rare patterns in fabric images or detect clothes with new styles in natural scene images.

Keywords

Adaptive Attention Feature Reweighting Few-Shot Object Detection Large-Margin Softmax Meta-Weight

Introduction

With the rapid development of machine learning technology, databases with numerous annotated samples have promoted the research progress of object detection. Methods based on CNNs (Convolutional Neural Networks) have made remarkable breakthroughs. However, the detection performance of the existing methods highly relies on the number of annotations. In some special applications, such as lesion detection,¹ recognizing rare diseases,² or rare patterns in cultural relics,³ there are few available samples, and thus a lack of sufficient annotations. Therefore, when facing to these tough circumstances, generic CNN-based methods^4–8 may fail to achieve acceptable performance with the risk of being trapped in overfitting. To meet these application requirements,^1–3 few-shot object detection^9–17 has been a research hotspot in recent years.

In the few-shot object detection community, meta-learning-based methods^9–11,18,19 aim to extract images’ meta-features that can memorize prediction gradients. Yan et al.¹⁸ extended Faster/Mask region-based convolutional neural network (R-CNN) via a meta-learner over region-of-interest features. Kang et al.¹⁹ extracted meta-features from base classes, and generalized them to detect novel classes based on an end-to-end episodic few-shot learning scheme. Data augmentation methods^12–15 resort to increase the amount of data through fetching video frames or processing images. However, since most of the data augmentation processing will inevitably introduce additional noise, these methods usually achieve suboptimal detection performance. Misra et al.¹² assume the availability of abundant unannotated samples for semi-supervised training, while Ren et al.¹⁵ need to set processing parameters manually. Obviously, these requirements violate the spirit of few-shot learning. Transfer-learning-based methods^16,17 convert the meta-features from the source domain (support set) to the target domain (query set), and synthesize representative query features by fusing the knowledge learned from the support domain. Chen et al.¹⁶ designed a low-shot transfer module consisting of transfer knowledge and background suppression regularizations. The transfer module can be seamlessly integrated into some generic object detection model, like Faster R-CNN.⁴ Hu et al.¹⁷ proposed a novel DCNet, which mainly consists of a DRD (Dense Relation Distillation) module and a CAA (Context-Aware Aggregation) module. The former establishes dense matching relationships between support and query features over the spatial dimension, while the latter targets adaptively fusing multiple features over the scale dimension.

Despite these successes, there still exists room for improvements. In the existing works,^16,17 raw features directly serve as inputs to the meta-learner or feature-transfer. It is helpful to enhance the foreground features and to suppress the background features in advance. Moreover, the existing works^16,17 usually use the Softmax to normalize features. Unfortunately, the traditional Softmax may smooth object features, thereby reducing the discriminability between different classes.

In this article, we aim to improve the recently proposed DCNet¹⁷ from the following two aspects. First, we design an adaptive attention module, which combines the support features and then dynamically modulates the query features. The adaptive attention module is configured in front of the DRD module, and can be trained together with the other part of the DCNet. After training, the adaptive attention module can extract fine-grained features for each query object, and suppress interference at the same time. Second, we introduce a Large-Margin (LM) Softmax²⁰ to prevent feature smoothness. We conduct extensive experiments on PASCAL visual object classes (VOC) 2007/2021^21,22 and Microsoft (MS) common objects in context (COCO)²³ data sets. Experimental results show that the improved method reaches higher detection accuracy (mean Average Precision (mAP)) than the original DCNet.¹⁷ Moreover, ablation studies demonstrate that the adaptive attention module and the LM Softmax indeed enhance the query features in terms of in-class representativeness and between-class separability.

The rest of this article is organized as follows. The section “Revisit DCNet” gives a brief introduction of the DCNet.¹⁷ In the section “Our Proposal,” we describe the adaptive attention module and the LM Softmax. In the section “Experiments,” we exhibit our experimental results. The final section, “Conclusion”, concludes this article.

Revisit DCNet

In this section, we briefly introduce the DCNet,¹⁷ which is the baseline of our proposal. The flow diagram of the DCNet¹⁷ is shown in Figure 1. The DCNet¹⁷ is mainly composed of a feature extractor, a DRD module, a CFA (Context-aware Feature Aggregation) module, and an RPN (Region Proposal Network) module. The feature extractor is based on a pretrained CNN-like ResNet-101,²⁴ and is used to extract the raw features of the support and query images. The DRD module, which follows the framework of transfer learning, establishes dense matching relationships between the support and query features for each pair of spatial positions. In the DRD module, a Softmax is used to normalize the support features. After training, the support features can be transferred to the query feature in a forward propagation. The CFA module captures the deep features and fuses them in a multi-scale manner. The RPN module with region-of-interest alignment produces a fine-grained feature for the query image, and a detection head on the top of the DCNet performs the object detection.

Figure 1.

The flow diagram of the DCNet.

As analyzed in section “Revisit DCNet,” raw features are directly input to the DRD module without refinement. In addition, the traditional Softmax used in the DRD module may smooth the object features so as to reduce the discriminability between different classes.

Our Proposal

In this article, we design an adaptive attention module, which is configurated in front of the DRD module for refining features. Moreover, we introduce an LM Softmax into the DRD module that replaces the traditional Softmax. The structure of the improved DCNet is shown in Figure 2. In our proposal, the raw features—namely, the outputs of the feature extractor—will be processed by the adaptive attention module in advance. The adaptive attention module combines support features, and allocates an attention score to each pixel of the query image. The adaptive attention module can be trained together with the other part of the DCNet.¹⁷ After training, a higher (lower) attention score will be adaptively allocated to each object (background) pixel. As such, the adaptive attention module helps the DRD module to enhance the foreground features and to suppress the background features. In addition, an LM Softmax is equipped into the DRD module. Compared with the traditional Softmax, the LM Softmax helps the DRD module to produce more representative in-class features and more separable between-class features.

Figure 2.

The structure of the improved DCNet. An adaptive attention module and an LM Softmax are added to the original DCNet.

Adaptive Attention Module

The adaptive attention module is designed to process the outputs of feature extractor. In this article, a pretrained ResNet-101²⁴ is used as the feature extractor. In the training phase, a query image and N support images together with their masks are input to the feature extractor. Note that the N support images come from N object classes, with one image corresponding to one class. All the extracted features serve as the inputs to the adaptive attention module. The structure diagram of the adaptive attention module is shown in Figure 3.

Figure 3.

The structure diagram of the adaptive attention module.²⁵

The adaptive attention module follows the meta-feature re-weighting strategy.²⁵ It combines the support features, and allocates attention scores for query features. As shown in Figure 3, the adaptive attention module is mainly composed of a meta-weight generator and a spatial attention generator. The former takes the support features as input, and is trained to generate a class-specific meta-weight vector. The meta-weight vector is used to modulate the query feature. Then, the modulated feature is input to the spatial attention generator for score allocation. After training, the spatial attention generator will output a reliable attention map, in which foreground (and background) features are assigned by higher (and lower) attention scores. We give a detailed description of the adaptive attention module in the following.

The meta-weight generator, denoted by $A_{R} (\cdot; θ_{R})$ , aims to generate a class-specific meta-weight vector, denoted by $w_{s}$ , where $θ_{R}$ is the trainable parameters. Suppose that we have a support image set ${x_{1, 1}, \dots, x_{1, n}}, \dots, {x_{K, 1}, \dots, x_{K, n}}$ , where $K$ denotes the number of classes while $n$ denotes the number of support images in each class. The labels of K classes are denoted by $y_{1}, \dots, y_{K}$ , respectively. Accordingly, the query image and its label are denoted by $x_{q}$ and $y_{q}$ , respectively. Given a support image $x_{s}$ and a query image $x_{q}$ , the feature extractor—namely, the pretrained ResNet-101²⁴—is used to extract the raw features, denoted by $f_{s}$ and $f_{q}$ , respectively. The meta-weight generator $A_{R} (\cdot; θ_{R})$ is applied to $f_{s}$ and generates a meta-weight vector $w_{s}$ of class $y_{s}$ . This process can be described by $w_{s} = A_{R} (f_{s}; θ_{R})$ . For the output vector $w_{s}$ , its length is equal to the number of channels of $f_{s}$ (and $f_{q}$ ). For multiple support images, we average the meta-weight vectors that belong to the same class, and obtain K meta-weight vectors ${w_{1}, \dots, w_{K}}$ . The modulated query feature is obtained by:

f_{q}^{y_{s}} = f_{q} \otimes w_{s}

(1)

where $\otimes$ denotes the channel-wise multiplication. All meta-weight vectors are applied to equation (1) in turn, which generates K modulated query features ${f_{q}^{y_{1}}, \dots, f_{q}^{y_{K}}}$ for each query image. In doing so, the support features are injected into the query feature in a class-wise manner, which alleviates the sample scarcity of the query set.

Furthermore, the spatial attention generator processes the modulated query feature $f_{q}^{y_{s}}$ , and outputs an attention map, denoted by $M_{q}^{y_{s}}$ . The spatial attention generator $A_{S} (\cdot; θ_{S})$ consists of two convolution layers, where $θ_{S}$ denotes the trainable parameters of the convolution kernels. The above process can be described by $M_{q}^{y_{s}} = A_{S} (f_{q}^{y_{s}}; θ_{S})$ . Similarly, for each query image, we obtain K attention maps—namely, $M_{q}^{y_{i}}$ ( $i = 1, \dots, K$ )—according to the K modulated query features ${f_{q}^{y_{1}}, \dots, f_{q}^{y_{K}}}$ .

To train the adaptive attention module, we shall further process the attention maps. First, we apply a global average pooling to each attention map, which takes the form $a_{i}^{q} = {\bar{M}}_{q}^{y_{i}}$ . Here, $a_{i}^{q}$ represents the confidence that the query image $x_{q}$ belongs to class i, and $\bar{M}$ indicates the global average pooling. Second, we apply the following equation:

L = - \log \frac{\exp (a_{j}^{q})}{\sum_{i = 1}^{K} \exp (a_{i}^{q})}, j = ar g_{y_{i} = y_{q}} {i}

(2)

to normalize the confidences. Third, cross-entropy loss function, which measures the consistency between the normalized confidence and the label of $x_{q}$ , is used to train the adaptive attention module. After training, a matching attention map with highest confidence can reflect the salient region of the query image. As shown on the right side of Figure 3, the query image “dog” has corresponding salient regions in its attention map.

Improved DRD Module with LM Softmax

The DRD module, which is the key part of the DCNet, first transforms the support and query features into a pair of key and value maps in a learnable manner. Then, the traditional Softmax normalizes the key maps and produces a weight matrix. Through training, the weight matrix will establish dense matching relationships between support and query features for each pair of spatial positions. However, the traditional Softmax may smooth object features, and reduces the discriminability between different classes. In this article, we replace the traditional Softmax by an advanced one, called the LM Softmax.²⁰ Compared with the traditional Softmax, there is an adjustable hyperparameter in the LM Softmax, paving the way for enhancing the feature separability.

Suppose that the input features to the DRD module are of size $H \times C \times W$ , in which C denotes the number of channels, and H and W denote the height and width, respectively. As shown in Figure 4, the DRD module encodes the query features into a key map $k_{q} \in R^{C / 8 \times H \times W}$ and a value map $v_{q} \in R^{C / 2 \times H \times W}$ . Similarly, the features of N support images are encoded into key maps $k_{s} \in R^{N \times C / 8 \times H \times W}$ and value maps $v_{s} \in R^{N \times C / 2 \times H \times W}$ , where $N$ represents the number of the support images with each corresponding to one class. Note that the above feature transformations are implemented in a trainable manner.

Figure 4.

The detailed structure of the DRD module with the LM Softmax.

With these preparations, a weight matrix W, which reflects the dense matching relationships, will be calculated as follows. First, a position-wise similarity is measured between the two key maps $k_{q}$ and $k_{s}$ . This procedure is described by

F (k_{qi}, k_{sj}) = Φ {(k_{qi})}^{T} Φ' (k_{sj})

(3)

where $i$ and $j$ are the position indices of the key maps, while $Φ$ and $Φ'$ are the two different linear transformations. Then, the position-wise similarity will be normalized to yield the weight matrix W. In this article, we use the LM Softmax²⁰ for such normalization. That is:

W_{ij} = \frac{{\tilde{W}}_{ij}}{{\tilde{W}}_{ij} + \sum_{h \neq i} \exp (‖ F (k_{qi}, k_{sh}) ‖ \cdot ‖ Ω_{h} ‖ \cdot \cos (θ_{h}))}

(4)

in which ${\tilde{W}}_{ij} = \exp (| | F (k_{qi}, k_{sj}) | | \cdot | | Ω_{j} | | \cdot φ (θ_{i}))$ . In equation (4), $W_{ij}$ is the component of W, while $Ω_{h}$ is a learnable parameter assigned to the position h. The term $φ (θ)$ is set to:

φ (θ) = {(- 1)}^{ℓ} \cdot \cos (m θ) - 2 ℓ, θ \in [\frac{l π}{m}, \frac{(l + 1) π}{m}]

(5)

where l is an integer belonging to $[0, m - 1]$ , while m is a hyperparameter that controls the separable margin. The form in equation (5) is recommended by Liu et al.²⁰ for simplifying the forward and backward propagation. Equipped with the LM Softmax, the improved DRD module can produce more representative in-class features and more separable between-class features. The experimental results in the section “Visualizations” validate this merit.

The weight matrix is used to activate the value maps $v_{s}$ of support features. Then, the activated value maps are concatenated with the value map $v_{q}$ of query features. These two procedures can be written as follows:

y = concat [v_{q}, W * v_{s}]

(6)

where $*$ denotes the matrix inner-product, and “concat” represents the concatenation operation for tensors. Note that N support features are involved in the calculations of equations (1)–(3). In this way, the output feature map y is a combination of one query value map and N support value maps activated by the weight matrix. In other words, not only the query feature itself but also the support features coexist in the final output $y$ , thus alleviating the sample scarcity of the query set.

Other Parts of the DCNet

The RPN module and the CFA module are two necessary parts in our proposal to achieve reliable detection performance.

The RPN module takes the feature map y as input and produces a set of rectangular object proposals, each of which is assigned by an objectness score. Then, the input feature encompassed by each rectangular box is mapped to a lower-dimensional version. Finally, a detection head consisting of two sibling fully connected layers takes the lower-dimensional feature as input and outputs a box-regression value and a box-classification value. Note that a traditional Softmax with no modifications is used in the detection head for fair comparisons. More details can be found in Ren et al.⁴

In the DCNet,¹⁷ the CFA module is inserted on the top of RPN module to mine the scale-awareness of features. In our proposal, we also inherit the configuration in Hu et al.¹⁷ for fair comparisons. As shown in Figure 5, the CFA module is comprised of three parallel branches. Each branch contains the same operators but performs at different resolutions 4, 8, and 12. The first operator “Linear” is the two consecutive full-connected layers. The second one “GAP” is the global average pooling. The larger resolution focuses on the contextual semantic information for smaller objects, while the smaller resolution targets capturing overall semantic information for larger objects. In this way, the CFA module can extract the scale-aware features. More details can be found in Hu et al.¹⁷

Figure 5.

The structure of the CFA module.¹⁷

Experiments

Setups and Settings

In this section, we conduct extensive experiments, including performance comparisons, ablation studies, and visualizations. All the experiments are performed on the data sets of PASCAL VOC 2007/2012^21,22 and MS COCO.²³ The data sets of the PASCAL VOC series^21,22 contain 20 classes. They are airplane, bicycle, bird, boat, bottle, car, bus, cat, dog, cow, sofa, horse, person, dining table, motorbike, potted plant, chair, train, TV monitor, and sheep. The MS COCO data set²³ contains 80 classes. During training, several classes (5 for PASCAL VOC and 20 for MS COCO) are randomly selected as the novel classes, each of which forms a query set. The remaining classes serve as the base classes, corresponding to the support sets. In the few-shot scenario, each novel class only contains k image samples. In our experiments, k is set to 1, 2, 3, 5, and 10 for the PASCAL VOC data set, and to 10 and 30 for the MS COCO data set.

We train the proposed model by the following two stages. In the first stage, the base classes are divided into support and query sets. Note that the feature extractor is pretrained on ImageNet²⁶ in advance, while the remaining modules are trained together in an end-to-end manner. In the second stage, one novel class corresponds to one query set, and only $k$ image samples are available in each class for parameter fine-tuning.

For the model training, batch size is set to 4, and the initial learning rate is set to 0.005. We apply an SGD (Stochastic Gradient Descent) optimizer²⁷ for updating parameters. The learning rate is reduced with the increase in training epochs. We train the models on the set of base classes for 20 epochs. Throughout our experiments, we use Top-1 mAP value as the evaluation metric. The higher the mAP value, the better the detection performance.

Our model is implemented based on the PyTorch framework. All the experiments are performed on a workstation with the operating system Ubuntu18.04.5LTS, CPU Intel Xeon(R)-2150B@3.00 GHz*20, dual GPUs GeForce RTX 2080Ti, and 32 GB memory.

Performance Comparisons

To validate the superiority of our proposal, we conduct performance comparisons on the data sets of PASCAL VOC^21,22 and MS COCO.²³ In our experiments, the baselines are Meta R-CNN,¹⁸ Reweight-YOLOv3,¹⁹ and the original DCNet.¹⁷ Moreover, an abbreviation “ATT”—which is short for adaptive attention mechanism—is added into the Meta R-CNN¹⁸ and Reweight-YOLOv3,¹⁹ respectively. This means that the two models are equipped with the adaptive attention mechanism. In total, six models (including ours) were prepared for the performance comparisons. To verify the generalization ability and the robustness of the proposed model, we test the detection performance on three novel sets, which consist of different combinations of the novel classes. The detailed results are listed in Table 1.

Table 1.

Detection performance (mAP) on the PASCAL VOC data set.

Models	Novel set 1					Novel set 2					Novel set 3
Models	1	2	3	5	10	1	2	3	5	10	1	2	3	5	10
Meta R-CNN	12.3	19.4	25.7	34.6	37.5	15.4	21.6	27.7	35.1	40.7	10.3	20.8	24.6	32.3	38.8
Meta R-CNN/ATT	15.6	21.1	29.4	33.7	39.9	14.6	23.4	27.4	36.9	43.4	14.8	28.9	27.2	35.5	39.4
Reweight-YOLOv3	15.5	21.2	28.1	36.2	40.3	13.9	22.6	28.8	35.6	41.9	15.1	27.8	29.6	38.4	41.7
Reweight-YOLOv3/ATT	21.2	27.8	32.6	37.6	44.7	22.3	26.3	39.7	42.9	48.7	25.6	28.8	35.7	40.1	47.9
DCNet	19.3	25.8	30.7	35.0	42.5	19.2	24.4	32.4	40.1	45.1	23.5	27.1	34.6	38.5	46.3
Our proposal	22.3	30.1	36.3	45.7	50.2	21.9	29.8	36.3	43.4	51.2	27.9	29.7	39.2	43.7	51.2

mAP: mean average precision; R-CNN: region-based convolutional neural network; ATT: adaptive attention mechanism.

The best mAP values are printed in bold, while the second best ones are highlighted by underlines.

We can find that, in most cases, our proposal achieves the best mAP values. This demonstrates the effectiveness of the proposed method. Comparing the last two rows of Table 1, we can see that our proposal outperforms the original DCNet.¹⁷ Specifically, when k = 10, our model reaches the mAP value of 50.9%, which is about 6 percentage points higher than the original DCNet.¹⁷ These observations validate that the designed adaptive attention module and the LM Softmax are helpful for the DCNet to boost the detection performance. This is because the adaptive attention module helps to suppress the background interferences, while the LM Softmax can enhance the between-class separability of object features. Comparing Meta R-CNN (Reweight-YOLOv3) and Meta R-CNN/ATT (Reweight-YOLOv3/ATT), we find that equipping the adaptive attention module indeed improves the detection performance. This suggests that the designed adaptive attention module is a plug-and-play tool that can gear toward generic object detection models.^4–8 Moreover, we find that the mAP values generally increase with k. This observation also accords with common sense since more image samples provide more object-related information. All these observations and analysis demonstrate that our proposal has better robustness and generalization ability.

For MS COCO data set, we also organize three novel sets, which have different combinations of the novel classes. The corresponding mAP values are listed in Table 2.

Table 2.

Detection performance (mAP) on the MC COCO data set.

Models	Novel set 1		Novel set 2		Novel set 3
Models	10	30	10	30	10	30
Meta R-CNN	5.3	9.7	4.8	9.3	5.1	9.7
Meta R-CNN/ATT	5.4	10.1	5.2	10.3	6.2	10.4
Reweight-YOLOv3	4.4	7.3	4.9	7.5	4.8	6.9
Reweight-YOLOv3/ATT	7.6	12.7	8.1	13.2	8.3	12.5
DCNet	6.7	10.6	7.2	11.3	7.9	12.3
Our proposal	7.9	12.5	8.5	13.4	8.9	13.3

mAP: mean average precision; R-CNN: region-based convolutional neural network; ATT: adaptive attention mechanism.

The best mAP values are printed in bold, while the second best ones are highlighted by underlines.

As we see, the mAP values are lower than those in Table 1. This is because the MS COCO data set contains more object classes than PASCAL VOC data set. Despite the complex scenes of the MS COCO data set, our proposal still achieves the best detection performance over all three novel sets. When k = 30, the highest mAP value reaches 13.4%. These numerical results, which are consistent with Table 1, corroborate the superiority of our proposal.

Ablation Studies

In the ablation studies, all the experiments are conducted on the novel set 2 of the PASCAL VOC data set with k = 10. We test the contributions of the adaptive attention module and the LM Softmax to the final detection performance. The results of the ablation studies are given in Table 3.

Table 3.

Results of the ablation studies.

Adaptive attention module	LM Softmax	mAP
×	×	45.1
√	×	49.8
×	√	48.3
√	√	51.2

LM: large-margin; mAP: mean average precision.

The best mAP value is printed in bold.

Without the adaptive attention module and the LM Softmax, the original model achieves a mAP value of 45.1%. When equipping the adaptive attention module or the LM Softmax, the detection performance becomes better, achieving a 4.7% or 3.2% increase. This indicates the adaptive attention module or the LM Softmax is useful for the original DCNet.¹⁷ Moreover, when we switch on the two modules simultaneously, the mAP value reaches 51.2% (see the last row of Table 3). This result demonstrates that the two modules can cooperate with each other to further boost the detection performance. These ablation studies provide solid evidence as to the reasonability of our proposal.

Visualizations

In this section, we visualize the experimental results that can reveal the working mechanisms of the adaptive attention module and the LM Softmax.

In Figure 6, we exhibit the attention maps generated by the adaptive attention module. Hot and cool colors represent high and low attention scores, respectively. The first row/column of Figure 6 is arranged for query/support images. In Figure 6, the support images are potted plant, sheep, car, and airplane, respectively. The query images are sheep, cow, car, potted plant, and boat, respectively. The visualization results in Figure 6 show that, for a query image, different attention maps will be generated when given different support images. When the query and support images come from the same class, for example, sheep, our adaptive attention module can allocate higher attention scores to the targeted objects. This phenomenon can be found in the attention maps located at (row = 3, column = 2), (2, 5), (4, 4), and (5, 6). On the contrary, when the query and support images come from different classes, the attention scores are low and scattered over the attention map. We observe the phenomenon in the last column of Figure 6. More interestingly, we note that the attention map at (3, 3) has a continuous salient region, even if the query and support images come from different classes. In this example, although the support image belongs to sheep, our adaptive attention module assigns higher scores to the region of cow. This phenomenon is reasonable because both sheep and cow are Bovidae. This demonstrates that the adaptive attention module not only can concentrate on the same-class objects but also has outstanding generalization ability for similar classes.

Figure 6.

Visualizations of the attention maps.

In this article, the LM Softmax is introduced into the original DRD module of DCNet.¹⁷ The LM Softmax helps the DRD module to extract features with better in-class representativeness and between-class separability. In this experiment, we visualize the support features of 15 base classes of the PASCAL VOC data set by t-distributed stochastic neighbor embedding (t-SNE).²⁸ The feature distributions are shown in Figure 7. In Figure 7, different support classes are printed by different colors. As we can see, the feature points belonging to the same class are basically gathered into a cluster. Remarkably, when using the LM Softmax, the feature points of the same class are gathered more compactly. Moreover, we find that feature points of the similar classes share smaller between-class distances. For instance, the feature points of sheep and cow are overlapped closely.

Figure 7.

Feature distributions of 15 base classes (PASCAL VOC data set). Feature normalization using the Softmax (the left one) and the LM Softmax (the right one).

Moreover, we calculate standard deviations of support features in a class-wise manner. We exhibit the standard deviations in Figure 8. We find that, when using the LM Softmax, the standard deviations have been effectively reduced. Specifically, the average standard deviation for LM Softmax is 1.05, which is about 15% lower than that for the Softmax. This demonstrates that the LM Softmax can improve the feature separability, and thus validates the reasonability of introducing the LM Softmax.

Figure 8.

The standard deviation for each class. Feature normalization is based on the Softmax and the LM Softmax, respectively.

Conclusion

In this article, we aim to improve the DCNet¹⁷ by designing an adaptive attention module and introducing an LM Softmax for feature normalization. The adaptive attention module is configured in front of the DRD module, and can be trained together with the other part of the DCNet. After training, the adaptive attention module can enhance the object features and suppress background interference at the same time. In addition, we introduce an LM Softmax²⁰ into the DRD module of DCNet,¹⁷ which normalizes features without reducing the discriminability between different classes. Experimental results on data sets of PASCAL VOC 2007/2021^21,22 and MS COCO²³ show that the improved method reaches a higher detection accuracy (mAP) than the original DCNet.¹⁷ The ablation studies and visualizations also demonstrate that the adaptive attention module and the LM Softmax indeed enhance the query features in terms of in-class representativeness and between-class separability. The application potential includes recognizing rare patterns in the cultural relics, or adapting to new styles of clothes for a human parsing system.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the National Key Research and Development Program of China (2019YFC1521300), supported by the National Natural Science Foundation of China (62001099), and supported by the Fundamental Research Funds for the Central Universities of China (17D110408).

ORCID iD

Rong Huang

References

Shen

Suk

HI.

Deep learning in medical image analysis. Ann Rev Biomed Eng 2017; 19(1): 221–248.

Brasil

Pascoal

Francisco

, et al. Artificial Intelligence (AI) in rare diseases: Is the future brighter? Genes 2019; 10(12): 978.

Brindha

Bhuvaneswari

Repossession and recognition system: Transliteration of antique Tamil Brahmi typescript. Curr Sci 2021; 120(4): 654–665.

Ren

Girshick

, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell2017; 39(6): 1137–1149.

Redmon

Divvala

Girshick

, et al. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788, https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

Liu

Anguelov

Erhan

, et al. SSD: Single shot multibox detector. In: Proceedings of the European conference on computer vision, 2016, pp. 21–37, https://www.cs.unc.edu/~wliu/papers/ssd.pdf

Girshick

Fast

R-CNN

. In: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448, https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587, https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.pdf

Vinyals

Blundell

Lillicrap

, et al. Matching networks for one shot learning. In: Proceedings of the neural information processing systems, 2016, pp. 3637–3645, https://proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf

10.

Finn

Abbeel

Levine

. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the international conference on machine learning, 2017, pp. 1856–1868, https://arxiv.org/pdf/1703.03400.pdf

11.

Snell

Swersky

Zemel

. Prototypical networks for few-shot learning. In: Proceedings of the neural information processing systems, 2017, pp. 4078–4088, https://www.cs.toronto.edu/~zemel/documents/prototypical_networks_nips_2017.pdf

12.

Misra

Shrivastav

Hebert

. Watch and learn: Semi-supervised learning of object detectors from videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3593–3602, https://openaccess.thecvf.com/content_cvpr_2015/papers/Misra_Watch_and_Learn_2015_CVPR_paper.pdf

13.

Xing

Rostamzadeh

Oreshkin

, et al. Adaptive cross-modal few-shot learning. In: Proceedings of the neural information processing systems, 2019, pp. 4847–4857, https://proceedings.neurips.cc/paper/2019/file/d790c9e6c0b5e02c87b375e782ac01bc-Paper.pdf

14.

Zhang

Xie

. Dual adversarial semantics-consistent network for generalized zero-shot learning. In: Proceedings of the neural information processing systems, 2019, pp. 6146–6157, https://arxiv.org/pdf/1907.05570.pdf

15.

Ren

Liao

Fetaya

, et al. Incremental few-shot learning with attention attractor networks. In: Proceedings of the neural information processing systems, 2019, pp. 5275–5285, https://proceedings.neurips.cc/paper/2019/file/e833e042f509c996b1b25324d56659fb-Paper.pdf

16.

Chen

Wang

, et al. LSTD: A low-shot transfer detector for object detection. In: Proceedings of the AAAI conference on artificial intelligence, 2018, pp. 2836–2843, https://arxiv.org/pdf/1803.01529.pdf

17.

Bai

, et al. Dense relation distillation with context-aware aggregation for few-shot object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2021, pp. 10180–10189, https://arxiv.org/pdf/2103.17115.pdf

18.

Yan

Chen

, et al. Meta R-CNN: Towards general solver for instance-level low-shot learning. In: Proceedings of the IEEE international conference on computer vision, 2019, pp. 9576–9585, https://openaccess.thecvf.com/content_ICCV_2019/papers/Yan_Meta_R-CNN_Towards_General_Solver_for_Instance-Level_Low-Shot_Learning_ICCV_2019_paper.pdf

19.

Kang

Liu

Wang

, et al. Few-shot object detection via feature reweighting. In: Proceedings of the IEEE international conference on computer vision, 2019, pp. 8419–8428, https://arxiv.org/pdf/1812.01866.pdf

20.

Liu

Wen

, et al. Large-margin Softmax loss for convolutional neural networks. In: Proceedings of the international conference on machine learning, 2016, pp. 507–516, https://arxiv.org/pdf/1612.02295.pdf

21.

Everingham

Van Gool

Williams

, et al. The Pascal visual object classes (VOC) challenge. Int J Comput Vis 2010; 88(2): 303–338.

22.

Everingham

Eslami

SMA

Van Gool

, et al. The Pascal visual object classes challenge: A retrospective. Int J Comput Vis 2014; 111(1): 98–136.

23.

Lin

Maire

Belongie

, et al. Microsoft COCO: Common objects in context. In: Proceedings of the European conference on computer vision, 2014, pp. 740–755, https://arxiv.org/pdf/1405.0312.pdf

24.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778, https://arxiv.org/pdf/1512.03385.pdf

25.

Jiang

Kang

Zhou

, et al. Few-shot classification via adaptive attention, 2020, https://arxiv.org/pdf/2008.02465.pdf

26.

Russakovsky

Deng

, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015; 115(3): 211–252.

27.

Bottou

Stochastic gradient descent tricks. Neur Netw Trick Trade Reload 2012; 7700: 430–445.

28.

Van der Maaten

Hinton

. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(2): 2579–2605.

Few-Shot Object Detection Based on Adaptive Attention Mechanism and Large-Margin Softmax

Abstract

Keywords

Introduction

Revisit DCNet

Our Proposal

Adaptive Attention Module

Improved DRD Module with LM Softmax

Other Parts of the DCNet

Experiments

Setups and Settings

Performance Comparisons

Ablation Studies

Visualizations

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References