Sage Journals: Discover world-class research

Abstract

Recent studies have shown that pillar-based detectors perform better in terms of both accuracy and speed, but these detectors perform poorly for detecting small objects such as pedestrians and cyclists. To solve this problem, we propose a highly efficient pillar-based model named MASNet, which mainly consists of a pillar mix attention (PMA) module, an attention-pooling operation, and a focal sparse network (FSN) module. The PMA module encodes the pillar features by fusion of the point-wise attention module and the channel-wise attention module. The attention-pooling operation aggregates the attention-encoded pillar features in a more comprehensive way to obtain the most expressive pillar features. In addition, the FSN module exploits the intrinsic sparsity of the data by introducing focal sparse convolution, which enriches the learned pillar features in the foreground without adding redundant pillars in other regions. On the KITTI three-dimensional (3D) Object Detection Benchmark, it achieves a 3D average precision of (77.81%, 60.30%, and 53.92%) in easy, moderate, and hard levels, which outperforms other pillar-based methods for the detection of cyclists. Additionally, our method is only 0.52% lower than the top-ranked method (pillar feature network (PIFENet)) on the KITTI Bird's Eye View pedestrian leaderboard, but our inference speed reaches 41 frames per second ahead of PIFENet by 57.69%.

Keywords

3D object detection autonomous driving sparse convolution attention mechanism

1. Introduction

Three-dimensional (3D) object detection involves recognizing and localizing objects in a 3D scene, and its applications extend across various domains, including robot vision and autonomous driving (Guo et al., 2020). Utilizing point cloud data is more advantageous in handling localization challenges due to its ability to offer detailed geometric and positional information on the object surface. Consequently, light detection and ranging (LIDAR) has become the prevalent sensor in the realm of autonomous driving. In contrast to the two-dimensional (2D) image data captured by cameras, point cloud data are sparsely and irregularly distributed in continuous space, which pose challenges for accurate 3D object detection.

To handle these challenges and utilize the advantages of point cloud data, many methods based on point clouds have been proposed recently, which could be grouped into two categories: point-based (Chen et al., 2019; Huang et al., 2020; Qi et al., 2018a; Shi et al., 2019; Wang & Jia, 2019; Yang et al., 2020; Zhang et al., 2022) and voxel-based methods (Chen et al., 2020, 2023; Lang et al., 2019; Noh et al., 2021; Shi et al., 2020; Yan et al., 2018; Ye et al., 2020; Zhou et al., 2023; Zhou & Tuzel, 2018). Depending on the size of the voxel on the $z$ -axis, we further divide the voxel-based methods into 3D voxel-based methods (Chen et al., 2020, 2023; Kuang et al., 2020; Shi et al., 2020; Yan et al., 2018; Ye et al., 2020; Zheng et al., 2021; Zhou & Tuzel, 2018) and pillar-based methods (Lang et al., 2019; Noh et al., 2021; Ye et al., 2020; Zhou et al., 2023). Powered by the pioneering work (PointNet; Qi et al., 2017, and its variants; Qi et al., 2017), the point-based methods use neighbor sampling, grouping operations, and multilayer perceptron (MLP) to learn point-wise features on the raw point cloud data. These methods retain comprehensive 3D geometric structure information without any explicit information loss. However, point-based methods entail high computational costs in the sampling process, resulting in inference time not meeting real-time requirements. Voxel-based methods usually transform irregular point clouds into regular voxels and then extract voxel features using 2D or 3D convolutional neural network (CNN). These methods are more computationally efficient, but the inevitable loss of 3D spatial information during voxelization degrades the localization accuracy. To strike a balance between accuracy and real-time performance in practical applications, voxel-based methods are commonly favored, particularly pillar-based methods.

Voxel-based methods typically take the classical “voxel feature encoder (VFE) layer-backbone-head” detection architecture. Voxel-based methods divide the input point cloud into regular 3D voxel grids or 2D pillars. For the pillar-based approaches, pillar features are extracted by the VFE layer after voxelization. Backbone is used to further encode the pillar features and perform multi-scale feature fusion. Then, the results are sent to the detection head. In addition, the pillar size also affects the detection performance of the network. A smaller pillar size can minimize the loss of fine-grained details and improve the detection accuracy, but it also generates higher-resolution feature maps and thus increases the computational cost. A larger pillar size leads to faster inference speed but worse detection performance, especially for small objects. Mask attention interaction and scale enhancement network (Zhang & Zhang, 2022) proposes a mask attention interaction and scale enhancement (SE) network to address the issue of detection accuracy for small objects. SE uses a content-aware reassembly of features block to generate an extra pyramid bottom level to boost small ship performance, a feature balance operation to improve scale feature description, and a global context block to refine features. The authors also propose a polarization fusion (PF) network with geometric feature embedding (PFGFE-Net) to address the issues of polarization insufficient utilization and traditional feature abandonment. PFGFE-Net (Zhang & Zhang, 2022) achieves the PF from the input data, feature-level, and decision-level. Moreover, geometric feature embedding (GFE) enriches expert experience. In addition, the author proposes a lightweight shipborne synthetic aperture radar (SAR) ship detector called Lite-YOLOv5 (Xu et al., 2022a) based on the You Only Look Once 5th edition (YOLOv5) algorithm to address the issue of huge computational complexity in complex models. The above work has played a profound role in the field of ship classification and inspection.

Larger voxels reduce the pillar features of small objects that can be learned by the network. Small objects such as pedestrians typically often located in complex backgrounds, which lead to false and missed detection. To address this limitation, TANet (Liu et al., 2020) proposes a triple attention module to enhance pillar feature extraction, but this approach only uses max pooling to select the most representative pillar features, suppressing a lot of effective information, which is important for the detection. To address the loss of fine-grained information during the point cloud encoding process, we present a more effective attention mechanism, the pillar mix attention (PMA) module. In addition, our research improves the pooling method of the VFE layer that enables to capture contextual information of all points and channels in a pillar. The PMA module is divided into two parts: channel-wise attention (CA) and point-wise attention (PA). Specifically, CA uses the non-local (Wang et al., 2018) operation to increase the receptive field and suppress unimportant information from each channel. Point attention aggregates point information within pillars through various pooling methods. Furthermore, we use attention pooling to extract the strongest pillar feature of each pillar.

To further strengthen the capability of pillar feature encoding, we adopt the 2D sparse network and introduce focal sparse convolution (Chen et al., 2022). For the sparsity of the point cloud, it allows the data to be extended at appropriate locations to achieve the filling of pillar features without destroying the sparsity of the data, which is very effective for small object detection. The idea of attention is introduced to predict the importance of the data at each pillar when processing the point cloud using sparse convolution, and the features predicted to be important are expanded into deformable output shapes, thereby separating the foreground and background in the scene. Importance is learned by an additional convolutional layer that dynamically depends on the input features. This module increases valuable information, thus compensating to some extent for the lack of data for small objects. Combining the aforementioned modules, we propose a novel 3D object detector: MASNet. The key contributions are as follows:

(a)
In point clouds, for smaller targets such as pedestrians, the effective points obtained by LiDAR on their bodies are few and may be more irregular and incomplete, making it difficult to accurately detect and identify. We propose a stackable PMA module, which enhances the pillar features extraction ability by integrating PA and CA. The PMA module can be stacked to obtain multi-level pillar feature attention, thus effectively improving the performance of small object detection.
(b)
In the past, voxel-based methods typically used max pooling operations to reduce feature dimensions and aggregate point features. Max pooling selects the maximum value of each channel as the feature representation, but these maximum values cannot fully represent all features within the pillar and suppress a large amount of effective information. To further improve feature utilization, we present an attention-style pooling operation (attention pooling), which can effectively encode geometric information better and aggregate features based on importance.
(c)
We propose a new sparse multi-scale-fusion module that powerfully encodes pillar features by focal sparse convolution and facilitates multi-level feature fusion of sparse features and dense features. The FSN module enhances the information exchange between features by generating dynamic outputs, achieving the fusion of sparse and dense features without incurring excessive computational costs, which is beneficial for further feature fusion learning.
(d)
We evaluate the performance of our model on the KITTI (Geiger et al., 2012) benchmark. We further segment the training set into a training subset of 6,358 samples and a validation subset of 1,123 samples with a ratio of $4 : 1$ . We compare our model with the state-of-the-art models on the KITTI test set for three types of objects: cars, pedestrians, and cyclists on the 3D detection benchmark of the KITTI test server and analyze the effect of each component in MASNet through ablation experiments on the KITTI validation dataset. Our model runs at 41 frames per second (FPS), while obtaining competitive performance compared to most state-of-the-art approaches.

2. Related Work

Current 3D Object detection frameworks can be broadly categorized into point-based and voxel-based methods.

2.1. Point-Based Methods

PointNet (Qi et al., 2017) consists of two main modules: a feature extraction network and a global feature pooling network. The feature extraction network consists of MLP to extract local features at each point. The global feature pooling network aggregates the features of all points into a global feature vector through a max pooling operation. Unlike PointNet, PointNet++ (Qi et al., 2017) employs a hierarchical architecture that gradually focuses on local features by operating layer by layer. This hierarchical design allows PointNet++ to better capture local structures and relationships in point cloud data. Influenced by Fast Region-based-CNN (Girshick, 2015), Shi et al. (2019) proposed a network named PointRCNN, which is a two-stage network. PointRCNN utilizes PointNet++ to distinguish foreground and background points and then generate a small number of high-quality initial detection frames in the first stage, while the second stage involves refining and fine-tuning the initial detection frames. To further improve the inference speed of the network, Yang et al. (2020) proposed a pioneering one-stage target detection network for 3D object detection, which greatly improves the detection efficiency and improves the sampling strategy in PointNet. Since previous point cloud down-sampling methods are task-independent, Zhang et al. (2022) proposed the single-stage detector based on instance-aware (IA-SSD) network that adopts a task-oriented down-sampling method. IA-SSD greatly increases the proportion of foreground points while down-sampling results generate more reliable bounding boxes (BBoxes). Although point-based methods do not require preprocessing such as voxelization, these methods still suffer from expensive computational/memory costs that cannot meet the real-time requirements.

2.2. Voxel-Based Methods

Unlike point-based methods, voxel-based methods typically transform unstructured point clouds into structured 3D voxels or 2D pillars.

Pillar-based methods are a technique in 3D object detection that converts point cloud data into a cylindrical representation and processes it using 2D convolutional networks to improve computational efficiency and reduce resource consumption.

The core idea of pillar-based methods is to map point cloud data in 3D space to a 2D columnar structure, where each column represents a vertical voxel and contains information about all point cloud data within that voxel. The advantage of this method is that it can utilize mature 2D CNNs for feature extraction while avoiding the high computational complexity of 3D convolution. In addition, as autonomous vehicles are typically equipped with sensors such as LiDAR, the point cloud data generated by these sensors is naturally suitable for columnar representation as they typically scan along the direction of the vehicle’s travel.

Recent advancements in pillar-based methods have focused on enhancing the feature learning capabilities of the network to improve detection accuracy. One notable development is the introduction of the PillarNet architecture, which employs a powerful encoder network for effective pillar feature learning, a neck network for spatial-semantic feature fusion, and a detection head for box regression, classification, and intersection-over-union (IoU) prediction. PillarNet is designed to be flexible with respect to pillar size and compatible with classical 2D CNN backbones such as VGGNet and ResNet. It also incorporates an orientation-decoupled IoU regression loss and an IoU-aware prediction branch to further refine the detection performance.

Zhou and Tuzel (2018) proposed VoxelNet to extract voxel features by using a 3D CNN network and project the voxel features into a 2D pseudo-image on the bird’s eye view (BEV) plane. Furthermore, building a region proposal network for further filtering and classification of the candidate BBoxes. SECOND (Yan et al., 2018) introduced the submanifold sparse convolution (Graham et al., 2018) to replace the ordinary 3D convolution, efficiently improving network inference efficiency. Based on SECOND, Lang et al. (2019) firstly convert the point cloud into a 2D pillar, in which no more voxel height is set in the $z$ -direction. Hence the network can use 2D CNN for feature extraction, powerfully reducing the computational costs. HVNet (Ye et al., 2020) fuses multi-scale voxel feature maps at a point-wise level through an attentive VFE, which enhances the detection performance. To further utilize the information contained in different voxel size levels, SMS-Net (Liu et al., 2022) fuses multi-scale voxel features at the level of 3D feature maps by a new sparse multi-scale fusion module and improves the 3D BBox estimation accuracy by using a new shallow-to-deep regression module. Compared with point-based methods, voxel-based methods achieve good performance in both speed and accuracy. However, voxel-based methods are still sensitive to the voxel size.

2.3. Attention Mechanism

In recent years, the attention mechanism has proved to be effective in various computer vision applications such as image segmentation, classification, and object detection. Meanwhile, it has been widely welcomed due to its lightweight, small computational cost, especially in the detection of small objects. It makes the network pay more attention to important information and ignore irrelevant background in the scene.

Using group-wise feature enhancement-and-fusion network, Xu et al. (2022b) proposed a hybrid pooling channel attention for channel modeling to balance the contribution of each polarization feature. In addition, its main contributions include: rich dual-polarization features, enriched feature library, suppression of clutter interference, and ease of feature extraction; GFE is used to enhance each polarized semantic feature to highlight each polarized feature region; group feature fusion combines multi-scale polarization features to achieve group information exchange of polarization features.

Shadow-background-noise 3D spatial decomposition (SBN-3D-SD; Xu et al., 2022c) points out the sparse, low-rank, and Gaussian properties of the shadow, background, and noise in Video-SAR data and takes advantage of sparse, low-rank, Gaussian properties of shadows, backgrounds, and noises to enhance shadows. Alternating direction method of multipliers is used to solve SBN-3D-SD. Shadows are separated from backgrounds and noises. Besides, this is the first attempt in the Video-SAR moving target detection-tracking community.

SENet (Hu et al., 2018) uses a global average pooling operation and fully connected (FC) layer to compute CA in the squeeze and excitation block for image-related tasks. However, SENet ignores spatial attention, which plays a crucial role in determining “where.” To address this problem, using a convolutional block attention module, Woo et al. (2018) sequentially generate attention maps along channel dimensions and spatial dimensions, which performs better than other methods that only use one kind of attention mechanism. In recent years, some approaches have tried to use the attention mechanism at the network based on the point cloud. Qiu et al. (2021) used experiments to verify that the introduction of a 2D or 3D attention module into current 3D object detection frameworks can improve detection performance. TANet (Liu et al., 2020) combines CA, PA, and voxel-wise attention, which enhances the critical information of the object, suppresses the unstable points, and improves the robustness of the network. In order to retain important information while generalizing to various pedestrian representations, the network needs to use more comprehensive attention mechanisms when learning pillar features. Le et al. (2022) proposed pillar feature network (PIFENet) and designed a pillar aware attention module to adaptively focus on the important pillar features.

3. Method

In this section, we detail the structure of the MASNet, which is an advanced one-stage end-to-end model for 3D object detection. As illustrated in Figure 1, it consists of three main stages (a) the PMA module; (b) the attention pooling; (c) the FSN module.

Figure 1.

The structure of MASNet. Firstly, the input point clouds are transformed into 2D pillars. Pillar features are extracted via a stackable PMA module, and then an attention-pooling operation is used to aggregate information about each pillar and form a pseudo-image. Finally, the FSN module is employed to fuse the multi-scale feature maps and the result serves as input for generating 3D bounding boxes.Note. 2D = two-dimensional; PMA = pillar mix attention; FSN = focal sparse network; 3D = three-dimensional.

3.1. Voxelization

We take point cloud data as input that are defined as $P = {{p_{i} = {[x_{i}, y_{i}, z_{i}, r_{i}]}^{T} \in R^{4}}}_{i = 1, 2, \dots, M}$ , where $x_{i}, y_{i}, z_{i}$ denote the x, y, and z coordinates in the 3D space, $r_{i}$ is the laser reflection intensity of all points and $M$ is the number of points in the whole scene. $P$ has the range of $P_{L}, P_{W}, P_{H}$ along $X$ , $Y$ , and $Z$ axes, then $P$ is equally divided into pillars, each pillar’s size is $p_{l}^{*}, p_{w}^{*}, p_{h}^{*}$ , where $p_{h}^{*} = P_{H}$ , since the point clouds are not segmented along $z$ -axis. Consequently, generating a pillar set $V = {V^{1}, V^{2}, \dots, V^{K}}$ , where $V^{K} \in R^{N \times C}$ denotes $K$ th non-empty pillar, $N$ and $C$ denote the number of points in every pillar and the channel of each point.

3.2. PMA Module

In the point cloud, pedestrians are smaller in size, which makes the LIDAR scan fewer valid points on the pedestrians. Furthermore, the shape of pedestrians may be more irregular in 3D space which leads to incomplete representations in the point cloud and therefore difficult to accurately detect and recognize. In order to improve the network’s detection accuracy on small objects, we introduce a PMA module, which is expected to increase the expressive capability of pillar features for small objects by adopting a more comprehensive attention mechanism. The PMA module consists of two parts: PA and CA, which aggregate information at the feature-map level in parallel, the structural model of the PMA is shown in the following Figure 2. Compared with other attention mechanisms, we will simultaneously use two channels to obtain the features that they are more interested in, and ultimately aggregate the information. The CA captures remote dependencies by aggregating remote information to enhance global capabilities. And PA is used to obtain independent point features in the data through a series of pooling operations.

Figure 2.

The architecture of pillar mix attention (PMA) module.

3.2.1. Channel-Wise Attention (CA)

Every channel map of high-level features can be viewed as a class-specific response, where different semantic responses are interrelated. To capture long-range dependencies and aggregate remote contextual information adaptively, we design a CA module as shown in the channel-wise attention (CA) section of Figure 2. We introduce non-local operations to establish associations between channel features to explore global contextual information and thus obtain a sufficient receptive field. Given a pillar set $V \in R^{K \times N \times C}$ , we first reshape it to $R^{D \times C}$ , where $D = K \times N$ is the number of points, and then we perform a matrix multiplication between $V$ and its transpose. Finally, we deploy a softmax layer to calculate the CA and get an attention map $A \in R^{C \times C}$ :

a_{j i} = \frac{\exp (V_{i} \cdot V_{j})}{\sum_{i = 1}^{C} \exp (V_{i} \cdot V_{j})} .

(1)

where

a_{j i}

represents the influence of the

i

th channel on the

j

th channel. Based on the attention mechanism, the higher the attention score, the more relevant the features. Furthermore, employ a matrix multiplication between the transpose of

A

and

V

, and then their result is reshaped to

R^{M \times N \times C} .

Finally, we multiply it by a variable parameter

α

and obtain the output

B \in R^{M \times N \times C}

through the element-wise sum with

V

b_{j} = α \sum_{i = 1}^{c} (a_{j i} V_{i}) + V_{j},

(2)

where

α

is a learnable variable parameter and is initialized as 0. The final feature of each channel is obtained by weighting the sum of the features of all other channels with the original features superimposed. Due to the attention map, useless channel features are suppressed which helps to improve the discriminability of the features. Our approach calculates the correlation degree between all channels of each point, powerfully enhancing the network’s ability to obtain global information and resolving the issue of long-distance information transmission.

3.2.2. Point-Wise Attention (PA)

Consider a pillar set $V \in R^{K \times N \times C}$ to aggregate point features of the channel-wise dimension, we perform max pooling and average pooling operations on $V$ along the channel-wise dimension, respectively, generating two point-wise responses: $F_{V}^{max} \in R^{K \times N \times 1}$ and $F_{V}^{avg} \in R^{K \times N \times 1}$ . Then we feed two point-wise responses into the FC layers and nonlinear activation function. After that, we sum $F_{V}^{max}$ and $F_{V}^{avg}$ element-wise and apply a sigmoid layer to obtain PA map $E \in R^{K \times N \times 1}$ , which denote the spatial correlation of points. The PA map can be defined as:

E = Sigmoid (w_{2} (δ (w_{1} F_{V}^{max}) + δ (w_{1} F_{V}^{avg}))),

(3)

where

w_{1}

and

w_{2}

are weight parameters of the FC layers,

δ

is nonlinear activation function (rectified linear unit). As shown in the point-wise section of Figure 2 the final output

G \in R^{K \times N \times C}

can be calculated as follows.

G = E \times A .

(4)

The final output $H \in R^{K \times N \times C}$ of the PMA is obtained by concatenating $B$ and $G$ along the channel dimension. Through all the above operations, the pillar features are refined, which improves the representation of features with specific semantics.

3.2.3. PMA Module Stacking

PMA modules are stacked to utilize multiple levels of feature attention, with the first PMA acting directly on the original pillar feature and the output of the first PMA module serves as input for the second PMA module. Finally, we obtain a high-dimensional representation of the features $I \in R^{M \times N \times C^{^{'}}}$ through the FC layer.

3.3. Attention Pooling

Attention mechanism is an important concept in deep learning that mimics the function of human visual attention, allowing models to focus on the most important parts of input data. This mechanism has been widely applied in various fields such as natural language processing, computer vision, and audio processing, and has achieved significant performance improvements in various tasks.

The current research status indicates that the study of attention mechanisms is very active, with new models and variants constantly being proposed. For example, the self-attention mechanism in the transformer model is an important breakthrough that improves computational efficiency through parallel processing and achieves excellent performance in tasks such as machine translation and text classification. In addition, multi-head attention allows the model to learn information in parallel in different representation subspaces, further enhancing the model’s expressive power.

In addition to the standard attention mechanism, there are many variants and extensions, such as hard attention and soft attention, which introduce different probability distributions in the model to guide the focus of information. In addition, there are applications of attention-based models in unsupervised or semi-supervised learning tasks, which may utilize unlabeled data to learn useful feature representations.

Previous voxel-based methods usually adopt a simplified version of PointNet to extract the features of the pillar. PointNet uses a max pooling operation to reduce the dimension of the feature maps and aggregate point features in the pillar. This operation selects the maximum value of each channel as the encoded representation. However, these maximum values cannot represent the rich feature representation of all points within pillars and suppress the amount of effective information. To comprehensively utilize all the information in the pillars and extract the most representative pillar features, we propose an attention-style pooling operation: attention-pooling, as shown in Figure 3.

Figure 3.

The framework of attention pooling.

For each non-empty pillar feature $I^{K} \in R^{N \times C^{^{'}}}$ , we perform a max pooling operation along the first dimension to get the query vector ( $Q^{^{'}}$ ). $I$ serves the role of value vector ( $V^{^{'}}$ ), and the transpose of $I$ is used to construct the key vector ( $K^{^{'}}$ ). We apply the structure of self-attention (Vaswani et al., 2017) to obtain the attention map, compute dot products of the attention map with $I$ , and divide each by $\sqrt{d_{k}}$ to get the final pooling result $J$ . We compute the output as:

\begin{matrix} Q^{^{'}} = max pool (I^{k}) . \end{matrix}

(5)

\begin{matrix} J = SoftMax (\frac{Q^{^{'}} \cdot I^{T}}{\sqrt{d_{k}}}) V^{^{'}} . \end{matrix}

(6)

where

d_{k}

is dimension of input.

The result of attention-pooling is regarded as an input to the backbone. The experiment proves that the operation is simple but helpful for the detection and localization of the pedestrian.

3.4. FSN Module

Similar to PillarNet (Shi et al., 2022), we use the 2D sparse encoder network for effective pillar feature encoding. Within the 2D sparse encoder network, focal sparse convolution is introduced to facilitate the network in extracting deep sparse pillar features from the projected sparse 2D pillar features, which enhances the network’s 3D sensing capability. We propose an FSN module, as shown in Figure 4.

Figure 4.

The framework of the focal sparse network (FSN) module. The green box represents the regular sparse convolution layer, the purple boxes represent the submanifold sparse convolution layer, and the gold box represents the focal sparse convolution layer.

3.4.1. 2D Focal Sparse Convolution

We construct a 2D sparse encoder network by using the focal sparse convolution (Chen et al., 2022). The entire computational process can be divided into three steps: (a) calculate the importance map, (b) obtain the location of the important inputs, and (c) generate the dynamic output. The framework of focal sparse convolution is shown in Figure 5.

Figure 5.

Framework of focal sparse convolution. An additional branch predicts an importance map for each input sparse feature, which determines the output feature positions.

To obtain the importance map, firstly, we use submanifold sparse convolution to process the pseudo-image and the channel of output is determined by $k^{2}$ , where $k$ signifies the size of the kernel. Subsequently, a sigmoid function is applied to derive the importance map, serving as a tool for predicting the significance of both input location $p$ and the corresponding location around it. Based on the importance score assigned to each input location, we choose the location characterized by more crucial input features, as determined by the following equation:

P_{im} = {I_{p}^{im} \geq τ, p \in P_{in}} .

(7)

where

P_{in}

denotes input sparse feature spatial space,

I_{p}^{im}

is the center of the importance map at position

p

, and

τ

represents a predefined threshold. The focal sparse convolution transitions to regular sparse convolution or submanifold sparse convolution when

τ

is set to 0 and 1, respectively. Then positions around

P_{im}

are activated, and locations with importance scores exceeding

τ

will generate outputs

K_{im}^{2} (p)

. It is computed as:

K_{im}^{2} (p) = {k^{^{'}} | p + k^{^{'}} \in P_{in}, I_{k}^{im} \geq τ, k^{^{'}} \in k^{2}},

(8)

where

k^{^{'}}

represents all discrete positions within the 2D kernel space

k^{2}

p + k^{^{'}}

denotes the position around the center

p

. For input positions that are not important, the focal sparse convolution still retains the index of their positions, but the positions of their outputs are fixed as input. Finally, the output positions of the focal sparse convolution

P_{out}

is the union of all the important output positions with their dilated area and other unimportant input positions. The formula is shown below:

\begin{matrix} P_{out} = (⋃_{p \in P_{im}} P (p, K_{im}^{2} (p))) \cup P_{in ∖ im}, \end{matrix}

(9)

\begin{matrix} P (p, K_{im}^{2} (p)) = {p + k^{^{'}} | k^{^{'}} \in K_{im}^{2} (p)} . \end{matrix}

(10)

In contrast to regular sparse convolution, focal sparse convolution more effectively preserves data sparsity and significantly reduces computational load. Moreover, when compared to submanifold sparse convolution, focal sparse convolution generates a dynamical output shape based on input positions, thereby facilitating information exchange between disconnected features. Additionally, utilizing the focal loss function to supervise the prediction of important maps during training and multiplying the important maps with output features to encode sparse features, enables the inflation of output shapes at appropriate locations and enriches the learnable features of the object. This approach proves particularly advantageous for small object detection.

3.4.2. Backbone Network Structure

The entire backbone comprises four stages, Stages 1 to 3 can be characterized by a series of blocks ( $K_{1}$ , $K_{2}$ , $K_{3}$ ) that execute down-sampling operations on pseudo-maps obtained in the VFE layer. The list of blocks produces sparse features with the $1 \times, 2 \times$ , and $4 \times$ down-sampled sizes and the shape of these outputs are $(C^{'}, W, H)$ , $(2 C^{'}, W / 2, H / 2)$ , and $(4 C^{'}, W / 4, H / 4)$ . $K_{1}$ encompasses two layers of submanifold sparse convolutions layer. The subsequent blocks include one layer each of regular sparse convolution, submanifold sparse convolution, and focal sparse convolution as shown in Figure 4. In Stage 4, an $8 \times$ down-sampling operation is executed through a set of dense 2D CNN to extract advanced semantic features and enhance the receptive fields of large objects. Ultimately, high-level semantic features from Stage 4 are fused with low-level spatial features from Stage 3, facilitating rich information exchange. The result is utilized as input for a lightweight SSD (Liu et al., 2016) detection header to perform classification, IoU prediction, and box regression tasks.

3.5. Loss Function

Similar to SECOND, focal loss (Lin et al., 2017) and L1 loss are used for BBox classification and regression, respectively (Le et al., 2022). We use the loss function in PointPillar. The PointPillars algorithm divides the point cloud into individual pillars from a top-down perspective, effectively discarding height information and retaining only the $x y$ coordinates in the horizontal direction. Each pillar is equivalent to a pixel coordinate in a 2D image. After obtaining the fake image, common 2D networks are used for feature extraction and learning, and finally, SSD detection heads are used for BBox regression. Each 3D BBox is represented by a seven-dimensional vector, which is ( $x, y, z, w, h, l,$ $θ$ ), Among them, $(x, y, z)$ are the center coordinates, $w, h, l$ are the size data, $θ$ is the direction angle, and the parameters to be learned in the detection box regression task are the offsets of these seven variables.

The residual of localization regression between truth and anchor is defined as follows:

\begin{matrix} Δ x = \frac{x^{g t} - x^{a}}{d^{a}}, Δ y = \frac{y^{g t} - y^{a}}{d^{a}}, Δ z = \frac{z^{g t} - z^{a}}{d^{a}}, \end{matrix}

(11)

\begin{matrix} Δ w = \log \frac{w^{g t}}{w^{a}}, Δ l = \log \frac{l^{g t}}{l^{a}}, Δ h = \log \frac{h^{g t}}{h^{a}}, \end{matrix}

(12)

\begin{matrix} Δ θ = \sin (θ^{g t} - θ^{a}) . \end{matrix}

(13)

Smooth L1 loss function was used for training:

L_{loc} = \sum_{b \in (x, y, z, w, l, h, θ)} SmoothL1 (Δ b) .

(14)

A Softmax loss is introduced to learn the direction of objects in order to avoid direction discrimination errors. The loss is recorded as $L_{dir}$ . Regarding classification loss, Focal loss is still used, defined as follows:

L_{cls} = - α_{a} (1 - p^{a})^{γ} \log p^{a} .

(15)

The overall loss function is as follows:

L = \frac{1}{N_{pos}} (β_{loc} L_{loc} + β_{cls} L_{cls} + β_{dir} L_{dir}) .

(16)

4. Experiments

4.1. Setup and Implementation

All the experiments are conducted on the KITTI object detection benchmark dataset, which consists of 7,481 training samples and 7,518 test samples. It contains three test classes: Car, Cyclist, and Pedestrian. Each class contains three levels: easy, moderate, and hard, which depend on the size of the object with respect to the level of occlusion and truncation. Due to the ground truth of the test set being unavailable in the official dataset, we further segment the training set into a training subset of 6,358 samples and a validation subset of 1,123 samples with a ratio of $4 : 1$ .

4.2. Implementation Details

For the data enhancement part, in order to enrich the training data, we put the point clouds contained in the ground truth 3D BBoxes of other frames and their BBoxes into the blank positions in the frame to be trained. For training data, we first adopt a data augmentation strategy, we put the point clouds contained in the ground truth 3D BBoxes of other frames and their BBoxes into the blank positions in the frame to be trained. Then we utilize the method of minimum point filtering to select the points in each BBox. The points in the ground truth 3D BBoxes are transformed along the $z$ -axis according to a uniform distribution in $[- π / 4, π / 4]$ and randomly flipped on the $x$ -axis with a global scaling ratio of $[0.95, 1.05]$ . For the intersection-over-union (IoU) threshold, we set 0.7 for Cars, and 0.5 for Pedestrians and Cyclists. We use the average precision (AP) over the 40 recall positions and the mean AP (mAP) of the three difficulty levels as our evaluation metric.

We use the PyTorch deep learning framework to build our algorithm model based on the model structure diagram in the previous section. We set the batch size to 2 and trained the model for 80 epochs. We use the adaptive moment estimation (Adam) optimizer for end-to-end model training during the training process and set the initial learning rate to 0.003. We use the cosine annealing learning strategy with a weight of 0.01 to attenuate the learning rate.

In all our experiments, the number of points within each pillar is set to $N$ , where $N = 100$ . We randomly sample 100 points when the number of points is more than 100 and simply fill with zeros when the number of points is $< 100$ . For the voxel size, we set pl* = 0.16, pw* = 0.16, and ph* = 4. The anchor sizes are set to $[3.9, 1.6, 1.56]$ for cars, $[0.8, 0.6, 1.73]$ for pedestrians, and $[1.76, 0.6, 1.73]$ for cyclists. All experiments are done on the NVIDIA GeForce RTX3090 GPU.

The performance indicators of object detection are key factors in evaluating the effectiveness of a model on specific tasks. In the field of autonomous driving, 3D object detection is particularly important as it directly relates to the vehicle’s understanding and responsiveness to the surrounding environment. Here are some main performance indicators:

(1)
Precision: Refers to the proportion of samples that are truly positive among all the samples predicted as positive by the model. The calculation formula is TP/(TP + FP), where TP is the true positive case and FP is the false-positive case.
(2)
Recall: Refers to the proportion of samples that are correctly predicted as positive by the model among all samples that are actually positive. The calculation formula is TP/(TP + FN), where FN is a false negative example.
(3)
F1 score: The harmonic mean of precision and recall, which is a performance metric that comprehensively considers both precision and recall. The calculation formula is $2 \times (Precision \times Recall) / (Precision + Recall)$ .
(4)
AP: In object detection, AP measures the model’s performance in each category. It calculates the AP at different recall levels.
(5)
mAP: mAP measures the goodness or badness across all types, usually taking the average of all categories’ APs.

In this article, we use AP and mAP as performance metrics.
4.3. Experimental Results

As shown in Table 1, we compare our model with the state-of-the-art models on the KITTI test set for three types of objects: cars, pedestrians, and cyclists on the 3D detection benchmark of the KITTI test server. The proposed method achieves the best detection performance for the cyclist category of all pillar-based methods on three difficulty levels, leading the second-best SMS-Net and TANet methods by over 2.46%, 0.07%, 0.55% and 2.11%, 0.86%, 1.39% in easy, moderate, and hard difficulty levels, respectively. In particular, for the pedestrian category, our model outperforms all other methods on the easy and hard difficulty level. Furthermore, MASNet achieves the second-ranked performance in pedestrians on the medium difficulty level and is only inferior to the PIFENet. In the case of car detection, we compare our model with sparse-to-dense (STD) and graph neural network for 3D object detection in a point cloud (Point-GNN), although there is little difference in performance in the easy level of difficulty, our method performs relatively poorly in moderate and hard difficulty levels, lagging behind 3.82%, 5.43%, and 3.58%, 2.63%, respectively. But the inference speed of STD is only 12.5 FPS) and Point-GNN even only 1.6 FPS, which cannot guarantee real-time performance. Our proposed method achieves an inference speed of 41 FPS which is ahead of most of the one-stage methods.

Table 1.
Performance Comparison of 3D Detection on KITTI Test Set. We Report Results for Point-Based (Point), Voxel-Based Methods Using 3D Voxel Representations (Voxel-3D), and Pillar-Based (Voxel-PI) Methods.

Car 3D AP (%) Ped. 3D AP (%) Cyc. 3D AP (%)

Type Methods Stage Speed (FPS) Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard

Point PointRCNN 2-stage 10 86.96 75.64 70.70 47.89 39.37 36.01 74.96 58.82 52.53

Part-A2 (Shi et al., 2021) 2-stage 12.5 87.81 78.49 73.51 – – – 79.17 63.52 56.93

STD (Yang et al., 2019) 2-stage 12.5 87.95 79.71 75.09 53.29 42.47 38.35 78.69 61.59 55.30

Point-GNN (Shi & Rajkumar, 2020) 1-stage 1.6 88.33 79.47 72.29 51.92 43.77 40.14 78.60 63.48 57.08

3D IoU-Net (Li et al., 2020) 1-stage 10 87.96 79.03 72.78 – – – – – –

Frustum PointNets (Qi et al., 2018b) 2-stage 5.9 82.19 69.79 60.59 – – – 72.27 56.12 49.01

PointPainting (Vora et al., 2020) 1-stage 2.5 82.11 71.70 67.08 – – – 77.76 63.78 55.89

Voxel-3D VoxelNet 1-stage 5 77.47 65.11 57.73 39.48 33.69 31.50 61.22 48.36 44.37

SECOND 1-stage 20 83.13 73.66 66.20 51.07 42.56 37.29 70.51 53.85 46.90

SA-SSD (He et al., 2020) 1-stage 25 88.75 79.79 74.16 – – – – – –

Voxel-PI PointPillars 1-stage 62.5 82.58 74.31 68.99 51.45 41.92 38.89 77.10 58.65 51.92

TANet 1-stage 28 84.39 75.94 68.82 53.72 44.34 40.49 75.70 59.44 52.53

SMS-Net 1-stage 42 87.01 76.21 70.45 53.46 44.76 41.35 75.35 60.23 53.37

PIFENet 1-stage 26 – – – 56.39 46.71 42.71 67.50 51.10 44.66

ours 1-stage 41 86.88 75.89 69.66 56.65 46.16 43.05 77.81 60.30 53.92

				Car 3D AP (%)	Ped. 3D AP (%)	Cyc. 3D AP (%)
Point	PointRCNN	2-stage	10	86.96	75.64	70.70	47.89	39.37	36.01	74.96	58.82	52.53
	Part-A2 (Shi et al., 2021)	2-stage	12.5	87.81	78.49	73.51	–	–	–	79.17	63.52	56.93
	STD (Yang et al., 2019)	2-stage	12.5	87.95	79.71	75.09	53.29	42.47	38.35	78.69	61.59	55.30
	Point-GNN (Shi & Rajkumar, 2020)	1-stage	1.6	88.33	79.47	72.29	51.92	43.77	40.14	78.60	63.48	57.08
	3D IoU-Net (Li et al., 2020)	1-stage	10	87.96	79.03	72.78	–	–	–	–	–	–
	Frustum PointNets (Qi et al., 2018b)	2-stage	5.9	82.19	69.79	60.59	–	–	–	72.27	56.12	49.01
	PointPainting (Vora et al., 2020)	1-stage	2.5	82.11	71.70	67.08	–	–	–	77.76	63.78	55.89
Voxel-3D	VoxelNet	1-stage	5	77.47	65.11	57.73	39.48	33.69	31.50	61.22	48.36	44.37
	SECOND	1-stage	20	83.13	73.66	66.20	51.07	42.56	37.29	70.51	53.85	46.90
	SA-SSD (He et al., 2020)	1-stage	25	88.75	79.79	74.16	–	–	–	–	–	–
Voxel-PI	PointPillars	1-stage	62.5	82.58	74.31	68.99	51.45	41.92	38.89	77.10	58.65	51.92
	TANet	1-stage	28	84.39	75.94	68.82	53.72	44.34	40.49	75.70	59.44	52.53
	SMS-Net	1-stage	42	87.01	76.21	70.45	53.46	44.76	41.35	75.35	60.23	53.37
	PIFENet	1-stage	26	–	–	–	56.39	46.71	42.71	67.50	51.10	44.66
	ours	1-stage	41	86.88	75.89	69.66	56.65	46.16	43.05	77.81	60.30	53.92

Note. 3D = three-dimensional; FPS = frames per second; AP=average precision; Mod. = moderate; Ped. = pedestrians; Cyc. = cyclists; Point-GNN = graph neural network for 3D object detection in a point cloud; STD = sparse-to-dense; IoU-Net=intersection-over-union network; SA-SSD = structure-aware single-stage detector; PIFENet = pillar feature network. The best performance in Voxel-PI methods is shown in bold and blue ones are the second best.

We present in Table 2 the performance of our method in BEV detection. In pedestrian detection, MASNet attains a BEV mAP of 53.38%, which exceeds all methods except PIFENet. Besides, in terms of speed of inference, our approach surpasses PIFENet by 57.69%. This suggests that our method provides a reference for balancing speed and accuracy. These significant improvements demonstrate that our network can enhance the expressiveness of pillars, better utilize the sparsity of the point cloud data, and allow the sparse and dense information to fully communicate thus producing more discriminative features, which are beneficial for small object detection.

Table 2.

Performance Comparison of Bird’s-Eye Views on KITTI Test Dataset for Pedestrian Class. Results in Bold Denote the Best of all of the Methods, and the Second-Ranked Performance is Shown in Blue.

	Bird’s eye view
Method	Easy	Mod.	Hard	3D mAP	Times
VoxelNet	57.73	39.48	33.69	43.63	5
SECOND	55.10	46.27	44.76	48.71	20
PointPillars	57.60	48.64	45.78	50.67	62.5
PointRCNN	54.77	46.13	42.84	47.91	10
Point-GNN	55.36	47.07	44.61	49.01	1.6
STD	60.02	48.72	44.55	51.10	12.5
TANet	60.85	51.38	47.54	53.26	28
PIFENet	63.25	53.92	50.53	55.9	26
Ours	60.43	51.63	48.67	53.58	41

Note. 3D = three-dimensional; Mod.=moderate; mAP = mean average precision; Point-GNN = graph neural network for 3D object detection in a point cloud; STD = sparse-to-dense; PIFENet = pillar feature network.

Several qualitative results for 3D detection on KITTI are shown in Figure 6. From Figure 6, we notice that PointPillars has serious false detection problems, especially for small objects. Because pedestrians are smaller in size than cars, the effective points scanned by LiDAR are fewer. Then pedestrians are often in various backgrounds, such as poles, trees, and other objects are often mixed with pedestrians, resulting in PointPillars providing a false-positive target for pedestrians. Our method solves the above problem to some extent, which proves the accuracy and robustness of our model.

Figure 6.

Qualitative results for pedestrian detection on KITTI validation set. The two-dimensional (2D) scene (top), object detection result for PointPillars (bottom left), and object detection result for our method (bottom right). The ground-truth boxes are dark blue, and the predicted boxes are other color.

In the process of model construction and training, some of the model hyperparameters and training optimizer parameters we use are selected based on excellent parameters. We did not discuss it extensively, but these parameters will also have an impact on the performance of the model. Therefore, we should consider using intelligent optimization algorithms to optimize and search for these hyperparameters in order to obtain the best performance.

The experimental results show that our algorithm model can generate high-quality 3D BBoxes for all target categories, especially for pedestrian and bicycle object lists, but it does not perform as well as the other two categories in detecting car categories, which may be due to issues with our model structure. In addition, our experiments were conducted on the same dataset, and the generalization ability of the model will be one of the challenges in the future.

We adopted sparse convolution operation in the FSN module of this article, which can directly perform convolution operation on active data points while ignoring blank or irrelevant data areas. Compared to regular convolutions, this can significantly reduce computational complexity and improve processing speed, as regular convolutions require operations on the entire data region, including a large number of zero or blank areas. Sparse convolution, due to its design, may have fewer parameters as it only processes the active parts of the data. However, ordinary convolutions may have more parameters due to the need to process the entire data area. Reducing the number of parameters can reduce the complexity of the model and help prevent overfitting. In addition, sparse convolution can reduce memory usage as it only stores and processes information from active data points. In terms of model structure, sparse convolutional networks can run strictly on submanifolds, which means they can maintain the sparsity of data without expanding the entire data area. This design can make sparse convolutional networks more efficient in processing sparse data.

3D object detection plays a crucial role in the field of autonomous driving, providing vehicles with precise spatial understanding of their surrounding environment. This technology can identify and locate objects from various sensor data, including their 3D information such as position, size, and shape. These pieces of information are crucial for tasks such as navigation, obstacle avoidance, path planning, and motion prediction in autonomous vehicles. And the experimental results of our algorithm show its superior performance in pedestrian and bicycle categories, especially for detecting small target objects. Therefore, our research will further improve the performance of recognition and localization problems in autonomous driving processes.

The challenges faced by 3D object detection in the field of autonomous driving, such as occlusion, constantly changing lighting conditions, and complex traffic scenes, are currently a hot research topic. To solve the occlusion problem, we consider using Repulse Loss to optimize the prediction box so that it is not only close to the target, but also far away from other objects, reducing the impact of occlusion. The variation of lighting has a significant impact on the accuracy of object detection algorithms. To address this challenge, researchers have tried various methods. We consider using data augmentation techniques to simulate different lighting conditions to create diverse datasets and improve the robustness of the model through training. In complex traffic scenarios, object detection models need to handle highly dynamic and complex environments. They can use multi-scale feature fusion for multi-scale feature prediction, as well as establish connections between objects and context using relational networks.

In the field of autonomous driving, performance metrics for 3D object detection are crucial for evaluating the effectiveness of models. These indicators not only measure the accuracy of the model, but also reflect its inference speed, model complexity, and resource requirements in practical applications, which are key factors in balancing performance and computational cost. Fast and accurate object detection is essential in autonomous driving. The model must process and respond to sensor data in a very short amount of time in order to make real-time decisions. Therefore, inference speed is a key performance indicator that directly affects the real-time performance of the system. The complexity of a model is usually directly proportional to its performance. More complex models may provide higher detection accuracy, but at the same time require more computational resources, which may lead to a decrease in inference speed. The auto drive system usually needs to run on limited hardware resources, which requires that the 3D object detection model should not only maintain high accuracy, but also control its resource consumption.

4.4. Ablation Studies

In this section, we analyze the effect of each component in MASNet through ablation experiments on the KITTI validation dataset. We use PointPillars as the Baseline. Our Baseline achieves a 3D mAP of 78.33%, 61.40%, and 66.11% for car, pedestrian, and cyclist detection, respectively.

4.4.1. PMA Modules Analysis

We evaluate the contribution of the PMA module by adding the CA module and the PA module to the baseline in different ways. As shown in Table 3, using the PA module and the CA module alone can improve the detection performance of the network, but the improvement is relatively limited. We also provide some variants of PA and CA: PMA, CA–PA, and PA–CA. PMA represents the concatenation of the outputs of PA and CA along the channel dimension. PA–CA stands for PA cascading with CA, CA–PA is opposite to PA–CA. For CA–PA and PA–CA, further performance improvement can be achieved. This shows the importance of reasonably using the channel-wise and the point-wise information. It is clear that PMA achieves a significant 2.03% mAP increase which is the best performance in the combination of two attention mechanisms, which verifies the superiority of PMA better.

Table 3.
Ablation Experiments on Effects of Proposed PMA Module. The Bold Data Represent the Optimal Performance Data of all Comparison Methods Under the Same Indicator.

Method Car Pedestrian Cyclist 3D mAP

Baseline 78.33 61.40 66.11 68.61

PA 78.56 61.83 66.43 68.94

CA 78.63 61.80 66.49 68.97

PA–CA 78.81 63.54 66.73 69.69

CA–PA 78.72 63.45 66.89 69.69

PMA 79.20 64.96 67.78 70.64

Method	Car	Pedestrian	Cyclist	3D mAP
Baseline	78.33	61.40	66.11	68.61
PA	78.56	61.83	66.43	68.94
CA	78.63	61.80	66.49	68.97
PA–CA	78.81	63.54	66.73	69.69
CA–PA	78.72	63.45	66.89	69.69
PMA	79.20	64.96	67.78	70.64

Note PMA = pillar mix attention; PA = point-wise attention; CA = channel-wise attention; PA–CA = PA cascading with CA; CA–PA = CA cascading with PA.

4.4.2. Attention Pooling Analysis

Table 4 shows the role of attention-pooling. To reduce the computation load, we only replace max pooling in the simple PointNet structure of PointPillars. Different pooling methods are used for experiments in the two cases: with and without the PMA module. As shown in Table 4, compared to max pooling (Max.), using average pooling (Avg.) under unchanged conditions does not significantly enhance performance. Using attention pooling (Att.) alone, yielding a 3D mAP of 69.06%, only 0.45% higher than the baseline model. However, with the introduction of the PMA module, performance is elevated to 71.64%. Especially in pedestrian category detection, there is an additional 0.89% improvement in 3D mAP. This proves the PMA module and attention pooling mutually enhance each other. The results of the experiments also indirectly prove that the use of max pooling to aggregate the information of the pillar is flawed. This also provides a proof that our attention pooling can fully exploit the global information in the pillar and generate high-quality feature pseudo-image, which improves the detection accuracy and proves the effectiveness of the method.

Table 4.
Ablation Experiments on Effects of Proposed Attention Pooling Operation. The Bold Data Represent the Optimal Performance Data of all Comparison Methods Under the Same Indicator.

Pooling

Method Max Avg. Att. Car Pedestrian Cyclist 3D mAP

Baseline ✓ 78.33 61.40 66.11 68.61

Baseline ✓ 78.45 61.28 66.39 68.71

Baseline ✓ 78.64 62.00 66.54 69.06

Baseline + PMA ✓ 79.20 64.96 67.78 70.64

Baseline + PMA ✓ 79.17 65.34 67.92 70.81

Baseline + PMA ✓ 79.68 66.45 68.79 71.64

	Pooling
Baseline	✓			78.33	61.40	66.11	68.61
Baseline		✓		78.45	61.28	66.39	68.71
Baseline			✓	78.64	62.00	66.54	69.06
Baseline + PMA	✓			79.20	64.96	67.78	70.64
Baseline + PMA		✓		79.17	65.34	67.92	70.81
Baseline + PMA			✓	79.68	66.45	68.79	71.64

Note. 3D=three-dimensional; mAP = mean average precision; PMA = pillar mix attention; Avg. = average; Att. = attention.

4.4.3. FSN Analysis

We further validate the effectiveness of the FSN module, and the experimental results are shown in Table 5. In the FSN module, we introduce a sparse convolutional network and focal sparse convolution to obtain a greater number of learnable features as well as stronger sparse pillar feature encoding capability, while we utilize a simple network to implement the fusion of sparse features with dense features. As shown in Table 5, compared with the baseline model, the performance is boosted to 70.08% by introducing the FSN module, and the 3D mAPs for car, pedestrian, and cyclist show improvements of 2.27%, 3.06%, and 1.33%, respectively. It can be noticed that we get a compelling performance improvement after using PMA, attention pooling, and FSN simultaneously, proving that they are all indispensable parts of the network.

Table 5.
Ablation Experiments on Effects of Proposed FSN Module. The Bold Data Represent the Optimal Performance Data of all Comparison Methods Under the Same Indicator.

Method Car Pedestrians Cyclist mAP

Baseline 78.33 61.40 66.11 68.61

Baseline + PMA 79.20 64.96 67.78 70.64

Baseline + Att. 78.64 62.00 66.54 69.06

Baseline + FSN 79.58 63.72 66.93 70.08

Baseline + PMA + Att. 80.11 66.45 68.37 71.64

Baseline + PMA + Att. + FSN 81.28 67.35 69.23 72.62

Method	Car	Pedestrians	Cyclist	mAP
Baseline	78.33	61.40	66.11	68.61
Baseline + PMA	79.20	64.96	67.78	70.64
Baseline + Att.	78.64	62.00	66.54	69.06
Baseline + FSN	79.58	63.72	66.93	70.08
Baseline + PMA + Att.	80.11	66.45	68.37	71.64
Baseline + PMA + Att. + FSN	81.28	67.35	69.23	72.62

Note. FSN = = focal sparse network; mAP = mean average precision; PMA = pillar mix attention; Att. = attention.

5. Conclusion

In this paper, we propose a novel one-stage MASNet for 3D object detection. The PMA module, the attention pooling method, and the FSN module are the core components of the MASNet.

We propose a stackable PMA module, which enhances the pillar features extraction ability by integrating PA and CA. The PMA module can also be stacked to obtain multi-level columnar feature attention, further improving the performance of small object detection. The PMA module combines CA and PA to better extract pillar features.

To further improve feature utilization, We present an attention-style pooling operation (attention pooling), which can effectively encode geometric information better and aggregate features based on importance. Attention pooling makes full use of the information of all points in the pillar to reduce the loss of fine-grained information.

We propose a new sparse multi-scale-fusion module that powerfully encodes pillar features by focal sparse convolution and facilitates multi-level feature fusion of sparse features and dense features. The FSN module enhances the information exchange between disconnected features by generating dynamic outputs and achieves the fusion of sparse and dense features without incurring too much computational cost.

In the process of model construction and training, some of the model hyperparameters and training optimizer parameters we use are selected based on excellent parameters. Therefore, we should consider using intelligent optimization algorithms to optimize and search for these hyperparameters in order to obtain the best performance. And our experiment was only conducted on one dataset, so we are considering extending the model to other commonly used datasets, such as nuScenes and Waymo Open. These datasets provide rich scene and annotation information for training and testing models to characterize our model’s generalization ability.

The experiments show that our method can generate high-quality 3D BBoxes for all categories, especially for objects in the pedestrian and cyclist categories, but the car category does not show a consistent improvement in detection. In the future, we hope to use a multi-modal fusion approach to further improve the detection performance by introducing red–blue–green images to provide a priori information and then fusing the features of both modalities.

Footnotes

ORCID iD

Fuqiang You

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Chen

Zhang

Sun

Jia

(2022). Focal sparse convolutional networks for 3D object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5428–5437). IEEE/CVF. https://doi.org/10.48550/arXiv.2204.12463

Chen

Liu

Shen

Jia

(2019). Fast point R-CNN. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9775–9784). IEEE/CVF. https://doi.org/10.1109/ICCV.2019.00987

Chen

Liu

Zhang

Jia

(2023). Voxelnext: Fully sparse voxelnet for 3D object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21674–21683). IEEE/CVF. https://doi.org/10.1109/CVPR52729.2023.02076

Chen

Sun

Wang

Jia

Yuille

(2020). Object as hotspots: An anchor-free 3D object detection approach via firing of hotspots. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XXI 16 (pp. 68–84). Springer. https://doi.org/10.1007/978-3-030-58589-1_5

Geiger

Lenz

Urtasun

(2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354–3361). IEEE. https://doi.org/10.1109/CVPR.2012.6248074

Girshick

(2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448). IEEE. https://doi.org/10.1109/ICCV.2015.169

Graham

Engelcke

Van Der Maaten

(2018). 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9224–9232). IEEE. https://doi.org/10.1109/CVPR.2018.00961

Guo

Wang

Liu

Bennamoun

(2020). Deep learning for 3D point clouds: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12), 4338–4364. https://doi.org/10.1109/TPAMI.2020.3005434

Zeng

Huang

Hua

X.-S.

Zhang

(2020). Structure aware single-stage 3D object detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11873–11882). IEEE. https://doi.org/10.1109/CVPR42600.2020.01189

10.

Shen

Sun

(2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141). IEEE. https://doi.org/10.1109/CVPR.2018.00745

11.

Huang

Liu

Chen

Bai

(2020). EPNet: Enhancing point features with image semantics for 3D object detection. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XV 16 (pp. 35–52). Springer. https://doi.org/10.1007/978-3-030-58555-6_3

12.

Kuang

Wang

Zhang

(2020). Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors, 20(3), 704. https://doi.org/10.3390/s20030704

13.

Lang

A. H.

Vora

Caesar

Zhou

Yang

Beijbom

(2019). Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12697–12705). IEEE. https://doi.org/10.1109/CVPR.2019.01298

14.

D. T.

Shi

Rezatofighi

Cai

(2022). Accurate and real-time 3D pedestrian detection using an efficient attentive pillar network. IEEE Robotics and Automation Letters, 8(2), 1159–1166. IEEE. https://doi.org/10.1109/LRA.2022.3233234

15.

Luo

Zhu

Dai

Krylov

A. S.

Ding

Shao

(2020). 3D IoU-Net: IoU guided 3D object detector for point clouds. arXiv preprint arXiv:2004.04962.

16.

Lin

T.-Y.

Goyal

Girshick

Dollár

(2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988). IEEE. https://doi.org/10.1109/ICCV.2017.324

17.

Liu

Anguelov

Erhan

Szegedy

Reed

C.-Y.

Berg

A. C.

(2016). SSD: Single shot multibox detector. In Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part I 14 (pp. 21–37). Springer. https://doi.org/10.1007/978-3-319-46448-0_2

18.

Liu

Huang

Cao

Chen

(2022). SMS-Net: Sparse multi-scale voxel feature aggregation network for LiDAR-based 3D object detection. Neurocomputing, 501, 555–565. https://doi.org/10.1016/j.neucom.2022.06.054

19.

Liu

Zhao

Huang

Zhou

Bai

(2020). TANet: Robust 3D object detection from point clouds with triple attention. In Proceedings of the AAAI conference on artificial intelligence (vol. 34, pp. 11677–11684). AAAI Press. https://doi.org/10.1609/aaai.v34i07.6837

20.

Noh

Lee

Ham

(2021). HVPR: Hybrid voxel-point representation for single-stage 3D object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14605–14614). IEEE. https://doi.org/10.1109/CVPR46437.2021.01437

21.

C. R.

Liu

Guibas

L. J.

(2018a). Frustum pointnets for 3D object detection from RGB-D data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 918–927). IEEE. https://doi.org/10.1109/CVPR.2018.00102

22.

C. R.

Liu

Guibas

L. J.

(2018b). Frustum PointNets for 3D object detection from RGB-D data. https://arxiv.org/abs/1711.08488.

23.

C. R.

Guibas

L. J.

(2017). Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660). IEEE. https://doi.org/10.1109/CVPR.2017.16

24.

C. R.

Guibas

L. J.

(2017). PointNet++: Deep hierarchical feature learning on point sets in a metric space. Proceedings of the 31st International Conference on Neural Information Processing Systems, 5105–5114. https://doi.org/10.48550/arXiv.1706.02413

25.

Qiu

Anwar

(2021). Investigating attention mechanism in 3D point cloud object detection. In 2021 International conference on 3D vision (3DV) (pp. 403–412). IEEE. https://doi.org/10.1109/3DV53792.2021.00050

26.

Shi

(2022). PillarNet: Real-time and high-performance pillar-based 3D object detection. In European conference on computer vision (pp. 35–52). Springer. https://doi.org/10.1007/978-3-031-20080-9_3

27.

Shi

Rajkumar

(2020). Point-GNN: Graph neural network for 3D object detection in a point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1711–1719). IEEE. https://doi.org/10.1109/CVPR42600.2020.00178

28.

Shi

Wang

(2019). PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 770–779). IEEE. https://doi.org/10.1109/CVPR.2019.00086

29.

Shi

Wang

Shi

Wang

(2020). From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2647–2664. https://doi.org/10.1109/TPAMI.2020.2977026

30.

Shi

Wang

Shi

Wang

(2021). From points to parts: 3D object detection From point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2647–2664. https://doi.org/10.1109/TPAMI.2020.2977026

31.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.03762

32.

Vora

Lang

A. H.

Helou

Beijbom

(2020). PointPainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020 (pp. 4603-4611). IEEE/CVF. https://doi.org/10.1109/CVPR42600.2020.00466

33.

Wang

Girshick

Gupta

(2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803). IEEE. https://doi.org/10.1109/CVPR.2018.00813

34.

Wang

Jia

(2019). Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3D object detection. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1742–1749). IEEE. https://doi.org/10.1109/IROS40897.2019.8968513

35.

Woo

Park

Lee

J.-Y.

Kweon

I. S.

(2018). CBAM: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19). Springer, Cham. https://doi.org/10.1007/978-3-030-01234-2_1

36.

Zhang

Shao

Shi

Wei

Zhang

Zeng

(2022b). A group-wise feature enhancement-and-fusion network with dual-polarization feature enrichment for SAR ship detection. Remote Sensing, 14(20), 5276. https://doi.org/10.3390/rs14205276

37.

Zhang

(2022a). Lite-YOLOv5: A lightweight deep learning detector for on-board ship detection in large-scene sentinel-1 SAR images. Remote Sensing, 14(4), 1018. https://doi.org/10.3390/rs14041018

38.

Zhang

Yang

Shi

Zhan

(2022c). Shadow-background-noise 3D spatial decomposition using sparse low-rank Gaussian properties for video-SAR moving target Shadow enhancement. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2022.3223514

39.

Yan

Mao

(2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337. https://doi.org/10.3390/s18103337

40.

Yang

Sun

Liu

Jia

(2020). 3DSSD: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11040–11048). IEEE. https://doi.org/10.1109/CVPR42600.2020.01105

41.

Yang

Sun

Liu

Shen

Jia

(2019). STD: Sparse-to-dense 3D object detector for point cloud. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1951–1960). IEEE. https://doi.org/10.1109/ICCV.2019.00204

42.

Chen

Zhang

Hao

Zhang

(2020). Sarpnet: Shape attention regional proposal network for LIDAR-based 3D object detection. Neurocomputing, 379, 53–63. https://doi.org/10.1016/j.neucom.2019.09.086

43.

Cao

(2020). HVNet: Hybrid voxel network for LIDAR based 3D object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1631–1640). IEEE. https://doi.org/10.1109/CVPR42600.2020.00170

44.

Zhang

Wan

Guo

(2022). Not all points are equal: Learning highly efficient point-based detectors for 3D LIDAR point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18953–18962). IEEE. https://doi.org/10.1109/CVPR52688.2022.01838

45.

Zhang

(2022). A mask attention interaction and scale enhancement network for SAR ship instance segmentation. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. http://doi.org/10.1109/LGRS.2022.3189961

46.

Zhang

(2022). A polarization fusion network with geometric feature embedding for SAR ship classification. Pattern Recognition, 123, 108365. https://doi.org/10.1016/j.patcog.2021.108365

47.

Zheng

Tang

Chen

Jiang

C.-W.

(2021). Cia-SSD: Confident IoU-aware single-stage object detector from point cloud. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, pp. 3555–3562). AAAI Press. https://doi.org/10.1609/aaai.v35i4.16470

48.

Zhou

Tian

Chu

Zhang

Feng

Jie

Chiang

P. Y.

(2023). FastPillars: A deployment-friendly pillar-based 3D detector. arXiv preprint arXiv:2302.02367. https://doi.org/10.48550/arXiv.2302.02367

49.

Zhou

Tuzel

(2018). VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4490–4499). IEEE. https://doi.org/10.1109/CVPR.2018.00472

				Car 3D AP (%)			Ped. 3D AP (%)			Cyc. 3D AP (%)
Type	Methods	Stage	Speed (FPS)	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
Point	PointRCNN	2-stage	10	86.96	75.64	70.70	47.89	39.37	36.01	74.96	58.82	52.53
	Part-A2 (Shi et al., 2021)	2-stage	12.5	87.81	78.49	73.51	–	–	–	79.17	63.52	56.93
	STD (Yang et al., 2019)	2-stage	12.5	87.95	79.71	75.09	53.29	42.47	38.35	78.69	61.59	55.30
	Point-GNN (Shi & Rajkumar, 2020)	1-stage	1.6	88.33	79.47	72.29	51.92	43.77	40.14	78.60	63.48	57.08
	3D IoU-Net (Li et al., 2020)	1-stage	10	87.96	79.03	72.78	–	–	–	–	–	–
	Frustum PointNets (Qi et al., 2018b)	2-stage	5.9	82.19	69.79	60.59	–	–	–	72.27	56.12	49.01
	PointPainting (Vora et al., 2020)	1-stage	2.5	82.11	71.70	67.08	–	–	–	77.76	63.78	55.89
Voxel-3D	VoxelNet	1-stage	5	77.47	65.11	57.73	39.48	33.69	31.50	61.22	48.36	44.37
	SECOND	1-stage	20	83.13	73.66	66.20	51.07	42.56	37.29	70.51	53.85	46.90
	SA-SSD (He et al., 2020)	1-stage	25	88.75	79.79	74.16	–	–	–	–	–	–
Voxel-PI	PointPillars	1-stage	62.5	82.58	74.31	68.99	51.45	41.92	38.89	77.10	58.65	51.92
	TANet	1-stage	28	84.39	75.94	68.82	53.72	44.34	40.49	75.70	59.44	52.53
	SMS-Net	1-stage	42	87.01	76.21	70.45	53.46	44.76	41.35	75.35	60.23	53.37
	PIFENet	1-stage	26	–	–	–	56.39	46.71	42.71	67.50	51.10	44.66
	ours	1-stage	41	86.88	75.89	69.66	56.65	46.16	43.05	77.81	60.30	53.92

	Pooling
Method	Max	Avg.	Att.	Car	Pedestrian	Cyclist	3D mAP
Baseline	✓			78.33	61.40	66.11	68.61
Baseline		✓		78.45	61.28	66.39	68.71
Baseline			✓	78.64	62.00	66.54	69.06
Baseline + PMA	✓			79.20	64.96	67.78	70.64
Baseline + PMA		✓		79.17	65.34	67.92	70.81
Baseline + PMA			✓	79.68	66.45	68.79	71.64

MASNet: Multi-Attention Sparse Network for Light Detection and Ranging (LIDAR)-Based Three-Dimensional (3D) Object Detection

Abstract

Keywords

1. Introduction

2.1. Point-Based Methods

2.2. Voxel-Based Methods

2.3. Attention Mechanism

3. Method

3.2. PMA Module

3.3. Attention Pooling

3.5. Loss Function

4.1. Setup and Implementation

4.2. Implementation Details

4.4.1. PMA Modules Analysis

Footnotes

ORCID iD

Funding

Declaration of Conflicting Interests

References