Sage Journals: Discover world-class research

Abstract

A multifeature fusion small-target detection network (MF-Net) is proposed based on PointRCNN, aimed at enhancing the detection accuracy of small targets in vehicle-mounted LiDAR systems. Semantically controlled farthest point sampling and multisampling strategies are presented to achieve uniform sampling and retain a greater number of small target points. Additionally, a local feature aggregation module is utilized to learn the intensity features of small target point clouds through spatial intensity encoding. Furthermore, PointPillars technology is implemented to convert the three-dimensional point cloud into a pseudo-image, allowing for the extraction of features at various scales using a feature pyramid network. Experimental results demonstrate that MF-Net improves the mean average precision for pedestrian and cyclist detection by 2.49% and 2.88%, respectively, compared to the baseline network PointRCNN. The false detection rate is reduced significantly and the detection accuracy is enhanced across diverse scenarios.

Keywords

LIDAR point cloud feature extraction PointRCNN small target detection feature fusion

1. Introduction

In automatic driving systems, environmental perception is intricately linked to a diverse array of sensors. The environmental perception data acquired by vehicle sensors serves as the cornerstone for subsequent processes, including positioning, path planning, decision-making, and control. Fully capturing and leveraging data acquired from existing sensors for environmental assessment represents a pivotal aspect of autonomous driving technology (Song et al., 2024). Among the numerous onboard sensors, laser radars have garnered significant attention due to their ability to offer extensive three-dimensional (3D) data and their immunity to disruptions caused by light (Lee et al., 2024).The laser radar sensor acquires point-cloud data by emitting a laser beam and capturing the echo signal reflected back from the 3D objects in space. In recent years, as autonomous driving technology has continued to advance, deep learning techniques have become widely utilized in the detection of 3D targets using vehicle-mounted LiDAR, enabling the effective extraction and utilization of fine point cloud features from small targets. Classic studies, such as VoxelNet (Zhou & Tuzel, 2018), PointPillars (Lang et al., 2019), and PointRCNN (Shi et al., 2019), have paved the way for subsequent enhancements and optimizations in point-cloud-based 3D target detection performance. Currently, the approach to utilizing deep learning for detecting 3D targets in LiDAR point clouds can be categorized into three methods: Multiview, voxel-based, and pure point-cloud, based on the varying data processing techniques employed for the point clouds.

In the multiview approach, the VeloFCN (Li et al., 2016) network projects a 3D point cloud into two-dimensional space for processing. However, this projection results in information loss and poses accuracy limitations, primarily due to the constraints in unit coding ability, which are unable to address the challenges arising from point-cloud discretization. Beltrán et al.’s BirdNet (Beltrán et al., 2018) network is a 3D detection framework specifically tailored for Bird’s-Eye-View projection of LiDAR point clouds. It leverages the elimination of 3D bounding box post-processing to enhance detection rates. The BirdNet+ (Barrera et al., 2020) network further refines the detection rate of 3D objects. However, despite these advancements, this method still generates substantial ambiguities, and feature extraction is hindered by the sparsity of the point cloud. For example, MV3D (Chen et al., 2017) integrates data from multiple viewpoints for feature extraction, which is effective but demands considerable computational resources and lacks robustness. It cannot transcend the visual constraints imposed by its own sensors and falls short in addressing the challenges associated with small-target feature extraction and detection accuracy. Additionally, MVMM (Li et al., 2023) merges data from different sensors and viewpoints through point cloud coloring, achieving high detection accuracy but with complex computations and challenging hyperparameter tuning.

The voxel method discretizes a sparse 3D point cloud into a voxel grid, using voxels as representations of the point cloud. Compared to the multiview method, the voxel method processes less data and attains higher detection accuracy. Vote3Deep (Engelcke et al., 2017) incorporates a convolution layer grounded in centrosymmetric voting and a modified linear unit to manage sparse point clouds, yet it encounters challenges in effectively extracting local features. Subsequently, Voxel-RCNN (Jiang et al., 2024) elevates feature extraction and refinement through a two-stage detection strategy, mitigating the information loss stemming from voxelization and enhancing detection accuracy, albeit at the expense of detection speed. Additionally, MVTR (Ai et al., 2024) integrates point cloud semantics, sparsity, and non-empty voxel features, facilitating easier access to global feature information, yet its performance in extracting features from small target point clouds remains limited. PV-SSD (Shao et al., 2024) proposes a multimodal feature fusion network that combines projection features and voxel features, but its ability to extract features from sparse point clouds still requires further enhancement.VoxelNeXt (Chen et al., 2023) achieves efficient 3D object detection through fully sparse convolutional networks, avoiding the reliance on anchors, center points, and dense heads. However, its performance is relatively limited when handling small objects and low-density point clouds.

The development of pure point-cloud methods can be largely attributed to the introduction of PointNet (Qi et al., 2016) and PointNet++ (Qi et al., 2017), which enable 3D object detection networks to directly process point clouds, thus effectively avoiding information loss. The DF-SSD network (Zhai et al., 2020) enhances detection accuracy and reduces model parameters through feature fusion and DenseNet. However, it lacks low-level deep feature information, which leads to slower detection speeds. In contrast, TSKPD (Feng et al., 2024) employs multiframe point cloud image fusion to detect missing point clouds, significantly improving the robustness of keypoint detection. However, it has high computational demands and its performance in detecting small targets still requires improvement. The PointPillars network (Lang et al., 2019) processes point clouds by converting them into pillars, significantly enhancing runtime speed, but its detection accuracy still needs further enhancement. CenterPoint (Yin et al., 2021) simplifies the detection process by utilizing center point regression to predict 3D size, orientation, and velocity, which improves accuracy. Nevertheless, its performance in detecting small targets and objects with extreme aspect ratios remains limited. The PointRCNN network (Shi et al., 2019) excels in 3D object detection but faces challenges in detecting small targets. PillarNeXt (Li et al., 2023) focuses on a pillar-based feature extraction model, expanding the receptive field to capture richer features from the point cloud, thereby improving detection performance. However, its feature fusion between points and pillars still requires further refinement. To address these limitations, this paper proposes multifeature fusion small-target detection network (MF-Net), which is based on PointRCNN and integrates both point and pillar features. This approach improves the detection accuracy of small targets, such as pedestrians and cyclists, reduces false detection rates, and enhances target feature extraction capabilities, thereby improving performance in detecting small targets across varying levels of task difficulty.

Currently, deep learning-based algorithms for 3D small-target detection confront several challenges, including a restricted feature range, scarcity of available features, and complexities in feature extraction. To address these issues, common approaches include multiscale learning, optimized downsampling strategies, and data augmentation. Subsequent developments introduced local search, sparse convolution networks (Ke et al., 2025) and unsupervised training strategies (Cai et al., 2025), further enhancing the feature extraction capability of 3D small object detection algorithms. In 2D detection, while downsampling can effectively reduce the size of feature maps and computational complexity, excessive downsampling may lead to the loss of small-target feature information. Although two-stage detection algorithms generally achieve higher accuracy, they tend to be slower in real-time scenarios (Li et al., 2021). The channel enhancement feature pyramid network (CE-FPN) network (Luo et al., 2022) improves feature representation capability by utilizing rich channel information through sub-pixel convolution and channel attention mechanisms; however, its speed remains limited when processing large-scale point cloud data.

Lin et al. developed a top-down architecture with lateral connectivity, building upon the traditional feature pyramid network (FPN) (depicted in Figure 1). This enhanced FPN integrates features at the same level during the upsampling phase, thereby improving the detection accuracy of small targets through multiscale learning (Lin et al., 2017).

Figure 1.

Feature pyramid network (FPN) structure.

Despite the capability of farthest point sampling based on Euclidean distance (D-FPS) to achieve uniform sampling, this process inevitably leads to the loss of substantial background point information (Zhu et al., 2024). Consequently, some researchers have refined downsampling strategies to better preserve small target point clouds. For instance, the 3DSSD network (Yang et al., 2020) integrates a feature distance-based sampling method with the D-FPS algorithm, effectively retaining small target point clouds; however, its operational speed remains low in large-scale point cloud scenarios. The IA-SSD network (Zhang et al., 2022) employs a downsampling strategy that combines category-aware and centroid-aware sampling to select foreground points, utilizing a single-stage detection approach to enhance detection speed. The DPA-RCNN network (Jiang et al., 2024) incorporates a center-aware feature extraction module to selectively retain points near the object center, while applying an edge segmentation-aware module for bounding box regression; however, its accuracy diminishes when detecting occluded objects. Additionally, the WS-SSD network (Li et al., 2024) improves pedestrian detection accuracy through weighted sampling and a unique network architecture, yet its performance in detecting small targets in complex scenes remains inadequate.

The advantages and disadvantages of different point cloud processing methods are presented in Table 1. Overall, deep-learning-based small-target detection algorithms can significantly enhance performance through multiscale learning and optimized downsampling strategies, particularly for small targets that lack distinct appearance features. However, in complex scenes, small targets are often affected by sparse and partially missing point clouds. Therefore, designing an efficient network to extract and utilize small-target feature information remains an important research topic in the current field.

Table 1.

Comparison of the Advantages and Disadvantages of Different Point Cloud Detection Methods.

Model	Type	Advantages	Disadvantages
VeloFCN (Li et al., 2016)	Multiview	Reduces computational complexity	Information loss due to point cloud discretization
BirdNet (Beltrán et al., 2018)	Multiview	Improves detection speed with 3D bounding boxes	Loss of height information in sparse clouds
MV3D (Chen et al., 2017)	Multiview	Enriches feature representation	Lower accuracy for small-object detection
MVMM (Li et al., 2023)	Multiview	High detection accuracy through feature fusion	High computational cost and tuning complexity
Vote3Deep (Engelcke et al., 2017)	Voxel-based	Uses a symmetric voting mechanism	May overlook local features
Voxel-RCNN (Jiang et al., 2024)	Voxel-based	Addresses feature loss with sparse convolution	Prone to losing shallow features
MVTR (Ai et al., 2024)	Voxel-based	Facilitates global feature extraction	Limited small-object feature extraction
PV-SSD (Shao et al., 2024)	Voxel-based	Multimodal fusion for projections	Struggles with sparse point clouds
VoxelNeXt (Chen et al., 2023)	Voxel-based	Efficient 3D object detection using sparse convolution	Limited performance in small object and low-density point clouds detection
PillarNeXt (Li et al., 2023)	Voxel-based	Enhances feature extraction by expanding receptive fields	Needs improvement in feature fusion for points and pillars
PointPillars (Lang et al., 2019)	Point-based	Fast processing with pillar-based learning	Limited feature extraction, potential false positives
Pointrcnn (Shi et al., 2019)	Point-based	Generates precise 3D proposals	Performance drops with sparse or low-quality clouds
PointNet (Qi et al., 2016)	Point-based	Processes raw point clouds directly	Sensitive to object scale, weak on small objects
DF-SSD (Zhai et al., 2020)	Point-based	Reduces parameters with DenseNet integration	Slower detection due to shallow feature representation
CenterPoint (Yin et al., 2021)	Point-based	Simplifies detection by using object center point regression	Performance issues with small and extreme aspect ratio objects
TSKPD (Feng et al., 2024)	Point-based	Improves robustness in missing point detection	High computational complexity
CE-FPN (Luo et al., 2022)	Point-based	Enhances features with sub-pixel convolution	Insufficient spatial feature extraction
3DSSD (Yang et al., 2020)	Point-based	Effective small-object retention via D-FPS sampling	Slower processing with high data volume
IA-SSD (Zhang et al., 2022)	Point-based	Combines sampling techniques for accuracy	Ineffective with sparse data
DPA-RCNN (Jiang et al., 2024)	Point-based	Center-aware feature extraction for accuracy	Reduced performance with occlusions
WS-SSD (Li et al., 2024)	Point-based	Weighted sampling for improved pedestrian detection	Underperformance in complex scenes

3D = three-dimensional; CE-FPN = channel enhancement feature pyramid network; D-FPS = farthest point sampling based on Euclidean distance.

Figure 2.

PointNet++ point-set abstraction layer structure.

The main contributions and innovations of this paper are as follows:

A novel semantically controlled farthest-point sampling algorithm is proposed to mitigate the loss of point-cloud information for small targets during the downsampling process in the PointNet++ backbone of PointRCNN. A multisampling strategy is combined to ensure uniform sampling of the point set while effectively retaining small target points.

A novel local feature aggregation (LFA) module is proposed to correct and encode the point-cloud echo intensity in the PointNet++ backbone network. By utilizing positional intensity encoding, the intensity features of small target point clouds are learned to reduce the misdetection of similar obstacles.

A column feature branch is proposed to enhance the feature representation capabilities for small targets. 3D point cloud data are converted into pseudo-images, and then a FPN is adopted to enhance multiscale features learning for small targets. A multifeature concatenation module is utilized to obtain the fused features of point and column features to improve the detection accuracy for small targets.

The structure of this paper is organized as follows: a detailed analysis of the network structure of PointRCNN is provided and a MF-Net based on PointRCNN is proposed in Section 2. The experimental validation and discussion on the KITTI dataset are presented in Section 3. Finally, conclusions are presented in Section 4.

2. Methods

2.1. PointRCNN Network Structure Analysis

In this study, PointRCNN is optimized as the foundational network framework, with PointNet++ serving as the backbone network. This combination integrates bottom-up encoding branches and top-down decoding branches to acquire highly discriminative features. The multiscale learning approach allows PointRCNN to capture a richer and more comprehensive set of multiscale information. PointRCNN comprises several ensemble abstraction layers, including sampling, grouping, and PointNet layers, as illustrated in Figure 2. Each aggregate abstraction layer takes an $N \times (d + C)$ matrix as input and produces an $N_{1} \times (d + C_{1})$ matrix as output. These feature vectors encapsulate the local context information, where $N$ and $N_{1}$ denote the number of input and output points, respectively; $d$ represents the coordinate dimension of the points; $C$ and $C_{1}$ signify the feature dimensions of the input and output points, respectively.

Sampling layer: The farthest-point sampling algorithm, which relies on Euclidean distance, is employed to uniformly select points from the point-cloud data. This process aims to characterize the original point set while effectively reducing the data volume.

Grouping layer: We define $N$ spherical neighborhoods centered around each central point, query $K$ points within a radius $r$ of each neighborhood, and construct local regions for feature extraction.

PointNet layer: Receives a collection of points, specifically $N^{'} \times K$ , extracts their features, identifies the relationships among the points, and ultimately outputs distinct feature points.

PointRCNN encounters the following challenges in extracting small-target features:

When using PointNet++ to extract point features, it overlooks multivariate attributes such as voxels and columns, leading to a limited feature set that hinders the detection performance for small targets.

The point-set abstraction layer primarily relies on furthest point sampling, which tends to prioritize distant points to cover the space, potentially resulting in incomplete feature extraction for smaller targets.

Within the local neighborhood, PointNet relies solely on point coordinates to approximate information, which may not fully capture local details and fine-grained features. This limitation can lead to misdetections, especially for small targets with similar shapes. As illustrated in Figure 3, the echo intensity of obstacles varies significantly, and this additional information could be leveraged to effectively distinguish between similar obstacles.

Figure 3.

Histogram of echo intensity of different objects.

2.2. Overall Structure of MF-Net

The MF-Net architecture is a two-stage detection network, as shown in Figure 4. The first stage consists of three key modules: Point feature extraction, column feature extraction, and feature fusion. In the point feature extraction branch, an enhanced PointNet++ backbone is used for encoding and decoding. This branch combines farthest-point sampling with semantically controlled farthest-point sampling and utilizes a LFA module to efficiently capture local features. Afterward, a series of inverse interpolation operations are applied to generate feature vectors for the points. The column feature extraction branch then refines the target’s positional and semantic information by using a FPN.In the second stage, the features obtained from the first stage undergo local pooling to further extract local features. Meanwhile, the bounding box features predicted in the first stage are processed using segmentation masks and passed through a multilayer perceptron (MLP) for dimensionality expansion. Then, these features are combined with the point cloud features and pillar features from the first stage to obtain both global semantic features and local features. Finally, the merged features are processed through PointNet for regression and classification, generating the final confidence scores and refined detection boxes.

Figure 4.

Network structure of multifeature fusion small-target detection network (MF-Net).

2.2.1. Point-Feature Extraction Branch

The point-feature extraction branch primarily uses an enhanced version of PointNet++ as the cornerstone of its network architecture. This enhanced PointNet++ incorporates a coding and decoding framework that consists of a point-set sampling (PS) module and a LFA module serving as the encoder, paired with a feature propagation (FP) layer functioning as the decoder. The comprehensive layout of the system is depicted in Figure 5.

Figure 5.

Branch structure of point-feature extraction.

The original point cloud is input into the enhanced PointNet++ backbone network, where feature extraction is carried out by an encoder composed of a PS module and a LFA module. The decoding process involves the stacked inverse interpolation operation in PointNet++ to obtain the point features within the region. In this study, our primary focus is on improving the coding layer of the PointNet++ network. We offer a comprehensive explanation of the two modules within the coding layer: The PS module and the LFA module.

(1) Point set sampling module.

The original point cloud may contain a disproportionately high number of background points and larger target points compared to smaller target points. When using a single farthest-point sampling method, this disparity can result in the loss of critical small target points. To broaden the global sensing scope of the point cloud and ensure the retention of more small target points, this paper introduces a semantically controlled farthest-point sampling algorithm that integrates multiple sampling techniques, akin to a heuristic feature search optimizer. In the hierarchical downsampling process, the initial layer applies the traditional farthest-point sampling algorithm (D-FPS), whereas the subsequent three layers utilize the proposed semantically controlled farthest-point sampling algorithm (SC-FPS). This methodology not only broadens the global sensing range of the point cloud but also successfully retains a higher proportion of small target points, as demonstrated in Figure 6.

Figure 6.

Layer level downsampling structure in encoder.

The SC-FPS algorithm introduced in this study incorporates semantic weights into the farthest-point sampling process, minimizing redundancy within large target point clouds while ensuring that small target point clouds, despite their smaller quantity, are not overlooked. This strategy helps provide more small target points for subsequent feature extraction tasks. The detailed process of the SC-FPS algorithm is outlined as follows:

Firstly, the frames per second (FPS) rate is computed. Subsequently, point-wise features $F = {f_{1}, f_{2}, \dots, f_{N}} \in R^{N \times D}$ are fed into a three-layer MLP network to extract semantic information specific to the foreground points. This process yields feature dimensions that are consistent with those of the target category. Next, the extracted feature dimensions are sorted, and the maximum value among them is designated as the categorical feature for each point. Ultimately, the foreground point score $V$ is computed using the Sigmoid function, as outlined in equation (1):

V = Sigmoid (max (MLP (MLP (MLP (F)))))

(1)

Secondly, we compute the shortest distance from each point to the set of already sampled points, denoted as

d_{i j}

.Assuming the coordinates of a point

P_{i}

are

(x_{i}, y_{i}, z_{i})

and the coordinates of its nearest neighbor within the set of sampled points are

(x_{j}, y_{j}, z_{j})

, the shortest distance from

P_{i}

to the set of sampled points is calculated using the following formula in equation (2):

d_{i j} = \sqrt{{(x_{j} - x_{i})}^{2} + {(y_{j} - y_{i})}^{2} + {(z_{j} - z_{i})}^{2}}

(2)

The shortest distance is adjusted based on the foreground point score. The foreground point score of the input point set is used as semantic feature information

V = {v_{1}, v_{2}, v_{3}, \dots, v_{i}}

, and combined with the shortest distance from each point to the sampled point set

D = {d_{i 1}, d_{i 2}, d_{i 3}, \dots d_{i j}}

to recalculate the semantically weighted distance, as shown in equation (3):

{\hat{d}}_{i j} = v_{i}^{β} \cdot d_{i j}

(3)

where

β

is the weighting factor that controls the semantic importance.

In the process of determining the sampled point set, the initial point with the highest semantic score is first selected based on the semantically weighted distance and added to the sampling point set. Subsequently, the system iteratively calculates the points that are the farthest from the current sampling point set and adds them to the final point set until the preset number is reached. Parameter $β$ is used to adjust the influence of the foreground point scores on the shortest distance between points and the sampled point set. When $β$ is set to $1 / d_{i j}$ , if $β$ is less than 0.02, it is assigned a value of 0, and the system defaults to using traditional Euclidean distance for calculations, meaning that farthest point sampling relies entirely on traditional distance. Conversely, when $β$ is greater than 0.1, the system adjusts the semantic weighting based on Euclidean distance to enhance the influence of semantic information on the distance calculation. The SC-FPS algorithm proposed in this paper not only retains more small target points in stratified downsampling but also captures rich contextual information around these small targets and demonstrates strong resistance to long-distance outliers. By combining the D-FPS and SC-FPS algorithms, this study achieves uniform sampling of the entire point set, thereby enhancing the overall sampling performance.

(2) Local feature enhancement module.

In this study, we propose a local feature enhancement method that integrates echo intensity and positional data to capture the geometric configuration and intensity variations within a small-target point cloud. This is achieved by introducing spatial position intensity coding. The architecture of the LFA module is depicted in Figure 7.

Figure 7.

Local feature enhancement module.

First, after the sampling stage is concluded, the center point resulting from the downsampling process is designated as the center of a circle. Subsequently, $N$ local spherical regions with varying neighborhood radii are established in the point cloud space, with each local region constituting a group. Within each group, the $K$ points nearest to the center point are selected for further feature extraction.

Secondly, before encoding the positional intensity of the $K$ points within the spherical neighborhood after grouping, it is necessary to preprocess the echo intensity to correct for the influence of surface materials and incidence angle. This paper adopts the echo intensity correction method proposed by Wen long (Wenlong, 2020), which indicates that the reflection intensity of different materials follows a similar trend with changes in the incidence angle. By fitting the relationship curve between the incidence angle and echo intensity, the data is normalized, thereby improving the quality of the intensity data and more accurately reflecting the distribution of echo intensity for objects in the environment.

Once the corrected echo intensity values are obtained, the 3D coordinates of each point in the point cloud, along with their corresponding corrected echo intensities, are introduced. To effectively encode the spatial coordinates while preserving the geometric structure of the point cloud, this study employs a method of spatial position coding, as detailed in equation below

P_{i}^{k} = (p_{i}; p_{i}^{k}; p_{i} - p_{i}^{k}; ‖ p_{i} - p_{i}^{k} ‖)

(4)

where

p_{i}

is the center-point location;

P_{i}^{k}

is the neighboring point location;

p_{i} - p_{i}^{k}

is the relative distance; and

‖ p_{i} - p_{i}^{k} ‖

is the absolute distance. Similarly, the echo intensity characterization information is shown in equation below:

R_{i}^{k} = (r_{i}; r_{i}^{k}; r_{i} - r_{i}^{k}; ‖ r_{i} - r_{i}^{k} ‖)

(5)

where

r_{i}

is the center-point echo intensity,

r_{i}^{k}

is the neighboring point echo intensity,

r_{i} - r_{i}^{k}

is the intensity difference, and

‖ r_{i} - r_{i}^{k} ‖

is the absolute value of the intensity difference. The positional intensity coding proposed in this paper includes spatial position coding and echo intensity coding, both of which are calculated as a variable as a whole, so the formed positional intensity coding can be expressed as equation below:

{\begin{matrix} I_{i} = (p_{i}; r_{i}) \\ I_{i}^{k} = (p_{i}^{k}; r_{i}^{k}) \\ (I_{i} - I_{i}^{k}) = (p_{i} - p_{i}^{k}; r_{i} - r_{i}^{k}) \\ \begin{matrix} ∥ I_{i} - I_{i}^{k} ∥ = (∥ p_{i} - p_{i}^{k} ∥; ∥ r_{i} - r_{i}^{k} ∥) \\ I_{f} = (\begin{matrix} I_{i} \oplus I_{i}^{k} \oplus (I_{i} - I_{i}^{k}) \oplus ‖ I_{i} - I_{i}^{k} ‖ \end{matrix}) \end{matrix} \end{matrix}

(6)

where

I_{i}

is the center-point positional intensity formation,

I_{i}^{k}

is the neighboring point positional intensity information,

I_{i} - I_{i}^{k}

is the relative difference between two points,

∥ I_{i} - I_{i}^{k} ∥

is the absolute value of the relative difference between two points,

I_{f}

is the positional intensity information coding, and

\oplus

denotes the splicing operation.

After the positional intensity encoding module, the encoded features and remaining point features $f_{i}$ are spliced to obtain the positional intensity-enhanced point features $\tilde{F} = {{\tilde{f}}_{i}, {\tilde{f}}_{2}, \dots, {\tilde{f}}_{K}}, \tilde{F} \in R^{K \times (D + C_{1})}$ , which can be expressed as follows in equation (7):

{\tilde{f}}_{i} = \begin{matrix} I_{i} \oplus I_{i}^{k} \oplus (I_{i} - I_{i}^{k}) \oplus ‖ I_{i} - I_{i}^{k} ‖ \end{matrix} \oplus f_{i}

(7)

The MLP network is applied to the enhanced point feature

\tilde{F}

to obtain the high-dimensional point feature, and subsequently goes through the MLP with the Sigmoid function of the activation function RELU to obtain the attention score

s_{i}

, which is pointwise multiplied with the positional intensity-enhanced feature

\tilde{F}

to generate the weighted point feature. Finally, the maximum pooling operation is applied to obtain the local features within the spherical neighborhood.

2.2.2. Point-Feature Extraction Branch

To enrich the feature expression and achieve good accuracy and fast inference at the same time, this paper introduces the column feature extraction branch, which mainly includes the column feature network under the point pillars and the 2D convolutional neural network (CNN) based on the FPN. The structure of the column feature extraction branch is shown in Figure 8.

Figure 8.

Structure of pillar feature extraction branch.

Figure 9.

Schematic diagram of multifeature splicing.

(1) Pillar feature network.

The point cloud space is first divided into column cells; each cell can contain up to $N$ points; if more than $N$ , random downsampling is used, and the opposite is performed to fill zero to $N$ points. At this point, each point in the cell is encoded as a 9-dimensional vector $D (x, y, z, r, x_{c}, y_{c}, z_{c}, x_{p}, y_{p})$ , where $(x, y, z, r)$ denotes the point-cloud spatial coordinates and echo intensity, $(x_{c}, y_{c}, z_{c})$ denotes the deviation of each point relative to the average of all points in the column, and $(x_{p}, y_{p})$ denotes the deviation of each point relative to the center of the column. Subsequently, feature extraction is performed on all point-cloud data, which are stacked into $(D, P, N)$ tensors. The original point-cloud dimension of the nine dimensions is expanded to 64 dimensions, and the tensor is converted to $(C, P)$ by a maximum pooling operation, where $P$ denotes the number of cells, $C$ denotes the feature dimension, and $P$ is converted to $(W, H)$ by a feature-mapping network to generate a pseudo-image of $(C, H, W)$ for feature learning by the subsequent CNN.

(2) 2D CNN backbone network based on feature pyramid.

In this study, a feature pyramid-based network is used to capture semantic and location information at different scales to improve detection accuracy. The spatial resolution of the pseudo-image is gradually reduced using three sets of convolutions in the top-down branch, whereas the bottom-up branch outputs deep features with strong semantic but weak spatial information. In horizontal concatenation, the dimensionality is first reduced by a $1 \times 1$ convolution and then superimposed on the upsampled results; a $3 \times 3$ convolution is used to reduce the aliasing effect. Finally, the three-layer feature map is joined by inverse convolution and $1 \times 1$ convolution to obtain point-by-point column features.

2.2.3. Multifeature Splicing Module

After the point-cloud data has traversed both the Column and Point-Feature Extraction branches, it is necessary to integrate these two branches. The point features are derived through an enhanced PointNet++ encoding and decoding process, which encapsulates the coordinate information of each point. Meanwhile, in the column feature extraction branch, the column feature mesh documents the point coordinates associated with each column. This information is utilized to facilitate the fusion of features, as illustrated in Figure 9.

Figure 10.

Partial image data from the KITTI dataset and the nuScenes dataset.

Furthermore, it is crucial to adjust the number of channels in the subsequent classification and regression branches to match the number of channels after concatenation. This ensures compatibility and allows for seamless reuse within the subsequent detection network of the PointRCNN framework.

3. Experiments and Results

3.1. Experimental Setup

The performance of deep learning models typically depends on the scale and diversity of the training dataset. Commonly used datasets include ONCE Dataset (Mao et al., 2021), nuScenes (Caesar et al., 2020), DAIR-V2X (Yu et al., 2022), and KITTI (Geiger et al., 2012). To evaluate small object detection capabilities, this study selects the KITTI 3D detection dataset and the nuScenes dataset for experimental validation and analysis. The KITTI dataset covers a variety of road environments and target categories, making it suitable for training and testing small object detection algorithms. The nuScenes dataset provides high-quality sensor data, making it particularly suitable for evaluating small object detection in complex urban environments. Figure 10 demonstrates the diversity of image data from these two datasets, allowing MF-Net to rigorously assess its small object detection capabilities.

This study uses the KITTI dataset, which consists of 7,481 training samples and 7,518 test samples. Detection tasks are classified into easy, medium, and difficult categories based on the extent of target occlusion and truncation. To bolster the robustness of the model and mitigate the risk of overfitting, various data augmentation techniques are employed. These include random flipping, scaling, rotating, and the addition of non-overlapping small-target ground truth boxes. These operations emulate targets from diverse viewpoints, scales, and rotational states, thereby enhancing sample diversity and improving the model’s generalization capability. The precise environmental setup for the experiment is outlined in Table 1.

In the experimental setup, the OpenPCDet object detection framework was utilized, and the training process was conducted in a PyTorch environment, as shown in Table 2. The training parameters were configured with a batch size of 8, a learning rate of 0.02, a learning rate decay factor of 0.1, and a weight decay coefficient of 0.01 to reduce the risk of overfitting. The momentum parameter was selected within a range of 0.95–0.85, with the specific value of 0.9 being chosen for this study. A weight decay of 0.1 was applied, and the Adam optimizer was utilized with the loss function as the objective to minimize, with the training process lasting for a total of 80 epochs. During the training process, an intersection over union (IoU) threshold of 0.7 was established for cars, whereas for pedestrians and cyclists, the IoU threshold was set at 0.5. The definitions of the 3D bounding boxes were as follows: car [3.9, 1.6, 1.56], pedestrian [0.8, 0.6, 1.73], and cyclist [1.76, 0.6, 1.73]. Figure 11 depicts the trends of various loss functions throughout the training process, encompassing the total loss, classification loss, corner loss, and regression loss.

Figure 11.

The trend graph of various loss function values during the training process with respect to the number of iterations.

Table 2.

Experimental Environment Configuration.

Environmental form	Configuration	Environmental form	Configuration
CPU	Intel(R) Xeon(R) Gold 5118 CPU @ 2.30 GHz	Spconv	2.3.6
GPU	NVIDIA Corporation GP104GL	Programming language	Python 3.7
Operating system	Ubuntu 18.04	Learning framework	PyTorch1.10.1
CUDA	11.3	CuDNN	8.2.0.53
Internal storage	128 GB	OpenPCDet	2
Video memory	8 GB	Graphics card	Quadro P4000
Anaconda	0.0.1.1	Torchvision	0.11.2

In this study, we utilize average precision (AP) and mean Average Precision (mAP) in 3D space as evaluation metrics to quantify the accuracy of single-class detection and overall detection performance, respectively. Furthermore, the IoU threshold is determined by calculating the overlap volume between the detected target’s bounding box and the ground truth bounding box in 3D space.

In this study, we examined PointPillars (Lang et al., 2019), PointRCNN (Shi et al., 2019), STD (Yang et al., 2019), SECOND (Yan et al., 2018), and IA-SSD (Zhang et al., 2022), which are renowned for their high accuracy and speed in detecting small targets. Comparison experiments were conducted using the MF-Net proposed in this study on the KITTI test set. To ensure a fair comparison of each network’s performance, we adhered to common evaluation criteria and set the IoU thresholds for pedestrians and cyclists to 0.5 for calculating the AP and mAP metrics of the detection results. The 3D target detection results of MF-Net and the other networks are presented in Table 3.

Table 3.

Three-dimensional Mode Detection Results of Small Targets in KITTI test(%).

		Pedestrian IoU=0.5				Cyclist IoU=0.5
Model	Type	Easy	Moderate	Hard	mAP	Easy	Moderate	Hard	mAP	Speed/Hz
PointPillars	Voxel-based	50.85	41.42	38.29	43.52	76.85	58.45	51.72	62.34	40.25
SECOND	Voxel-based	45.15	35.14	32.84	37.71	70.89	51.88	45.59	56.12	19.52
IA-SSD	Point-based	46.25	38.78	35.21	40.08	77.82	61.61	55.24	64.89	38.29
STD	Point-Voxel	53.17	42.34	38.14	44.55	78.14	61.27	54.96	64.79	12.28
PointRCNN	Point-based	49.25	41.57	38.45	43.09	73.87	59.52	53.45	62.28	21.35
MF-Net	Point-Voxel	52.01	44.86	39.87	45.58	76.37	64.07	55.04	65.16	20.22

MF-Net = multifeature fusion small-target detection network; mAP = mean average precision; IoU = intersection over union

As shown in Table 3 and Figure 12, MF-Net generally demonstrated higher detection accuracy for small pedestrian and cyclist targets compared to other networks. Specifically, MF-Net achieved a 2.49% improvement in the mAP value for pedestrian detection and a 2.88% improvement for cyclist detection relative to PointRCNN, leveraging its multifeature fusion strategy. This performance enhancement was evident across all three difficulty levels, particularly notable in the easy (52.01%) and moderate (44.86%) categories, while improvements for severely occluded targets were more limited (39.87%). In terms of real-time performance, MF-Net operated at a speed of 20.22 Hz, slightly slower than PointRCNN (21.35 Hz) but still outperforming other networks, notably those like STD, which operated at only 12.28 Hz. Overall, MF-Net exhibits significant advantages in both the accuracy and speed of small-target detection, highlighting its potential value in practical applications.

Figure 12.

Comparing the detection accuracy of different networks for pedestrians and cyclists under different tasks. (a) Class average detection accuracy of pedestrians; (b) Class average detection accuracy of cyclists.

As illustrated in Figure 12, the detection accuracy of MF-Net for pedestrians and cyclists diminishes as the difficulty level increases. However, the rate of decline is notably slower compared to other models, indicating that MF-Net possesses a robust detection capability and strong scale adaptability for small targets. This advantage stems from MF-Net’s innovative approach of integrating point and column features, which enhances the feature representation of small targets.

To evaluate the effectiveness of the PS module, LFA module, and column feature extraction branch in the MF-Net network for small-target detection, we conducted ablation experiments on the KITTI test set, with results detailed in Table 4. The baseline model achieved an mAP of 43.09% for pedestrian detection and 62.28% for cyclist detection. After introducing only the PS module, the mAP increased to 43.95% for pedestrians and 63.23% for cyclists. Adding the LFA module further improved the mAP to 45.10% for pedestrians and 64.50% for cyclists. Finally, the model that combined the column feature extraction branch achieved the highest mAP of 45.58% for pedestrians and 65.16% for cyclists. These results clearly indicate that each proposed module significantly contributes to improving detection accuracy, confirming their effectiveness in enhancing small-target detection capabilities.

Table 4.

Results of Ablation Experiments (%).

			Detection precision mAP%
PS module	LFA module	Column feature extraction branch	Pedestrian	Cyclist
–	–	–	43.09	62.28
✓	–	–	43.95/+0.86	63.23/+0.95
✓	✓	–	45.10/+1.15	64.50/+1.27
✓	✓	✓	45.58/+0.48	65.16/+0.66

mAP = mean average precision; PS = point-set sampling; LFA = local feature aggregation.

The experimental results demonstrate that each module enhancement in MF-Net contributes to improving the detection performance for small targets. Notably, the LFA module significantly boosts the detection accuracy of pedestrians and cyclists while reducing false detections by encoding echo strength feature information. The PS module, which combines D-FPS and SC-FPS, effectively retains small-target points and alleviates data imbalance. Additionally, the FPN extracts multiscale features of small targets to further enhance detection accuracy. To validate the effectiveness of MF-Net in detecting small targets, we utilize Open3D to visualize the KITTI test set data. As shown in Figure 13, MF-Net’s detection accuracy for pedestrians and cyclists decreases as detection difficulty increases, but this decrease is less pronounced than in other models. This suggests that MF-Net maintains stable detection capabilities and demonstrates strong adaptability to varying scales, especially for small targets. This robustness likely comes from MF-Net’s unique approach of combining point and column features, which enriches the feature representation for smaller objects. Figure 14 illustrates MF-Net’s detection results, with 3D bounding boxes marking detections: green for vehicles, blue for pedestrians, and yellow for cyclists, each labeled to show the target’s orientation. The side of each frame marked with crossed lines indicates the direction the target is facing.

Figure 13.

Detection accuracy of different networks under varying task difficulties. The red line represents the average detection accuracy of the cyclist while the blue line depicts the average detection accuracy of the pedestrian.

Figure 14.

MF-Net detection results. (a) 2D image data under the current frame; (b) Corresponding point-cloud data. MF-Net = multifeature fusion small-target detection network; 2D = two-dimensional.

In this paper, the optimal detection models of MF-Net and the baseline network are used to detect the test set. The detection results of the two models are visualized, and the samples in three different scenes as shown in Figure 15 , Sample 1, Sample 2, and Sample 3 are selected, Sub-figure (a) is the scene graph of the current frame, sub-figure (b) is the effect graph of the PointRCNN network model detection, and sub-figure (c) is the effect graph of the effect image of MF-Net network model detection. Because pure point-cloud data were used and the relevant information could not be represented visually, the current frame scene graph captured by the front-view camera was used to compare the LIDAR detection results.

Figure 15.

Comparison of visualization results of different samples.(a) Scene graph of the current frame; (b) Effect graph of PointRCNN network model detection; (c) Effect graph of MF-Net network model detection. MF-Net = multifeature fusion small-target detection network

In Sample 1, both models struggled to detect a heavily occluded and distant pedestrian within the orange circle. However, MF-Net demonstrated its superiority by successfully detecting two other pedestrians at long distances and accurately determining their motion directions. In contrast, PointRCNN missed these pedestrians due to the sparsity of the point cloud. Furthermore, PointRCNN incorrectly identified walls as pedestrians, while MF-Net avoided such misdetections.

In Sample 2, MF-Net successfully detected the partially occluded pedestrian target within the red elliptical box and accurately determined its direction of motion, whereas PointRCNN failed to detect the target. Furthermore, PointRCNN exhibited an over-reliance on geometric information and neglected echo strength information, leading to misidentifications. Specifically, it incorrectly classified the cylindrical iron obstacle and railing within the red elliptical box as pedestrians and vehicles, respectively, and overlooked the detection of two vehicles. In contrast, MF-Net avoided these errors.

In Sample 3, PointRCNN encounters significant challenges in detecting small targets, particularly pedestrians at long distances or under occlusion. The two pedestrians within the orange elliptical box have sparse point-cloud data due to their distance or severe occlusion, resulting in PointRCNN missing the detection and falsely identifying walls as pedestrians. Additionally, in the red elliptical box, the significant movement of a pedestrian causes its point cloud to merge with obstacles, deforming the contour and leading PointRCNN to mistakenly classify it as a cyclist. In contrast, MF-Net demonstrates its capability to accurately detect small targets and correctly determine their direction of movement.

To further validate the model’s effectiveness, a visualization analysis was conducted on the nuScenes dataset, as shown in Figure 16. In sample 1, the model successfully identified a pedestrian with occlusion; in sample 2, a pedestrian in front of a car on the left side was accurately detected; and in sample 3, despite the cyclist being partially occluded by a traffic light on the right side, the model was still able to effectively recognize the individual. These results further confirm the superiority of MF-Net in small object detection.

Figure 16.

Visualization results under the nuScenes dataset. (a) Scene graph of the current frame; (b) Detection effect of the multifeature fusion small-target detection network (MF-Net) network model.

MF-Net integrates a multisampling strategy that combines SC-FPS and D-FPS, successfully retaining more small-target point clouds while effectively reducing false detections through LFA. Its multifeature fusion strategy enhances feature extraction capabilities and demonstrates excellent performance in detecting occluded targets. Although the model demonstrates promises, further optimization is necessary for effectively detecting small targets in sparse point clouds within complex scenes, and future work could optimize the feature extraction process using metaheuristic optimization algorithms. Overall, compared to PointRCNN, MF-Net shows higher accuracy and real-time performance in small-target detection, exhibiting greater robustness and adaptability.

4. Conclusion

A small-target detection network MF-Net is proposed to effectively reduce false and missed detection rates of small targets in deep learning by fusing point and column features. A semantically controlled farthest-point sampling algorithm, combined with a multisampling strategy, is introduced to uniformly sample the point set, thereby retaining a higher proportion of small-target points and significantly enhancing detection capabilities. The LFA module further refines the echo intensity of the point cloud and encodes the spatial position intensity, leading to a substantial reduction in false detections of similar obstacles. Furthermore, MF-Net incorporates a column extraction branch that transforms 3D point clouds into pseudo-images, employing a 2D CNN to capture small-target features across multiple scales. This enables the integration of point and column features via a multifeature splicing module, thereby enhancing the overall feature representation. Experimental results demonstrate that MF-Net achieves mAP values that are 2.49% and 2.88% higher than the baseline network PointRCNN for the pedestrian and cyclist categories, respectively, signifying a notable enhancement in detection accuracy.

In conclusion, MF-Net proposes a novel framework for small-target detection by integrating point cloud and pillar features, combined with innovative sampling strategies and the LFA module, effectively enhancing detection rates. However, although the model improves point feature extraction through an enhanced backbone network, its feature extraction performance is still less significant compared to pillar features. Additionally, in the feature fusion stage, the model simply combines point cloud and pillar features without fully considering their relative contributions, which limits its ability to detect small targets in sparse point clouds within complex scenes. Future work could incorporate attention mechanisms to assign adaptive weights to different features during the fusion process, thereby optimizing feature integration and further improving detection rates and robustness. Ultimately, we expect MF-Net to make a significant contribution to future vehicle-mounted LiDAR detection applications.

Footnotes

ORCID iD

Zhibing Duan

Funding

The author(s) received the following financial support for the research, authorship, and/or publication of this article: This study was funded by the Shandong Province Major Science and Technology Innovation Project (grant no. 2023CXGC010111) and the Small and Medium-sized Enterprise Innovation Capability Improvement Project (grant no. 2022TSGC2277).

Conflicts Interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Xie

Yao

Yang

(2024). MVTr: Multi-feature voxel transformer for 3D object detection. Visual Computer, 40(3), 1453–1466.

Barrera

Guindel

Beltrán

(2020). BirdNet+: End-to-end 3D object detection in LiDAR bird’s eye view. In Proceedings of the 23rd international conference on intelligent transportation systems (ITSC) (pp. 1–6). IEEE.

Beltrán

Guindel

Moreno

F. M.

(2018). BirdNet: A 3D object detection framework from LiDAR information. In Proceedings of the 21st international conference on intelligent transportation systems (ITSC) (pp. 3517–3523). IEEE.

Caesar

Bankiti

Lang

A. H.

, et al. (2020). nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621–11631).

Cai

Wang

Sohel

, et al. (2025). Unsupervised anomaly detection for improving adversarial robustness of 3D object detection models. Electronics, 14(2), 236.

Chen

Liu

Zhang

(2023). VoxelNeXt: Fully sparse VoxelNet for 3D object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21674–21683).

Chen

Wan

Xia

(2017). Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).

Engelcke

Rao

Wang

D. Z.

(2017). Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In Proceedings of the IEEE international conference on robotics and automation (ICRA) (pp. 1355–1361). IEEE.

Feng

Yang

Zhang

Jin

(2024). TSKPD: Twin structure key point detection in point cloud. Complex Intelligent Systems, 10(6), 8213–8231.

10.

Geiger

Lenz

Urtasun

(2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, Providence, RI (pp. 3354–3361).

11.

Jiang

J.-J.

Wan

H.-B.

Sun

H.-M.

Qin

T.-F.

Wang

Z.-Q.

(2024). Reinforced Voxel-RCNN: An efficient 3D object detection method based on feature aggregation. IEICE Transactions on Information Systems, 107(9), 1228–1238.

12.

Jiang

Xie

Liu

(2024). DPA-RCNN: Dual position aware 3D object detector for point cloud. In Proceedings of the international joint conference on neural networks (IJCNN) (pp. 1–9).

13.

Lin

Zhang

, et al. (2025). Multi-feature enhancement based on sparse networks for single-stage 3D object detection. Alexandria Engineering Journal, 111, 123–135.

14.

Lang

A. H.

Vora

Caesar

(2019). PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12697–12705).

15.

Lee

D. K.

Jung

Yang

(2024). LiDAR odometry survey: Recent advancements and remaining challenges. In Intelligent Service Robotics (pp. 1–24).

16.

Feng

X. S.

Zha

(2021). Summary of target detection algorithms. In Journal of Physics: Conference Series (vol. 1757, p. 012003). IOP Publishing.

17.

Geng

Yin

Wang

Qian

(2023). MVMM: Multiview multimodal 3-D object detection for autonomous driving. IEEE Transactions on Industrial Informatics, 20(1), 845–853.

18.

Luo

Yang

(2023). PillarNeXt: Rethinking network designs for 3D object detection in LiDAR point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17567–17576).

19.

Wang

Zeng

(2024). WS-SSD: Achieving faster 3D object detection for autonomous driving via weighted point cloud sampling. Expert Systems with Applications, 249, 123805.

20.

Zhang

Xia

(2016). Vehicle detection from 3D LiDAR using fully convolutional network. arXiv preprint arXiv:1608.07916.

21.

Lin

T. Y.

Dollar

Girshick

(2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI (pp. 936–944).

22.

Luo

Cao

Zhang

Guo

Shen

Wang

Feng

(2022). CE-FPN: Enhancing channel information for object detection. Multimedia Tools and Applications, 81(21), 30685–30704.

23.

Mao

Niu

Jiang

, et al. (2021). One million scenes for autonomous driving: ONCE dataset. arXiv preprint arXiv:2106.11037.

24.

R. C.

(2016). PointNet: Deep learning on point sets for 3D classification and segmentation. CoRR, abs/1612.00593.

25.

C. R.

(2017). PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information Processing systems (vol. 30).

26.

Shao

Tan

Sun

Zheng

Yan

Liao

(2024). PV-SSD: A multi-modal point cloud 3D object detector based on projection features and voxel features. In IEEE transactions on emerging topics in computing intelligence.

27.

Shi

Wang

(2019). PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 770–779).

28.

Song

Liu

Jia

(2024). Robustness-aware 3D object detection in autonomous driving: A review and outlook. arXiv preprint arXiv:2401.06542.

29.

Wenlong

(2020). Research on obstacle detection and tracking technology for autonomous vehicles based on onboard LiDAR. Master’s thesis, Jilin University. DOI: 10.27162/d.cnki.gjlin.2020.006003.

30.

Yan

Mao

(2018). SECOND: Sparsely embedded convolutional detection. Sensors, 18, 3337.

31.

Yang

Sun

Liu

(2019). STD: Sparse-to-dense 3D object detector for point cloud. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1951–1960).

32.

Yang

Sun

Liu

, et al. (2020). 3DSSD: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11040–11048).

33.

Yin

Zhou

Krahenbuhl

(2021). Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11784–11793).

34.

Luo

Shu

, et al. (2022). DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21361–21370).

35.

Zhai

Shang

Wang

Dong

(2020). DF-SSD: An improved SSD object detection algorithm based on DenseNet and feature fusion. IEEE Access, 8, 24344–24357.

36.

Zhang

Wan

Guo

(2022). Not all points are equal: Learning highly efficient point-based detectors for 3D LiDAR point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18953–18962).

37.

Zhou

Tuzel

(2018). VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4490–4499).

38.

Zhu

K. F.

Guo

(2024). A comparative analysis of point sampling strategies in point-based 3D object detection. In Proceedings of the 5th international conference on image, video processing and artificial intelligence (IVPAI 2023) (vol. 13074, pp. 39–46). SPIE.

Small Targets Detection in LIDAR Point Clouds Based on Deep Learning

Abstract

Keywords

1. Introduction

2.1. PointRCNN Network Structure Analysis

3.1. Experimental Setup

Footnotes

ORCID iD

Funding

Conflicts Interest

References