Sage Journals: Discover world-class research

Abstract

In numerous practical applications, particularly in the field of autonomous driving, acquiring annotated datasets that include both images and LiDAR point clouds simultaneously presents significant challenges and incurs substantial costs. To overcome the limitations of limited sample annotations, we propose an innovative weakly supervised learning methodology that utilizes reciprocal knowledge transfer between image detection models and 3D point cloud detection models. To the best of our knowledge, this area has not been explored by prior research teams. Our approach effectively addresses the alignment challenge of diverse modal features from an aerial perspective. Through heatmap prediction, we successfully facilitate knowledge transfer between the image detection and 3D point cloud detection models. Additionally, we conduct extensive experiments to evaluate the performance of our models under different parameters in the domain adaptation process, employing Exponential Moving Average (EMA) progressive learning. Furthermore, we explore the advantages of incorporating regression and prediction fusion heads to enhance weakly supervised learning. Remarkably, our experimental results on the widely accessible KITTI datasets demonstrate that our proposed approach achieves outstanding performance in 3D object detection under weak supervision, surpassing the baseline performance of the original 3D point cloud detection model.

Keywords

Autonomous driving domain adaptation heatmap weakly supervised learning

Introduction

In the realm of autonomous driving perception tasks, there is a frequent need to transfer existing perception models to new scenarios to accomplish fundamental perception tasks. This is especially pertinent in the context of 3D detection tasks, where the expansion of detection capabilities for new categories is necessary based on the scene. The conventional approach typically involves gathering additional 3D annotations for new categories and fine-tuning the original model accordingly. However, annotating a significant volume of 3D point cloud samples in new scenarios poses a formidable challenge. In contrast, 2D annotations of images are easier to obtain. Furthermore, by integrating texture information from images into the foundation of 3D point cloud detection models, it would be possible to further enhance the model’s accuracy. Therefore, effectively assisting new-category 3D detection through the annotation of 2D image samples is a meaningful and challenging task.

We aim to integrate models trained on 2D annotated image datasets with detection models trained on 3D point clouds, ultimately constructing a multimodal detection model. We achieve this by training image detection models on 2D images to acquire “knowledge” about new classes. Then, employing various weakly supervised training strategies, we progressively fuse this valuable information with existing 3D detection data. Ultimately, by combining the 2D image detection model with the 3D point cloud detection model, we attain a more generalized multimodal model.

However, achieving the effective fusion of the two single-modal models (2D image detection model and 3D point cloud detection model) requires addressing the following aspects:

(1) elevating the image feature information obtained by the 2D image detection model on 2D anchor points to the 3D space and aligning it with the features acquired by the 3D point cloud detection model to avoid domain mismatch issues.

(2) facilitating the mutual “migration” of information for non-overlapping categories between the two single-modal models and employing effective supervised training with cross-category relevance.

(3) ensuring the robust detection of 3D information for new targets in new categories by the fusion model while maintaining the detection effectiveness of existing categories.

The contributions of this article include: (1) proposing a spatial position based feature alignment method for aligning 2D image features and 3D point cloud features, (2) proposing a weakly supervised training paradigm for training multimodal models, and (3) demonstrating the importance of feature alignment and the effectiveness of our weakly supervised training method through experiments on the public dataset KITTI.

Related work

To the best of our knowledge, there is currently no directly applicable framework to achieve our objectives. However, in recent years, there have been some relevant research efforts worth considering. Socher et al.¹ were the first to introduce zero-shot learning into the field of artificial intelligence. Frome et al.,² in 2015, proposed the DeViSE model which integrates visual features with semantic representations, achieving zero-shot learning through the embedding of semantic features. Sung et al.³ significantly improved the effectiveness of zero-shot learning using relational networks. In 2015, Yosinski et al.⁴ conducted experimental studies on the transferability of features in deep neural networks, exploring the effects of sharing features between different tasks. Raina et al.⁵ discussed how to perform knowledge transfer from unlabeled data. Gopalan et al.⁶ explored general methods for domain adaptation in visual recognition tasks. Gong et al.⁷ employed manifold learning to enhance performance, while Hoffman et al.⁸ employed adversarial training and conditional constraints for pixel-level domain adaptation.

In the field of 3D object detection, significant progress has been made in recent years, with many meaningful contributions. Methods such as Mono3D and M3D-RPN, which rely solely on panoramic images with cameras as the only sensor, have been introduced. However, due to the lack of depth information in images, these methods suffer from lower detection accuracy. On the other hand, 3D object detection methods based on LiDAR point clouds, such as VoxelNet,⁹ CenterPoint,¹⁰ SA-SSD,¹¹ and 3DSSD,¹² utilize structural information from point clouds to achieve more accurate 3D object localization. Nevertheless, they exhibit lower classification accuracy for small targets under poor laser reflection conditions, such as traffic cones. To harness the advantages of both images and point clouds, multimodal models have been proposed. Several outstanding multimodal 3D object detection models, including BEVFormer,¹³ BEVfusion,¹⁴ and PMFNet,¹⁵ have been introduced. For instance, Li et al.¹⁶ proposed a Local-to-Global multimodal feature fusion method, while Zhang et al.¹⁷ utilized a 2D auxiliary branch to learn local spatial-aware features from images. However, obtaining multimodal labels, especially annotating 3D objects on point clouds, is challenging and costly. The challenge of training multimodal models solely based on 2D annotations from images to achieve detection capability for unknown categories remains a significant and meaningful task. To address this challenge, Wei et al.¹⁸ introduces a non-learning technique that employs 2D bounding boxes to facilitate the segmentation of frustum sub-point clouds, followed by heuristic methods to compute the most precise 3D bounding box based on the segmented point cloud, Liu et al.^19,20 leverages comprehensive image data to tackle the inherent sparsity challenge found in 3D point clouds. These methods are excessive reliance on the internal and external parameters of the camera, and a lack of consideration for the differences in multimodal feature extraction branches. To address these issues, we propose a paradigm using multi-stage weakly supervised training. This approach gradually integrates 2D detection models and 3D point cloud detection models, enabling the final fused multimodal model to achieve detection capability for unknown categories effectively.

Weakly supervised training of the multimodal model

Overview

Despite many of the studies mentioned earlier not directly addressing our problem, they still hold important reference value. To facilitate the construction of a fused multimodal model, we adopt a dual-branch structure similar to BEVFusion.¹⁴ Before combining the image detection model and the 3D point cloud detection model into a dual-branch structure, we separately pre-trained anchor-based detection models for the two feature extraction branches. Specifically, we trained CenterNet²¹ on images and CenterPoint¹⁰ on 3D point clouds. During the weakly supervised training phase, pre-training weights were provided for both branches. Additionally, when lifting 2D features to 3D space, we employed the same strategy as BEVFusion, obtaining depth estimation on images using LSS (lift, splat, shoot).²² Subsequently, we mapped 2D pixel features forward using camera intrinsic and extrinsic parameters and filled them into the feature tensor in 3D space.

During the fusion stage of the two models, our weakly supervised training consists of three steps:

(1) elevating the features extracted by the 2D image detection model into 3D space. Additionally, guiding the “alignment” of 3D features extracted by two branches through predictions of heatmaps strongly correlated with anchor positions.

(2) gradually merging the features extracted by the two branches, predicting heatmaps for new categories, and updating the parameters of the heatmap prediction module through Exponential Moving Average (EMA) during the training process.

(3) leveraging the shared categories between the pre-trained models of the two branches, applying weakly supervised methods to train the parameters of the 3D bounding box regression module.

Feature alignment

When using anchor based detectors for feature fusion, it is necessary to consider the alignment of 2D and 3D features in spatial position, as 2D image features lack depth information and the noise of internal and external parameters affects the projection of image points to 3D space. There is a significant deviation in the center position of the target between 2D and 3D annotations, which is harmful for feature fusion and not conducive to the network’s regression of the center of the 3D bounding box. In the initial phase of model fusion, we consider aligning the different features extracted respectively by CenterPoint¹⁰ and CenterNet.²¹ As shown in Figure 1, we devise a training scheme based on predicting heatmaps to align features from a top-down perspective. Firstly, we utilize the pre-trained backbone networks of CenterNet and CenterPoint to extract corresponding features. Then, we map the features extracted from the 2D image branch to 3D space, obtaining voxel features $F_{1}$ and $F_{2}$ for each branch. Because feature alignment is required from a top-down perspective, we convert voxel features into pillar features $F_{img}$ and $F_{lidar}$ . Through the Feature Alignment Module (FAM), we obtain aligned features $F_{img}'$ . To effectively supervise the training of FAM, we design a shared-weight heatmap prediction module K, which predicts a pair of heatmaps $H_{1}$ and $H_{2}$ on the aligned features $F_{img}'$ and $F_{lidar}$ through K, by reducing the difference between $H_{1}$ and $H_{2}$ to train the FAM module.

Figure 1.

Alignment of heterogeneous features.

In the 2D image branch, the extracted features are based on the 2D pixel spatial positions, and we need to project them into 3D space. Initially, we must establish correspondences between 2D pixel coordinates and points in 3D space. However, due to the loss of depth information, this correspondence from pixel position to 3D coordinates presents singularity issues. To address the problems arising from singularities, it is necessary to utilize semantic information from a single image to infer depth information. Several studies in this regard exist; for instance, Laina et al.²³ proposed a method using Fully Convolutional Residual Networks (FCRN) for depth prediction, while Wang et al.²⁴ discussed how integrating depth and semantics can provide a more comprehensive understanding of scenes in images, achieving unified depth and semantic predictions from single images. Additionally, Godard et al.²⁵ utilized consistency information between left and right images for unsupervised monocular depth estimation.

It’s crucial to strike a balance between depth prediction accuracy and module operational efficiency. Moreover, during the feature transformation process, quantization of depth is required, and achieving high-precision depth estimation is not mandatory. Hence, drawing inspiration from Liang et al.’s method in the multimodal fusion detection model BevFusion,¹⁴ which maps 2D features to 3D space, we adopt the LSS model proposed by Philion and Fidler,²² This model obtains depth probability maps and maps position points in the camera coordinate system to some position in 3D space through camera extrinsics and intrinsics, as illustrated in Figure 2(a). Finally, the feature tensor $F_{1}$ is obtained through feature inpainting.

Figure 2.

LSS and FAM module: (a) lift 2D feature to 3D space. (b) Feature Alignment Module.

To address the alignment issues of features under a bird’s-eye view perspective, we refer to Lang et al.,²⁶ under this perspective, we compute the average features of $F_{1}$ and $F_{2}$ at different heights, resulting in pillar features $F_{img}$ and $F_{lidar}$ . However, $F_{img}$ and $F_{lidar}$ are not aligned on the bird’s-eye plane due to two main reasons: (1) the discrepancy between the object’s center in the 2D image and its center in 3D space, and (2) the inevitable introduction of noise during depth prediction. Therefore, before merging the features $F_{img}$ and $F_{lidar}$ , we introduce a Feature Alignment Module (FAM), as depicted in Figure 1.

For the design of the FAM layer, we utilize the improved convolutional layer proposed by Liu et al.,²⁷ as illustrated in Figure 2(b). We incorporate the coordinate information of the pillar features $F_{img}$ into the input as well, and through a full convolutional module, obtain the aligned feature tensor $F_{img}'$ . Finally, as depicted in Figure 1, we pass the aligned feature tensors $F_{img}'$ and $F_{lidar}$ through a shared-weight full convolutional module K to derive heatmap $H_{1}$ and $H_{2}$ for the respective branches.

Taking the maximum values of $F_{img}$ and $F_{img}'$ along the channel dimension, we obtained the activation heatmap. The top-20 activated positions were selected and back-projected onto pixel locations. The blue cross on the image denotes the accurate target center, the red region represents the aligned position, and the light blue indicates the unaligned position.

In the first stage of training, we freeze all parameters except for modules FAM and K. We train the model parameters by comparing predicted heatmaps $H_{1}$ and $H_{2}$ . Specific details can be found in the experimental section. As shown in Figure 3, partial results obtained on the KITTI dataset are displayed. The aligned feature map $F_{img}'$ predicts target center positions closer to the actual 3D center positions of the targets.

Figure 3.

Alignment results on the KITTI dataset.

Generating heatmap for new classes

After the preceding steps, we obtain aligned features $F_{img}'$ and $F_{lidar}$ . In the second stage of training, we simply fuse $F_{img}'$ and $F_{lidar}$ , and then predict heatmaps for all categories based on the fused features.

During the second training stage, we solely focus on predicting heatmaps for all categories. This is because the regression of target bounding boxes is inherently more complex than the task of heatmap prediction and relies on anchor points generated through heatmaps. To enhance the accuracy of anchor point prediction while avoiding the introduction of excessive noise from the regression module, we exclusively perform category heatmap prediction in this stage.

Even though $F_{img}'$ and $F_{lidar}$ are aligned in position, the heterogeneous feature extraction branches exhibit different inductive biases,²⁸ leading to suboptimal performance when directly merging $F_{img}'$ and $F_{lidar}$ with equal weights. In our training process, we employ a dynamic parameter $η$ to adjust the fusion weights of the two, progressively merging the two heterogeneous features. Correspondingly, inspired by the approach of Liang et al.²⁹, within a dynamic parameter $η$ changing period, we gradually update the parameters of the heatmap prediction module through exponential moving average (EMA).

As shown in Figure 4, for the features obtained from two branches, $F_{img}'$ and $F_{lidar}$ , we employ a variable coefficient $η$ for weighted fusion:

F = η F_{img}' + (1 - η) F_{lidar}

(1)

Figure 4.

Training heatmaps for new categories.

After passing through a fully convolutional module K1, $F$ predicts a heatmap $H$ containing all categories, with dimensions ${H, W, C}$ , where $C$ represents the total number of categories. Moreover, it requires weighting of category channels based on the variable coefficient $η$ . For instance, among the $C$ category channels, ${C_{1}, C_{2}, C_{3}}$ correspond to categories in the pre-training dataset of the 2D image detection model, while ${C_{2}, C_{3}, C_{4}, C_{5}}$ correspond to categories in the pre-training dataset of the 3D point cloud detection model. Therefore, channels $1 ~ 3$ in the predicted heatmap are weighted by $η$ , whereas channels $2 ~ 5$ are weighted by $1 - η$ . Please refer to Algorithm 1 for the specific algorithm description.

Algorithm 1 Progressive training of the multi-category prediction module K1.
1: The category sets $S_{1}$ and $S_{2}$ correspond to the image and point cloud datasets, respectively. 2: for each batch in all samples do 3: for each sample in one batch do 4: for $η \in [0.3, 0.7]$ with increment $0.05$ , and $β = 0.9$ do 5: (1) generate features $F_{i m g}^{'}$ and $F_{lidar}$ respectively 6: (2) fusion features $F = η F_{i m g}^{'} + (1 - η) F_{l i d a r}$ 7: (3) predict heatmap $H$ through $K 1$ module 8: (4) predict heatmap $H_{1}$ and $H_{2}$ with $K$ module 9: (5) weighting heatmap $H$ 10: for each $c \in S_{1}$ do 11: $P_{1} = η * H [:, :, c]$ 12: end for 13: for each $c \in S_{2}$ do 14: $P_{2} = (1 - η) * H [:, :, c]$ 15: end for 16: (6) contrastive training: $H_{1}$ v.s. $P_{1}$ , and $H_{2}$ v.s. $P_{2}$ 17: (7) update K1’s parameters $P_{K 1}^{u p d a t e} = β P_{K 1}^{p r e v i o u s} + (1 - β) P_{K 1}^{c u r r e n t}$ 18: end for 19: end for 20: end for

Algorithm 1 Progressive training of the multi-category prediction module K1.

1: The category sets

S_{1}

and

S_{2}

correspond to the image and point cloud datasets, respectively. 2: for each batch in all samples do 3: for each sample in one batch do 4: for

η \in [0.3, 0.7]

with increment

0.05

, and

β = 0.9

do 5: (1) generate features

F_{i m g}^{'}

and

F_{lidar}

respectively 6: (2) fusion features

F = η F_{i m g}^{'} + (1 - η) F_{l i d a r}

7: (3) predict heatmap

H

through

K 1

module 8: (4) predict heatmap

H_{1}

and

H_{2}

with

K

module 9: (5) weighting heatmap

H

10: for each

c \in S_{1}

do 11:

P_{1} = η * H [:, :, c]

12: end for 13: for each

c \in S_{2}

do 14:

P_{2} = (1 - η) * H [:, :, c]

15: end for 16: (6) contrastive training:

H_{1}

v.s.

P_{1}

, and

H_{2}

v.s.

P_{2}

17: (7) update K1’s parameters

P_{K 1}^{u p d a t e} = β P_{K 1}^{p r e v i o u s} + (1 - β) P_{K 1}^{c u r r e n t}

18: end for 19: end for 20: end for

When learning the multimodal feature fusion module K1 through the EMA strategy, it is necessary to use the module K obtained in the previous stage to predict a reference heatmap for the aligned image features $F_{i m g}$ and $F_{l i d a r}$ , as shown in Figure 4. Generally, the original models of both branches have inductive bias, and their predicted heatmaps also have bias toward corresponding categories. For unknown categories, the activation value of the heatmap is relatively low. To prevent interference, during the training of the feature fusion module K1, we generate heatmaps predicted by K1 based on the category preferences of the models in the two branches. We then obtain heatmap predictions $P_{1}$ and $P_{2}$ for these respective branches through channel mean fusion. Under the strategy of EMA, we use $H_{1}$ and $H_{2}$ as references for $P_{1}$ and $P_{2}$ respectively. We compare and learn from these two branches, using $H_{1}$ and $H_{2}$ to guide the heatmaps of their respective categories. We adjust the parameters of the K1 module through gradient feedback and ultimately obtain a feature fusion module K1 that can integrate the strengths of these two branches.

Furthermore, throughout the entire training process, the parameters of the K model obtained in the first stage are frozen. For the training of the feature extraction backbone network, we adopt the approach described in,^30,31 where after reaching a certain stage of training, we gradually introduce training on the feature extraction backbone network with a smaller learning rate. This approach helps to avoid instability in the training of the K1 model caused by adjustments to the backbone network parameters in the early stages of training. Additionally, it allows for further adjustment of the feature extraction backbone network to assist in training the parameters of the K1 module when the final loss converges. During the contrastive training process, as we need to use heatmap as the supervised training label, we employ focal loss³² as the loss function to mitigate the interference caused by imbalanced positive and negative samples. When regressing the final 3D box, we utilize smooth L1 loss³³ as the loss function. To balance regression loss and classification prediction loss, we applied a weight of 0.2 to the classification prediction loss to enhance the stability of training in the final stage.

Training of the boundary box regression branch

The backbone network we employ for extracting features from 3D point clouds shares a similar structure with detection models like PointPillar²⁶ and VoxelNet.⁹ The distinction lies in CenterPoint, where the regression of target positions and bounding boxes is anchor-based. Through earlier training stages, we obtain fairly accurate heatmaps. These heatmaps enable us to determine anchor point positions. During the final training step, we simply need to regress the corresponding 3D target boxes at the respective heatmap positions.

As shown in Figure 5, at this stage, we directly fuse the aligned feature maps $F_{img}'$ and $F_{lidar}$ with equal weights and use them as inputs for predicting 3D bounding boxes and heatmaps. We have two design options for the regression module of the target’s 3D box. One is Scheme A, where K1 is parallel with K2, and the regression module of the target’s 3D box is used in another branch. The other is Scheme B, where a single module K2 is reused to accomplish both heatmap prediction and target 3D box regression tasks.

Figure 5.

Train the 3D box regression module. K1 module had been trained by previous stage. The K2 module can be designed in two different ways.

Generally, CenterPoint divides the prediction of the heatmap and the regression of the 3D box into two parallel branches. Many studies have mentioned that this design is beneficial for achieving higher accuracy in detection results.^32,34,35 However, during the third stage of training, after comparing the detection results of the two options, we found that for small targets, such as pedestrians in the distance, the parallel mode of Scheme A yielded lower detection performance compared to Scheme B. We further analyzed the reasons for this phenomenon in ablation experiments, and in the Supplemental Materials, we presented partial comparative results of the two options.

Experiments

Experiments on the KITTI dataset

We assess the performance of our model using the KITTI³⁶ 3D object detection benchmark, which comprises 7481 training images/point clouds and 7518 test images/point clouds. This benchmark encompasses three categories: Car, Pedestrian, and Cyclist. Evaluation of detection results for each class is conducted across three difficulty levels: easy, moderate, and hard, which are determined based on factors such as object size, occlusion state, and truncation level. Following the experimental setup of Centernet3d,³⁷ we trained a 3D detection model specifically for the car category and achieved results similar to those reported in the paper on the validate set. Additionally, we trained Centernet on the COCO dataset³⁸ with the following modifications to the category labels: (1) selecting data for the three categories of pedestrians, cars, and bicycles, and (2) merging bicycles and motorcycles into a single bicycle category. Since the KITTI dataset contains these three categories, we utilize the KITTI validation set to validate the effectiveness of our training.

We utilized the COCO dataset version released in 2017, which consists of 118,000 training images and 5000 test images. For training CenterNet, we chose ResNet-50 as the backbone network to balance speed and accuracy. During training, we employed a multi-scale augmentation approach to ensure the 2D object detection model can accurately localize objects of different scales.

We employed Adam as the optimizer and adopted common image augmentation techniques, including random flipping, random scale resizing, and cropping. Training was conducted for 140 epochs on 8× NVIDIA RTX 3090TI GPUs, with an initial learning rate of 5e-4 and a decay rate of 0.1, adjusted at the 90th and 120th epochs. We set the batch size to 128 and utilized ResNet-50 weights pretrained on ImageNet. On the COCO validation set, our results are as Table 1.

Table 1.

The performance of CenterNet on the validation set of the COCO dataset.

Measurement	Car	Pedestrian	Cyclist
mAP@IoU0.5^a	81.72	68.11	73.70
mAP@IoU0.7^2b	77.36	65.10	64.25

The prediction accuracy of CenterNet when requiring the intersection over union (IoU) between the predicted 2D bounding box and the ground truth bounding box to be greater than 0.5.

The prediction accuracy of CenterNet when requiring the intersection over union (IoU) between the predicted 2D bounding box and the ground truth bounding box to be greater than 0.7.

Following the experimental settings outlined in reference,³⁷ we evaluated the detection accuracy of cars on the validation set using an intersection over union (IoU) threshold of 0.7. The specific results are presented in Table 2.

Table 2.

The accuracy of CenterPoint on the validation set of KITTI.

class	Easy^a	Moderate	Hard
Car	88.12	79.40	78.09

On the KITTI dataset, three difficulty levels, Easy, Moderate, and Hard, are defined based on the size of the objects, the degree of occlusion, and the occlusion ratio.

After pre-training both the 2D image detection model and the 3D point cloud detection model, we move on to the stage of model fusion and training.

Training the fused model is divided into three parts:

PART I:

Drawing inspiration from the Lift method for 2D to 3D feature transformation in BEVFusion,¹⁴ we utilize the LSS model²² to generate depth probability maps for 2D images. As mentioned earlier, by using intrinsic and extrinsic parameters, we fill 3D feature tensors $F_{img}$ pixel by pixel with 2D features. Next, we train the FAM module to align $F_{img}$ and $F_{lidar}$ from a bird’s eye view perspective, as illustrated in Figure 1. Another module that requires training is K, which is a weight-shared 2D fully convolutional layer. It predicts heatmaps $H_{1}$ and $H_{2}$ on the 3D features of both branches. We freeze all trainable parameters except for the FAM and K modules, utilize AdamW as the optimizer, set the batch size to 8, the learning rate to 1e-5, and employ cosine annealing with a maximum iteration of 1000. We conduct a total of 20 epochs of training, using Focaloss³² with parameters $γ = 0.25$ and $α = 1.5$ as the loss function. Similar to the approach used in training siamese networks like Koch et al.³⁹ we compute the differences between the heatmaps predicted by the two branches at the 3D centroid positions of objects with shared categories. By minimizing these differences, we guide the training of both the heatmap prediction module K and the alignment module FAM. Additionally, to prevent the issue of predicted heatmaps being entirely zero during training caused by module K, we also generate ground truth heatmaps for objects with shared categories based on KITTI data. These ground truth heatmaps are used to supervise the training of module K. As shown in Figure 6, the loss stabilizes around the 8th epoch and gradually decreases throughout the training process.

PART II:

In the prediction stage of the heatmaps for new categories, as shown in Figure 4, we fuse the feature maps $F_{img}'$ and $F_{lidar}$ from two top-down views using coefficient $η$ . The coefficient $η$ varies in the range $[0.3, 0.7]$ with a step size of 0.05. During training, we freeze the previously trained module K from the first stage and obtain predicted heatmaps $H_{1}$ and $H_{2}$ separately through it. Then, using the newly introduced multi-category heatmap prediction module K1, we obtain new multi-category heatmaps $H$ . Through comparative training, we update the parameters of the K1 model.

Figure 6.

The curve of loss variation. The left figure depicts the change in loss during the first stage of training for the FAM (Feature Alignment Module) module and heatmap prediction module K, while the right figure illustrates the curve of loss variation across epochs during the second stage of training.

We train on the KITTI dataset with a batch size of 8 for a total of 80 epochs. For the first 60 epochs, we freeze all layers except for K1, and in the last 20 epochs, we update the parameters of the feature extraction branch with a very small learning rate. We utilize AdamW as the optimizer with an initial learning rate of 2e-6. Learning rate modification follows a step decay strategy with a decay rate of 0.9, modifying the learning rate at epochs 20, 30, 50, and 70. We set the exponential moving average (EMA) parameter $β$ to 0.9. Focal loss is employed as the loss function with hyperparameters $γ = 0.25$ and $α = 2.0$ . Throughout training, the loss tends to stabilize toward a final value, as depicted in Figure 6.

PART III:

When retraining the regression branch of 3D bounding boxes, we encounter the challenge of lacking effective supervision for regressing 3D boxes of two classes (pedestrians and cyclists). To enhance the accuracy of regressing boundary boxes, we utilize supervised learning with labels from shared categories on one hand. On the other hand, we adopt scheme B as illustrated in Figure 5, where the K2 module also predicts category heatmaps. By comparing the category heatmaps generated by K1 (obtained from previous training) and K2, and minimizing the error between them, we train the parameters of the K2 module.

In this stage, we conducted a total of 20 epochs of training. During the first 10 epochs, the learning rate was set to 1e-4, and the feature extraction branches for 2D images and 3D point clouds were frozen, only training the K2 module separately. Additionally, we used the output results of the K1 module as a reference for comparative training of K2. Simultaneously, by supervised training across different categories, we enhanced the ability of module K2 for 3D target regression. In the final 10 epochs, the K2 module was frozen, and the learning rate was reduced to 1e-5 to fine-tune the feature extraction branches of images and point clouds for optimizing the regression results. We selected AdamW as the optimizer with a weight decay of 0.03. Evaluating our detection performance on the entire KITTI validation set, as shown in Table 3, our detection accuracy ultimately surpassed that of the directly trained CenterNet3D model. Particularly noteworthy is that for the pedestrian and cyclist categories, due to the model’s ability to utilize texture information from images, the detection performance for small objects was better than that of the original single-modal 3D point cloud detection model. As shown in Figure 7, we provide some typical detection result comparisons.

Table 3.

Accuracy comparison on the validation set of KITTI.

class	Easy		Moderate		Hard
	CenterPoint	Ours	CenterPoint	Ours	CenterPoint	Ours
Car	85.12	85.09	76.17	76.11	69.90	69.91
Pedestrain	58.77	59.89	55.56	55.79	47.36	48.07
Cyclist	67.07	68.53	67.07	68.53	50.19	50.33
Car^a	89.14	90.23	81.09	81.27	74.07	74.12

Originally, the IoU threshold for calculating detection accuracy for the Car category was set to greater than 0.7. The result in this row is computed with the threshold set to 0.5.

Figure 7.

Comparison of prediction results on KITTI. Green boxes on the image represent the results detected by our pre-trained CenterPointNet, blue boxes indicate the results detected by the fusion model, and red boxes represent ground truth.

Comparison with SOTA method

To compare with the state-of-the-art (SOTA) methods in the current field, we designed a new experiment. In the KITTI dataset, we treated “Pedestrians” and “Cyclists” as cross-category classes for two different modal models. We provided only 2D bounding box annotations for the category “Car” during the training of our 2D image detection model. Our 2D image detection model was pretrained on the COCO dataset, while the 3D detection model, CenterPoint, was trained only on samples annotated with the categories “Pedestrian” and “Cyclist.” Training parameters were consistent with those of the previous experiment. We compared the performance of FGR,¹⁸ MTrans,¹⁹ and MAP-Gen²⁰ on the KITTI validation set, as shown in Table 4. During FGR model training, 2D annotations and sparse point clouds corresponding to the Car category are first used. Pseudo-labels are obtained through 3D bbox estimation, and these pseudo-labels are used to train the 3D point cloud detection model. The MTrans model and MAP-Gen model first trains an autolabeler on a subset of 3D-annotated Car category samples. This autolabeler is then used to re-annotate all Car category samples, and the newly obtained labels are used to train the 3D point cloud detection model. Compare the detection performance of the 3D point cloud models obtained from training with different methods, our approach achieved higher detection accuracy in the “Car” category compared to SOTA methods.

Table 4.

Detection performance of the Car category on the KITTI validation set.

Method	Required 3D annotations	$A P_{3 D} @ IoU = 0.7$
		Easy	Moderate	Hard
FGR¹⁸	×	86.68	73.55	67.91
MTrans¹⁹	125f^a	87.64	77.31	74.32
MTrans	500f	88.72	78.84	77.43
MAP-Gen20	500f	87.87	77.98	76.18
ours(resnet50)^b	×	88.66	78.62	76.91
ours(resnet101)	×	90.07	78.89	77.39

MTrans needs to be pre-trained on some frames that have already been annotated with 3D boxes of cars. “f” stands for frames. For example, “125f” indicates that the model requires 125 frames of 3D annotations for the category “Car to pre-train models.” indicates 3D annotations for the Car category were not used during the model training process.

Our model employs different backbones to extract features in the 2D image detection branch.

Ablation experiment

The ablation experiment for the η step size

As shown in Figure 1, aligning $F_{img}$ and $F_{lidar}$ in 3D space is crucial, and a key parameter governing this process is the step size and range of fusion weight $η$ . We conducted ablation experiments on the KITTI dataset to determine the optimal range and step size. Initially, we set the range limits to $[0.3, 0.7]$ and compared four step sizes: 0.025, 0.01, 0.05, and 0.1. By comparing the four curves, we easily observed that increasing the step size affects the stability during the initial training phase. While smaller step sizes can accelerate convergence to some extent, they significantly increase training duration. To strike a balance between the two factors, we chose a step size of 0.5.

When comparing the range limits, we examined symmetric ranges: $[0, 1], [0.1, 0.9], [0.2, 0.8], [0.3, 0.7], [0.4, 0.6]$ , as well as asymmetric ranges $[0, 0.6]$ and $[0.4, 1]$ . We fixed the step size at 0.05 and conducted comparative experiments on the KITTI dataset. Through monitoring the stability of the loss during training, we found better stability in training with larger ranges. For asymmetric ranges, under the same number of training iterations, the range $[0.2, 0.8]$ outperformed $[0, 0.6]$ and $[0.4, 1]$ . As illustrated in Figure 8, we selected typical ranges $[0.2, 0.8], [0.4, 0.6], [0, 0.6]$ , and $[0.4, 1]$ to compare the training stability.

Figure 8.

The chart in the top left corner compares the variation of loss during training across different step sizes within the range $[0.3, 0.7]$ for $η$ . The chart in the top right corner compares the variation of loss during training across different ranges for $η$ with a step size of 0.05. The chart in the bottom left corner compares the variation of loss during training across different values of the exponential moving average coefficient $β$ . The bottom right chart compares the variation of loss during training and the change in accuracy on the validation set between separate and combined network prediction heads on the KITTI dataset.

Therefore, a larger range for $η$ , a smaller step size, and symmetry about 0 are optimal. However, excessively small steps and larger ranges increase the training time per epoch. To balance training speed and effectiveness, we chose $[0.3, 0.7]$ as the range and 0.05 as the step size.

Ablation experiments on the smoothing coefficient β

After fixing the fusion parameter $η$ (range: $[0.3, 0.7]$ , step: $0.05$ ), we compared three typical values for the Exponential Moving Average (EMA) coefficient $β$ : 0.85, 0.95, and 0.90. Given the setting of $η$ , one batch requires nine iterations according to the fusion coefficient, and EMA needs a total of nine iterations. According to the EMA formula, after t updates, the updated weight $v_{t}$ is related to historical update values $θ_{t - i}$ as follows: $v_{t} = \sum_{i} (1 - β) β^{i} θ_{t - i}$ . It can be approximated as the average of $1 / (1 - β)$ historical values. Since our iteration is nine times, $β = 0.9$ suffices. Actual experiments also confirmed this. As shown in Figure 8, when $β$ is set to 0.85, the training performance is the worst, and there is no significant improvement when $β$ is set to 0.95.

Comparing the detection accuracy between combined prediction heads and separate prediction heads

To compare the difference between using separate prediction heads and combined prediction heads in the third stage of fusion training, we trained models under both schemes. As depicted in Figure 8, we comprehensively compared the stability of loss and the average accuracy on the KITTI validation set’s Moderate difficulty level every two epochs. Finally, Table 5 presents a comparison of our results on the KITTI validation set. According to the comparison, we found that using combined prediction heads leads to higher accuracy.

Table 5.

Accuracy comparison of different prediction heads on KITTI.

class	Easy		Moderate		Hard
	Split^a	Merge	Split	Merge	Split	Merge
Car	84.68	85.09	76.02	76.11	68.98	69.91
Pedestrain	58.93	59.89	55.46	55.79	47.56	48.07
Cyclist	67.66	68.53	67.91	68.53	50.12	50.33

As shown in Figure 5, “split” implies that when training with scheme A, the heatmap prediction and bounding box regression are separated, while “merge” implies that when training with scheme B, both heatmap and bounding box regression are predicted within the same network head.

Conclusion

In autonomous driving, 3D object localization serves as a critical means of environmental perception. Utilizing images for 3D object detection in complex scenarios often results in inaccurate target positioning due to the lack of depth information. Combining texture information from images with positional data from 3D point clouds is a preferred approach in many multimodal 3D detection models. However, compared to image-based 2D annotations, point cloud-based 3D annotations are more complex and costly. To leverage readily available 2D image annotations to guide 3D detection models in detecting new classes and ultimately detecting 3D objects using fused information, we propose a weakly supervised training approach for multimodal models. This method comprises the following steps: (1) predicting positional heatmaps, aligning features extracted from 2D images and 3D point clouds; (2) adding new classes and retraining the heatmaps; (3) retraining the regression heads through cross-category training. Ultimately, we successfully fuse the 2D image detection model with the 3D point cloud detection model, resulting in a novel detection model that surpasses the performance of using the 3D point cloud detection model alone.

Our proposed method guides multimodal models to achieve 3D object detection capability on unknown categories through 2D annotations. However, this method relies on heatmaps as an intermediary for weakly supervised training, which restricts its applicability. In the future, we will extending this training method to query-based 3D object detection frameworks.

Supplemental Material

sj-doc-1-mac-10.1177_00202940241297568 – Supplemental material for A weakly supervised method for 3D object detection with partially annotated samples

Supplemental material, sj-doc-1-mac-10.1177_00202940241297568 for A weakly supervised method for 3D object detection with partially annotated samples by Bin Lu, Qing Li and Yanju Liang in Measurement and Control

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Bin Lu

Data availability statement

All relevant data are within the paper.

Supplemental material

Supplemental material for this article is available online.

References

Socher

Ganjoo

Manning

, et al. Zero-shot learning through cross-modal transfer. Adv Neural Inf Process Syst 2013; 26.

Frome

Corrado

Shlens

, et al. Devise: a deep visual-semantic embedding model. Adv Neural Inf Process Syst 2013; 26: 2121–2129.

Sung

Yang

Zhang

, et al. Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1199–1208. Salt Lake City: IEEE.

Yosinski

Clune

Bengio

, et al. How transferable are features in deep neural networks? Adv Neural Inf Process Syst 2014; 27.

Raina

Battle

Lee

, et al. Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th international conference on Machine learning, 2007, pp. 759–766. New York: ACM.

Gopalan

Patel

, et al. Domain adaptation for visual recognition. Found Trends® Comput Graph Vis 2015; 8(4): 285–378.

Gong

Shi

Sha

, et al. Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE conference on computer vision and pattern recognition, 2012, pp. 2066–2073. Providence, Rhode Island: IEEE.

Hoffman

Wang

, et al. Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:161202649. 2016.

Zhou

Tuzel

. Voxelnet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499. Salt Lake City: IEEE.

10.

Yin

Zhou

Krahenbuhl

Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.11784–11793. IEEE. online.

11.

Zeng

Huang

, et al. Structure aware single-stage 3D object detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11873–11882. Seattle: IEEE.

12.

Yang

Sun

Liu

, et al. 3dssd: point-based 3D single stage object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.11040–11048. Seattle: IEEE.

13.

Wang

, et al. Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European conference on computer vision, 2022, pp.1–18. Tel Aviv, Israel: Springer.

14.

Liang

Xie

, et al. Bevfusion: a simple and robust lidar-camera fusion framework. Adv Neural Inf Process Syst 2022; 35: 10421–10434.

15.

Tao

Guan

. Pmfnet: a progressive multichannel fusion network for multimodal sentiment analysis. In: International conference on neural information processing, 2023, pp.270–281. Springer.

16.

Hou

, et al. Logonet: towards accurate 3D object detection with local-to-global cross-modal fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.17524–17534. Vancouver, Canada: IEEE.

17.

Zhang

Hou

, et al. Unleash the potential of image branch for cross-modal 3d object detection. Adv Neural Inf Process Syst 2024; 36.

18.

Wei

, et al. Fgr: frustum-aware geometric reasoning for weakly supervised 3D vehicle detection. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp.4348–4354. XiAn, China: IEEE.

19.

Liu

Qian

Huang

, et al. Multimodal transformer for automatic 3d annotation and object detection. In: European conference on computer vision, 2022, pp.657–673. Tel Aviv, Israel: Springer.

20.

Liu

Qian

, et al. Map-gen: an automated 3D-box annotation flow with multimodal attention point generator. In: 2022 26th International Conference on Pattern Recognition (ICPR), 2022, pp.1148–1155. Montreal, Canada: IEEE.

21.

Duan

Bai

Xie

, et al. Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.6569–6578. Seoul, South Korea: IEEE.

22.

Philion

Fidler

. Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XIV 16, pp.194–210. Springer.

23.

Laina

Rupprecht

Belagiannis

, et al. Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV), 2016, pp.239–248. California: IEEE.

24.

Wang

Shen

Lin

, et al. Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.2800–2809. Boston: IEEE.

25.

Godard

Mac Aodha

Brostow

GJ.

Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, 2017, pp.270–279.

26.

Lang

Vora

Caesar

, et al. Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.12697–12705. Los Angeles: IEEE.

27.

Liu

Lehman

Molino

, et al. An intriguing failing of convolutional neural networks and the coordconv solution. Adv Neural Inf Process Syst 2018; 31.

28.

Zhang

Bengio

Hardt

, et al. Understanding deep learning (still) requires rethinking generalization. Commun ACM 2021; 64(3): 107–115.

29.

Liang

Zeng

Zhang

Details or artifacts: a locally discriminative learning approach to realistic image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Louisiana, 2022, pp.5657–5666.

30.

Zhou

Shin

Zhang

, et al. Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, 2017, pp.7340–7351.

31.

Kensert

Harrison

Spjuth

Transfer learning with deep convolutional neural networks for classifying cellular morphological changes. SLAS Discov: Advancing Life Sciences R&D 2019; 24(4): 466–475.

32.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, 2017, pp.2980–2988. Honolulu,Hawaii: IEEE

33.

Girshick

. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, 2015, pp.1440–1448. Santiago, Chile: IEEE.

34.

Gkioxari

Dollár

, et al. Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, 2017, pp.2961–2969. Honolulu, Hawaii: IEEE.

35.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 2016; 39(6): 1137–1149.

36.

Geiger

Lenz

Stiller

, et al. Vision meets robotics: the kitti dataset. Int J Robot Res 2013; 32(11): 1231–1237.

37.

Wang

Tian

, et al. CenterNet3D: An anchor free object detector for autonomous driving. arXiv 2020. arXiv preprint arXiv:200707214.

38.

Lin

Maire

Belongie

, et al. Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, 2014, pp.740–755. Zurich, Switzerland: Springer.

39.

Koch

Zemel

Salakhutdinov

, et al. Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop. vol. 2. Lille, 2015, pp. 1–30.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

14.37 MB