Sage Journals: Discover world-class research

Abstract

Since performing simultaneous localization and mapping in dynamic environments is a challenging problem, conventional approaches have used preprocessing to detect and then remove movable objects from images. However, those methods create many holes in the places, where the movable objects are located, reducing the reliability of the estimated pose. In this paper, we propose a model with detailed classification criteria for moving objects and point cloud restoration to handle hole generation and pose errors. Our model includes a moving object segmentation network and an inpainting network with a light detection and ranging sensor. By providing residual images to the segmentation network, the model can classify idle and moving objects. Moreover, we propose a smoothness loss to ensure that the inpainting result of the model naturally connects to the existing background. Our proposed model uses the movable object’s information in an idle state and the inpainted background to accurately estimate the sensor’s pose. To use a ground truth dataset for inpainting, we created a new dataset using the CARLA simulation environment. We use our virtual datasets and the KITTI dataset to verify our model’s performance. In a dynamic environment, our proposed model demonstrates a notable enhancement of approximately 24.7% in pose estimation performance compared to the previous method.

Keywords

Background inpainting dynamic environment moving object segmentation point cloud restoration

Introduction

Light detection and ranging (LiDAR)-derived point cloud data is extensively utilized in robots for simultaneous localization and mapping (SLAM). The method of creating maps and estimating locations through point cloud feature matching, as presented in Zhang and Singh¹ and Shan and Englot,² has demonstrated excellent performance and rapid computation. However, the majority of SLAM algorithms have been developed for static environments. Using extracted features from moving objects for feature matching in dynamic environments can result in pose estimation errors. To address this challenge, a technique was introduced in Sun et al.³ and Rashed et al.⁴ to identify and remove moving objects’ corresponding features. To cope with dynamic environments, additional sensors such as radar and camera^5,6 can be used to materialize information, or the object’s characteristics can be learned through deep learning.⁷ As the number of moving objects increases, removing a larger number of points can lead to the generation of multiple holes. These holes can subsequently result in pose estimation errors.

SAM-Net⁸ was initially introduced to address the issue of errors caused by holes. It is a segmentation model that uses a binary mask to isolate movable objects such as vehicles and people, and then employs an inpainting network to restore the masked regions. However, completely removing movable objects through the segmentation process may lead to the loss of valuable information required for feature matching. Moreover, the restored information may not reflect the characteristics of the point cloud, resulting in distortions.^9,10

To maximize the number of features utilized for achieving the objective illustrated in Figure 1, we propose an inpainting learning model that focuses on classifying only moving objects, along with a loss function that reflects the point cloud’s characteristics. The proposed model requires a residual image that is obtained by utilizing a continuous point cloud, enabling the classification of moving objects by learning object motion information between continuous point clouds. Additionally, we emphasize the importance of maintaining consistency between the proposed model’s results and the actual background to generate valid feature points. To achieve this, we introduce a loss function that utilizes one of the point cloud’s features, that is, smoothness.^1,2 By utilizing the association between multiple points, the proposed loss function generates points that are similar to their surroundings, which results in inpainting and feature points that are more similar to the ground truth compared to the previous model.

Figure 1.

Purpose of this paper. After determining and removing moving objects from the point cloud, the proposed model restores the hidden points through the inpainting process.

To train the proposed model, we created a new dataset using CARLA simulation because the existing dataset lacked ground truth for the regions obscured by moving objects, despite containing labels for the moving objects themselves. The moving objects were classified using CARLA’s semantic sensors, and a moving object label was assigned based on their speed.

The proposed model has the following contributions:

Our proposed inpainting model uses residual images to generate labels for moving objects, thereby reducing the loss of points and errors in pose estimation caused by the identification of movable objects.

The smoothness loss function relies on planar and edge feature points obtained through point association to produce results that seamlessly integrate the static background with the inpainted regions.

To improve the inpainting model’s performance and overcome the limitations of artificial data generation, we created a realistic dataset using the CARLA simulator. The model learns from various situations in this dataset, which aids in accurately classifying moving objects.

The remainder of this paper is organized as follows. Section “Related works” describes the use of LiDAR sensors for feature and SLAM, including an explanation of the reason for SLAM performance degradation in a dynamic environment and the proposed method for solving it. In addition, the prior inpainting model is described and a method to improve it is proposed. Section “Proposed model” Section explains the data generation methods, input data changes, and new loss functions of the proposed model. In Section “Experiments,” Section the segmentation, inpainting, and SLAM performance of the proposed model and the prior model are compared using the generated CARLA dataset. The suitability of the proposed model is then verified not only in the simulator, but also in the real environment using the KITTI¹¹ benchmark dataset. Finally, Section “Conclusion” Section provides the conclusion and outlines the future directions of this work.

Related works

Feature extraction with LiDAR

A LiDAR sensor retrieves three-dimensional (3D) points indicating the position of point clouds on surfaces detected by the sensor. This output can be utilized for feature extraction of important surrounding objects. However, complete processing of LiDAR sensor data can be costly. To enhance processing speed, research has focused on selecting feature points to represent point clouds, rather than utilizing all points. For instance, the normal aligned radial feature¹² can reliably extract normal vectors in environments with distinct changes between point clouds, while the fast point feature histogram¹³ employs multiple histograms to expedite calculations. However, these methods are primarily applicable in environments with sensors that yield dense point clouds, such as red–green–blue-D (RGB-D), and may pose challenges when applied to external environments using LiDAR. Recently, a prominent approach for feature extraction involves the utilization of the PointPillars^14,15 model. This method, however, has limitations, including difficulties in effectively handling multiple nearby objects and reduced reliability in regions with sparse features. To address this challenge, a method¹ was employed to extract surfaces and edges representing point clouds by expressing point associations. In this paper, we incorporate the corresponding feature points into a loss function to ensure continuity between the existing background and the inpainting result.

LiDAR implementation in SLAM

Generating feature maps from LiDAR output is challenging due to its limitations. To produce precise LiDAR maps, traditional methods often involve applying filters.^16,17 However, these methods are restricted to two-dimensional (2D) points and are inefficient for generating 3D maps. To address these challenges, several studies have explored leveraging feature or semantic information to enhance SLAM performance while reducing computational costs. For instance, LiDAR odometry and mapping (LOAM)¹ extracts surfaces and edges using the smoothness feature of points and creates a map in real-time through feature matching. Its extension, LeGO-LOAM,² improves SLAM performance by removing redundant ground point information and optimizing with the Levenberg–Marquardt algorithm. However, in complex environments, such as urban areas, SLAM performance deteriorates due to the assumption of static surroundings.

Recent studies^3,18,19 have proposed approaches to remove dynamic objects and improve SLAM performance in dynamic environments. For instance, SUMA++²⁰ uses deep learning to label objects and generates surfel-based maps from continuous scans. By identifying changes in the surfels of dynamic objects in the generated map, moving objects can be removed. However, this method may create holes where dynamic objects were removed, leading to errors in pose estimation.^9,10,8

Inpainting method

In recent years, deep learning-based inpainting methods using generative adversarial networks (GANs) have been proposed for filling holes in images.^21–23 In the context of SLAM, an inpainting model based on image features has also been introduced.^24,25 This model utilizes an Oriented FAST and Rotated BRIEF-based loss to minimize feature distortion in the inpainted results, resulting in improved performance in place recognition and visual odometry. Several SLAM algorithms exist that incorporate inpainting results, similar to the approach described in Ai et al.²⁶ and Bescos et al.²⁷

With the inpainting model, LiDAR and RGB images can be utilized to enhance the point cloud density via deep learning.^28–30 However, rather than creating points directly in obscured areas, this method involves filling in empty spaces due to the low density of the point cloud. In the field of LiDAR superresolution, a learning-based model^31–33 is employed to transform low-resolution LiDAR point cloud data into high-resolution data. This approach aims to increase the resolution by directly generating points, prioritizing resolution enhancement rather than filling in occluded areas. SAM-Net⁸ has introduced a novel approach to restoring parts covered by objects using LiDAR point clouds through learning. Nonetheless, this model has some limitations as it removes points without taking the object’s motion into account and the restoration process may not fully reflect the point cloud’s characteristics.

Proposed model

LiDAR point preprocessing

In order to utilize 3D point cloud data in deep learning applications, it is necessary to convert it into 2D range images. This conversion not only simplifies the data interpretation process, but also enables the use of various 2D convolutional neural networks designed for image processing, which can yield valuable insights. Range images are a popular method used in LiDAR-based learning models,^7,34 which we will briefly discuss in this paper.

To define a single LiDAR point $p$ in a Cartesian coordinate system as $p = {x, y, z}$ and its spherical coordinate in a range image as $(u, v)$ , we use a spherical coordinate conversion $Π : R^{3} \to R^{2}$ in Milioto et al.:¹⁸

The range image has vertical and horizontal dimensions denoted by $h$ and $w$ , respectively. The LiDAR sensor can measure a vertical field of view (FOV) given by $f = f_{up} + f_{down}$ , where $f_{up}$ and $f_{down}$ represent the upward FOV and downward FOV, respectively. The distance of each LiDAR point is represented by $r = \sqrt{x^{2} + y^{2} + z^{2}}$ . By matching the depth $r$ of each point to its corresponding $(u, v)$ index, 3D data can be converted into 2D images.

The previous network only relied on this method as input throughout the model’s process. However, it is challenging to identify crucial feature points for pose estimation, especially those associated with stationary vehicles, when removing all movable objects. Furthermore, using only the gradient loss function proved ineffective in restoring feature points in range images, as 3D point clouds were compressed into 2D depth images. Therefore, our proposed model employs a residual image as additional input, denoted as $\tilde{m}$ . The residual image contains motion information between two frames, allowing us to identify moving objects during training. The use of a range image sequence enables the generation of residual images, which can then be utilized as input.

Residual images are created to capture the differences in distances between two point clouds in 2D, facilitating the network’s ability to learn the motion of objects by using only LiDAR data. To generate a residual image, the point cloud $P_{j}$ from the previous frame $j$ is first transformed into the coordinate system of the current frame $i$ , resulting in a transformed point cloud $P_{j \to i}$ . A spherical coordinate transformation is then applied to produce a range image where each point’s depth is suitable for its corresponding index. Finally, by subtracting the depth of each range image index of the current frame $i$ from the converted previous frame $j$ , a residual image is obtained.

The residual information of a single LiDAR point $p_{k}$ in the current frame $i$ is defined as $d_{j \to i, k}$ and can be calculated using the following formula:

d_{j \to i, k} = {\begin{matrix} \frac{| r_{k} - r_{k}^{j \to i} |}{r_{k}} & k \in valid points \\ 0 & otherwise \end{matrix}

(2)

The pixel value

(u_{k}, v_{k})

p_{k}

in the current frame

i

is denoted by

r_{k}

d_{j \to i, k}

, which has the same index as

p_{k}

, represents the absolute value of the normalized depth difference between the two range images. This paper generates a residual image using the difference between the current frame

i

and the previous frames

i - 1

i - 3

. The input data of the seven-channel is composed of the three residual images,

x, y, z,

and range images.

Model

In this section, we present the proposed model, which is an extension of a previous network⁸ with input data and loss function modification. Our model aims to distinguish and remove moving objects using segmentation while restoring the background using inpainting. To achieve this, we employ a single encoder and two decoder modules to detect moving objects and generate new points through inpainting. Figure 2 illustrates the proposed model’s framework and output.

Figure 2.

Framework of the proposed model. The proposed model learns the movement of moving objects through inputs including residual images and generates a binary mask using a segmentation module. The results of the encoder and segmentation module are used to remove moving objects and restore the points of the hidden static part through the inpainting module.

The encoder network comprises five convolution layers. The first convolution layer extensively scans a large area to increase receptive fields. The following convolution layers generate a feature map containing the characteristics of moving objects from the residual images as the input. This whole process of the encoder modules produces a feature map with 128 depth channels, serving as the input for the segmentation and inpainting network. Additionally, each convolution layer that generates a feature map is connected to the inpainting network as a skip connection for reconstructing the object-segmented area.

The segmentation process resembles the precursor model, which acts as a decoder of the model. The model consists of four upsample groups and a single refine group. Each upsample group contains two convolutional layers followed by a nearest neighbor upsampling layer. The single refine group consists of two convolutional layers, serving as the refinement process for segmentation. It is worth noting that every convolutional layer in the segmentation utilizes LeakyReLU as the activation function, as the range of images retrieved ranges from $-$ 1 to 1, and this action helps prevent interruption of the training process of the inpainting network. The role of the segmentation process is to provide background information, which is extensively used by the inpainting network. Therefore, simultaneous training of the segmentation and inpainting networks is crucial.

After the segmentation process, the output will result in a binary mask containing background information (as shown by the black region in Figure 2). The inpainting network will use this as one of the inputs along with the result from the encoder. Specifically, the inpainting network consists of three fusion blocks. Each fusion block receives inputs from three sources: the feature map from each encoder stage, the binary mask from segmentation, and the previous fusion block. This process enables reconstruction of the missing regions.⁸

By maintaining these modules, the proposed approach effectively reduces pose estimation errors and improves the overall accuracy of the system. Finally, to restore feature points, we use a loss function that reflects the correlation of distances between multiple points, as described in section “Smoothness loss $L_{s}$ .” This loss function generates an inpainted image $\tilde{I}$ that smoothly connects the background and newly generated points.

Smoothness loss $L_{s}$

Smoothness feature points are crucial in 3D LiDAR SLAM for accurate pose estimation by identifying edges and planes of corresponding points. These feature points have demonstrated their effectiveness in LOAM, LeGO-LOAM, and other studies. In this section, we explain how to obtain smoothness and how to apply a new loss function to these feature points.

After converting LiDAR points to a range image, we can define the query point $p_{i}$ as the point of interest to determine the smoothness $c$ of a bundle $S$ containing $N$ consecutive points in the horizontal direction, which corresponds to the row direction of the range image. The distance value of the query point is represented by $r_{i}$ , and the distances of other points included in $S$ are represented by $r_{j}$ . Smoothness can be calculated using the following formula:

c = \frac{1}{| S | \cdot | r_{i} |} \cdot ‖ \sum_{j \in S, j \neq i}^{(} r_{j} - r_{i}) ‖

(3)

If the value of

c

is greater than a certain threshold value, it is classified as edge points, and if it is smaller than the threshold value, it is classified as planar points. The threshold for smoothness is determined by considering the characteristics of the point cloud environment and employing empirical settings. The odometry of LiDAR can be estimated through the matching of edges and planar points obtained from two frames.

To formulate (3) as a loss function, as shown in Figure 3, we employ max pooling and convolutional layers. However, in the case of LiDAR points, some pixels in the range image lack depth information due to the low density of the sensor data. To address this issue, the approach of obtaining the maximum depth value using max pooling is applied and the range image is compressed to exclude these pixels. Following that, a novel kernel inspired by Bescos et al.²⁵ is utilized to calculate the smoothness of all points in the row direction. Finally, the comparison of smoothness generated from the inpainted image $\tilde{I}$ and the ground truth image $I$ is performed using the L1 loss ( $L_{1}$ ).

L_{s} = L_{1} (F (\tilde{I}), F (I))

(4)

Figure 3.

Structure of smoothness loss. By max pooling, $I$ and $\tilde{I}$ are compressed to avoid using empty pixels. After calculating the smoothness through the convolution layer and the custom kernel, the result of the loss function is obtained using $L_{1}$ . The gray of the kernel is $-$ 1/9, and the black is 8/9.

The use of max pooling can result in data compression and loss. To mitigate this issue, a $2 \times 2$ filter was used for max pooling, and invalid pixels were ignored while minimizing the reduction in information. Next, a convolutional layer kernel with $1 \times | S |$ was created. In this study, we set the size of the bundle $S$ to 9, which is a commonly used value in smoothness-based SLAM.^1,2 The bundle consists of the query point along with four points positioned on each side of the query point. Smoothness information is extracted using the provided kernel, which is then utilized to generate a loss function. This loss function helps in generating inpainted results that effectively capture the unique characteristics of the points and seamlessly blend with the surrounding background.

Total loss

To perform inpainting of the region where the deleted moving objects are located, GANs are employed. The generator and discriminator are trained using the same hinge loss, as in SAM-Net, which is a model that utilizes the structural information of images based on structural similarity index measure.³⁵ The goal of SAM-Net is to remove static objects and restore empty regions in images. The training process involves utilizing the hinge loss, denoted as $L_{G}$ for the generator and $L_{D}$ for the discriminator. Since the generator performs both segmentation and inpainting concurrently, a separate loss function is employed to train each decoder. The segmentation loss function $L_{m}$ is computed by taking the L1 difference between the ground truth $m$ that classifies the moving objects and the $\tilde{m}$ generated through the segmentation decoder.

L_{m} = L_{1} (\tilde{m}, m)

(5)

Furthermore, to improve the performance of the inpainting decoder, depth information is incorporated into the loss function through

L_{d}

. This loss function is calculated by computing the L1 difference between the generated depth information,

\tilde{d}

, and the depth information

d

of the ground truth image

I

L_{d} = L_{1} (\tilde{d}, d)

(6)

The total loss function is the sum of all individual loss functions, as shown in the following equation:

L_{total} = L_{G} + λ_{m} \cdot L_{m} + λ_{s} \cdot L_{s} + L_{d}

(7)

Weights for each loss function are represented by

λ_{m}

and

λ_{s}

. In the proposed model, learning the moving object mask

\tilde{m}

is critical in the early stages of training. In the later stages, the focus is on the inpainted image

\tilde{I}

created using

\tilde{m}

. Therefore, we gradually shift the importance from

\tilde{m}

\tilde{I}

during training by multiplying hyperparameters

λ_{m}

and

λ_{s}

by the experimentally determined optimal values of 0.998 and 1.03, respectively, every epoch.

Generating the data

The data required to specify a dynamic object and proceed with inpainting includes ground truth data without dynamic objects, raw LiDAR data with dynamic objects, and labels that distinguish dynamic objects. However, existing datasets such as KITTI and Oxford Car³⁶ do not provide semantic information for dynamic objects, and it is challenging to use them to train an inpainting model since obtaining ground truth data for the parts occluded by dynamic objects is difficult. Furthermore, Semantic KITTI³⁷ does not meet our requirements as it provides labels for dynamic objects but does not supply static LiDAR point information for the occluded regions. To tackle this issue, a previous study⁸ proposed a dataset generation method using KITTI. However, it is challenging to create a dataset that closely resembles an actual driving environment because this method relies on an artificial arrangement of the object’s point cloud in an empty space.

To overcome the limitations of previous datasets and data generation methods, and to train the proposed model to handle moving objects and the occluded parts caused by them, we present a new dataset created using CARLA.³⁸ Initially, a raw dataset, including moving objects such as various vehicles and pedestrians, is generated through LiDAR and LiDAR semantics. A point cloud can be obtained through LiDAR, and the ground truth labeling that classifies moving objects can be obtained using LiDAR semantics. Then, to obtain the ground truth data for the parts covered by the dynamic object, a LiDAR point cloud is created through simulation, excluding the dynamic object in the same environment. In two separate simulations, the location of the vehicle with LiDAR attached is recorded in frame units. By comparing the poses recorded and matching the frames with the same pose, the input and ground truth data required for learning can be generated. The dataset was created by driving in two different environments. The LiDAR settings³⁹ were the same as those used in KITTI, the most commonly used dataset for evaluating the performance of algorithms using LiDAR. This allowed us to generate data, including natural situations in the real world, such as overlapping or overtaking vehicles, as shown in Figure 4, which influenced the classification of dynamic objects by learning models.

Figure 4.

Dataset created through CARLA. Acquire a natural point cloud as a dataset for inpainting. (a) Overlapping objects move with each other according to frame conversion. (b) The object overtakes, changing its speed according to the environment.

Experiments

Setting

To verify the performance of the proposed model, we check the performance of segmentation, inpainting, and location estimation. Segmentation and inpainting performance are compared with SAM-Net and pose estimation performance is compared with LeGO-LOAM and the method adding LeGO-LOAM and the proposed model. Finally, to evaluate the performance of the proposed model in the real environment, the performance of location estimation is verified using KITTI’s road data.

The dataset was created using CARLA, simulating a scenario in which a hundred vehicles are moving in Town 1 and Town 4. This implementation was chosen due to the diversity of dynamic object appearances provided by this dataset. The dataset has the capability to produce random natural behavior and movement for vehicles and pedestrians. For instance, vehicles are programmed to follow lanes, respect traffic lights, adhere to speed limits, make decisions at intersections, and avoid pedestrians. Meanwhile, pedestrians are set to wander around the town map, along sidewalks, marked road crossings, and avoid each other and vehicles. Based on these scenes, LiDAR and LiDAR semantic data were used to generate the dataset, producing range images and residual images. The range images are reshaped into $64 \times 512$ with four channels (i.e., $x, y, z$ , and depth). The residual images are generated with three additional channels. Consequently, after concatenation, the input consists of seven channels. The output comprises two sources: binary moving object masks from the segmentation decoder and their extension, the inpainted background information (i.e. $x, y, z$ ). The training was performed 100 times using Pytorch on an NVIDIA RTX 3090, with the Adam optimizer. The hyperparameters $λ_{m}$ and $λ_{s}$ are set to 5 and 0.01, respectively.

The proposed model is labeled “Proposed” and the comparison algorithm is labeled “SAM-Net.” Segmentation results are compared through precision $(%)$ , recall $(%)$ , and F1-score $(%)$ . Inpainting results are compared using root mean square error (RMSE) (mm), mean absolute error (MAE) $(m m)$ , root mean squared error of the inverse depth (iRMSE) (1/km), and mean absolute error of the inverse depth (iMAE) (1/km). Finally, pose estimation error is evaluated by comparing the estimated position of the current frame with the initial position as a reference. The relative position error is calculated using RMSE (m, degree). This measure allows for a comprehensive assessment of the accuracy and error of the relative position estimation with respect to the initial position. For all comparison results in Table 1, 2, 3, and 4, preferred values for each metric are highlighted in bold.

Segmentation performance

The output of both models consists of binary masks, which are compared at the pixel level with the ground truth for the purpose of detecting moving objects. A higher precision implies that the model exhibits fewer instances of erroneously labeling non-moving regions as objects. On the other hand, a higher recall indicates that the model successfully identifies a significant portion of the actual moving object areas.

Figure 5 and Table 1 show the performance of the proposed and SAM-Net models in a dynamic environment. The ground truth labels were trained to classify moving objects, and the difference between residual image and single image learning methods is demonstrated. The proposed model achieves high performance by generating a binary mask that extracts the motion of an object through residual images. In contrast, SAM-Net, which uses a single image, struggles to learn the motion of objects and cannot effectively extract features for moving objects or classify them. In broader impact, this improved segmentation performance has implications for obstacle avoidance. Therefore, enhancing segmentation is crucial as it directly benefits scenarios requiring obstacle avoidance.

Figure 5.

Segmentation results on several dynamic cases. The figure on the left (a) describes a pedestrian in the distance with a queue of cars. The figure on the middle (b) shows the dynamic scene of a solely car in the front. The figure on the right shows (c) shows more crowded cars, which are coming along two directions, aligned and perpendicular with the view moving direction.

Table 1.

Comparison of the precision $(%)$ , recall $(%)$ , and F1-score $(%)$ of the proposed model against SAM-Net.

	Precision	Recall	F1-score
SAM-Net	0.8564	0.8676	0.8619
Proposed	0.8626	0.9146	0.8878

Inpainting performance

The output generated by the inpainting network includes $x$ , $y$ , and $z$ coordinates of the point clouds result. For comparison, three-channel images are used to convert the depth of each point. One general example is the inpainting of a removed car in a scene, as shown in Figure 6. In this figure, the analysis compares the difference between the inpainted output and the ground truth in terms of depth and 3D points. The depth comparison reveals no significant difference between the inpainted output and the ground truth for both models, indicating that both can produce good inpainting results based on depth analysis. This performance is attributed to the effective implementation of the 2D general adversarial method. However, the proposed model performs better than SAM-Net in quantitative analysis, as shown in Table 2.

Figure 6.

The result of inpainting. The figure on the left shows raw including moving objects, ground truth with moving objects removed, inpainted image obtained through the model used, and diff, which is the difference between ground truth and inpainted image. The right side is a three-dimensional (3D) point cloud representation. Blue points are static points, orange points are ground truth, and red points are the result of inpainting.

Table 2.

Quantitative analysis of RMSE (mm), MAE (mm), iRMSE (1/km), iMAE (1/km) upon the depth of the proposed model against SAM-Net.

	RMSE	MAE	iRMSE	iMAE
SAM-Net	3.3318	0.3612	5.8459	1.7117
Proposed	2.1930	0.2832	2.9843	0.6470

RMSE: root mean square error; MAE: mean absolute error; iRMSE: root mean squared error of the inverse depth; iMAE: mean absolute error of the inverse depth.

For further analysis, the inpainting output of 3D points is also examined. In this aspect, the proposed model outperforms the SAM-Net model. The proposed model’s 3D points output maintains surface and edge characteristics more effectively than SAM-Net. This is demonstrated by the alignment of the inpainted 3D points (red points) with the ground truth (blue points), where the proposed model shows superior performance compared to SAM-Net. This improvement is primarily due to the enhanced smoothness loss function, which prevents abrupt changes in the gap region, and the implementation of the Steganalysis Rich Models kernel²⁵ in the inpainting process. Consequently, the proposed model performs better at interpolation. For a clearer comparison of the 3D points, refer to Figure 7.

Figure 7.

Three-dimensional (3D) result of inpainting. Blue points are static points, orange points are the ground truth, and red points are the result of inpainting.

SLAM performance

Table 3 shows a comparison between LeGO-LOAM and the proposed when used together with LeGO-LOAM. When the proposed is used together with LeGO-LOAM, it shows better performance in environments with many moving objects. In particular, when surrounded by multiple moving objects, a better pose is estimated by removing the moving objects and creating new points for feature matching. Figure 8 show the estimated route for Town_1 and Town_4. In the case of Town_4, the overall performance of position estimation deteriorates as moving objects rotate at high speeds. However, while LeGO-LOAM judges it to be a straight path due to being surrounded by moving objects in a rotating situation, the proposed correctly rotates by removing the moving object and restoring new points.

Figure 8.

Trajectory of Town1 (a) and Town4 (b).

Table 3.

Comparison of root mean square error (RMSE) [m, degree] of the proposed method against LeGO-LOAM.

Town	LeGO-LOAM		Proposed
Town	Trans. Error	Rot. Error	Trans. Error	Rot. Error
Town_1	0.569	0.861	0.365	0.857
Town_4	394.0	–	92.9	102.6

Performance on KITTI

To validate the performance of the proposed method in the real world, we experimented on 2011_09_26_drive_0015(seq_30), 2011_09_26_drive_0028(seq_32), and 2011_09_26_drive_0029(seq_33).²⁰ Figure 9 shows the estimated trajectory of the LeGO-LOAM and the proposed method, and Table 4 is the comparison result. The LeGO-LOAM accumulated errors due to the position estimation based on features extracted from moving objects. In contrast, the proposed method for seq_30 and seq_32 can achieve better results by removing invalid feature points and restoring the background to recover valid points. On the other hand, for seq_33, not only the moving object but also the surrounding background is erased in the segmentation result initially, resulting in an error in pose estimation due to the loss of feature points. To address this issue, it is necessary to improve the method’s generalization performance by training it on a larger and more diverse dataset.

Figure 9.

Trajectory of seq $_$ 30 (a), seq $_$ 32 (b), and seq $_$ 33 (c).

Table 4.

Comparison of root mean square error (RMSE) [m, degree] of the proposed method against LeGO-LOAM in KITTI.

Seq.	LeGO-LOAM		Proposed
Seq.	Trans. Error	Rot. Error	Trans. Error	Rot. Error
30	2.604	0.456	2.456	0.547
32	19.174	0.381	14.411	0.314
33	2.922	1.317	2.820	1.287

Conclusion

This paper presents a novel LiDAR point cloud inpainting model that effectively removes moving objects in a dynamic environment and restores valid points. Our proposed model differs from existing models by selectively removing only the moving objects from the potentially mobile ones, thereby preserving valid information to the maximum extent. Additionally, to better distinguish moving objects, we utilize residual images to generate a binary mask through the segmentation decoder. This mask is then employed to remove the moving objects and restore the empty areas using the inpainting decoder. To produce more natural results than previous models, we incorporate a smoothness loss into our model. Our proposed model demonstrates superior performance in moving object segmentation and inpainting on two datasets generated via the CARLA simulator, and it also improves the performance of the SLAM algorithm. We validate the effectiveness of our model on the KITTI dataset and show a promising result as well.

As future work, the model requires an improvement toward the ability to align the correct points in the further distance in order to obtain a more robust mapping process in terms of point cloud usage. For broader implications might also be found to use the model in moving obstacle avoidance, since it has good performance in detecting any moving object based on residual images. All of these improvements will be a great thing for the implementation of this model.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

I Made Putra Arya Winata

Junghyun Oh

References

Zhang

Singh

. LOAM: LiDAR odometry and mapping in real-time. In: Robotics: Science and systems, Edited by Dieter Fox, Lydia E. Kavraki and Hanna Kurniawati, Berkeley, CA, 2014, vol. 2, pp.1–9.

Shan

Englot

. LeGO-LOAM: Lightweight and ground-optimized LiDAR odometry and mapping on variable terrain. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain: IEEE, 1-5 October 2018, pp.4758–4765.

Sun

Dai

Zhang

, et al. Efficient spatial-temporal information fusion for LiDAR-based 3D moving object segmentation. arXiv preprint arXiv:220702201, 2022.

Rashed

Ramzy

Vaquero

, et al. FuseMODNet: Real-time camera and LiDAR based moving object detection for robust low-light autonomous driving. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, Seoul, Korea (South): IEEE, 27–28 October 2019, pp.2393–2402.

Dang

Liang

, et al. Moving objects elimination towards enhanced dynamic SLAM fusing LiDAR and mmW-radar. In: 2020 IEEE MTT-S international conference on microwaves for intelligent mobility (ICMIM), Linz, Austria: IEEE, 23 November 2020, pp.1–4.

Wang

Niu

. Multi-sensor fusion in automated driving: A survey. IEEE Access 2019; 8: 2847–2868.

Chen

Mersch

, et al. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. IEEE Robot Autom Lett 2021; 6: 6529–6536.

Lee

Hwang

Kim

, et al. SAM-Net: LiDAR depth inpainting for 3D static map generation. IEEE Trans Intell Transp Syst 2021; 23(8): 12213–12228.

Sun

Liu

Meng

MQH

. Improving RGB-D SLAM in dynamic environments: A motion removal approach. Rob Auton Syst 2017; 89: 110–122.

10.

Sun

Liu

Meng

MQH

. Motion removal for reliable RGB-D SLAM in dynamic environments. Rob Auton Syst 2018; 108: 115–128.

11.

Geiger

Lenz

Stiller

, et al. Vision meets robotics: The KITTI dataset. Int J Rob Res 2013; 32: 1231–1237.

12.

Steder

Rusu

Konolige

, et al. NARF: 3D range image features for object recognition. In: Workshop on defining and solving realistic perception problems in personal robotics at the IEEE/RSJ international conference on intelligent robots and systems (IROS), Taipei, Taiwan: IEEE/RSJ, October 2010, vol. 44, p.2.

13.

Rusu

Blodow

Beetz

. Fast point feature histograms (FPFH) for 3D registration. In: 2009 IEEE international conference on robotics and automation, Kobe, Japan: IEEE, 12–17 May 2009, pp.3212–3217.

14.

Sun

Luo

, et al. P2V-RCNN: Point to voxel feature learning for 3D object detection from point clouds. IEEE Access 2021; 9: 98249–98260.

15.

Lang

Vora

Caesar

, et al. Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA: IEEE, 15-20 June 2019, pp.12697–12705.

16.

Paz

Jensfelt

Tardos

, et al. EKF SLAM updates in O(n) with divide and conquer SLAM. In: Proceedings 2007 IEEE international conference on robotics and automation, Rome, Italy: IEEE, 10-14 April 2007, pp.1657–1663.

17.

Chen

Dong

, et al. An improved particle filter SLAM algorithm for AGVs. In: 2020 IEEE 6th international conference on control science and systems engineering (ICCSSE), Beijing, China: IEEE, 17-19 July 2020, pp.27–31.

18.

Milioto

Vizzo

Behley

, et al. RangeNet++: Fast and accurate LiDAR semantic segmentation. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China: IEEE, 03-08 November 2019, pp.4213–4220.

19.

Ruchti

Burgard

. Mapping with dynamic-object probabilities calculated from single 3D range scans. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, Australia: IEEE, 21-25 May 2018, pp.6331–6336.

20.

Chen

Milioto

Palazzolo

, et al. SuMa++: Efficient LiDAR-based semantic SLAM. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China: IEEE, 03-08 November 2019, pp.4530–4537.

21.

Liu

Wan

Huang

, et al. PD-GAN: Probabilistic diverse GAN for image inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Nashville, TN, USA: IEEE, 20-25 June 2021, pp.9371–9381.

22.

Lahiri

Jain

Agrawal

, et al. Prior guided GAN based semantic inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA: IEEE, 13-19 June 2020, pp. 13693-13702.

23.

Yuan

Ruan

, et al. Image inpainting based on Patch-GANS. IEEE Access 2019; 7: 46411–46421.

24.

Bescos

Neira

Siegwart

, et al. Empty cities: Image inpainting for a dynamic-object-invariant space. In: 2019 international conference on robotics and automation (ICRA), Montreal, Canada: IEEE, 20-24 May 2019, pp.5460–5466.

25.

Bescos

Cadena

Neira

. Empty cities: A dynamic-object-invariant space for visual SLAM. IEEE Trans Robot 2020; 37: 433–451.

26.

Rui

, et al. DDL-SLAM: A robust RGB-D SLAM in dynamic environments combined with deep learning. IEEE Access 2020; 8: 162335.

27.

Bescos

Fácil

Civera

, et al. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot Autom Lett 2018; 3: 4076–4083.

28.

Qiu

Cui

Zhang

, et al. DeepLiDAR: Deep surface normal guided depth prediction for outdoor scene from sparse LiDAR data and single color image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA: IEEE, 15-20 June 2019. pp.3308-3317.

29.

Zhu

Shi

, et al. Depth completion from sparse LiDAR data with depth-normal constraints. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea (South): IEEE, 27 October 2019 - 02 November 2019, pp.2811–2820.

30.

Cavalheiro

Karaman

. Self-supervised sparse-to-dense: Self-supervised depth completion from LiDAR and monocular camera. In: 2019 International conference on robotics and automation (ICRA), Montreal, QC, Canada: IEEE, 20-24 May 2019, pp.3288–3295.

31.

Eskandar

Palaniswamy

Guirguis

, et al. GLPU: A geometric approach for LiDAR pointcloud upsampling. arXiv preprint arXiv:220203901, 2022.

32.

Bai

Zhao

Elhousni

, et al. DepthNet: Real-time LiDAR point cloud depth completion for autonomous vehicles. IEEE Access 2020; 8: 227825.

33.

Triess

Peter

Rist

, et al. CNN-based synthesis of realistic high-resolution LiDAR data. In: 2019 IEEE intelligent vehicles symposium (IV), Paris, France: IEEE, 09-12 June 2019, pp.1512–1519.

34.

Zhong

, et al. Deep learning for LiDAR point clouds in autonomous driving: A review. IEEE Trans Neural Netw Learn Syst 2020; 32: 3412–3432.

35.

Wang

Bovik

Sheikh

, et al. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process 2004; 13: 600–612.

36.

Maddern

Pascoe

Linegar

, et al. 1 year, 1000 km: The Oxford robotcar dataset. Int J Rob Res 2017; 36: 3–15.

37.

Behley

Garbade

Milioto

, et al. SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea (South): IEEE, 27 October 2019 - 02 November 2019, pp.9297–9307.

38.

Dosovitskiy

Ros

Codevilla

, et al. CARLA: An open urban driving simulator. In: Conference on robot learning, Mountain View, California: PMLR, 13-15 November 2017, pp.1–16.

39.

Deschaud

. KITTI-CARLA: A KITTI-like dataset generated by CARLA simulator. arXiv preprint arXiv:210900892, 2021.

LiDAR Point Inpainting Model Using Smoothness Loss for SLAM in Dynamic Environments

Abstract

Keywords

Introduction

Related works

Feature extraction with LiDAR

LiDAR implementation in SLAM

Inpainting method

Proposed model

LiDAR point preprocessing

Model

Smoothness loss L s

Total loss

Generating the data

Experiments

Setting

Segmentation performance

Inpainting performance

SLAM performance

Performance on KITTI

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References

Smoothness loss $L_{s}$