Sage Journals: Discover world-class research

Abstract

Light Detection and Ranging (LiDAR)-visual-inertial odometry can provide accurate poses for the localization of unmanned vehicles working in unknown environments in the absence of Global Positioning System (GPS). Since the quality of poses estimated by different sensors in environments with different structures fluctuates greatly, existing pose fusion models cannot guarantee stable performance of pose estimations in these environments, which brings great challenges to the pose fusion of LiDAR-visual-inertial odometry. This article proposes a novel environmental structure perception-based adaptive pose fusion method, which achieves the online optimization of the parameters in the pose fusion model of LiDAR-visual-inertial odometry by analyzing the complexity of environmental structure. Firstly, a novel quantitative perception method of environmental structure is proposed, and the visual bag-of-words vector and point cloud feature histogram are constructed to calculate the quantitative indicators describing the structural complexity of visual image and LiDAR point cloud of the surroundings, which can be used to predict and evaluate the pose quality from LiDAR/visual measurement models of poses. Then, based on the complexity of the environmental structure, two pose fusion strategies for two mainstream pose fusion models (Kalman filter and factor graph optimization) are proposed, which can adaptively fuse the poses estimated by LiDAR and vision online. Two state-of-the-art LiDAR-visual-inertial odometry systems are selected to deploy the proposed environmental structure perception-based adaptive pose fusion method, and extensive experiments are carried out on both open-source data sets and self-gathered data sets. The experimental results show that environmental structure perception-based adaptive pose fusion method can effectively perceive the changes in environmental structure and execute adaptive pose fusion, improving the accuracy of pose estimation of LiDAR-visual-inertial odometry in environments with changing structures.

Keywords

Adaptive pose fusion LiDAR-visual-inertial odometry environmental structure perception pose estimation of unmanned vehicle sensor fusion

Introduction

Autonomous navigation of unmanned vehicles such as Unmanned Ground Vehicle (UGV) and Unmanned Aerial Vehicle (UAV) in unknown environments has always been a major challenge in robotics.¹ Localization is the foundation for unmanned vehicles to achieve autonomous navigation, and the accuracy of vehicles’ poses determines the completed quality of the tasks assigned to them.² Due to the difficulty in obtaining stable and accurate satellite positioning signals in most unknown environments, the sensor-fusion odometry technology based on LiDAR, vision, and Inertial Measurement Unit (IMU) has gained attention of researchers in recent years. Utilizing the pose estimations from the measurement models of different sensors such as LiDAR point cloud registration,³ visual bundle adjustment,⁴ and IMU pre-integration,⁵ LiDAR-visual-inertial odometry can provide stable relative pose transformation for unmanned vehicles under limited satellite positioning signals. In addition, since all these sensers of LiDAR-visual-inertial odometry are installed on the vehicle’s body, it enables rapid deployment of devices for vehicle’s pose estimation at real time in unknown environments.

Although LiDAR-visual-inertial odometry combines the advantages of LiDAR odometry, visual odometry, and inertial information for pose estimation in various scenarios, it also introduces noise from multiple sensors’ measurements, resulting in significant uncertainty in the output fused pose.⁶ Therefore, how to eliminate sensors’ measurement noise to minimize the uncertainty of the fused pose has always been the primary consideration for pose fusion in LiDAR-visual-inertial odometry. Existing pose fusion methods are mostly based on Kalman filter^7

–10 and factor graph optimization.^11
–13 R2LIVE¹⁰ is a LiDAR-visual-inertial odometry built upon Kalman filter, it models the poses estimated by visual measurements¹⁴ and LiDAR measurements¹⁵ as normal distributions with different uncertainties, and the covariance matrix is used to represent the uncertainties of the poses. The fusion process of poses can be seen as the fusion of multiple normal distributions, and the optimization goal is to maximize a posterior probability of the normal distribution of the final fused pose. LVI-SAM¹³ is a typical factor-graph-based LiDAR-visual-inertial odometry, the poses from visual-inertial odometry¹⁴ and LiDAR-inertial odometry¹⁶ are abstracted as items in factor graph, the specific forms of these items are represented by nodes or edges with different weights in the factor graph, the nodes represent the fused pose at different instants, and the edges represent the pose constraints introduced into the odometry system by sensors’ measurements. The factor graph is solved in the way of nonlinear optimization to achieve global optimization to estimate the fused pose.

No matter based on filtering method or graph optimization method, the existing pose fusion relies on a priori perception of the environmental structure information to determine some parameters of the pose fusion model. Since these parameters often remain unchanged during the execution of the odometry system, the performance of the pose fusion model is limited in the variable environmental structures. The Kalman filter needs to predict and update the vehicle’s state vector according to the covariance matrix of pose’s uncertainty, and the covariance matrix is derived from the fixed sensor’s noise items which are calibrated in advance. Factor graph optimization needs to determine the weights of each node or edge, and these weights also need to be determined in advance according to engineering experience. However, in the unknown and unstructured environment, the environmental information is difficult to perceive a priori, and due to the degenerations of odometry and other factors caused by the change of environmental characteristics,^17,18 the quality of vehicle’s poses estimated by LiDAR point cloud and vision is diverse in different environmental structures.¹⁹ Therefore, the performance of the sensor-fusion odometry system is more sensitive to the mathematical modeling and parameter configuration of the pose fusion model in unstructured environment than that in structured environment.

To solve the above problems, the existing methods use various optimization strategies to correct the degraded poses in LiDAR or visual odometry, for example,²⁰ automatically adjusts the sensor fusion mode of visual-inertial odometry according to the pose uncertainty¹⁹ and shields the pose output of the degraded module of LiDAR-visual-inertial odometry through a degeneration perception strategy.²¹ However, all these methods work after the corresponding pose is estimated, and preperception for environmental information and pose optimization cannot be realized.

To reduce the pose errors induced by the unexpected noises and inaccurate noise propagation models of different sensors, smooth variable structure filter (SVSF)-based estimation methods^22
–24 are leveraged to guarantee the robustness of vehicles’ poses from various sources. Since the key of SVSF to ensure robustness lies in the mandatory vehicle’s pose state switching between the upper and lower bound, the optimal accuracy of estimated poses cannot be reached. Though SVSF can be utilized in conjunction with Kalman filter²⁵ to improve the accuracy, the inappropriate smoothing boundary may lead to chattering effect.²⁶

Based on the above analysis, to quantitatively perceive the environmental structure and optimize the pose fusion module of the LiDAR-visual-inertial odometry in advance to enhance the accuracy and robustness of pose estimation in environments with changing structures, this article proposes ESP-APF, which is a novel adaptive optimization method of pose fusion based on environmental structure perception, and it includes the methods of environmental structure perception and the strategies of adaptive pose fusion. Figure 1 shows the schematic diagram of ESP-APF including several key steps. The main contributions of this article are summarized as follows.

Since existing pose fusion models in LiDAR-visual-inertial odometry cannot respond to the change of vehicle’s pose quality induced by the unexpected environmental structural change, the capacity of perceiving the structure of environment in real time needs to be developed for the odometry system. In ESP-APF, a novel quantitative perception method of environmental structural complexity is proposed, and this method can accurately perceive the LiDAR point cloud/visual environmental structure by constructing and analyzing the specially designed LiDAR point cloud histogram and visual bag-of-words vector.

The current mainstream pose fusion approaches rely on fixed noise propagation models or fixed pose weights to estimate the fused pose, which restrict the accuracy and robustness of poses estimated by LiDAR-visual-inertial odometry in environments with changing structures. In ESP-APF, two different adaptive pose fusion strategies for two mainstream pose fusion models (Kalman filter and factor graph optimization) are proposed based on the perceived environmental structural complexity, and it can adaptively fuse the poses estimated by LiDAR and vision online by the dynamic configuration of the noise items in Kalman filter and the pose weights in factor graph.

ESP-APF is deployed on two state-of-the-art LiDAR-visual-inertial odometry systems with different pose fusion models, and experiments are conducted in open-source data sets and self-gathered data sets to validate the effectiveness of the proposed ESP-APF. The experimental results verify the improvement brought by ESP-APF on the accuracy of vehicle’s poses estimated by LiDAR-visual-inertial odometry, compared with the original odometry without ESP-APF, and the pose translational errors can be reduced around 15%–20% in the data sets with changing environmental structures under the effect of ESP-APF.

Figure 1.

The schematic of the proposed ESP-APF. ESP-APF: environmental structure perception-based adaptive pose fusion method.

Overview of the proposed method

The framework of the proposed ESP-APF is shown as Figure 2, and the basic working process of the proposed pose fusion method ESP-APF and the organization of this article are introduced briefly as the following paragraphs.

Figure 2.

The framework of the proposed ESP-APF. ESP-APF: environmental structure perception-based adaptive pose fusion method.

Firstly, the visual images and the LiDAR point clouds are taken for the perception of environmental structure, and two quantitative perception methods are designed for vision and LiDAR point cloud in the third section. (1) For visual environmental structure, the complexity calculation method based on visual bag-of-words vector²⁷ is proposed. The basic visual bag-of-words vector is improved to be used to represent the complexity of the visual environmental structure. (2) For LiDAR point cloud environmental structure, the complexity calculation method based on 2D point cloud feature histogram^28,29 is proposed, a 2D histogram is constructed from the local point cloud map’s normal distributions to represent the environmental structure of point cloud, and the corresponding complexity is calculated based on the point cloud feature histogram.

Secondly, based on the quantitative perception of the two environmental structures, two adaptive pose fusion strategies are proposed in the fourth section to optimize the two mainstream pose fusion methods (Kalman filter and factor graph optimization). (1) For Kalman-filter-based pose fusion, the dynamic configuration of pose uncertainty is leveraged at the pose update stage of Kalman filter. (2) For graph-optimization-based pose fusion, the dynamic configuration of the weights of pose constraints is leveraged before the nonlinear optimization of factor graph is conducted.

Finally, the proposed ESP-APF is deployed on two different types of the state-of-the-art LiDAR-visual-inertial odometry systems (R2LIVE¹⁰ and LVI-SAM¹³), and the improvements of ESP-APF on the accuracy of the pose estimation of LiDAR-visual-inertial odometry are verified in the open-source data set and the self-gathered data set through a series of experiments in the fifth section.

Environmental structure perception

In this section, the method of environmental structure perception is introduced in detail, including the perception for visual environmental structure and LiDAR point cloud environmental structure.

Visual environmental structure perception

Construction of visual bag-of-words vector

Since visual bag-of-words can represent the type and quantity of visual features, it is often used to compare the similarity of two frames of images for place recognition. Here the traditional visual bag-of-words²⁷ is utilized to construct the vector used to perceive environmental structure with a series of improvements.

Firstly, the training images are used to construct the visual vocabulary, the detected visual features (ORB or SURF feature) are transformed into W visual words by splitting the vector space of the feature vector, and the K-means clustering is used to store the W words in the visual vocabulary tree. Unlike existing method of visual vocabulary construction for place recognition that assign and store weight to each visual word which is inversely proportional to the word’s universality, there is no need to store the weight of visual word since the visual word is used for environmental structure perception and its type and quantity are paid more attention rather than its universality.

Then, the current image I_t received at instant t is converted into a bag-of-words vector $v_{t} \in ℝ^{W}$ . The conversion procedure is to firstly traverse each layer of nodes in the visual vocabulary tree for each feature vector (ORB or SURF feature, Figure 1 shows the SURF feature) extracted from the image, afterwards the node with the closest Hamming distance is selected as the corresponding visual word, and finally store the times of occurrence of each visual word in the bag-of-words vector.

Visual environmental structural complexity calculation

As shown in Figure 3, for two visual images received at a fixed time interval T, the similarity of them $s (v_{t}, v_{t - T})$ (with a range of $[0, 1]$ ) is calculated using the bag-of-words vector mentioned above.

\begin{array}{r} s (v_{t}, v_{t - T}) = 1 - \frac{1}{2} | \frac{v_{t}}{| v_{t} |} - \frac{v_{t - T}}{| v_{t - T} |} | \end{array}

Figure 3.

The schematic of visual environmental structural complexity calculation.

For the sake of quantifying the changes in environmental structure accurately in the time period T, a series of bag-of-words vectors obtained during the most recent time period are used for calculation. Firstly, the similarity is normalized as $η (v_{t}, v_{t - T})$ . $Δ t$ is the time interval of two consecutive images.

\begin{array}{r} η (v_{t}, v_{t - T}) = \frac{s (v_{t}, v_{t - T})}{s (v_{t}, v_{t - Δ t})} \end{array}

Then the average similarity $\bar{s}$ of the bag-of-words vectors of the N received images most recently is calculated to represent the change of visual environmental structure during the time period T

\begin{array}{r} \bar{s} (v_{t}, v_{t - T}) = \frac{1}{N} (\sum_{k = 0}^{N - 1} η (v_{t - k Δ t}, v_{t - T})) \end{array}

If $\bar{s} (v_{t}, v_{t - T})$ falls below certain threshold, which implies that the visual environmental structure has changed greatly, resulting in the possible change of the quality of the poses estimated by visual measurements. In respond to this change, the uncertainty or the weight of the poses from visual measurement will be adjusted adaptively in accordance with the complexity of the visual environmental structure, which will be introduced in detail in the fourth section. Since a well-structured environment contains visual features with various types and large amounts, the standard deviation of the items in the bag-of-words vector v_t is utilized as the indicator $S_{V_{t}}$ to quantify the structural complexity of visual environment, and it represents the overall difference of the frequency of occurrence of the W visual words in the visual vocabulary

\begin{array}{r} S_{V_{t}} = variance (v_{t}) \end{array}

LiDAR point cloud environmental structure perception

Construction of local point cloud map and 2D point cloud feature histogram

For M LiDAR frames received in a fixed time period, a local point cloud map $L M$ containing these frames is constructed leveraging the relative pose transformation provided by the LiDAR odometry, and the local map is composed of voxels containing several LiDAR points (built in the way of LiTAMAN,³⁰ and KD tree is utilized to store the central point cloud coordinates of each voxel). Then the normal distribution of the LiDAR points (shown in Figure 1) in each voxel is fitted for the preparation of 2D histogram construction. The normal distribution $N$ is represented as follows. $C_{μ}$ is the center point of the normal distribution, $C_{Σ}$ is the covariance matrix of the normal distribution, and m is the number of LiDAR points in each voxel

\begin{array}{l} N_{i} (C_{μ_{i}}, C_{Σ_{i}}) \\ C_{μ_{i}} = \frac{1}{m} \sum_{j = 1}^{m} p_{j} \\ C_{Σ_{i}} = \frac{1}{m} (\sum_{j = 1}^{m} (p_{j} - C_{μ_{i}}) {(p_{j} - C_{μ_{i}})}^{T}) \end{array}

For each fitted normal distribution $N_{i} (C_{μ_{i}}, C_{Σ_{i}})$ in the local point cloud map $L M$ , consider it as a plane s_i with center point $C_{μ_{i}}$ and normal vector u_i representing the plane’s normal direction. The evaluation for the normal vector u_i is to perform Principal Component Analysis (PCA) decomposition for $C_{Σ_{i}}$ , and u_i is the eigenvector corresponding to the minimum eigenvalue. Since the local point cloud maps $L M$ constructed at different time instants start at origins with different vehicle’s poses, the planes’ normal vectors of different local maps are in different local coordinate systems. To get universal representations for the planes’ normal vectors which are invariant to the vehicle’s poses of starting origins, the rotation invariant transformation is performed for the evaluated normal vectors u_i as the following procedure. Supposing there are K voxels in the local map $L M$ , the overall normal vector $U_{L M}$ is firstly calculated to represent the main normal direction of $L M$ , PCA decomposition is performed for $U_{L M}$ , and the eigenvalues matrix is arranged in the descending order of eigenvalues

\begin{array}{l} U_{L M} = \sum_{i = 1}^{K} u_{i} u_{i}^{T} \\ U_{L M} = V Ω V^{T}, V = [v_{1} v_{2} v_{3}] \end{array}

Then the rotational matrix R is calculated using the eigenvector V, the rotation invariant transformation is performed for all the plane normal vectors u_i in $L M$ by left multiplying the rotational matrix R, and these normal vectors are projected as much as possible on the horizontal plane which is invariant to the vehicle’s poses of starting origins

\begin{array}{l} R = {[v_{1} v_{2} v_{1} \times v_{2}]}^{T} \\ u_{i} = R u_{i} \end{array}

Given the plane normal vectors $u_{i} [u_{x_{i}} u_{y_{i}} u_{z_{i}}]$ in the local point cloud map after rotation invariant transformation, the 2D point cloud feature histogram is constructed for $L M$ , firstly converting u_i into polar coordinate angle

\begin{array}{l} θ_{i} = {sin}^{- 1} (u_{z_{i}}) + 90^{\circ} \in [0, 180^{\circ}] \\ ϕ_{i} = {tan}^{- 1} (\frac{u_{y_{i}}}{u_{x_{i}}}) + 90^{\circ} \in [0, 180^{\circ}] \end{array}

Supposing the size of the histogram $H_{L M}$ is $n \times n$ , for each integer $x, y \in [0, n - 1]$ , $H_{L M}$ is constructed as the following representation. $⊙$ represents the same or operator

\begin{array}{l} θ_{i} = floor (n \cdot θ_{i} / 180), ϕ_{i} = floor (n \cdot ϕ_{i} / 180) \\ H_{L M} [x, y] = \sum_{i = 1}^{K} (θ_{i} ⊙ x) (ϕ_{i} ⊙ y) \end{array}

LiDAR point cloud environmental structural complexity calculation

As shown in Figure 4, to quantify the changes in LiDAR point cloud structure accurately within the time period T, the similarity between two local maps $L M_{t - T}$ and $L M_{t}$ is calculated as follows based on the 2D histograms $H_{L M_{t - T}}$ and $H_{L M_{t}}$ , $v e c (A)$ is the form of writing the elements in matrix A as vectors, and $\bar{A}$ is the matrix composed of the average values of all the elements in matrix A.

s (H_{L M_{t - T}}, H_{L M_{t}}) = 1 - \frac{1}{2} | \frac{v e c (H_{L M_{t - T}} - \bar{H_{L M_{t - T}}})}{| v e c (H_{L M_{t - T}} - \bar{H_{L M_{t - T}}}) |} - \frac{v e c (H_{L M_{t}} - \bar{H_{L M_{t}}})}{| v e c (H_{L M_{t}} - \bar{H_{L M_{t}}}) |} |

Figure 4.

The schematic of LiDAR point cloud environmental structural complexity calculation.

Like the visual environmental structure, if the similarity $s (H_{L M_{t - T}}, H_{L M_{t}})$ of the two local point cloud maps constructed in continuous time falls below certain threshold, which implies that the point cloud environmental structure has changed greatly and the quality of the poses from LiDAR measurements may be influenced, the adaptive pose fusion in the fourth section is activated according to the complexity of the LiDAR point cloud environmental structure. If the surrounding environment is well-structured, which means the environment contains a variety of point cloud structures and the directions of the fitted planes in voxels are diverse. For the purpose of quantifying the diversity, the standard deviation of the histogram $H_{L M_{t}}$ of the local point cloud map $L M_{t}$ constructed at t is used as the indicator $S_{L_{t}}$ to show the complexity of the LiDAR point cloud environmental structure.

\begin{array}{r} S_{L_{t}} = variance (v e c (H_{L M_{t}})) \end{array}

Adaptive pose fusion strategy

The pose fusion strategy for LiDAR-visual-inertial can be divided into two main categories including Kalman filter and factor graph optimization. For different types of pose fusion models, this section introduces different pose fusion strategies of ESP-APF.

Adaptive pose fusion strategy for Kalman filter

Figure 5 shows the dataflow of the pose fusion process of LiDAR-visual-odometry based on Kalman filter. Inertial information acquired from IMU with high frequency is used for pose propagation at the prediction stage of Kalman filter, and pose fusion with LiDAR/visual measurements with low frequency is performed at the update stage of Kalman filter. In the pose update stage, the Kalman gain is calculated to obtain the final fused pose, since the calculation of Kalman gain is affected by the covariance of pose uncertainty derived from the LiDAR/visual measurement models of poses, and it can be seen as a pose weight configuration based on pose uncertainty. Therefore, adaptive pose fusion of ESP-APF is achieved by adjusting pose uncertainty derived from LiDAR/visual measurements.

Figure 5.

The dataflow of LiDAR-visual-inertial pose fusion based on Kalman filter.

Since the noise items from LiDAR/visual measurements are added to the current pose uncertainty each time the pose update is performed and they are always fixed values in existing Kalman filters, the dynamic configuration of these noise items is conducted to adjust the pose uncertainty according to the complexity of environmental structure indirectly. We argue that the noise items in well-structured environment should have smaller values than the noise items in the unstructured environment.

The noise from visual measurement model is from the 2D pixel positions $P_{i}$ of the detected visual features and the corresponding 3D positions $^{W} p_{i}$ in world frame,¹⁴ and these noise items $n_{P_{i}}$ and $n_{^{W} p_{i}}$ are always modeled as Gaussian white noise

\begin{array}{l} P_{i} = P_{i}^{g t} + n_{P_{i}}, n_{P_{i}} \sim (0, Σ_{P_{i}}); \\ ​^{W} p_{i} = ​^{W} p_{i}^{g t} + n_{​^{W} p_{i}}, n_{​^{W} p_{i}} \sim (0, Σ_{​^{W} p_{i}}) \end{array}

In the proposed adaptive pose fusion strategy of ESP-APF for Kalman filter, the noise covariances $Σ_{P_{i}}^{t}$ and $Σ_{^{W} p_{i}}^{t}$ at instant t are configured dynamically based on $\bar{s} (v_{t}, v_{t - T})$ and $S_{V_{t}}$ calculated in equations (3) and (4) as follows. Since $\bar{s} (v_{t}, v_{t - T})$ represents the change of visual environmental structure, it is used as the activating signal for the adaptive pose fusion, only when it falls below certain threshold the dynamic configuration of the noise items is conducted to avoid frequent changing of these items

Σ_{P_{i}}^{t} = {\begin{array}{l} K_{V} \cdot \frac{S_{V_{t - T}}}{S_{V_{t}}} Σ_{P_{i}}^{t - T}, if \bar{s} (v_{t}, v_{t - T}) < threshold \\ Σ_{P_{i}}^{t - T}, else \end{array}

Σ_{​^{W} p_{i}}^{t} = {\begin{array}{l} K_{V} \cdot \frac{S_{V_{t - T}}}{S_{V_{t}}} Σ_{​^{W} p_{i}}^{t - T}, if \bar{s} (v_{t}, v_{t - T}) < threshold \\ Σ_{​^{W} p_{i}}^{t - T}, else \end{array}

K_V is a fixed coefficient used as the scale factor to make the values of adaptive noise covariances $Σ_{P_{i}}^{t}$ and $Σ_{^{W} p_{i}}^{t}$ fall within a reasonable range as much as possible. Here we provide a brief introduction for the determination of K_V . Firstly, the order of magnitudes O_S of $\frac{S_{V_{t - T}}}{S_{V_{t}}}$ is estimated by conducting environmental perception in testing data sets. Then, K_V is determined as $O_{S}^{- 1}$ , making the value of $K_{V} \cdot \frac{S_{V_{t - T}}}{S_{V_{t}}}$ varying around 1 for the purpose of finely tuning the noise covariances.

The noise n_i from LiDAR measurement model is from the 3D positions $^{L} p_{i}$ of LiDAR points in the current LiDAR frame.¹⁵ It is also modeled as Gaussian white noise

\begin{array}{r} ​^{L} p_{i} = ​^{L} p_{i}^{g t} + n_{i}, n_{i} \sim N (0, Σ_{​^{L} p_{i}}) \end{array}

The noise covariance is configured dynamically based on $s (H_{L M_{t - T}}, H_{L M_{t}})$ and $S_{L_{t}}$ calculated in equations (10) and (11) as follows. $s (H_{L M_{t - T}}, H_{L M_{t}})$ represents the change of point cloud environmental structure, and it is used as the activating signal of the adaptive pose fusion. K_L is also a fixed coefficient used as scale factor determined by the same procedure of determination method of K_V .

\begin{array}{r} Σ_{​^{L} p_{i}}^{t} = {\begin{array}{l} ​ ​ ​ ​ K_{L} \cdot \frac{S_{L_{t - T}}}{S_{L_{t}}} Σ_{​^{L} p_{i}}^{t - T} ​ ​, ​ ​ if s (H_{L M_{t - T}}, H_{L M_{t}}) ​ < ​ threshold \\ Σ_{​^{L} p_{i}}^{t - T}, else \end{array} \end{array}

Adaptive pose fusion strategy for factor graph optimization

Figure 6 shows the dataflow of the pose fusion process of LiDAR-visual-odometry based on factor graph optimization. The poses estimated by IMU pre-integration, visual odometry, and LiDAR odometry are transformed into constraints represented by edges in the factor graph with different weights, and the optimized fused poses are given as nodes after the nonlinear optimization of factor graph is solved. In existing factor graph solvers (such GTSAM³¹ and G2O¹¹), the weight matrix W is always derived from inverse of the covariance matrix $Σ$ of the pose constraints, and the objective function of the nonlinear optimization can be represented as follows. r_i and W_i are the residuals and weights of the pose constraints in the odometry system, respectively.

Minimize \sum_{i} ∥ r_{i} ∥_{W_{i}}^{2}

Figure 6.

The dataflow of LiDAR-visual-inertial pose fusion based on factor graph optimization.

In the existing LiDAR-visual-inertial odometry with factor graph optimization such as the study by Shan et al.,¹³ the weights of pose constraints from different sources are manually preconfigured based on engineering experience and remain static throughout the whole working process of the odometry system. For the purpose of adaptive pose fusion, we apply dynamic configuration of weights of the constraints from LiDAR/visual odometry based on the complexity of environmental structure calculated in the third section. We argue that the weights of pose constraints in well-structured environment should have larger values than the weights in the unstructured environment. The adaptive pose fusion strategy of ESP-APF for factor graph optimization is designed as follows. W_L and W_V represent the weight matrices of LiDAR/visual odometry with initial configurations. $\bar{s} (v_{t}, v_{t - T})$ and $s (H_{L M_{t - T}}, H_{L M_{t}})$ are used as the activating signals of the adaptive pose fusion. K_L and K_V are fixed coefficients used as the scale factors to keep the values of weights in reasonable ranges.

$W_{V}^{t}$ at instant t is dynamically configured based on $\bar{s} (v_{t}, v_{t - T})$ and $S_{V_{t}}$ calculated in equations (3) and (4) as follows

\begin{array}{l} W_{V}^{t} = {\begin{array}{l} ​ ​ ​ K_{V} \cdot \frac{S_{V_{t}}}{S_{V_{t - T}}} W_{V}^{t - T} ​ ​, ​ ​ if \bar{s} (v_{t}, v_{t - T}) ​ ​ < ​ ​ threshold \\ W_{V}^{t - T}, else \end{array} \end{array}

$W_{L}^{t}$ at instant t is dynamically configured based on $s (H_{L M_{t - T}}, H_{L M_{t}})$ and $S_{L_{t}}$ calculated in equations (10) and (11) as follows

\begin{array}{r} W_{L}^{t} = {\begin{array}{l} ​ ​ ​ K_{L} \cdot \frac{S_{L_{t}}}{S_{L_{t - T}}} W_{L}^{t - T} ​ ​, ​ ​ if s (H_{L M_{t - T}}, H_{L M_{t}}) ​ ​ < ​ ​ threshold \\ W_{L}^{t - T}, else \end{array} \end{array}

Experiment

Data set and experimental platform

To verify the proposed adaptive pose fusion method (ESP-APF) for LiDAR-visual-inertial odometry, the data set should include the raw LiDAR point cloud, visual image, and high-frequency inertial information synchronized by time stamp, and the covered environment of the data set should include enough environmental structural changes to reflect the effectiveness of this method. Therefore, the self-gathered data set containing scene-switching with different environmental structures is used for the following experiments; at the same time, the open-source KITTI data set³² and the data set used in LVI-SAM¹³ are also selected to expand the experimental scenarios. The basic information of these data sets is shown in Table 1.

Table 1.

Basic information of the data set used in the experiment.

Data set source	KITTI	LVI-SAM	Self-gathered
Covered scenario	Urban roads and rural roads	Forest park	Side walk and fields
Sensor carrier	Car	Handheld device and small UGV	Small UGV
LiDAR type	64-Line LiDAR	64-Line LiDAR	64-Line LiDAR
Camera type	Binocular camera	Fish-eye camera	Monocular camera
Ground truth type	GPS-RTK	GPS-RTK	GPS-RTK

The proposed ESP-APF is lightweight and does not require high performance for computing units, so low-cost computing platforms can be used for experiments. The hardware computing platform is an i5-9300 CPU with of 16 GB RAM, installed with Ubuntu 18.04 operating system and Robot Operating System (ROS) melodic.

Effectiveness evaluation of the environmental structure perception of ESP-APF

To verify that the proposed environmental structure perception method in the third section (the LiDAR point cloud environmental structural complexity calculated from the point cloud feature histogram, and the visual environmental structural complexity calculated from the visual bag-of-words vector) can accurately quantify the complexity of the surrounding environmental structure, experiments are carried out in the self-gathered data set. The self-gathered data set covers the process of vehicle’s crossing from unstructured environment (field) to structured environment (sidewalk), and the environmental structure has changed greatly during the process. Figure 7 shows the satellite map of the self-gathered data set and the vehicle’s trajectory evaluated by RTK-GPS, and the specific details of different environmental structures (LiDAR point cloud information and visual information) are also listed to show the change of environmental structure.

Figure 7.

LiDAR point cloud and visual details of the self-gathered data set. The upper right pictures show the field scenario which is an unstructured environment, while the lower right pictures show the sidewalk scenario which is a structured environment.

Figure 8 visualizes the point cloud feature histogram and visual bag-of-words vector constructed in the corresponding scene in Figure 7, and the related parameters for the construction are listed in Table 2. In the structured region, the features of the surrounding environment are diverse, and the planes’ normal directions fitted from the normal distributions in the local point cloud map are also diverse, which are conducive to the pose estimation of the point cloud registration algorithm such as point-to-plane Iterative Closest Point (ICP). While in the unstructured region, the features of the surrounding environment are not diverse enough, and the normal directions of planes are concentrated in limited range, which may lead to the degradation of pose estimation. Through the intuitive representation of the point cloud feature histogram, it can be seen that the histogram can well represent the structural characteristics of the surrounding environment. Similarly, the visual bag-of-words vector constructed in the structured environment contains more types and amounts of visual words than that in the unstructured region, which presents a more diverse and complex visual environment structure, and it is also conducive to pose estimation based on visual feature matching and bundle adjustment.

Figure 8.

Point cloud feature histogram and visual bag-of-words vector constructed in the corresponding scenarios of Figure 7.

Table 2.

Parameter configuration for environmental structural complexity perception.

Parameter	Value
Number of visual words in the vocabulary W	700
Maximum number of extracted visual features in a single image	300
Number of LiDAR frames contained in local point cloud map M	20
Size of point cloud feature histogram $n \times n$	60 $\times$ 60

Figure 9 shows the changing process of point cloud/visual environmental structural complexity along with the vehicle’s trajectory, and the specific value of the complexity calculated in different scenarios are also included. The complexity of point cloud environmental structure (calculated in equation (11)) is derived from the standard deviation of point cloud feature histogram, and it represents the average difference of the quantities of planes with different normal directions in a local point cloud map, reflecting the diversity of planes’ normal directions. The complexity of visual environmental structure is calculated in equation (4) using the standard deviation of the visual bag-of-words vector, and it represents the average difference of the quantities of different types of visual words contained in a visual image, reflecting the diversity of visual features. It can be seen from Figure 9 that in the fields with open planar areas and few geometric features, both the complexities of the point cloud environmental structure and the visual environmental structure are lower than those calculated in the sidewalk with rich features, indicating that the environmental structure perception method of ESP-APF can accurately provide continuous indicators to abstract the complexity of the environmental structure. This method is not only in line with human’s intuitive understanding of the environmental structure but also indirectly indicates the quality of poses estimated by LiDAR/visual odometry, providing a priori basis for pose fusion of LiDAR-visual-inertial odometry.

Figure 9.

The environmental structural complexity calculated in self-gathered data set. The color of the trajectory in (a) and (b) is utilized to represent the complexity of the environment structure, the four points A, B, C, and D are utilized to show the type of the scenarios in the data set, the trajectory between A and B is in the field (unstructured area, duration 172 seconds, distance 164 meters), while the trajectory between C and D is in the sidewalk (structured area, duration 167s, distance 186 m), (c) and (d) show the distributions of the values of environmental structural complexity calculated in different scenarios using the form of boxplot.

Effectiveness evaluation of the adaptive pose fusion strategy of ESP-APF

Based on the complexity of point cloud/visual environmental structure derived from the point cloud feature histogram and visual bag-of-words vector, different pose fusion strategies are proposed in ESP-APF to adapt to the mainstream pose fusion framework including Kalman filter and factor graph optimization. To verify the effectiveness of the pose fusion strategy proposed in “Adaptive pose fusion strategy for Kalman filter” section on improving the pose estimation performance of LiDAR-visual-inertial odometry based on Kalman filter, the strategy was applied to R2LIVE¹⁰ system for experimental evaluation. To validate the pose fusion strategy proposed in “Adaptive pose fusion strategy for factor graph optimization” section on improving the pose estimation performance of LiDAR-visual-inertial odometry based on factor graph optimization, the strategy is applied to LVI-SAM¹³ system for experimental evaluation. To better verify the pure effectiveness of the proposed method, loop-closure modules are disabled in the following experiments.

Ablation study of ESP-APF for the pose fusion strategy for Kalman filter

In this experiment, the pipeline of pose fusion in R2LIVE is utilized, and the pose estimations from LiDAR measurement (point-to-plane error), visual measurements (visual feature reprojection error), and inertial information (IMU pose propagation) are fused through the iterative error state Kalman filter framework. The pose fusion strategy of ESP-APF for Kalman filter introduced in “Adaptive pose fusion strategy for Kalman filter” section is applied as an extra functional module of R2LIVE. The noise items are dynamically configured according to the complexity of the environmental structure to adjust the pose uncertainties adaptively, so as to realize the adaptive pose fusion. Ablation study of adaptive pose fusion strategy is performed on the data set mentioned in “Data set and experimental platform” section. We select some representative data sets and divide these data sets into two parts. One part contains obvious changes of environmental structures, including KITTI-00, KITTI-09, LVI-SAM-Jackal and self-gathered data set. KITTI-00 and KITTI-09 contain scene-switching from urban roads to rural roads, LVI-SAM-Jackal contains a variety of unstructured scenes (forests, fields, lawns, and asphalt roads), and the self-gathered data set includes the changes from unstructured environment (field) to structured environment (sidewalk). The other part contains a relatively simple environmental structure, including KITTI-01, KITTI-10, and LVI-SAM-Handheld, where KITTI-01 is a highway scene, KITTI-10 is a residential road, and LVI-SAM-Handheld is a field. The detailed configuration of related parameters of the adaptive pose fusion strategy is shown in Table 3.

Table 3.

Parameter configuration of adaptive pose fusion strategy for Kalman filter.

Parameter	Value
K_V in equation (13)	1
Period T in equation (13)	1s
Initial values of the items of $Σ_{P_{i}}$ and $Σ_{^{W} p_{i}}$ in equation (13)	0.001
Threshold of $\bar{s} (v_{t}, v_{t - T})$ in equation (13)	0.05
K_L in equation (15)	1
Period T in equation (15)	2s
Initial values of the items of $Σ_{^{L} p_{i}}$ in equation (15)	0.0015
Threshold of $s (H_{L M_{t - T}}, H_{L M_{t}})$ in equation (15)	0.1

The average sequence translational errors and rotational errors of the vehicles’ poses obtained in the ablation study are shown in Table 4, these indicators are calculated using the KITTI odometry evaluation metric,²⁶ and they are represented as the Root Mean Square Error of pose errors compared with GPS-RTK ground truth. It can be found from the experimental results that in the data sets with relatively simple environmental structures, the improvement of pose accuracy brought by the proposed adaptive pose fusion strategy of ESP-APF is relatively limited. Taking the average translational errors as an example, in KITTI-01, KITTI-10, and LVI-SAM-Handheld, compared with the original R2LIVE, the average translational errors achieved by R2LIVE with ESP-APF are reduced by 7.50%, 6.85%, and 8.30%, respectively. However, in the areas with large changes in environmental structure, the proposed adaptive pose fusion strategy of ESP-APF can significantly improve the pose estimation accuracy of R2LIVE. In KITTI-00, KITTI-09, LVI-SAM-Jackal, and self-gathered data sets, the average translation errors are reduced by 16.05%, 14.89%, 16.31%, and 15.10%, respectively, and the pose trajectories estimated in these data sets are shown in Figure 10. The pose trajectory after the optimization of ESP-APF is more consistently aligned with the GPS-RTK ground truth. The average rotational error also shows greater improvement in the data sets with changing environmental structures than that in the data sets with simple environmental structures. To better demonstrate the overall improvement of ESP-APF on Kalman-filter-based pose fusion model, Figure 11(a) and (b) shows the changing process of the adaptive parameters (the items of the noise covariance matrices $Σ_{P_{i}}$ , $Σ_{^{W} p_{i}}$ , and $Σ_{^{L} p_{i}}$ in equations (13) and (15)) of R2LIVE with ESP-APF in self-gathered data set, and the continuous absolute pose errors (APEs) of poses estimated by R2LIVE and R2LIVE with ESP-APF are also shown in Figure 11(c) to (e) using evo.³³ It can be found from the figure that the value of these noise items in LiDAR and visual measurements varies as the change in environmental structures to realize dynamical adjustment of pose uncertainty in Kalman filter. Under the effect of ESP-APF, the growth rate of the accumulated pose errors in R2LIVE can be slowed down, which implies the odometry can provide stable poses for vehicles for longer duration. To sum up, the experimental results show that the adaptive pose fusion strategy of ESP-APF for Kalman filter can improve the overall accuracy of poses estimated by the LiDAR-visual-inertial odometry based on Kalman filter in the environments with structural changes.

Table 4.

Accuracy evaluation of poses estimated by different methods.

Method	Indicator	KITTI-00	KITTI-01	KITTI-09	KITTI-10	LVI-SAM- Jackal	LVI-SAM- Handheld	Self- gathered
R2LIVE	Average sequence translational error (m/100 m)	0.81	1.20	0.47	0.73	4.23	11.32	4.04
R2LIVE	Average sequence rotational error (deg/100 m)	0.31	0.43	0.26	0.36	–	–	5.58
R2LIVE with ESP-APF	Average sequence translational error (m/100 m)	0.68	1.11	0.40	0.68	3.54	10.38	3.43
R2LIVE with ESP-APF	Average sequence rotational error (deg/100 m)	0.24	0.37	0.21	0.33	–	–	3.53
LVI-SAM	Average sequence translational error (m/100 m)	0.78	1.42	0.42	0.77	4.05	7.87	3.64
LVI-SAM	Average sequence rotational error (deg/100 m)	0.22	0.46	0.24	0.41	–	–	5.19
LVI-SAM with ESP-APF	Average sequence translational error (m/100 m)	0.66	1.35	0.35	0.70	3.29	7.21	2.97
LVI-SAM with ESP-APF	Average sequence rotational error (deg/100 m)	0.19	0.42	0.18	0.37	–	–	2.85

ESP-APF: environmental structure perception-based adaptive pose fusion method.

Figure 10.

The pose trajectories estimated by the ablation study on the adaptive pose fusion strategy for Kalman filter in parts of the data sets. The enlarged pictures on the right show the details of the trajectories in the regions within the red circles of the trajectories on the left. The fluctuation of ground truth of LVI-SAM-Jackal is due to the block of GPS signal.

Figure 11.

The changing process of the adaptive parameters in R2LIVE with ESP-APF and APEs of the poses estimated by R2LIVE and R2LIVE with ESP-APF in self-gathered data set. The four time instants A, B, C, D in (a), (b), and (e) indicate the types of the scenarios the same as Figure 9 (the duration between A∼B is in unstructured area, the duration between C∼D is in structured area). The continuous APEs are mapped onto the pose trajectory using the color bar in (c) and (d), the APEs aligned with duration are shown in (e), and the unit of APE is m. APE: absolute pose errors; ESP-APF: environmental-structure-perception-based adaptive pose fusion.

Ablation study of ESP-APF for the pose fusion strategy for factor graph optimization

The pipeline of LVI-SAM’s pose fusion framework is leveraged to validate the effectiveness of the pose fusion strategy of ESP-APF for factor graph optimization, and the pose constraints from LiDAR odometry, visual odometry, and IMU pre-integration are fused into the factor graph for optimal pose estimation. The pose fusion strategy introduced in “Adaptive pose fusion strategy for factor graph optimization” section is applied as an extra functional module of LVI-SAM, and the weights of LiDAR/visual pose constraints are dynamically configured to realize adaptive pose fusion. The same data sets in the last section are selected for the ablation study of the pose fusion strategy for factor graph optimization. The detailed configuration of related parameters of the adaptive pose fusion strategy are shown in Table 5.

Table 5.

Parameter configuration of adaptive pose fusion strategy for factor graph optimization.

Parameter	Value
K_V in equation (17)	1
Period T in equation (17)	1s
Initial values of the items of W_V in equation (17)	100
Threshold of $\bar{s} (v_{t}, v_{t - T})$ in equation (17)	0.05
K_L in equation (18)	1
Period T in equation (18)	2s
Initial values of the items of W_L in equation (18)	1000
Threshold of $s (H_{L M_{t - T}}, H_{L M_{t}})$ in equation (18)	0.1

The related pose translational and rotational errors are shown in Table 4. In the data sets with simple environmental structures, compared with the original LVI-SAM, the adaptive pose fusion strategy using dynamical weights of pose constraints can reduce the average translational errors of poses by 4.93%, 9.09%, and 8.39% in KITTI-01, KITTI-10, and LVI-SAM-Handheld, respectively. In the data set with changing environmental structures (KITTI-00, KITTI-09, LVI-SAM-Jackal, and self-gathered data set), the average translational errors are reduced by 15.38%, 16.67%, 18.77%, and 18.41%, respectively, which is more significant for the reduction of pose translational errors. The pose trajectories of the ablation study in parts of the data sets are shown in Figure 12, and the trajectories with ESP-APF are better aligned with GPS-RTK ground truth. Figure 13(a) and (b) shows the changing process of the adaptive parameters (the items of weight matrices K_V and K_L in equations (17) and (18)) of LVI-SAM with ESP-APF in self-gathered data set. We can notice from the figure that the pose weights of LiDAR and visual odometry vary in different degrees as the environmental structural change. As the result, the continuous APEs mapped onto the pose trajectory shown in Figure 13(c) to (e) demonstrate better improvement effect of ESP-APF on the stability of pose estimation of LVI-SAM. In summary, the experimental results show that the proposed pose fusion strategy of ESP-APF for factor graph optimization is effective to improve the overall accuracy of pose estimation for the LiDAR-visual-inertial odometry based on factor graph.

Figure 12.

The pose trajectories estimated by the ablation study on the adaptive pose fusion strategy for factor graph optimization in parts of the data sets. The enlarged pictures on the right show the details of the trajectories in the regions within the red circles of the trajectories on the left. The fluctuation of ground truth of LVI-SAM-Jackal is due to the block of GPS signal.

Figure 13.

The changing process of the adaptive parameters in LVI-SAM with ESP-APF and APEs of the poses estimated by LVI-SAM and LVI-SAM with ESP-APF in self-gathered data set. The four time instants A, B, C, and D in (a), (b), and (e) indicate the types of the scenarios the same as Figure 9 (the duration between A∼B is in unstructured area, the duration between C∼D is in structured area). The continuous APEs are mapped onto the pose trajectory using the color bar in (c) and (d), the APEs aligned with duration is shown in (e), and the unit of APE is m. APE: absolute pose errors; ESP-APF: environmental structure perception-based adaptive pose fusion.

Processing time evaluation

Since ESP-APF works as an extra functional module (see Figure 2) in LiDAR-visual-inertial odometry, the computing cost of ESP-APF needs to be as low as possible to meet the real-time requirement of pose estimation. In this experiment, the time cost of the key phrases of ESP-APF is evaluated and analyzed to validate that it is a lightweight module assisting pose fusion for LiDAR-visual-inertial odometry.

Utilizing the self-gathered data set and the experimental platform in “Data set and experimental platform” section, the detailed average processing time of the key modules of ESP-APF and the selected two odometry systems (R2LIVE and LVI-SAM) is listed in Table 6. From the table we can find that the average total time cost of ESP-APF for processing the dataflow from LiDAR measurements is around 17 ms, and the average total time cost of ESP-APF for the visual measurements is around 20 ms, and they are obviously shorter than the time costs of R2LIVE and LVI-SAM for estimating poses from LiDAR scans (R2LIVE is around 36 ms, LVI-SAM is around 41 ms) and visual images (R2LIVE is around 56 ms, LVI-SAM is around 58 ms), which validates that ESP-APF is a lightweight module assisting the original odometry systems. In addition, since ESP-APF works as an extra module of LiDAR-visual-inertial odometry, it runs in two single threads to process LiDAR and visual dataflows in parallel with low frequency, and the time costs will not be superposed too much to the original odometry system for pose estimation. For R2LIVE with ESP-APF, the average total processing time for LiDAR and visual measurement is 1.37 ms and 1.29 ms longer than the original R2LIVE, respectively. For LVI-SAM with ESP-APF, the average extra time consumption for LiDAR and visual measurement comparing with the original LVI-SAM is 1.45 ms and 1.33 ms, respectively. Besides, to compare the total time consumption of the proposed method with other state-of-the-art odometry systems, the average duration between sensor’s measurement and the corresponding pose output regardless of the sensor’s type is shown in Table 7, and VINS,¹⁴ FAST-LIO,¹⁵ and LIO-SAM¹⁶ are considered as the comparative odometry systems. Even though the processing time of R2LIVE with ESP-APF and LVI-SAM with ESP-APF is a little larger than the original R2LIVE and LVI-SAM, the time consumption is still able to keep up with the pace of other comparative odometry systems. According to the above experimental analysis, the extra time consumption brought by ESP-APF is small enough to maintain the real-time performance of the odometry systems.

Table 6.

Processing time evaluation of R2LIVE and LVI-SAM with ESP-APF.

	R2LIVE with ESP-APF			LVI-SAM with ESP-APF
LiDAR measurement processing time (ms)	LiDAR point cloud environmental structural perception	Local map construction	12.37	LiDAR point cloud environmental structural perception	Local map construction	12.44
		2D point cloud feature histogram construction	3.21		2D point cloud feature histogram construction	3.17
		Point cloud environmental structural complexity calculation	1.41		Point cloud environmental structural complexity calculation	1.42
	Adaptive configuration of noise items from LiDAR measurements		Negligible	Adaptive configuration of pose constraints’ weights from LiDAR measurements		Negligible
	Total processing time each ESP-APF		16.99	Total processing time each ESP-APF		17.03
	Duration between one LiDAR scan received and the corresponding pose output in R2LIVE		35.84	Duration between one LiDAR scan received and the corresponding pose output in LVI-SAM		41.39
	Duration between one LiDAR scan received and the corresponding pose output in R2LIVE with ESP-APF		37.21	Duration between one LiDAR scan received and the corresponding pose output in LVI-SAM with ESP-APF		42.84
Visual measurement processing time (ms)	Visual environmental structural perception	Visual feature extraction	12.51	Visual environmental structural perception	Visual feature extraction	12.49
		Visual bag-of-words vector construction	6.25		Visual bag-of-words vector construction	6.22
		Visual environmental structural complexity calculation	0.97		Visual environmental structural complexity calculation	0.96
	Adaptive configuration of noise items from visual measurements		Negligible	Adaptive configuration of pose constraints’ weights from visual measurements		Negligible
	Total processing time each ESP-APF		19.73	Total processing time each ESP-APF		19.67
	Duration between one visual image received input and the corresponding pose output in R2LIVE		56.22	Duration between one visual image received input and the corresponding pose output in LVI-SAM		58.16
	Duration between one visual image received input and the corresponding pose output in R2LIVE with ESP-APF		57.51	Duration between one visual image received input and the corresponding pose output in LVI-SAM with ESP-APF		59.49

ESP-APF: environmental structure perception-based adaptive pose fusion.

Table 7.

Average processing time of sensor’s measurement in different state-of-the-art odometry systems

Methods	R2LIVE with ESP-APF	LVI-SAM with ESP-APF	R2LIVE	LVI-SAM	VINS	FAST-LIO	LIO-SAM
Average duration between sensor’s measurement and the corresponding pose output (ms)	47.73	51.31	46.58	50.19	58.52	35.92	40.97

ESP-APF: environmental structure perception-based adaptive pose fusion.

Conclusion and future work

Based on the quantitative analysis of the LiDAR point cloud/visual environmental structure and the characteristics of two mainstream pose fusion models, this article proposes an ESP-APF for mainstream LiDAR-visual-inertial odometry. Through experimental analysis, the proposed environmental structure perception method of ESP-APF can accurately quantify the visual structure and point cloud structure of surrounding environment, providing a priori environmental structure perception for the subsequent pose fusion to predict the quality of the poses estimated from LiDAR and vision. Under the effect of the adaptive pose fusion strategies of ESP-APF, the enhanced accuracy of pose estimation can be achieved in the environment with changing environmental structures, endowing the odometry with the ability of adaptive pre-optimization for pose fusion according to the environmental structure. In addition, from the perspective of system level of LiDAR-visual-inertial odometry, the proposed ESP-APF has universality and decoupling, which implies that it can be easily migrated and adapted to various LiDAR-visual-inertial odometry systems as a separate functional module, providing accurate and robust pose estimations for the mobile vehicles working in the unknown and challenging environments with changing structures.

To further enhance the performances of pose estimation of sensor-fusion odometry systems and expand the application scenarios, the future work may consider adding new types of sensors such as radar and thermal-infrared camera and developing more sophisticated pose fusion techniques. Since filter-based pose fusion models such as Kalman filter, particle filter, and SVSF have their own advantages in advancing the accuracy, robustness, and stability of the estimated vehicle’s pose, the combination of these models also needs to be developed to fuse the multisource heterogeneous data from different sensors to trade-off the overall performance of sensor-fusion odometry. What’s more, our future works will also revolve around the adaptive technique in sensor-fusion odometry by extracting semantic environmental information using machine-learning methods toward complicated and changing environments.

Footnotes

Authors’ contributions

Conceptualization: ZZ; methodology: ZZ; software programming: ZZ and CL; data curation: CL and WY; validation: ZZ, CL, and WY; writing – original draft: ZZ, CL, and WY; writing – review and editing: CL and JS; supervision: JS and DZ.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This research was supported by the National Key R&D Program of China, No. 2022YFC3320800 and Zhejiang Provincial Key R&D Plan of China, No. 2021C01040.

ORCID iD

Zixu Zhao

References

Yang

G-Z

Bellingham

Dupont

, et al. The grand challenges of science robotics. Sci Robot 2018; 3: eaar7650.

Siciliano

Khatib

. Springer handbook of robotics. Heidelberg, Germany: Springer, 2008.

Huang

Mei

Zhang

, et al. A comprehensive survey on point cloud registration. arXiv preprint arXiv:210302690 . 2021.

Sibley

(2009). Relative bundle adjustment. Technical Report No. 2307/09. Oxford, UK: Department of Engineering Science, Oxford University.

Forster

Carlone

Dellaert

, et al. IMU preintegration on manifold for efficient visual-inertial maximum – a-posteriori estimation. In: Robotics: Science and systems XI, Rome, Italy, 2015.

Alatise

Hancke

. A review on challenges of autonomous mobile robot and sensor fusion methods. IEEE Access 2020; 8: 39830–39846.

Mourikis

Roumeliotis

. A multi-state constraint Kalman filter for vision-aided inertial navigation. In: Proceedings 2007 IEEE international conference on robotics and automation, Rome, Italy, 10–14 April 2007, pp. 3565–3572. IEEE.

Zuo

Geneva

Lee

, et al. LIC-fusion: Lidar-inertial-camera odometry. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China, 03–08 November 2019, pp. 5848–5854. IEEE.

Zuo

Yang

Geneva

, et al. LIC-fusion 2.0: Lidar-inertial-camera odometry with sliding-window plane-feature tracking. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp. 5112–5119. IEEE.

10.

Lin

Zheng

, et al. R

​^{2}

LIVE: A robust, real-time, LiDAR-inertial-visual tightly-coupled state estimator and mapping. IEEE Robot Autom Lett 2021; 6(4): 7469–7476.

11.

Kümmerle

Grisetti

Strasdat

, et al. G²o: a general framework for graph optimization. In: 2011 IEEE international conference on robotics and automation (ICRA), Shanghai, China, 09–13 May 2011, pp. 3607–3613. IEEE.

12.

Qin

Cao

Pan

, et al. A general optimization-based framework for global pose estimation with multiple sensors. arXiv preprint arXiv:190103642 . 2019.

13.

Shan

Englot

Ratti

, et al. LVI-SAM: tightly-coupled lidar-visual-inertial odometry via smoothing and mapping. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 30 May–05 June 2021, pp. 5692–5698. IEEE.

14.

Qin

Shen

. VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans Robot 2018; 34(4): 1004–1020.

15.

Zhang

. FAST-LIO: a fast, robust lidar-inertial odometry package by tightly-coupled iterated Kalman filter. IEEE Robot Autom Lett 2021; 6(2): 3317–3324.

16.

Shan

Englot

Meyers

, et al. LIO-SAM: tightly-coupled lidar inertial odometry via smoothing and mapping. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp. 5135–5142. IEEE.

17.

Giubilato

Stürzl

Wedler

, et al. Challenges of slam in extremely unstructured environments: the DLR planetary stereo, solid-state LiDAR, inertial dataset. IEEE Robot Autom Lett 2022; 7(4): 8721–8728.

18.

Wan

Yang

Cai

, et al. Robust and precise vehicle localization based on multi-sensor fusion in diverse city scenes. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018, pp. 4670–4677. IEEE.

19.

Zhang

Singh

. Laser–visual–inertial odometry and mapping with high robustness and low drift. J Field Robot 2018; 35(8): 1242–1264.

20.

Nakashima

Seki

. Uncertainty-based adaptive sensor fusion for visual-inertial odometry under various motion characteristics. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May–31 August 2020, pp. 3119–3125. IEEE.

21.

Zhang

Kaess

Singh

. On degeneracy of optimization-based state estimation problems. In: 2016 IEEE international conference on robotics and automation (ICRA), Stockholm, Sweden, 16–21 May 2016, pp. 809–816. IEEE.

22.

Avzayesh

Abdel-Hafez

AlShabi

, et al. The smooth variable structure filter: a comprehensive review. Digit Signal Process 2021; 110: 102912.

23.

Demim

Nemra

Louadj

. Robust SVSF-SLAM for unmanned vehicle in unknown environment. IFAC-Papers OnLine 2016; 49(21): 386–394.

24.

Outamazirt

Lin

, et al. A new SINS/GPS sensor fusion scheme for UAV localization problem using nonlinear SVSF with covariance derivation and an adaptive boundary layer. Chin J Aeronaut 2016; 29(2): 424–440.

25.

Gadsden

Habibi

Kirubarajan

. Kalman and smooth variable structure filters for robust estimation. IEEE Trans Aerosp Electron Syst 2014; 50(2): 1038–1050.

26.

Habibi

. The smooth variable structure filter. Proc IEEE 2007; 95(5): 1026–1059.

27.

Gálvez-López

Tardos

. Bags of binary words for fast place recognition in image sequences. IEEE Trans Robot 2012; 28(5): 1188–1197.

28.

Fernández-Moral

Mayol-Cuevas

Arevalo

, et al. Fast place recognition with plane-based maps. In: 2013 IEEE international conference on robotics and automation, Karlsruhe, Germany, 06–10 May 2013, pp. 2719–2724. IEEE.

29.

Lin

Zhang

. A fast, complete, point cloud based loop closure for lidar odometry and mapping. arXiv preprint arXiv:190911811 . 2019.

30.

Yokozuka

Koide

Oishi

, et al. LiTAMIN: LiDAR-based tracking and mapping by stabilized ICP for geometry approximation with normal distributions. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp. 5143–5150. IEEE.

31.

Dellaert

. Factor graphs and GTSAM: a hands-on introduction. Technical Report number GT-RIM-CP&R-2012-002. Atlanta, GA, USA: Georgia Institute of Technology. 2012; 2: 4.

32.

Geiger

Lenz

Urtasun

. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp. 3354–3361. IEEE.

33.

MichaelGrupp/evo. Python package for the evaluation of odometry and slam. https://github.com/MichaelGrupp/evo. 2017.

Environmental-structure-perception-based adaptive pose fusion method for LiDAR-visual-inertial odometry

Abstract

Keywords

Introduction

Overview of the proposed method

Environmental structure perception

Visual environmental structure perception

Construction of visual bag-of-words vector

Visual environmental structural complexity calculation

LiDAR point cloud environmental structure perception

Construction of local point cloud map and 2D point cloud feature histogram

LiDAR point cloud environmental structural complexity calculation

Adaptive pose fusion strategy

Adaptive pose fusion strategy for Kalman filter

Adaptive pose fusion strategy for factor graph optimization

Experiment

Data set and experimental platform

Effectiveness evaluation of the environmental structure perception of ESP-APF

Effectiveness evaluation of the adaptive pose fusion strategy of ESP-APF

Ablation study of ESP-APF for the pose fusion strategy for Kalman filter

Ablation study of ESP-APF for the pose fusion strategy for factor graph optimization

Processing time evaluation

Conclusion and future work

Footnotes

Authors’ contributions

Declaration of conflicting interests

Funding

ORCID iD

References