Sage Journals: Discover world-class research

Abstract

This paper proposes a novel multisensor fusion ground SLAM system that integrates LiDAR, visual, IMU, and wheel encoder data to enhance localization and mapping accuracy in complex environments. The proposed system consists of a LiDAR subsystem and a visual subsystem. To fully leverage the complementary advantages of LiDAR and visual sensors, the two subsystems provide each other with initial values for optimization. LiDAR supplies high-precision depth information to the visual subsystem, while the visual subsystem assists the LiDAR module in motion distortion correction, thereby enhancing the overall accuracy of state estimation. In addition, a novel plane extraction method is proposed based on adaptive dynamic distance thresholds and normal vector consistency verification. And the planar factor is introduced into the LiDAR subsystem for joint optimization. Constraining continuous planar structures (such as the ground and walls) reduces matching ambiguity, which further decreases local map noise and improves mapping accuracy. The proposed method has been extensively evaluated in complex urban campus scenarios, and the results show the effectiveness and accuracy of the algorithm.

Keywords

multisensor fusion SLAM planar factor LiDAR subsystem visual subsystem

Introduction

In recent years, simultaneous localization and mapping (SLAM) has gained widespread attention and experienced rapid advancement. SLAM has become indispensable for various robot navigation tasks due to its ability to estimate poses and reconstruct maps in real time.¹ The mainstream methods primarily consist of LiDAR-based, vision-based, and LiDAR-vision fusion approaches.

Visual SLAM

In vision-based methods, purely visual SLAM^2–4 algorithms, which rely solely on cameras as sensors, offer advantages like being lightweight and cost-effective, and have been widely applied over the past decade. Meanwhile, due to the inability of monocular cameras to recover metric scale, a common solution is to incorporate IMU sensors. An IMU supplies high-frequency inertial measurements, while a camera captures information-rich images. These two sensors complement each other to form a visual-inertial odometry (VIO) system. According to the estimation strategy, VIO can be categorized into two categories: filter-based methods and optimization-based methods.

Filter-based methods^5–7 typically use a Kalman Filter⁸ and its extended forms to propagate system states. State propagation is accomplished using the IMU kinematic model, while visual updates offer multi-frame constraints. Optimization-based methods^9,10 employ a sliding window optimization framework that jointly estimates system states by constructing a multi-modal constrained factor graph. Specifically, a nonlinear least squares optimization problem incorporating visual reprojection constraints, IMU pre-integration constraints is formulated within the sliding window. To address the challenge of precise picking point localization in cluttered lychee clusters, Fcaf3d-lychee¹¹ enhances the 3D detection backbone with a visual attention mechanism, achieving exceptional precision.

However, the sensitivity of visual SLAM algorithms to illumination changes, as well as the difficulty in extracting sufficient feature points in texture-sparse environments, can still easily result in divergence or failure of localization.

LiDAR SLAM

Owing to its capability of accurately measuring point-wise distances and spatial coordinates, LiDAR offers distinct advantages in point cloud registration and high-accuracy motion estimation. Compared with visual SLAM, LiDAR SLAM directly acquires accurate depth information via LiDAR sensors. However, purely LiDAR SLAM^12,13 relies on sufficient environmental characteristics during the initialization phase, otherwise, it is prone to failure. Therefore, integrating IMU sensors to assist with distortion correction and provide initial values is a common solution. This sensor fusion approach forms a LiDAR-inertial odometry (LIO) system.

Similarly, LIO can also be categorized into filter-based methods and optimization-based methods. Although state-of-the-art filter-based SLAM methods^14–16 can achieve highly accurate and stable localization over short durations, their inability to rectify accumulated pose errors makes them unsuitable for applications requiring globally consistent mapping. While optimization-based LiDAR SLAM methods^17,18 achieve high-precision global consistency through factor graph optimization, they are highly sensitive to initialization and prone to getting trapped in local minima under aggressive motion or in feature-scarce environments.

In addition, the performance of LiDAR-based SLAM often degrades significantly in environments with insufficient geometric constraints, such as narrow corridors, tunnels, or large open flat areas.

LiDAR-visual-inertial SLAM

Recently, LiDAR-visual-inertial systems have attracted growing attention owing to their robustness when one sensor fails or is partially degenerated.¹⁹ VIL-SLAM²⁰ employs a loosely-coupled method without leveraging the joint optimization of LiDAR, camera, and IMU measurements. The LIMO framework²¹ establishes cross-modal associations by leveraging LiDAR point clouds to geometrically constrain visual features, subsequently performing keyframe-based motion estimation through bundle adjustment optimization with geometric consistency constraints. In,²⁰ a cascaded architecture is employed to integrate tightly-coupled stereo visual-inertial odometry, LiDAR odometry, and a LiDAR-based loop-closure module. LIC-Fusion²² implements a tightly-coupled fusion strategy within the MSCKF⁵ framework, which synchronously integrates heterogeneous sensor modalities including IMU kinematic constraints, sparse visual feature tracking, and LiDAR geometric features. The enhanced LIC-Fusion2.0²³ framework refines LiDAR-based pose estimation through sliding-window constrained planar feature association, where consecutive LiDAR scans are processed using plane parameterization and geometric consistency verification within a factor graph optimization architecture. R2LIVE²⁴ establishes a unified probabilistic framework combining LiDAR, camera, and IMU measurements via manifold-based iterated Kalman filtering.⁸ Specifically, its VIO component leverages keyframe-based sliding window optimization with geometric verification to achieve robust visual feature triangulation through maximum a posteriori estimation. R3LIVE²⁵ employs a hierarchical architecture where LIO establishes the geometric representation of the global map, while VIO concurrently populates photometric attributes through texture rendering. These complementary subsystems maintain joint state estimation via tightly-coupled fusion of heterogeneous sensor streams. R3LIVE++²⁶ implements a self-adaptive radiometric compensation framework with real-time exposure parameter estimation and proactive photometric calibration,²⁷ enabling radiometrically consistent reconstruction of environmental radiation properties across varying illumination conditions. LVI-SAM²⁸ implements a tightly-integrated multi-modal fusion architecture within a factor graph-based optimization framework, where LiDAR geometric constraints, visual feature observations, and IMU kinematic models are jointly optimized through smoothing-based state estimation. Specifically, its visual-inertial subsystem executes sparse feature tracking while establishing cross-modal depth associations through LiDAR-derived geometric verification. Fast-LIVO²⁹ and Fast-LIVO2¹ implements a tightly-coupled framework through singular error-state iterative Kalman filtering (ESIKF), which can be updated by both LiDAR and visual observations. SR-LIVO³⁰ enhances LiDAR scan segmentation by reconstructing point clouds to align with camera timestamps. Once a relatively accurate pose estimation is obtained using LIO, it integrates visual updates, enhancing both pose accuracy and processing efficiency. The study³¹ addressed the limitation of locating 3D cracks by introducing an automated detection system that integrates LiDAR and camera data, achieving sub-millimeter measurement precision for 3D structural health monitoring.

In addition, wheel encoder, as a low-cost high-frequency motion sensor, provides stable measurements in GNSS-denied environments. Current LiDAR-visual-inertial fusion methodologies predominantly neglect systematic integration of wheel encoders. Besides, there are a large number of planar features in actual environments, and using continuous planar structures in the scene (such as floors and walls) to construct planar constraints can significantly reduce matching ambiguity in degraded environments. To address these situation, we propose a novel optimization framework for multimodal sensors (LiDAR-Camera-IMU-Wheel) with adaptive planar constraints. The specific contributions are as follows:

A novel multimodal sensor (LiDAR-Camera-IMU-Wheel) fusion optimization framework is proposed, which jointly optimizes visual reprojection residuals, IMU preintegration residuals, wheel encoder preintegration residuals, and LiDAR odometry residuals through factor graphs.

A novel plane extraction method based on adaptive dynamic distance thresholds and normal vector consistency verification is proposed. Additionally, planar factors are integrated into the backend optimization process of the LiDAR subsystem. By imposing constraints derived from planar structures, local map noise is effectively suppressed, and the geometric consistency of the reconstructed environment is rigorously maintained, thereby improving overall mapping accuracy.

Extensive validation in complex urban environments, demonstrating superior localization accuracy and system robustness compared to state-of-the-art SLAM methods.

Methodology

System overview

The overview of the proposed system is shown in Figure 1, with input sensors including LiDARs, RGBD cameras, IMUs, and wheel encoders, while GNSS is optional. The system is primarily composed of two subsystems: the visual subsystem and the LiDAR subsystem. The visual subsystem takes input camera images, IMU measurements, and wheel encoder measurements, with LiDAR data being optionally input. This subsystem achieves state estimation by jointly minimizing visual reprojection residuals, IMU preintegration³² residuals, and wheel encoder preintegration residuals. The LiDAR subsystem extracts geometric features from point clouds and constructs a LiDAR odometry by matching the features of the current frame to a local map, thereby providing high-precision pose estimation. The back-end employs a factor graph optimization framework, utilizing iSAM2³³ to optimize the system states jointly. Constraints integrated into the optimization include IMU preintegration factors, LiDAR odometry factors, and loop closure factors.

Figure 1.

System overview.

Two subsystems solve the state estimation problem using optimization-based methods, where the problem is formulated as a Maximum A Posteriori (MAP) problem. During this formulation, it is assumed that the measurements from different sensors are independent and that their noise follows zero-mean Gaussian distributions. Based on these assumptions, the MAP problem is further converted into a cost minimization problem, where each cost corresponds to the residual of a specific sensor measurement:

\begin{matrix} X^{★} = \underset{X}{\arg \max} p (X ∣ z) \\ = \underset{X}{\arg \max} p (X) p (z ∣ X) \\ = \underset{X}{\arg \max} p (X) Π_{i = 1}^{n} p (z_{i} ∣ X) \\ = \underset{X}{\arg \min} {∥ r_{p} - H_{p} X ∥^{2} + \sum_{i = 1}^{n} ∥ r (z_{i}, X) ∥_{P_{i}}^{2}} . \end{matrix}

(1)

Where $z_{i}$ stands for the measurements of the $i$ th sensor. ${r_{p}, H_{p}}$ is the prior information from marginalization. $r (\cdot)$ denotes the residuals corresponding to each sensor measurement. $∥ \cdot ∥_{P}$ represents the Mahalanobis norm.

Visual subsystem

The visual subsystem performs an optimization process within a sliding window framework, where the robot state $x$ can be defined as:

x = [R, p, v, b] .

(2)

Where $R \in SO (3)$ is the rotation matrix, $p \in R^{3}$ is the position vector, $v$ is the speed, and $b$ is the IMU bias. The transformation $T \in SE (3)$ from $B$ to $W$ is represented as $T = [R ∣ p]$ .

Improvements in depth estimation and initialization

The processing pipeline of the visual subsystem is improved upon the Ground-Fusion algorithm,³⁴ which is a low-cost SLAM framework that tightly couples RGBD-IMU-Wheel-GNSS sensors. The more detailed improvements are as follows:

1) Image Depth Estimation: In our visual subsystem, LiDAR measurements are optional input. During the state estimation process of visual odometry, accurate depth information is critical for recovering the camera’s 3D motion trajectory from 2D images. To enhance the accuracy of depth estimation, we employ a strategy that incorporates three different depth sources: firstly, when the LiDAR subsystem is operating normally, LiDAR-provided depth information is prioritized due to its superior measurement accuracy compared to other sensors; secondly, if the LiDAR becomes unavailable or fails, the system switches to using depth data obtained from RGBD cameras. That is, the 2D features tracked in the image are combined with the pixel values at the same locations in the depth image and the two are aligned; finally, when neither of the aforementioned sources is available, the system utilizes IMU sensors for depth estimation. Within this framework, the system dynamically selects the depth source based on the operational status of the LiDAR subsystem, ensuring reliable depth estimation under varying conditions.

2) System Initialization: Optimization-based methods are highly dependent on accurate system initialization. The optimization process relies on an initial state to commence iterative solving. If the initial values are inaccurate or deviate significantly from the true values, the back-end optimization may become trapped in local minima, leading to substantial errors in state estimation or even failure to converge to the correct solution. Furthermore, poor initialization can significantly slow the convergence rate of the optimization process. This issue is particularly pronounced in complex environments or large-scale scenarios, ultimately compromising the accuracy and robustness of the system. In the original system, the initialization process often failed under conditions of low acceleration excitation. To enhance the accuracy and robustness of the visual subsystem’s initialization, we utilize the system state $x$ and IMU biases $b$ , obtained from the back-end optimization of the LiDAR subsystem, as the initial estimates for the visual subsystem when the LiDAR subsystem is operational. Specifically, the optimized states are mapped to the corresponding keyframes of the visual subsystem based on the image timestamps, serving as the initial guesses for the visual state estimation.

Constraint factors

1) IMU Preintegration Factor: In the modeling of the IMU preintegration factor, the measurements involved include platform biases, noisy linear acceleration and angular velocity. Additionally, the acceleration needs to contain the gravity component. Therefore, the IMU’s acceleration and angular velocity measurements³² can be modeled as:

\begin{matrix} {\tilde{a}}_{t} = a_{t} + b_{a_{t}} + R_{w}^{b_{t}} g^{w} + n_{a} \\ {\tilde{ω}}_{t} = ω_{t} + b_{w_{t}} + n_{w} . \end{matrix}

(3)

Where ${a_{t}, ω_{t}}$ is the real linear acceleration and angular velocity in IMU sensor at time $t$ . ${{\tilde{a}}_{t}, {\tilde{ω}}_{t}}$ is the raw measurement of accelerometer and gyroscope at time $t$ . Assumed that $n_{a}$ and $n_{ω}$ to be zero-mean Gaussian distributed, $n_{b_{a}} ~ N (0, Σ_{b_{a}}), n_{b_{ω}} ~ N (0, Σ_{b_{ω}})$ . The biases of the accelerometer and gyroscope are modeled as a random walk as follows:

{\overset{\cdot}{b}}_{a_{t}} = n_{b_{a}}, {\overset{\cdot}{b}}_{w_{t}} = n_{b_{w}},

(4)

In the process of nonlinear optimization, the state variables are iteratively updated, which requires re-integrating the IMU measurements after each update. To improve the efficiency of the overall optimization process, we employ the preintegration method, where the integration results between time instants $k$ and $k + 1$ are precomputed and used as constraint factors in the back-end. The preintegration is defined as follows:

\begin{matrix} α_{b_{t_{k + 1}}}^{b_{t_{k}}} = \int \int_{t \in [t_{k}, t_{k + 1}]} R_{b_{t}}^{b_{t_{k}}} ({\tilde{a}}_{t} - b_{a_{t}}) d t^{2} \\ β_{b_{t_{k + 1}}}^{b_{t_{k}}} = \int_{t \in [t_{k}, t_{k + 1}]} R_{b_{t}}^{b_{t_{k}}} ({\tilde{a}}_{t} - b_{a_{t}}) dt \\ γ_{b_{t_{k + 1}}}^{b_{t_{k}}} = \int_{t \in [t_{k}, t_{k + 1}]} \frac{1}{2} Ω ({\tilde{ω}}_{t} - b_{w_{t}}) γ_{b_{t}}^{b_{t_{k}}} dt . \end{matrix}

(5)

Where $b_{k}$ stands for the body frame at time $t_{k}$ . ${α, β, γ}$ represents the relative change in position, velocity, and rotation between the $k$ and $k + 1$ moments. If the estimated bias undergoes significant changes, the preintegration process is repropagated using the updated bias value. $Ω (ω)$ is defined as follows:

Ω (ω) = [\begin{matrix} - {⌊ ω ⌋}_{\times} & ω \\ - ω^{T} & 0 \end{matrix}], {⌊ ω ⌋}_{\times} = [\begin{matrix} 0 & - ω_{z} & ω_{y} \\ ω_{z} & 0 & - ω_{x} \\ - ω_{y} & ω_{x} & 0 \end{matrix}] .

(6)

Finally, the IMU preintegration residuals⁹ can be expressed as:

\begin{matrix} r_{B} ({\tilde{z}}_{b_{t_{k + 1}}}^{b_{t_{k}}}, X) = [\begin{matrix} δ α_{b_{t_{k + 1}}}^{b_{t_{k}}} \\ δ β_{b_{t_{k}}}^{b_{t_{k + 1}}} \\ δ θ_{b_{t_{k + 1}}}^{b_{t_{k + 1}}} \\ δ b_{a} \\ δ b_{g} \end{matrix}] \\ = [\begin{matrix} R_{w}^{b_{t_{k}}} (p_{b_{t_{k + 1}}}^{w} - p_{b_{t_{k}}}^{w} + \frac{1}{2} g^{w} Δ t_{k}^{2} - v_{b_{t_{k}}}^{w} Δ t_{k}) - {\hat{α}}_{b_{t_{k + 1}}}^{b_{t_{k}}} \\ R_{w}^{b_{t_{k}}} (v_{b_{t_{k + 1}}}^{w} + g^{w} Δ t_{k} - v_{b_{t_{k}}}^{w}) - {\hat{β}}_{b_{t_{k + 1}}}^{b_{t_{k}}} \\ 2 {[q_{b_{t_{k}}}^{w^{- 1}} \otimes q_{b_{t_{k + 1}}}^{w} \otimes {({\hat{γ}}_{b_{t_{k + 1}}}^{b_{t_{k}}})}^{- 1}]}_{xyz} \\ b_{a b_{t_{k + 1}}} - b_{a b_{t_{k}}} \\ b_{w b_{t_{k + 1}}} - b_{w b_{t_{k}}} \end{matrix}] . \end{matrix}

(7)

2) Visual Reprojection Residual Factor: In our system, visual measurements are represented as a set of feature points extracted from image frames. Specifically, prominent corners within the image are detected as feature points using the Harris corner detector,³⁵ and their temporal correspondence is established through tracking based on the iterative Lucas and Kanade optical flow method.³⁶ The process of projection can be modeled as:

\tilde{P} = π_{c} (R_{b}^{c} (R_{w}^{b} x^{w} + p_{w}^{b}) + p_{b}^{c}) + n_{c} .

(8)

Where $\tilde{P} = [u, v]^{T}$ is the pixel coordinates of the image. $π_{c} (\cdot)$ represents the camera projection function and $n_{c}$ is the sensor measurement noise. For the $l$ th feature point that is first detected in the $i$ th image frame, its corresponding visual reprojection residuals in the $j$ th image frame are as follows:

r_{C} ({\tilde{z}}_{l}, X) = {\tilde{P}}_{l}^{c_{t_{j}}} - π_{c} ({\hat{x}}_{l}^{c_{t_{j}}}) .

(9)

\begin{matrix} \begin{matrix} {\hat{x}}_{l}^{c_{t_{j}}} = R_{b}^{c} (R_{w}^{b_{t_{j}}} (R_{b_{t_{i}}}^{w} (R_{c}^{b} \frac{1}{ρ_{l}} π_{c}^{- 1} ({\tilde{P}}_{l}^{c_{t_{i}}}) + p_{c}^{b}) \\ + p_{b_{t_{i}}}^{w}) + p_{w}^{b_{t_{j}}}) + p_{b}^{c} . \end{matrix} \end{matrix}

(10)

Where ${R_{b}^{c}, t_{c}^{b}}$ represents the transformation matrix between camera and IMU and $ρ_{l}$ represents the inverse depth in frame $i$ .

3) Wheel Preintegration Factor: For ground-based mobile robots, the motion dynamics often fail to meet the requirements for robust visual-inertial fusion. The weak observability of scale and pose during operation can significantly degrade localization accuracy. This limitation is particularly evident in scenarios where the robot’s motion lacks sufficient excitation, such as in low-speed or constrained environments.³⁷ To address this challenge, ground robots are typically equipped with wheel encoders. These sensors provide reliable motion measurements with an absolute scale, independent of environmental conditions. Unlike vision-based measurements that depend on feature-rich scenes or inertial measurements that rely on sufficient acceleration and angular motion, wheel encoder offers consistent and accurate motion estimation under most operating conditions.

By incorporating wheel encoders into the SLAM framework, the system benefits from enhanced scale observability and more robust pose estimation, particularly in scenarios where visual-inertial methods alone struggle. The integration of wheel encoders as an additional constraint within the optimization process allows for improved localization accuracy and system robustness, even in environments with low texture or minimal motion excitation.

To estimate the motion of the robot from wheel encoder measurements, the following kinematic formulation³⁸ is developed.

\begin{matrix} {\overset{\cdot}{R}}_{O}^{W} = R_{O}^{W} ω^{O^{\land}} \\ {\overset{\cdot}{p}}_{O}^{W} = v_{O}^{W} \end{matrix}

(11)

The equation (11) describes the pose transformation of the wheel encoder coordinate system. By integrating (11), the pose at time $t + Δ t$ can be obtained as follows.

\begin{matrix} R_{O}^{W} (t + Δ t) = R_{O}^{W} (t) Exp (\int_{t}^{t + Δ t} ω^{O} (τ) d τ) \\ p_{O}^{W} (t + Δ t) = p_{O}^{W} (t) + \int_{t}^{t + Δ t} v_{O}^{W} (τ) d τ \end{matrix}

(12)

The wheel encoder can provide motion measurement between frames. Based on the midpoint integration, we can derive the wheel encoder measurement model as follows:

\begin{matrix} R_{O}^{W} (t + Δ t) = R_{O}^{W} (t) Exp ({\bar{ω}}^{O} Δ t) \\ p_{O}^{W} (t + Δ t) = p_{O}^{W} (t) + {\bar{v}}_{O}^{W} Δ t . \end{matrix}

(13)

Where $O$ represents the coordinate of the wheel odometer. ${\bar{ω}}^{O}$ and ${\bar{v}}_{O}^{W}$ represent the averages of the measurements at time $t$ and $t + Δ t$ :

\begin{matrix} {\bar{ω}}^{O} = \frac{1}{2} (ω^{O} (t) + ω^{O} (t + Δ t)) \\ {\bar{v}}_{O}^{W} = \frac{1}{2} (v_{O}^{W} (t) + v_{O}^{W} (t + Δ t)) \\ = \frac{1}{2} (R_{O}^{W} (t) v^{O} (t) + R_{O}^{W} (t + Δ t) v^{O} (t + Δ t)) . \end{matrix}

(14)

Similar to IMU preintegration, the wheel encoder measurements between keyframe $k = i$ and keyframe $k = j$ can be integrated into a composite measurement, which imposes a motion constraint between the two consecutive keyframes. The final wheel encoder preintegration measurement model is as follows:

\begin{matrix} Δ {\tilde{p}}_{O_{j}}^{O_{i}} = R_{Oi}^{W^{T}} (p_{Oj}^{W} - p_{Oi}^{W}) + δ p_{ij} \\ Δ {\tilde{R}}_{O_{j}}^{O_{i}} = R_{Oi}^{W^{T}} R_{Oj}^{W} Exp (δ θ_{ij}) . \end{matrix}

(15)

Where $[δ p_{ij}, δ θ_{ij}]$ is the random noise.

Based on the wheel encoder preintegration measurement model (15), the wheel preintegration residual can be given by:

\begin{matrix} r_{Δ R_{j}^{o_{i}}} = Log (Δ {\tilde{R}}_{O_{j}}^{O_{i}^{T}} R_{O_{i}}^{W^{T}} R_{O_{j}}^{W}) \\ r_{p_{O_{j}}} = R_{Oi}^{W^{T}} (p_{Oj}^{W} - p_{Oi}^{W}) - Δ {\tilde{p}}_{O_{j}}^{O_{i}} . \end{matrix}

(16)

Where $p_{O}^{W} = R_{B}^{W} p_{O}^{B}, R_{O}^{W} = R_{B}^{W} R_{O}^{B}$ .

Therefore, the final wheel preintegration residual, expressed as a function of the system state, is given by:

\begin{matrix} r_{Δ R_{o_{j}}^{o_{i}}} = Log (Δ {\tilde{R}}_{O_{j}}^{O_{i}^{T}} {(R_{Bi}^{W} R_{O}^{B})}^{T} R_{Bj}^{W} R_{O}^{B}) \\ r_{Δ p_{o_{j}}^{o_{i}}} = {(R_{Bi}^{W} R_{O}^{B})}^{T} (R_{Bj}^{W} p_{O}^{B} + p_{Bj}^{W} \\ - R_{Bi}^{W} p_{O}^{B} - p_{Bi}^{W}) - Δ {\tilde{p}}_{O_{j}}^{O_{i}} . \end{matrix}

(17)

Where $[R_{O}^{B}, p_{O}^{B}]$ is the known extrinsic parameter between IMU and wheel odometer.

LiDAR subsystem

The LiDAR subsystem is an enhancement based on the LIO-SAM algorithm.¹⁷ The optimization variables in the system are same to those used in the visual subsystem. The system retains the factor graph used for global pose optimization. In this factor graph, three types of constraints are incorporated: IMU preintegration factor, LiDAR odometry factor, and loop closure factor. These constraints are then jointly optimized to refine the system’s state estimate. The IMU preintegration constraints account for the motion between consecutive time steps, while the LiDAR odometry constraints provide relative pose estimates between keyframes or consecutive frames. Additionally, loop closure constraints, which are derived from detecting revisited locations, help to correct drift and improve the global consistency of the trajectory.

By formulating the constraints in a factor graph framework, the system can simultaneously optimize the states associated with the IMU and LiDAR sensors, while also minimizing the error in the loop closure detection. This approach ensures that all available sensor information is efficiently utilized, leading to a more accurate and robust global trajectory estimation.

LiDAR odometry

Upon the arrival of a new LiDAR scan, we first perform feature extraction to identify key elements within the point cloud. The extraction process involves evaluating the roughness of points within a local neighborhood to distinguish between edge and planar features. By classifying points into edge and planar features based on local roughness thresholds, we can effectively segment the LiDAR scan into distinct feature types that are more suitable for matching and alignment tasks.

After extracting features from the current LiDAR scan, a local map is constructed by identifying keyframes within the sliding window that are spatially and temporally close to the current frame. This local map provides a reference for scan-to-map matching, which utilizes both edge and planar features for precise alignment. Following the design principles of the LIO-SAM framework, we adopt its foundational structure while implementing a series of enhancements tailored to improve system performance and robustness. The more details are as follows:

1) Wheel Preintegration Factor: Due to the presence of the visual subsystem, the initial values required for optimizing the backend of the LiDAR subsystem can be obtained not only from the IMU, but also through results derived from the backend optimization of the visual subsystem. These initial values are then propagated to the LiDAR subsystem to enhance the accuracy and efficiency of its optimization process. Because the initial system state provided by the visual subsystem, optimized through the factor graph containing various constraints, is more accurate compared to the state estimated by the IMU.

2) Plane Factor: To enhance the robustness of point cloud registration, we propose a plane extraction method incorporating dynamic threshold adaptation and normal vector constraints. After extracting the initial planar features through curvature, an improved RANSAC algorithm is used for multi plane segmentation based on this. For the $k$ -th iteration, the plane parameters $π_{k} = (a_{k}, b_{k}, c_{k}, d_{k})$ are estimated by minimizing the sum of absolute distances for inliers:

\min_{π_{k}} \sum_{i \in I_{k}} \frac{| a_{k} x_{i} + b_{k} y_{i} + c_{k} z_{i} + d_{k} |}{\sqrt{a_{k}^{2} + b_{k}^{2} + c_{k}^{2}}} .

(18)

Where $I_{k}$ denotes the inlier index set.

To handle varying scene scales, we dynamically adjust the distance threshold based on the point cloud centroid:

τ_{d} = α \cdot ‖ μ ‖_{2}, μ = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i}, y_{i}, z_{i})}^{T} .

(19)

Where $α$ is the scaling factor.

For the ground, we enforce directional constraints to select physically meaningful planes:

n_{k} \cdot n_{ref} > γ .

(20)

Where $n_{k} = {(a_{k}, b_{k}, c_{k})}^{T} / \sqrt{a_{k}^{2} + b_{k}^{2} + c_{k}^{2}}$ is the normalized normal vector, $n_{ref}$ denotes reference direction, and $γ$ is the angular threshold.

Finally, the plane quality metric is defined as the mean absolute distance error of inliers:

E_{k} = \frac{1}{| I_{k} |} \sum_{i \in I_{k}} | a_{k} x_{i} + b_{k} y_{i} + c_{k} z_{i} + d_{k} | .

(21)

A plane is accepted as a valid constraint factor only when $E_{k} < 0.05$ is satisfied.

We have conducted tests on multiple sequences by adjusting $α$ (within 0.1–0.3) and $γ$ (within 0.75–0.95) while keeping other settings consistent. The experimental results show that the localization accuracy (measured by ATE) varies by less than 5% across different parameter combinations.

Experiment

To comprehensively evaluate the accuracy and efficiency of the proposed algorithm, APGC-LVIW was tested on the M2DGR-PLUS³⁴ and M2DGR.³⁹ The M2DGR datasets contain various challenging urban campus scenarios, such as weakly-textured, open areas, and moving obstacles. The M2DGR-PLUS datasets are the Extension and update of M2DGR, which contain the scenarios of degraded bridge decks, indoor-outdoor transitions, complex street environments, etc. More details of the specifications of all sensors are summarized in Table 1.

Table 1.

Details of benchmark datasets.

Sensor		M2DGR-PLUS	M2DGR
LiDAR	Type	Robosense RS-16	Velodyne VLP-32C
	Frequency	5 Hz	10 Hz
IMU	Type	Built in IMU of Intel RealSense D435i	Handsfree
	Frequency	200 Hz	150 Hz
Camera	Type	Realsense d435i	Realsense d435i
	Frequency	15 Hz	15 Hz
Wheel	Type	WHEELTEC	*
	Frequency	20 Hz	*

indicates that there is no Wheel sensor in the M2DGR dataset.

On these two datasets, APGC-LVIW was compared with several state-of-the-art SLAM methods: LIO-SAM,¹⁷ VINS-Mono,⁹ LINS,⁴⁰ Odom (The wheel odometry in Ground-Fusion.³⁴), FAST-LIO2,¹⁴ R3LIVE,²⁵ FAST-LIVO,²⁹ Ground-Fusion,³⁴ LVI-SAM.²⁸ And the computation platform used for tests is a laptop equipped with an Intel i7-12,700H 2.50 GHz CPU and 32 GB RAM.

Accuracy results

Table 2 shows the Root Mean Square Errors (RMSEs) of absolute trajectory error (ATE) for several algorithms on the M2DGR-PLUS datasets. APGC-LVIW performs best in seven out of nine sequences. Benefiting from the proposed multi-modal sensor fusion framework (LiDAR-Camera-IMU-Wheel Odometry), the system effectively integrates heterogeneous sensor measurements, achieving superior localization accuracy in seven out of nine test sequences.

Table 2.

RMSEs of ATE [ $m$ ] on M2DGR-PLUS datasets.

Sequence	LIO-SAM	VINS-Mono	LINS	Odom	FAST-LIO2	R3LIVE	FAST-LIVO	Ground-Fusion	LVI-SAM	APGC-LVIW (Ours)
Tree	0.226	1.452	0.207	2.789	0.772	*	*	0.594	0.320	0.151
Bridge1	*	1.647	*	0.822	*	*	*	0.491	0.248	0.161
Street1	0.153	0.657	0.191	0.264	0.297	*	1.016	0.198	0.093	0.081
Street2	0.106	0.835	0.106	0.974	0.186	3.135	2.428	0.271	0.095	0.089
Parking1	0.056	0.091	0.052	0.238	0.071	0.090	1.039	0.189	0.057	0.057
Parking2	0.037	0.143	0.038	0.263	0.050	0.468	0.771	0.197	0.038	0.035
Building1	0.278	0.547	0.280	1.607	0.284	1.894	2.444	0.628	0.254	0.241
Building2	0.222	0.491	0.193	0.971	0.374	1.752	1.863	0.252	0.233	0.216
Switch	0.217	1.446	0.451	4.820	0.889	*	7.169	0.785	0.216	0.214

The optimal results are marked in red and the suboptimal results are marked in blue.

indicates that the algorithm has degraded in these sequences, resulting in significant accuracy errors.

Notably, in challenging scenarios such as LiDAR-degraded environments (e.g. Bridge1 sequence) and dense urban campus scenes with heavy tree occlusions (e.g. Tree sequence), conventional LiDAR SLAM and existing multi-sensor fusion methods suffer from performance degradation or even localization failure due to sensor limitations or environmental interference. In contrast, our system, enhanced by wheel odometry integration and the optimization strategy, demonstrates significantly improved robustness and precision in degenerate conditions. Experimental results validate that the proposed multi-modal fusion framework exhibits higher reliability and localization accuracy in complex or degraded environments.

To more intuitively and accurately present the accuracy performance of APGC-LVIW, Figures 2 and 3 show the map construction effects and trajectory estimation results on M2DGR-PLUS datasets. As demonstrated by the mapping comparison in Figure 2 for the Tree and Bridge sequences, our system significantly enhances geometric consistency in multi-sensor state estimation by integrating wheel encoder constraints into the backend factor graph optimization framework. Compared with the LVI-SAM, our approach exhibits superior point cloud registration quality at structural boundaries, such as building walls and vehicles. While LVI-SAM shows noticeable point cloud stratification. And Figure 3 demonstrates that the proposed algorithm achieves higher localization accuracy in local regions compared to the other SOTA algorithms.

Figure 2.

Mapping result of APGC-LVIW on the tree sequence and bridge sequence presented from a bird’s eye view. The partial map highlighted by the red boxes is zoomed in on and compared with the LVI-SAM results.

Figure 3.

Comparison of trajectories on tree and switch sequence.

In addition, to validate the effectiveness of the proposed plane extraction method based on adaptive dynamic distance thresholds and normal vector consistency verification, along with the integration of planar factors into LiDAR backend optimization, ablation studies on the M2DGR datasets were conducted by incorporating the methodology into the LVI-SAM framework. As shown in Table 3, the proposed method achieves the highest localization accuracy. And the experimental results fully validate the effectiveness of the planar factor proposed in this paper.

Table 3.

RMSEs of ATE [ $m$ ] on M2DGR datasets.

Sequence	LIO-SAM¹⁷	VINS-Mono⁹	FAST-LIVO³⁰	LVI-SAM²⁹	APGC-LVIW(Ours)
Gate01	0.202	1.426	9.497	0.148	0.145
Gate02	0.409	*	2.927	0.317	0.315
Gate03	0.112	7.381	2.907	0.112	0.107
Street02	4.063	*	8.521	3.852	3.144
Street03	0.219	7.470	5.358	0.142	0.142
Street04	1.022	*	3.834	0.868	0.775
Street05	0.399	*	4.500	0.383	0.268
Street06	0.417	*	3.405	0.414	0.394

The optimal results are marked in red and the suboptimal results are marked in blue.

indicates that the algorithm has degraded in these sequences, resulting in significant accuracy errors.

Ablation study

To quantify the independent contributions of the two core innovations, adaptive planar factor and wheel encoder fusion, we designed three groups of experiments on the M2DGR-PLUS Tree (LiDAR-degraded, heavy occlusion) and Bridge1 (planar-dominant, degraded deck) sequences. The results are shown in Table 4.

Table 4.

Ablation results on M2DGR-PLUS sequences.

Sequence	Group	ATE (RMSE)	Increase
Tree	A (full system)	0.151	-
	B (without planar)	0.166	9.93%
	C (without wheel)	0.221	46.3%
Bridge1	A (full system)	0.161	-
	B (without planar)	0.180	11.8%
	C (without wheel)	0.251	55.9%

Group B (no planar factor) shows significant ATE degradation: In Sequence Tree and Bridge1, the loss of planar constraints increases local map noise (e.g. deck edge misalignment), leading to 9.93% and 46.3% higher ATE. This confirms the planar factor’s role in suppressing geometric ambiguity. Group C (no wheel fusion) exhibits 11.8% and 55.9% ATE increase: Wheel encoders provide absolute scale to mitigate IMU drift in low-excitation motion, so their removal reduces scale observability. Synergy of the two innovations. The full system (Group A) achieves the lowest error, proving that planar constraints and wheel fusion complement each other.

Runtime analysis

In this section, we evaluate the computation efficiency of our proposed method. Table 5 shows the runtime of each module. When the system receives camera images and LiDAR scans, our algorithm executes each module. Experimental results demonstrate that the total computational time remains at 97.937 ms per frame (including 48.65 ms for LiDAR subsystem and 49.287 ms for visual subsystem), which satisfies the real-time requirement for SLAM systems.

Table 5.

Runtime of each module (unit: ms).

Subsystem	Module	Average time
LiDAR	Feature extraction	0.947
LiDAR	Backend optimization	47.703
Camera	Feature extraction	11.253
Camera	Backend optimization	38.034
Total		97.937

Conclusion

This paper proposed APGC-LVIW, a novel multisensor fusion ground SLAM system that integrates LiDAR, visual, IMU, and wheel encoder data to address the challenges of localization and mapping in complex urban environments. The proposed system achieves high accuracy and robustness by leveraging the complementary strengths of LiDAR and visual sensors through bidirectional initialization and joint optimization. Furthermore, a planar factor based on scene geometric consistency is introduced into the LiDAR subsystem for joint optimization, which enforces geometric consistency on continuous planar structures (e.g. ground and walls), effectively reduces matching ambiguity and suppresses local map noise. This structural constraint not only improves the geometric integrity of the map but also enhances the system’s robustness to dynamic environmental changes. In future research, we plan to conduct more optimization studies on the real-time performance of the system.

Footnotes

ORCID iDs

Yan Sun

Lu Zhou

Lei Sun

Ethical considerations

This article does not contain any studies with human or animal participants.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Author contributions

Yan Sun conceived and designed the study, developed the codes, and wrote the article. Wanbiao Lin, Liangbo Hu, Bowen Peng and Jinlin Xiong aided in the theoretical formulation. Junjie Zhang assists me in conducting experiments. Lu Zhou and Lei Sun guided the study and provided critical revisions to the article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is funded by the National Natural Science Foundation of China 62173192 and the Shenzhen Science and Technology Program Foundation JCYJ20220530162202005.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All used data is contained within the article.

References

Zheng

Zou

, et al. FAST-LIVO2: fast, direct LiDAR–inertial–visual odometry. IEEE Trans Robot 2025; 41: 326–346.

Mur-Artal

Tardos

JD.

ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans Robot 2017; 33(5): 1255–1262.

Forster

Pizzoli

Scaramuzza

SVO: Ffast semi-direct monocular visual odometry. In: 2014 IEEE international conference on robotics and automation (ICRA), 2014, pp. 15–22. Hong Kong, China.

Engel

Koltun

Cremers

Direct sparse odometry. IEEE Trans Pattern Anal Mach Intell 2018; 40(3): 611–625.

Mourikis

Roumeliotis

. A multi-state constraint Kalman filter for vision-aided inertial navigation. In: IEEE international conference on robotics and automation (ICRA), Rome, Italy, 2007, pp. 3565–3572.

Seiskari

Rantalankila

Kannala

, et al. HybVIO: pushing the limits of real-time visual-inertial odometry. In: 2022 IEEE/CVF Winter conference on applications of computer vision (WACV), 2022, pp. 287–296. Waikoloa, HI, USA.

Geneva

Eckenhoff

. OpenVINS: a research platform for visual-inertial estimation. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 2020, pp. 4666–4672.

Bell

Cathey

FW.

The iterated Kalman filter update as a gauss-newton method. IEEE Trans Automat Contr 1993; 38(2): 294–297.

Qin

Shen

VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans Robot 2018; 34(4): 1004–1020.

10.

Leutenegger

Lynen

Bosse

, et al. Keyframe-based visual–inertial odometry using nonlinear optimization. Int J Rob Res 2015; 34(3): 314–334.

11.

Wang

Zhang

Cao

, et al. Geometry-aware 3D point cloud learning for precise cutting-point detection in unstructured field environments. J Field Robot 2025; 42: 3063–3076.

12.

Zhang

Singh

Loam: lidar odometry and mapping in real-time. Robotics 2014; 2(9): 1–9.

13.

Shan

Englot

. Lego-loam: lightweight and ground-optimized lidar odometry and mapping on variable terrain. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain, 2018, pp. 4758–4765. IEEE.

14.

Cai

, et al. Fast-lio2: fast direct lidarinertial odometry. IEEE Trans Robot 2022; 38(4): 2053–2073.

15.

Chen

, et al. Point-LIO: robust high-bandwidth light detection and ranging inertial odometry. Adv Intell Syst 2023; 5(7): 2200459.

16.

Chen

Yuan

, et al. Ig-lio: an incremental GICP-based tightly-coupled lidar-inertial odometry. IEEE Robot Autom Lett 2024; 9: 1883–1890.

17.

Shan

Englot

Meyers

, et al. LIOSAM: tightly-coupled lidar inertial odometry via smoothing and mapping. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, 2020, pp. 4758–4765.

18.

Pan

Xiao

YZ.

Mulls: versatile lidar slam via multi-metric linear least square. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi'an, China, 2021, pp. 11 633–11 640. IEEE.

19.

Debeunne

Vivet

A review of visual-lidar fusion based simultaneous localization and mapping. Sensors 2020; 20(7): 2068.

20.

Shao

Vijayarangan

, et al. Stereo visual inertial lidar simultaneous localization and mapping. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China, 2019, pp. 370–377. IEEE.

21.

Graeter

Wilczynski

Lauer

Limo: lidar-monocular visual odometry. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2018, pp. 7872–7879. IEEE. Madrid, Spain.

22.

Zuo

Geneva

Lee

, et al. Lic-fusion: lidarinertial-camera odometry. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2019, pp. 5848–5854. Macau, China.

23.

Zuo

Yang

Geneva

, et al. Lic-fusion 2.0: lidar-inertial-camera odometry with sliding-window plane-feature tracking. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2020, pp. 5112–5119. IEEE. Las Vegas, NV, USA.

24.

Lin

Zheng

, et al. R2 LIVE: a robust, real-time, lidar-inertial-visual tightly-coupled state estimator and mapping. IEEE Robot Autom Lett 2021; 6(4): 7469–7476.

25.

Lin

Zhang

. R 3 live: a robust, real-time, RGB-colored, lidar-inertial-visual tightly-coupled state estimation and mapping package. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, 2022, pp. 10 672–10 678. IEEE.

26.

Lin

Zhang

R (3) LIVE++: a robust, real-time, radiance reconstruction package with a tightly-coupled LiDAR-inertial-visual state estimator. IEEE Trans Pattern Anal Mach Intell 2024; 46(12): 11168–11185.

27.

Engel

Usenko

Cremers

A photometrically calibrated benchmark for monocular visual odometry. arXiv preprint arXiv:1607.02555, 2016.

28.

Shan

Englot

Ratti

, et al. Lvi-sam: tightly-coupled lidar-visual-inertial odometry via smoothing and mapping. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 2021, pp. 5692–5698. IEEE.

29.

Zheng

Zhu

, et al. FAST-LIVO: fast and tightly-coupled sparse-direct lidar-inertial-visual odometry. In: 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), Kyoto, Japan, 2022, pp. 4003–4009.

30.

Yuan

Deng

Ming

, et al. SR-LIVO: LiDAR-inertial-visual odometry and mapping with sweep reconstruction. IEEE Robot Autom Lett 2024; 9(6): 5110–5117.

31.

Chen

Kang

, et al. 3D vision technologies for a self-developed structural external crack damage recognition robot. Autom Constr 2024; 159: 105262.

32.

Cao

Shen

GVINS: tightly coupled GNSS–visual–inertial fusion for smooth and consistent state estimation. IEEE Trans Robot 2022; 38(4): 2004–2021.

33.

Kaess

Johannsson

Roberts

, et al. iSAM2: incremental smoothing and mapping using the Bayes tree. Int J Rob Res 2012; 31(2): 216–235.

34.

Yin

, et al. Ground-fusion: A low-cost ground SLAM system robust to corner cases. In: Proc. IEEE Int. Conf. Robot. Autom., 2024, pp. 8603–8609. Yokohama, Japan.

35.

Shi

, et al. Good features to track. In: Proceedings of the IEEE international conference on pattern recognition, 1994, pp. 593–600. Seattle, WA, USA.

36.

Lucas

Kanade

An iterative image registration technique with an application to stereo vision. In: Proceedings of the international joint conference on artificial intelligence, Vancouver, Canada, August–1981, pp. 24–28.

37.

Zhuang

. VIW-Fusion: visual-inertial-wheel fusion odometry. 2021. [online] https://github.com/TouchDeeper/VIW-Fusion

38.

Murray

Sastry

, et al. A mathematical introduction to robotic manipulation. CRC Press, 1994.

39.

Yin

, et al. M2DGR: a multi-sensor and multi-scenario slam dataset for ground robots. IEEE Robot Autom Lett 2022; 7(2): 2266–2273.

40.

Qin

Pranata

, et al. LINS: a lidar-inertial state estimator for robust and efficient navigation. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 2020, pp. 8899–8906.

APGC-LVIW: LiDAR-visual-inertial-wheel SLAM with adaptive planar geometric constraints

Abstract

Keywords

Introduction

Visual SLAM

LiDAR SLAM

LiDAR-visual-inertial SLAM

Methodology

System overview

Visual subsystem

Improvements in depth estimation and initialization

Constraint factors

LiDAR subsystem

LiDAR odometry

Experiment

Accuracy results

Ablation study

Runtime analysis

Conclusion

Footnotes

ORCID iDs

Ethical considerations

Consent to participate

Consent for publication

Author contributions

Funding

Declaration of conflicting interests

Data availability statement

References