Sage Journals: Discover world-class research

Abstract

This paper introduces a neural nonlinear model predictive control (NMPC) framework for mapless, collision-free navigation in unknown environments with Aerial Robots, using onboard range sensing. We leverage deep neural networks to encode a single range image, capturing all the available information about the environment, into a signed distance function (SDF). The proposed neural architecture consists of two cascaded networks: a convolutional encoder that compresses the input image into a low-dimensional latent vector, and a Multi-Layer Perceptron that approximates the corresponding spatial SDF. This latter network parametrizes an explicit position constraint used for collision avoidance, which is embedded in a velocity-tracking NMPC that outputs thrust and attitude commands to the robot. First, a theoretical analysis of the contributed NMPC is conducted, verifying recursive feasibility and stability properties under fixed observations. Subsequently, we evaluate the open-loop performance of the learning-based components as well as the closed-loop performance of the controller in simulations and experiments. The simulation study includes an ablation study, comparisons with two state-of-the-art local navigation methods, and an assessment of the resilience to drifting odometry. The real-world experiments are conducted in forest environments, demonstrating that the neural NMPC effectively performs collision avoidance in cluttered settings against an adversarial reference velocity input and drifting position estimates.

Keywords

autonomous navigation aerial robotics neural networks nonlinear MPC signed distance function

Introduction

Recent advances in autonomous navigation and aerial robotics have enabled the large-scale deployment of robots in several challenging applications, such as subterranean or forest exploration and ship inspection (Fang et al., 2017; Tian et al., 2020; Tranzatto et al., 2022). However, navigation in unknown and unstructured environments while relying only on onboard sensing remains a strenuous challenge. This is particularly true in perceptually-degraded environments and conditions, where mapping systems are failure-prone (Ebadi et al., 2024).

Conventional state-of-the-art navigation stacks often involve the construction of a reliable volumetric map of the scene, followed by a planning step (Sucan et al., 2012; Tranzatto et al., 2022). Although this approach has proven effective in many cases, errors at the mapping level (such as a sudden localization drift) propagate downstream, adversely affecting planning and control. Furthermore, the sequential update of dense maps before motion planning induces latency, which limits the responsiveness to unforeseen events. Enhanced resilience w.r.t. collisions can be achieved by integrating map-based methodologies with local reactive strategies that rely on instantaneous exteroceptive observations.

Accordingly, the robotics community has investigated a set of approaches for the so-called mapless navigation problem, that is, the problem of achieving collision-free navigation without relying on explicit mapping. Beyond early heuristic approaches to reactive collision avoidance (Siegwart et al., 2011), more advanced strategies have emerged in recent years. While initial efforts focused on model-based methods (Florence et al., 2018; Gao et al., 2019; Florence et al., 2020; Yadav and Tanner, 2020; Zhou et al., 2021a), deep learning-based techniques have taken the field by storm, primarily due to their scalability to high-dimensional exteroceptive inputs such as depth images. Relevant learning strategies include imitation learning (IL) (Loquercio et al., 2021; Lu et al., 2023), and reinforcement learning (RL) (Hoeller et al., 2021; Ugu et al., 2022; Kulkarni and Alexis, 2024). In addition to these end-to-end methods, recent research has investigated the use of deep learning for implicit environment representations, combined with principled control strategies (Harms et al., 2024; Jacquet and Alexis, 2024).

This work falls into this latter category, and its main contribution is threefold. First, we introduce a neural architecture for encoding a single range observation as an signed distance function (SDF) (Park et al., 2019; Sitzmann et al., 2020). Unlike prior work on (neural or non-neural) scene-scale implicit SDF encoding—which primarily targets mapping and planning applications (Ortiz et al., 2022; Wu et al., 2023)—we explicitly avoid aggregating multiple observations. Second, we integrate this representation into a neural nonlinear model predictive control (NMPC) for achieving collision-free, mapless navigation in unknown environments. We verify that the NMPC satisfies recursive feasibility (under fixed observations) and stability conditions. An overview block diagram of the contributed method, referred to as SDF-NMPC, is shown in Figure 1. Third, we conduct a detailed, component-wise evaluation, including a quantitative assessment of the neural environment encoding, ablation studies, comparisons with existing methods, and real-world experiments involving drifting odometry and adversarial reference inputs.

Figure 1.

An overview of the proposed SDF-NMPC method. The left side depicts the neural architecture used to approximate the mapping between the input depth images and the corresponding SDF, through sampling-based training in the 3D sensor frustum. The Convolutional encoder-decoder and MLP networks are trained sequentially. The right side presents the proposed control scheme, highlighting the contributed neural-constrained NMPC for velocity tracking, and the color-coded learning-based components.

The proposed NMPC relies solely on instantaneous range measurements from an onboard sensor and on odometry, which may suffer from position drift. The neural SDF encoding of the current range observation parametrizes a position constraint for the controller. To address the limited sensor field of view (FoV), the predicted motion is also constrained to remain within the visible frustum. The controller generates thrust and attitude commands that satisfy both collision and FoV constraints, and is integrated in a velocity-tracking framework for real-time control of an aerial robot (AR). The implementation of the contributed NMPC and the associated neural networks is released open-source.¹

Accordingly, this work departs from several limitations of our previous contribution in Jacquet and Alexis (2024). First, the proposed control scheme generalizes to the nonlinear dynamics of the multirotor, moving beyond the previously considered unicycle-like motions. Second, we make use of a richer 3D representation. Indeed, the SDF is well-suited for collision avoidance, being a continuous, almost everywhere differentiable 3D field that encodes both the distance and direction to the nearest obstacle. Defined purely at the position level, it can be constructed directly from sensor measurements, whereas defining safe sets for derivative states (e.g., velocity or acceleration) is often challenging for nonlinear, higher-order systems (Harms et al., 2024). Moreover, the neural encoding provides a closed-form differentiable expression, enabling straightforward integration with gradient-based NMPC. The SDF formalism also facilitates the definition of a forward-invariant safe set for the receding-horizon problem, allowing us to establish recursive feasibility of the control law without introducing excessively restrictive terminal constraints. Lastly, the evaluation is substantially more comprehensive, including the explicit evaluation of the resilience of the 0-memory approach against drifting odometry.

The remainder of this article is structured as follows. First, we provide an overview of the related literature, followed by the problem statement. Then, the proposed method is presented, covering the neural environment encoding and the collision-avoidance NMPC framework. Finally, we present a comprehensive evaluation of the method, before discussions and final remarks.

Related work

This section presents related work and especially (a) recent advances in neural NMPC, (b) methodologies for mapless navigation, and (c) neural distance fields and their applications in robotics.

Neural MPC

Nonlinear model predictive control (MPC) has become a widely adopted control strategy for constrained systems, showing strong performance for ARs (Sun et al., 2022).

With advances in onboard computing and machine learning, learning-based MPC emerged as a successful alternative to robust or stochastic MPC, offering online adaptability to model mismatches and external disturbances (Hewing et al., 2020). Gaussian Processes, for instance, have been used to model residual dynamics in Kabzan et al. (2019) and Torrente et al. (2021). Neural networks (NNs), capable of capturing complex data patterns, have also been used to model complex dynamics, such as aerodynamic effects for quadrotor flight (Bauersfeld et al., 2021). Following a similar approach, neural NMPC has been proposed, embedding small-scale NNs to learn unmodeled dynamics (Williams et al., 2017; Chee et al., 2022; Gao et al., 2024). Going further, Syntakas and Vlachos (2024) introduce an ensemble approach based on the Monte-Carlo dropout technique, addressing the epistemic uncertainty in the neural models. RL has also been used to optimize the NMPC model, cost, and constraints to enhance closed-loop performance (Gros and Zanon, 2020), further extended by including a neural model for unknown dynamics (Adhau et al., 2024).

An open-source toolbox for integrating large NNs into NMPC is proposed in Salzmann et al. (2023), enabling tasks like navigation in turbulent flow (Salzmann et al., 2024). This paved the way for exploiting complex neural representations in NMPC. In Alhaddad et al. (2024), a transformer-based NN has been used to represent obstacle fields as repulsive cost, while Jacquet and Alexis (2024) embed depth camera observations into a differentiable occupancy map, explicitly used as a position constraint. This work builds upon this architecture.

Other recent architectures reframe NMPC as a parameterized controller learned via policy search for agile flight (Song and Scaramuzza, 2022), or integrate differentiable MPC into RL agents for tighter coupling between short- and long-term control (Romero et al., 2024). RL-based warm-starting strategies have been proposed to initialize the NMPC with actor-critic rollouts (Reiter et al., 2024), or approximate terminal costs using neural networks (Alsmeier et al., 2024). Additionally, Celestini et al. (2024) leverage a Transformer to generate trajectory candidates for warm-starting the solver.

A set of novel architectures for neural MPC has been explored. In Song and Scaramuzza (2022), the NMPC is formulated as a parametrized controller learned via policy search for agile flight. Another avenue relies on NN to approximate the infinite horizon cost. In Romero et al. (2024), a differentiable MPC implementation (East et al., 2020) is integrated in the RL agent, such that the MPC drives the short-term actions while the critic network manages the long-term ones. RL-based warm-starting strategies have been proposed in Reiter et al. (2024) to initialize the NMPC via policy roll-out. In Alsmeier et al. (2024), the long-horizon cost of the MPC is approximated by the NN, and used as the terminal cost of a short-horizon problem, for computational efficiency. Additionally, in Celestini et al. (2024), a Transformer is used to generate candidate trajectories for warm-starting MPC solver.

Mapless navigation

Mapless navigation has gained attention in recent years. A first class of approaches constructs local volumetric representations from recent observations using structures like k-d trees (Florence et al., 2018; Gao et al., 2019), enabling fast collision checking. Ray-casting (Yadav and Tanner, 2020) or depth map back-projection (Matthies et al., 2014) can also be used for fast collision checks. Leveraging such efficient checks, motion primitives have been used for planning (Lopez and How, 2017; Bucki et al., 2020). In Zhou et al. (2021a), a spline-based planner using differentiable repulsive forces from visible obstacles is introduced. Contrary to these works, which rely on a single depth measurement to perform checks, Florence et al. (2018) introduce a data representation for querying points in a sliding window of consecutive observations, explicitly accounting for the transform uncertainty induced by the uncertain state estimation. In Zhang et al. (2025), a NMPC is proposed leveraging a k-d tree to query the N closest obstacles for each shooting node along a pre-sampled path, that are in turn used as 1D position constraints for this shooting node. The approach struggles in cluttered settings where the path must significantly diverge from the straight-line reference.

Deep learning offers an alternative to collision checking, and exhibits good results for mapless navigation. Nguyen et al. (2022, 2024) propose a neural architecture to correlate depth images to near-future collisions for a given action. This method was extended with a variational auto-encoder (VAE) to allow more control over the latent representation, enabling semantic augmentation for task-specific navigation (Kulkarni et al., 2023). A similar approach is proposed in Lu et al. (2023) where a Q-value-inspired model predicts action values using expert policy rollouts. In Liang et al. (2024), a generative NN is used to generate candidate trajectories that satisfy traversability and coverage learned from A*-generated data.

Imitation learning (IL) has also been employed to learn how to generate sensor-based trajectories by imitating a map-informed export policy, such as an NMPC (Tolani et al., 2021), a sampling-based expert (Loquercio et al., 2021), or a privileged RL agent (Song et al., 2023). However, IL often lacks generalization of the learned policy. Thus, numerous works tackled this problem with RL (Kahn et al., 2021a, b). For instance, Ugu et al. (2022) train a policy from depth data but neglect AR dynamics during training, while Kulkarni and Alexis (2024) incorporate dynamics and achieve real-world deployment. Some RL-based approaches also learn implicit representations of time-consistent environments and robot state (Hoeller et al., 2021).

A hybrid strategy leverages NNs to encode a compressed, interpretable environment representation. Several works use neural networks to learn control barrier functions (CBFs) for a given system, which describes safety w.r.t. to the obstacles as described in range measurements. A relevant instance is provided by Dawson et al. (2022), where an observation-based CBF is learned for a first-order system. The method relies on approximating the LiDAR observation dynamics, which is difficult for higher-order systems and error-prone in complex environments, because the true observation dynamics are, in general, discontinuous. More accurate prediction of the next observation can be achieved with a NeRF (Tong et al., 2023) and used in the CBF, but is still limited to linear (first or second order) dynamics and struggles to extend to nonlinear motions.

Instead of approximating the observation dynamics, Harms et al. (2024) propose treating observations as parameters of safe sets in a switching approach, relying on the principle that a state safe under one observation remains safe over time, and provides a constructive approach of the LiDAR-based CBF (for an input-constrained second-order linear system). Other works approximate a CBF from obstacle-centered SDFs (Long et al., 2021) or via a GP (Keyumarsi et al., 2024), though both are restricted to first-order dynamics.

In general, these techniques struggle to faithfully encode observation-based safe sets for systems with complex (i.e., nonlinear or high-order) dynamics. Our previous work (Jacquet and Alexis, 2024) proposes an alternative that consists of using the NN to encode a simpler, position-only representation in the form of an occupancy map, while the dynamics are handled by an NMPC. However, the occupancy map gradient is nearly zero everywhere except at obstacle boundaries, limiting expressivity our proposed method overcomes this by encoding observations as an SDF.

Implicit distance fields

In computer vision, NNs are commonly used for implicit representation of surfaces or volumes, that is, as the 0-levelset of a scalar field. Indeed, coordinate-based networks (Tancik et al., 2020) take spatial coordinates as input and output the corresponding Boolean occupancy (Chen and Zhang, 2019; Mescheder et al., 2019) or SDF values (Park et al., 2019), offering continuous, compact representations encoded in network weights or latent spaces. These methods have been applied to shape rendering (Wang et al., 2021; Azinović et al., 2022; Wang et al., 2022) and completion (Dai et al., 2017; Stutz and Geiger, 2018).

In robotics, SDFs are often organized as Euclidean signed distance functions (ESDFs), assigning voxel-wise distances to the nearest obstacle. However, obtaining an SDF from sensor data, online, can be a challenging task. Classical approaches (Newcombe et al., 2011; Oleynikova et al., 2017) often construct these incrementally from truncated signed distance functions (TSDFs), which store truncated distances along sensor rays. In Han et al. (2019), an efficient alternative using occupancy grids with fast indexing is proposed.

Consequently, neural SDFs have gained traction for mapping and localization. Neural SLAM frameworks have been introduced in Sucar et al. (2021), Zhu et al. (2022), and Wang et al. (2023), where an SDF is encoded in a multi-layer perceptron (MLP) and learned online from RGBD images. Other approaches build SDFs online using TSDFs-inspired schemes (Yan et al., 2021; Ortiz et al., 2022), with extensions for object structure (Jang et al., 2024) and hierarchical representation (Vasilopoulos et al., 2024). A comprehensive mapping framework with global consistency is proposed in Pan et al. (2024). Separately, a relevant, non-neural framework uses GP-based SDF for non-neural mapping, odometry, and planning (Wu et al., 2023).

Aside from mapping and navigation, implicit SDFs are also employed for manipulation, for example, as observations to an RL agent for grasping in Mosbach and Behnke (2022). Differently, in Lee and Liu (2024), a network is trained to predict SDF of the swept volume of a manipulator arm, to ensure operator safety. A non-neural example is provided in Marić et al. (2024), where the SDF is approximated using piecewise polynomial SDFs with enforced $C^{1}$ continuity. While relying on few parameters compared to NNs, the scalability to large, complex scenes is questionable, and the construction time remains prohibitive for real-time applications.

Problem statement

We consider the problem of collision-free navigation in an unknown, static environment by exploiting the immediate high-resolution exteroceptive range observations. We assume a reference velocity to be available, which is followed when possible while remaining collision-free. The AR is assumed to be a rigid body, whose state vector is denoted by $x \in X$ and the input vector is written as $u \in U$ . We assume that all collision constraints can be expressed as constraints on the position of the robot alone, and that the system state x contains position information, denoted $p \in R^{3}$ .

The system state evolves according to

\dot{x} = f (x, u),

(1)

where f denotes the nonlinear AR dynamics (detailed, e.g., in Lee et al. (2010)), which is recalled in Section 5.2. The AR also integrates a constrained-FoV range sensor that periodically produces observations

o \in O

of a body-centric 3D volume defined by the sensor frustum, denoted

F \subset R^{3}

. The observation o is a function of the (potentially uncertain) state vector x and the unknown environment. In turn, this implies that the mapping function ξ between the state and the observation o = ξ(x) is unknown.

We specifically tailor the scope of this work to 0-memory navigation, that is, at a given time t, the set $X_{free} (t)$ of states which are not in collision is a function of o. Because of the constrained FoV, such a set can strictly be defined only within the visible volume $F (t)$ . Accordingly, we define the mapping $h : R^{3} \times O \to R$ as

X_{free} (t) = {x \in X | p \in F (t), h (p, o) \geq 0} .

(2)

This mapping h defines the position constraint that must be satisfied to ensure collision avoidance. However, the time derivative of h cannot be obtained, as ξ is unknown and in general discontinuous. Instead, we resort to a switching update approach, as in Harms et al. (2024), that avoids an explicit approximation of the observation dynamics. At each step k, the discrete o_k defines a parametric mapping $h_{o_{k}} : R^{3} \to R$ , which in turn defines (together with the corresponding frustum $F_{k}$ ) the set $X_{free, k}$ of reachable positions w.r.t. the observation frame.

Our proposed method for tackling this problem is presented hereafter. In Section 4, we present an approach for approximating h_o, while in Section 5, we present the actual system model and derive the subsequent definition of the optimal control problem.

Learning the environment representation

Given the most recent observation o, we want to explicitly define the mapping h_o that defines the set of free positions within the sensor frustum $F$ . However, the very high dimensionality of the observations (typically, ≈ 1e5 pixels or 3D points) prevents naive implementations that would directly transpose each pixel to individual constraints. It is therefore necessary to use an intermediate volumetric representation of the visible environment. However, classical approaches for computing said representations (e.g., raycasting or k-d tree search) suffer limitations such as (a) slow computation, (b) algorithmic computing, or (c) discontinuous representations, which in turn prevents their embedding in gradient-based optimal controllers.

The core of the proposed approach is to leverage deep learning to encode the sensor observation into a single volumetric constraint, using a neural implicit SDF constructed from a single depth image.

SDF from observations

We define the distance field as the spatial distance function to the surface of visible obstacles. Specifically, it must have negative values for non-visible space, and positive values for visible space, as shown in Figure 2. Such a function is called a distance transform (Maurer et al., 2003). As commonly done for neural SDFs (Park et al., 2019), with inherently limited representation capabilities, the SDF is truncated beyond some user-defined distance, denoted T_SDF. For precise surface encoding, T_SDF is typically a few centimeters. For navigation, however, the obstacles are represented at a larger scale and we set T_SDF = 1 m.

Figure 2.

A 2D visualization of the distance transform (left) with two visible obstacles (gray). The blue line marks the 0-level set of the SDF, and its dashed part illustrates its heuristic extension beyond the FoV $F$ for training purposes. The zoom-in (right) illustrates the grid-based approximation of the distance transform for p by the distance d between the central cell and the closest cell with a different occupancy (in red).

However, this distance transform is defined on $F \subset R^{3}$ . To ensure proper boundary conditioning during training, its definition is extended to $R^{3}$ by assigning SDF values based on the spherical projection onto $F$ (see Figure 2). This entails that an additional constraint must enforce the queried positions to be within $F$ , where the SDF truly relates to the observation.

Figure 3.

Architecture of the neural networks. Convolution and deconvolution layers are pictured in orange, linear layers in green, activations in purple, and pooling layers in blue. The encoder computes a Gaussian latent representation of the image, parametrized by its mean and std. The latent is sampled and decoded to reconstruct the input image. The decoder (blue rectangle) is used for training the VAE but is disabled for inference. The Residual block for a given size N is pictured in the gray rectangle and instantiated in the network structure. Residual Deconvolution blocks follow the exact same structure.

Although there exist efficient parallelized algorithms for computing such a transform (Criminisi et al., 2008; Cao et al., 2010), those are tailored for computing the full distance transform over a given grid. This is typically not well suited for sampling-based training. Instead, we utilize an approximation algorithm to generate the target values SDF(p), for each sampled 3D position p, given the observation. This algorithm, illustrated in Figure 2, consists of fitting a 3D grid of adaptive resolution on the sampled point, locating the closest cell where the occupancy changes. The grid is dense around the center (thus enforcing higher precision close to surfaces) but gets increasingly sparser for achieving efficient online computation. The algorithm seamlessly provides the direction of the spatial gradient, whose norm satisfies an eikonal equation

‖\frac{\partial SDF (p)}{\partial p}‖ = \{\begin{cases} 1 if ‖SDF (p)‖ < T_{SDF} \\ 0 otherwise \end{cases} .

(3)

We note that the precision of the approximation algorithm is therefore lower bounded by the thinnest resolution of the grid. This algorithm can be tensorized for online training data generation. It relies on a routine for evaluating the occupancy of arbitrary points given the input image, which can be efficiently implemented on GPU kernels via backprojection onto the pixel grid, for example, using Warp (Macklin, 2022).

NN architecture

We use a two-step process to handle observations (Figure 3). First, a convolutional neural network (CNN) encodes the observation into a compact representation. This latent vector is then passed through a MLP for further processing, exploiting the spatial correlation properties of convolutions.

The two-step approach allows for (a) avoid redundant computation when evaluating the SDF at multiple points given the same observation, (b) leverage spatial patterns in the input via the CNN, while keeping the MLP (which parametrizes the collision constraint) lightweight, and finally (c) improve robustness to the noise and systematic errors in the observations (Kulkarni et al., 2023) (e.g., stereo shadow and invalid LiDAR rays).

We use a β-VAE (Higgins et al., 2017) architecture with a ResNet-10 network as the encoder. It incorporates ReLU activations, batch normalizations, and dropout regularization for improved generalizability. The final convolution volume is passed through an average pooling, before flattening and processing through a fully connected (FC) layer for computing the mean ( μ ) and standard deviation (std) ( σ ) parametrization of the Gaussian space. Accordingly, we use a deconvolution network based on ResNet-10 as the decoder.

The latent encoding z is processed using a coordinate-based MLP (Tancik et al., 2020) that approximates the distance transform for the corresponding observation.

Similar to Ortiz et al. (2022), positional embedding is first applied to the 3D point p, mapping it to a high-dimensional space using periodic activation functions. This sinusoidal mapping of input coordinates allows MLPs to better represent higher frequency content (Tancik et al., 2020). We use the off-axis positional embedding γ (Barron et al., 2022), defined as

γ (p) = [\begin{matrix} p \\ \sin (2^{0} A p) \\ \cos (2^{0} A p) \\ ⋮ \\ \sin (2^{L - 1} A p) \\ \cos (2^{L - 1} A p) \end{matrix}] \in R^{2 L + 3},

(4)

where L is the number of frequency bands, and the rows of

A \in R^{12 \times 3}

are vertices of a unit-norm icosahedron.

This embedding is concatenated to z, and processed via an MLP with four hidden layers and using sine activations (Sitzmann et al., 2020). The input to the MLP is also feed-forwarded to the third hidden layer. The MLP also uses dropout regularization. Various sizes for the hidden layers are evaluated in Sections 6.2 and 6.3.

We note that the VAE operates directly on pixels, and is therefore agnostic of the sensor FoV. However, because of the convolution operations, the network is trained for a specific input image ratio, which tends to differ between depth cameras (e.g., 16:9) and LiDAR range images (e.g., 4:1).

The SDF MLP operates on spatial data and implicitly encodes the intrinsic projection matrix, and is therefore trained for a specific sensor.

Training procedure

The VAE and MLP are trained sequentially.

VAE training

First, the VAE is trained using a dataset including images from real sensors, as well as simulated images collected from Gazebo, and from Aerial Gym (Kulkarni et al., 2025) based on Isaac Sim and using Warp rendering. The first two sets consist of indoor and outdoor scenes, while the latter consists of geometric shapes of arbitrary positions and orientations. The number of images for each set is reported in Table 1. The training set is randomly divided into training and validation subsets, using an 85%–15% ratio. The input images are cropped up to a maximum range, and normalized to

[0, 1]

. The maximum range d_max describes the range of environmental information encoded by the network. In practice, we set d_max = 5 m, which provides a sufficient horizon for effectively managing local collision avoidance.

Table 1.

Number of images used in the training and testing datasets. The sensors for each modality are an Intel D455 depth camera and an Ouster OS1 LiDAR.

	Depth camera		LiDAR
	train	test	train	test
Real	15,165	1685	18,682	2076
Gazebo	48,039	5338	23,707	2634
Aerial Gym	91,697	10,189	89,673	9964

We follow the training procedure in Kulkarni et al. (2023), that is, no reconstruction loss is computed for the invalid pixels caused by stereo shadow or obstructions.

Because the encoder is used to prevent collisions, all obstacles must be reconstructed despite the high compression rate, in particular those close to the robot. We therefore propose to bias the VAE to reconstruct close obstacles with higher accuracy. This is achieved by scaling the loss function according to the distance value of the target pixel. The scaling factor is computed such that it equals 1 for 0-valued pixels, and $w \in (0, 1)$ for 1-valued pixels, where w is a tunable parameter. The intermediate values are obtained using a quadratic interpolation with a horizontal tangent at 0.

The pixel-wise MSE reconstruction loss function of the VAE for an observation o is hence defined as

L_{MSE} (o) = \sum_{o_{i j} \neq 0} (o_{i j}^{2} (w - 1) + 1) {(VAE {(o)}_{i j} - o_{i j})}^{2},

(5)

where the subscript •_ij refers to the pixel on the i-th line and j-th column.

Additionally, a β-scaled Kullback-Leibler divergence (KLD) metric $L_{KL}$ is used to enforce that the latent vector z follows a proper normal distribution fitting the true posterior, as commonly done for variational encoders. It is defined as

L_{KL} = - \frac{β_{norm}}{2} (1 + \log (σ^{2}) - μ^{2} - σ^{2}),

(6)

where the μ and σ are the predicted mean and std of the latent distribution, and β_norm is a scaling factor computed from the latent and image sizes, according to Higgins et al. (2017).

SDF training

Then, the weights of the encoder are frozen, and the MLP is trained to approximate the SDF. The training is also fully supervised, using regression of the target SDF value and gradient at sampled 3D points. Target values for sampled points are computed following the procedure described in Section 4.1. The training is performed using the same dataset as the VAE. However, because the SDF ground truth approximation relies on backprojection onto the image plane, the image must contain only valid pixels. The images from the real sensors are therefore not used for training the MLP. The sampling process is such that points are sampled (a) within $F$ , (b) in a ball centered on O_S (to enforce accuracy close to the origin), (c) close to obstacles (for increased surface accuracy), and (d) in a ball larger that $F$ for boundary conditioning. We use the sampling ratios 0.4, 0.35, 0.2, and 0.05, respectively.

The loss function is defined as the weighted sum of two MSE losses, for both the target SDF value and its gradient.

We note that the MSE loss on the unit norm gradient includes the satisfaction of the eikonal property (3). We report, in Appendix A, the training losses and metrics for both the VAE and SDF networks, which are used in the remainder of the paper.

Neural NMPC

Mathematical notations and coordinate frames

We denote a coordinate frame • as $F_{•}$ , its’ origin as O_• and canonical base (x_•, y_•, z_•). We denote a generic world inertial frame with $F_{W}$ . The position and orientation of the AR are represented using the body-aligned frame $F_{B}$ attached to the geometric center of the robot O_B. Exteroceptive measurements made by the body-mounted range sensor are described in a sensor frame $F_{S}$ . Further, we define the vehicle frame $F_{V}$ , such that O_V = O_B, z_V = z_W, and which is yaw-aligned with $F_{B}$ .

Lastly, we consider the inertial frames $F_{B_{0}}$ , $F_{V_{0}}$ , $F_{S_{0}}$ , which at time t = t₀ coincide with $F_{B}$ , $F_{V}$ and $F_{S}$ , respectively. A visual summary of the relevant frames is provided in Figure 4.

Figure 4.

Planar representation of the relevant 3D frames used in the paper, including the inertial frames at time t₀ where the sensor observation is captured (left), and the sensor frustum $F$ .

The relative position of $F_{a}$ w.r.t. $F_{b}$ , expressed in $F_{b}$ , is denoted by ^bp_a. Similarly, the orientation of $F_{a}$ w.r.t. $F_{b}$ expressed in $F_{b}$ is denoted by ^bR_a. The unit quaternion representation of the rotation ^bR_a is denoted ^bq_a, and its Euler angle representation is ^b $η_{a} = [\begin{matrix} ϕ θ ψ \end{matrix}]$ , following the zyx Tait–Bryan convention.

Lower and upper bounds on any variable are, respectively, denoted with $\underset{̲}{•}$ and $\bar{•}$ , and ⊗ denotes the Hamilton product of two quaternions.

System modeling

The AR considered hereafter is a standard co-planar multirotor. It is modeled as a rigid body of mass m centered in O_B, and actuated by four or more co-planar propellers. We assume that the robot is fully enclosed in a sphere of radius r.

We denote t₀ as the time at which the most recent observation is acquired. We chose to represent the system position and orientation w.r.t. the inertial, switching frame $F_{V_{0}}$ , that is, the vehicle frame at t₀. The heading with respect to $F_{V_{0}}$ is defined in quaternion form, denoted ^V0q_V. It has two non-zero components (the scalar and z ones), respectively, denoted ^V0q_w and ^V0q_z.

Therefore, the system state is described, dropping the subscript •_B for legibility, by the vector***

x = {[\begin{matrix} {}^{V_{0}}p^{⊤}, {}^{V_{0}}q_{w}, {}^{V_{0}}q_{z}, {}^{V_{0}}v^{⊤} \end{matrix}]}^{⊤} \in R^{8},

(7)

where ^V0v is the velocity of O_B, expressed in

F_{V_{0}}

Accordingly, we select the system input variables in order to control the 3D thrust (magnitude and orientation through roll and pitch) and yaw rate, as typically done for co-planar multirotors (Furrer et al., 2016). The system input vector is therefore

u = {[\begin{matrix} T, {}^{V}ϕ, {}^{V}θ, {}^{B}ω_{z} \end{matrix}]}^{⊤} \in R^{4},

(8)

where T is the collective thrust of the actuators along z_B,

{}^{V}ϕ

and

{}^{V}θ

are the roll and pitch angles of

{}^{V}R_{B}

, and

{}^{B}ω_{z}

is the angular rate around z_B, expressed in

F_{B}

. It is assumed that lower and upper bounds on u are known, or obtained through an identification campaign.

The system dynamics (1) considered in the problem formulation are therefore instantiated as

{}^{V_{0}}{\dot{p}} = {}^{V_{0}}R_{B} {}^{V_{0}}v,

(9a)

[\begin{matrix} \dot{{}^{V_{0}}q_{w}} \\ 0 \\ 0 \\ \dot{{}^{V_{0}}q_{z}} \end{matrix}] = \frac{1}{2} [\begin{matrix} {}^{V_{0}}q_{w} \\ 0 \\ 0 \\ {}^{V_{0}}q_{z} \end{matrix}] \otimes [\begin{matrix} 0 \\ 0 \\ 0 \\ {}^{B}ω_{z} \end{matrix}],

(9b)

{}^{V_{0}}{\dot{v}} = \frac{T}{m} {}^{V_{0}}R_{B} z_{B} - g z_{V},

(9c)

where ${}^{V_{0}}R_{B} = {}^{V_{0}}R_{V} {}^{V}R_{B}$ is the composition of the rotations described, respectively, by ${}^{V_{0}}q_{V}$ and $({}^{V}ϕ, {}^{V}θ)$ , and g is the magnitude of the gravity vector.

We further motivate this choice of dynamics representation in Appendix B.

The onboard range sensor, whose principal axis is x_S, is rigidly attached such that ${}^{B}p_{S}$ and ${}^{B}R_{S}$ are constant and known. Its FoV $F$ is described by the halved vertical and horizontal angular apertures, respectively, α_V and α_H, as well as a maximum distance d_max. In practice, we set d_max = 5 m and match the maximum range handled by the VAE and SDF networks.

Local navigation neural MPC

Obstacle avoidance constraint

We assume that the observation o is encoded into a latent representation z through the VAE (which is a convenience wording shortcut referring to taking the mean of the latent distribution through the encoder part of the VAE). The neural approximation of the SDF defined in Section 4.1 defines the function

\begin{matrix} R^{3} & \to R \\ p & \mapsto {SDF}_{θ, z} (p), \end{matrix}

(10)

which is parametrized by z and the MLP weights, denoted as θ . Ensuring that the motion of the AR remains collision-free is achieved by constraining the SDF value of positions of the open-loop trajectory w.r.t. the observation o. Recalling that o, acquired at t = t₀, describes a position constraint w.r.t. $F_{S_{0}}$ , we write the position of O_B in $F_{S_{0}}$ as a closed-form function of the NMPC state and parametrized by the roll and pitch angles at time t₀, as

{}^{S_{0}}p_{B} = {}^{B}R_{S}^{⊤} ({}^{V_{0}}R_{B_{0}}^{⊤} {}^{V_{0}}p_{B} - {}^{B}p_{S}) .

(11)

Then, we impose on the NMPC the constraint

{SDF}_{θ, z} ({}^{S_{0}}p_{B}) \geq r + ϵ,

(12)

where ϵ > 0 is a user-defined safety margin added to the robot-enclosing radius r > 0.

Field of view constraints

Because the collision-free set is defined in the sensor frustum $F (t_{0})$ , the predicted motion must evolve within this volume. This results in a set of two additional constraints, respectively, relative to the horizontal and vertical translations of the sensor O_S w.r.t. to $F_{S_{0}}$ .

We denote the transformations from Euclidean coordinates to the azimuth and elevation angles as $S_{azimuth} : R^{3} \to R$ and $S_{elevation} : R^{3} \to R$ , respectively, which are computed as

\begin{aligned} S_{azimuth} ([\begin{matrix} x \\ y \\ z \end{matrix}]) & = a t a n 2 (y, x), \\ S_{elevation} ([\begin{matrix} x \\ y \\ z \end{matrix}]) & = a t a n 2 (z, \sqrt{x^{2} + y^{2}}) . \end{aligned}

(13)

The resulting constraints are then written as

\begin{aligned} - α_{H} \leq S_{azimuth} ({}^{S_{0}}p_{S}) \leq α_{H}, \\ - α_{V} \leq S_{elevation} ({}^{S_{0}}p_{S}) \leq α_{V} . \end{aligned}

(14)

We note that an additional maximum range constraint would be redundant, as it is already encoded in (12) that every position past d_max is unsafe.

Objective function for velocity tracking

We assume that a reference velocity and a reference heading, both expressed w.r.t. $F_{V_{0}}$ , are provided by any high-level planner, respectively, denoted ^V₀v_ref and ^V₀ψ_ref.

Following Brescianini and D’Andrea (2020), we express the heading errors in quaternions. Let ^V₀q_ref be the quaternion representation of ^V₀ψ_ref. The orientation error q_e is given by

q_{e} = [\begin{matrix} q_{e, w} \\ 0 \\ 0 \\ q_{e, z} \end{matrix}] = {}^{V_{0}}q_{ref} \otimes {[\begin{matrix} {}^{V_{0}}q_{w} \\ 0 \\ 0 \\ {}^{V_{0}}q_{z} \end{matrix}]}^{- 1},

(15)

and minimizing the heading error is therefore equivalent to minimizing q_e,z.

Additionally, a control input objective penalizes the magnitude of the roll, pitch, and yaw rate, as well as the deviation from mg of the vertical projection of the thrust $T_{z} = T \cos ({}^{V}ϕ) \cos ({}^{V}θ)$ .

The stage cost ℓ(x, u) is thus written as

ℓ (x, u) = {‖ [\begin{matrix} q_{e, z} \\ {}^{V_{0}}v - {}^{V_{0}}v_{ref} \end{matrix}] ‖}_{Q}^{2} + {‖[\begin{matrix} T_{z} - m g \\ {}^{V}ϕ \\ {}^{V}θ \\ {}^{B}ω_{z} \end{matrix}]‖}_{R}^{2},

(16)

where Q and R are tunable positive semidefinite and positive definite weight matrices, respectively, and

{‖•‖}_{M}^{2}

is the weighted squared norm operator defined as

{‖a‖}_{M}^{2} = a^{⊤} M a,

(17)

for an arbitrary vector a and diagonal weight matrix M ≥ 0.

Theoretical analysis

In this study, we construct a terminal constraint to ensure recursive feasibility of the control law. Specifically, we enforce the terminal state to be such that there exists a sub-optimal terminal control policy that is recursively feasible under all constraints.

We then derive sufficient conditions on an appropriate terminal cost to ensure that the optimal cost becomes a non-increasing Lyapunov function under the sub-optimal terminal control policy. This step is then used to quantify a region of local stability that scales with the choice of the free parameters in the terminal cost.

For the remainder of the analysis, we consider the evolution of the system dynamics under a fixed observation. This means that we effectively neglect inaccuracies of the SDF approximation and the observation-dependent nature of $X_{free}$ by assuming that $F_{S_{0}}$ and SDF_{θ
,z} are not time-varying. This partial approach does not directly prove the feasibility of the overall control problem under switching and noisy observations. However, it ensures that the control law never enters unsafe states under the current observation. Since the trajectory remains within the sensor frustum, this approach ensures that the (extended) open-loop trajectory is collision-free at all times.

The yaw dynamics are disregarded since they are decoupled from position dynamics for the considered AR.

Recursive feasibility

For all terminal states x_N, we define a corresponding set $X_{N} (x_{N})$ as the set of all states reached by recursively applying a “maximum braking to standstill” policy. This set is trivially forward invariant for such a policy and is bounded (both in velocity and position). To make the braking policy recursively feasible, it remains to ensure that it satisfies the input constraints, and that all states in $X_{N} (x_{N})$ satisfy the collision avoidance constraints, that is, $\forall x \in X_{N} (x_{N})$ ,

{SDF}_{θ, z} ({}^{S_{0}}p) \geq r + ϵ,

(18a)

{}^{S_{0}}p \in F (t_{0}) .

(18b)

The maximum braking policy, denoted π_b(x), is defined as the solution to the constrained optimization problem

{π_{b} (x), λ_{b}} = \arg \max_{T, ϕ, θ, λ} ‖a_{b} (v)‖

(19a)

\begin{align} s . t . & a_{b} (v) = \frac{T}{m} {}^{V_{0}}R_{B} z_{B} - g z_{V} \end{align}

(19b)

\begin{align} a_{b} (v) = - λ v \end{align}

(19c)

\begin{align} λ \geq 0 \end{align}

(19d)

\begin{align} 0 \leq T \leq \bar{T} \end{align}

(19e)

\begin{align} \underset{̲}{ϕ} \leq ϕ \leq \bar{ϕ} \end{align}

(19f)

\begin{align} \underset{̲}{θ} \leq θ \leq \bar{θ} \end{align}

(19g)

where a_b(v) is the deceleration opposite to the velocity vector, and equations (19e), (19f), and (19g) are the input constraints imposed on the system. The policy ensures a deceleration along a straight line via equation (19c), and the system position thus evolves along a straight line toward the hovering equilibrium. Therefore, it holds that

\forall x \in X_{N}, ‖p - p_{N}‖ \leq d_{b},

(20)

where d_b is the braking distance traveled while applying π_b given by

d_{b} (v_{N}) = \frac{v_{N}^{2}}{2 ‖a_{b} (v_{N})‖},

(21)

since the system is subject to a constant deceleration a_b(v_N).

The condition (18a) is thus enforced by the terminal constraint

{SDF}_{θ, z} ({}^{S_{0}}p_{B, N}) - d_{b} \geq r + ϵ .

(22)

Because $F (t_{0})$ is convex, the second condition (18b) is satisfied if

{}^{S_{0}}p_{S, N} \in F (t_{0}), {}^{S_{0}}p_{S, E} \in F (t_{0}),

(23)

where p_S,E is the position of O_S in the hovering equilibrium, obtained by translation of d_b(v_N) along the direction of v_N. This constraint is enforced as in equation (14).

To include these two terminal constraints in the Optimal Control Problem, we compute a closed-form approximation of the braking distance d_b. We employ an i-th degree 3-variate polynomial, whose coefficients are obtained through a least-square fitting on the target values obtained through equations (19a) and (21). Table 2 reports the approximation errors for a given set of actuation constraints, evaluating polynomials of various degrees to provide insights on the expected accuracy. We note that directly approximating d_b is easier than approximating a_b since the latter is discontinuous in v = 0, while equation (21) can be trivially extended by continuity. Hereafter, we assume that an approximation of d_b of arbitrary precision is available.

Table 2.

Fitting error, in centimeters, of various polynomial approximations of d_b(v) as computed through equations (19) and (21). The results are evaluated on the ball $‖v‖ \leq 3$ , with a resolution of 0.05 (i.e., ≈ 0.9 M points).

Degree of the fitted polynomial	3	4	5	6	7
# of params	20	35	56	84	120
RMSE (cm)	4.0	2.6	2.4	1.7	1.6

Stability

The current problem setting does not allow the establishment of asymptotic stability under arbitrary reference velocities due to the conflicting objectives in the stage cost (reference velocity tracking) and terminal cost and policy (coming to a stop). Further, the collision constraint may prevent the AR from converging to the desired reference velocity. An intuitive example of this is when a forward reference velocity is given while a wall is blocking the way; in this case, the constraint brings the system to a full stop. As a result, asymptotic stability (to a zero-cost equilibrium) cannot be established in general.

We instead derive sufficient conditions such that the optimal cost of the controller remains non-increasing. This is achieved by selecting an appropriate terminal cost V(x_N) ensuring that the optimal cost, denoted J* and introduced in Section 5.5, remains non-increasing under the sub-optimal terminal braking policy π_b. This enforces the existence of a feasible solution at the next time step that bounds J*.

We emphasize that while stability is established under certain assumptions on the reference velocity v_ref, the region in which the cost function is guaranteed to be non-increasing can be made arbitrarily large by the choice of the terminal cost parameter p. This is sufficient for the intended use case, as hard constraints are acting on the system state to ensure feasibility.

The Lyapunov stability condition on J* is written

J^{*} (x_{t + 1}) - J^{*} (x_{t}) \leq 0 .

(24)

Expanding the two terms, this condition can be equivalently written as a condition on the terminal cost V(x)

V (x_{N + 1}) - V (x_{N}) \leq - ℓ (x_{N}, π_{b} (x_{N})),

(25)

where ℓ(x_N, π_b(x_N)) is the stage cost, evaluated in x_N, while applying the braking policy (19).

We define the terminal cost as

V (x) = p v^{⊤} v,

(26)

where p > 0. Specific conditions on p ensuring that equation (24) holds are provided in Appendix C, including a narrow case when the system terminal velocity vanishes to zero (as well as a strategy to overcome this limitation by using a reference governor).

Nonlinear programming

The discrete-time nonlinear programming (NLP) over the receding horizon T, sampled in N shooting points, at a given instant t, given a range image captured at t₀ ≤ t and compressed into a latent vector z via a network parametrized by θ , is expressed as

J^{*} = \min_{\begin{array}{c} x_{0} \dots x_{N} \\ u_{0} \dots u_{N - 1} \end{array}} \sum_{k = 0}^{N - 1} ℓ (x_{k}, u_{k}) + V (x_{N})

(27a)

\begin{align} s . t . & x_{0} = x (t) \end{align}

(27b)

\begin{align} x_{k + 1} = f (x_{k}, u_{k}), & k \in {0, N - 1} \end{align}

(27c)

\begin{align} \underset{̲}{v} \leq v_{k} \leq \bar{v}, & k \in {0, N} \end{align}

(27d)

\begin{align} \underset{̲}{u} \leq u_{k} \leq \bar{u}, & k \in {0, N - 1} \end{align}

(27e)

\begin{align} r + ϵ \leq {SDF}_{θ, z} ({}^{S_{0}}p_{B, k}), & k \in {0, N - 1} \end{align}

(27f)

\begin{align} - α_{H} \leq S_{azimuth} ({}^{S_{0}}p_{S, k}) \leq α_{H}, & k \in {0, N} \end{align}

(27g)

\begin{align} - α_{V} \leq S_{elevation} ({}^{S_{0}}p_{S, k}) \leq α_{V}, & k \in {0, N} \end{align}

(27h)

\begin{align} r \leq {SDF}_{θ, z} ({}^{S_{0}}p_{B, N}) - d_{b} (v_{N}), \end{align}

(27i)

\begin{align} - α_{H} \leq S_{azimuth} ({}^{S_{0}}p_{E}) \leq α_{H}, \end{align}

(27j)

\begin{align} - α_{V} \leq S_{elevation} ({}^{S_{0}}p_{E}) \leq α_{V}, \end{align}

(27k)

where x(t) is the state estimate at t, and f synthetically denotes the discretized dynamics defined in equation (9). We note that the NLP could be solved without the terminal constraint and cost related to recursive feasibility and stability.

Implementation details

The above NLP is implemented in Python using Acados (Verschueren et al., 2021) and Casadi (Andersson et al., 2019). The neural network is implemented with PyTorch, and it is interfaced with the NMPC using L4Casadi (Salzmann et al., 2023). The NLP is transformed into a Sequential QP solved with Interior Point Method (Frison and Diehl, 2020) with a real-time iteration (RTI) scheme.

We use Levenberg–Marquardt (LM) regularization to improve the stability of computing sensitivities. We remark that this has a strong impact on the stability of the SQP solution under switching observations, and thus switching constraints. This is a consequence of both using a neural constraint parametrized with a large number of neurons, and of using the RTI suboptimal solving strategy, which relies on a reliable warm start from the previous solution. We empirically set the LM regularization factor to 10.

The FoV and neural constraints (27f), (27g), and (27h) are implemented as slackened constraints. This allows both to retain feasibility w.r.t. noisy feedback on the initial state x₀, and w.r.t. the switching observations and the corresponding SDF approximation errors. We impose a L1-norm penalty with a weight of 20 on both slack variables for the FoV constraints, and both L1 and L2 penalties on the avoidance constraint, with respective weights 200 and 20. The large penalty on the neural constraint aims to compensate for the potentially large magnitude of the velocity tracking error when the environment imposes a full stop on the robot (e.g., in front of the wall with a large forward velocity reference).

The proposed method relies on only a few tunable hyperparameters. Those are, namely, the MPC weight (including those for the slack variables) which are tuned as for any other controller; the length T and sampling N of the horizon; the LM regularization factor; the safety margin ϵ; and the size of the SDF network, which is discussed in the next subsection. We note that the safety margin ϵ is chosen to be at least greater than the neural SDF RMSE, evaluated in Section 6.2. Table 3 presents the values for these parameters for each experimental subsection hereafter. The LM regularization factor and slack variables weights detailed in this section are not repeated, as they are constant in all the simulations and experiments.

Table 3.

SDF-NMPC parameters used in the experimental sections.

Section	diag(Q)	diag(R)	T (s)	N	ϵ (m)
Section 6.3	$[\begin{matrix} 20 \\ 5 \\ 5 \\ 5 \end{matrix}]$	$[\begin{matrix} 0.04 \\ 50 \\ 50 \\ 5 \end{matrix}]$	1.5	20	0.1
Section 6.4	$[\begin{matrix} 20 \\ 5 \\ 5 \\ 2 \end{matrix}]$	$[\begin{matrix} 0.04 \\ 25 \\ 25 \\ 5 \end{matrix}]$	1.5	20	0.2
Section 6.5	$[\begin{matrix} 20 \\ 5 \\ 5 \\ 2 \end{matrix}]$	$[\begin{matrix} 0.04 \\ 25 \\ 25 \\ 5 \end{matrix}]$	1.5	20	0.2
Section 6.6	$[\begin{matrix} 20 \\ 5 \\ 5 \\ 5 \end{matrix}]$	$[\begin{matrix} 0.04 \\ 50 \\ 50 \\ 5 \end{matrix}]$	1.5	10	0.3

Validation

In this section, we evaluate the components of the proposed method. First, the encoding capabilities of the VAE and SDF networks are evaluated. Then, the neural controller is evaluated through randomized simulations to assess its sensitivity to the various parameters. The proposed method is then compared against two state-of-the-art mapless navigation methods. We perform an ablation study to quantify the resilience to odometry drift. Finally, the controller is integrated into a real system and validated in hardware experiments.

In order to properly study the properties of the proposed NMPC method, we avoid relying on a map-informed high-level planner generating the velocity reference provided to the NMPC. Instead, we consider a worst-case setting and implement a naive, obstacle-agnostic goal-seeking planner. It provides a reference ^V₀v_ref along the straight line to reach the given goal position. The magnitude v_ref of ^V₀v_ref is a user-defined parameter. In the laterally-constrained FoV case, the planner also provides a reference heading ^V₀ψ_ref such that the sensor aligns with the direction of motion.

VAE evaluation

This subsection presents a qualitative and quantitative evaluation of the VAE reconstruction. We chose a latent space dimension of 128, which offers a good trade-off between quality and compression for the same input image size (480 × 270 pixels) (Kulkarni et al., 2023). We compare the proposed distance-weighted (biased) VAE, introduced in Section 4.3, with a standard (unbiased) VAE. The tuning parameter is fixed at w = 0.01. We also include a baseline compression method based on the fast Fourier transform (FFT), which retains the 64 complex frequencies (i.e., 128 real values) with the largest magnitudes.

Quantitative results on the test dataset are summarized in Table 4. These results show that the biased VAE shifts reconstruction error toward the background, yielding better reconstruction in the foreground, that is, for the obstacles of interest for navigation. Overall, it outperforms both the unbiased VAE and the FFT-based reconstruction. Reconstruction quality is generally lower for LiDAR images compared to depth images, since the broader FoV perceives more objects and the resulting image has a higher spatial frequency. Thus, more information is passed through the same latent bandwidth.

Table 4.

Pixel-wise RMSE [m] using a FFT, a standard VAE, and the proposed biased VAE, on the LiDAR and Depth Camera datasets, both on the full image and on the non-background pixels. Bold values indicate the best result per row.

Sensor	Pixels considered	FFT	Vanilla VAE	Biased VAE
Depth camera	Full image	0.358	0.182	0.313
Depth camera	Non-background	0.562	0.489	0.342
LiDAR	Full image	0.640	0.375	0.681
LiDAR	Non-background	0.850	0.745	0.551

The biased VAE tends to reconstruct pixels closer to the camera. This is illustrated in selected instances in Figure 5. This results in a conservative approximation of the obstacles, in particular w.r.t. thin or small obstacles. This (even partial) reconstruction of an obstacle indicates that it is encoded in the latent representation, and therefore can be captured by the SDF network.

Figure 5.

Reconstruction error using the proposed biased VAE and a vanilla VAE. The blue pixels correspond to reconstructions “closer” than the actual pixel value. The green circles highlight instances of thin obstacles whose reconstructions are improved.

To assess generalization, we compute the reconstruction error on depth images from the TartanAir dataset (Wang et al., 2020). A total of 30,000 randomly sampled images are evaluated. These are cropped at d_max = 5 m, and images with little to no nearby content (within d_max) are excluded to ensure a representative sample.

The resulting pixel-wise RMSE is 0.225 m, which is consistent with the error magnitudes reported in Table 4 for our testing depth images. Specifically, this value is slightly lower as the TartanAir environments typically feature lower spatial frequency than the randomized training scenes.

The average inference time of the VAE encoder, respectively, on CPU (Intel Core i7-12800H) and GPU (NVIDIA GeForce RTX 3080 Ti Laptop), is 84.03 ms and 1.27 ms. This highlights the importance of GPU acceleration for real-time deployment.

SDF reconstruction

We now evaluate the SDF-approximating MLP.

Neural network size

We assess how the reconstruction errors vary with the number of neurons. Network size is indeed a critical parameter, as it directly affects both the solving time of the NMPC, and the quality of the obstacle avoidance (through the accuracy of the environment encoding).

Specifically, we evaluate six sets of layer sizes, and report the corresponding SDF reconstruction errors (and gradient direction errors) in Figure 6. The evaluated network sizes, detailed in Table 5, are named after the following convention: SDF_N denotes a network where all four layers have N neurons, and SDF_N−M denotes a network with decreasing layer sizes from N in the first layer to M in the last. We chose as the smallest architecture SDF₆₄ the one used in Jacquet and Alexis (2024), which has been shown successful in encoding an occupancy map, that is, a different (simpler) spatial representation. The network size SDF₂₅₆ is similar to the one used in Ortiz et al. (2022) for online learning of a weight-encoded SDF. Metrics are computed on a 3D grid with a resolution of 10 cm within the sensor frustum. The evaluation is performed solely on simulated data, as the ground truth cannot be obtained for images that contain invalid pixels. It can be observed that the RMSE is higher in the LiDAR case, which is consistent with the VAE results in Table 4.

Figure 6.

RMSE of the SDF estimation (blue) and of its gradient orientation (orange), for the depth (circles) and LiDAR (triangles) images, as a function of the neural network size.

Table 5.

Details of the six evaluated network sizes, with their naming abbreviations. The average query time [ms] for the SDF value and the corresponding gradient on a single query point is evaluated on an Intel Core i7-12800H CPU. The two networks highlighted in bold are further evaluated for closed-loop performances in Section 6.3.

Abbrev	SDF₆₄	SDF₁₂₈	SDF ₂₅₆₋₆₄	SDF₂₅₆	65SDF_512−256	SDF ₅₁₂
Layer sizes	$[\begin{matrix} 64 \\ 64 \\ 64 \\ 64 \end{matrix}]$	$[\begin{matrix} 128 \\ 128 \\ 128 \\ 128 \end{matrix}]$	$[\begin{matrix} 256 \\ 256 \\ 128 \\ 64 \end{matrix}]$	$[\begin{matrix} 256 \\ 256 \\ 256 \\ 256 \end{matrix}]$	$[\begin{matrix} 512 \\ 512 \\ 256 \\ 256 \end{matrix}]$	$[\begin{matrix} 512 \\ 512 \\ 512 \\ 512 \end{matrix}]$
# of params	41,217	115,201	246,401	361,473	869,889	1247233
Query time	0.53	0.66	0.69	0.73	0.89	1.06

In order to provide visual insights on the neural 3D field, some instances of 2D SDF reconstruction are depicted in Figure 7 for the slices z_B = 0. It illustrates the convergence of the network-predicted SDF (r + ϵ)-level set (blue) toward the ground truth (cyan) as network size increases. Notably, the SDF₁₂₈ architecture exhibits significant penetration into obstacle, rendering it unsuitable for navigation tasks. Indeed, Figure 6 shows a strong error decrease between SDF₁₂₈ and SDF₂₅₆₋₆₄. We select SDF₂₅₆₋₆₄ as the default architecture for the remainder of the paper, as it provides a practical trade-off between computational efficiency and reconstruction accuracy. For comparison, we also evaluate SDF₅₁₂ to explore how further accuracy improvements translate to closed-loop navigation performance.

Figure 7.

2D slice of the estimated SDF at z = 0 (shown as the green line). The green cones depict the FoV. The color gradient shows the neural SDF, and the white curve marks the visible obstacle surface. The blue and cyan curves are the neural (r + ϵ)-level sets, used as constraint, and its ground truth, respectively.

Comparison with other SDF approximations

To contextualize the performance of our method relative to existing SDF encoding approaches, we compare computation times in Table 6. Specifically, we benchmark against a GP-based method (Wu et al., 2023), implemented using the GPU-accelerated GPyTorch library (Gardner et al., 2018). As an additional baseline, we include a k-d tree implementation. Although it does not provide an analytical SDF and thus cannot be used in our framework, it serves as a useful point of reference for raw query performance.

Table 6.

Computation times [ms] for the construction and query (of the value and gradient) for different SDF approximation methods applied to depth images, with varying input dimensions.

Input size	k-d tree		GP		neural SDF
Input size	constr. (CPU)	query (CPU)	constr. (GPU)	query (CPU)	VAE (GPU)	query (CPU)
129600	41.0	0.5	–	–	1.3	0.7
14400	4.1	0.2	587.3	2.4	–	–
1296	0.3	0.1	38.7	2.3	–	–
190	0.1	0.1	5.5	1.9	–	–

While these methods compute a different SDF representation, preventing a direct one-to-one comparison, we include the reconstruction errors reported in Wu et al. (2023) and Ortiz et al. (2022) for reference alongside the values reported in Figure 7. The former reports an instance of scene RMSE of 7.7 cm, and the latter reports errors between 3 and 7 cm across different scenes. These values are comparable in scale to the RMSE obtained with our neural SDF (albeit generally lower). The gradient error reported in Ortiz et al. (2022) is also comparable, lying between 25° and 30°.

Generalizability

We further evaluate the SDF reconstruction error on depth images from the TartanAir dataset Wang et al. (2020). Using the same 30,000 images as in Section 6.1, we compute the reconstruction error with the SDF_256-64 network, on a 3D grid with a resolution of 10 cm. The recorded RMSE is 9.1 cm (+8%) for the SDF estimation, and the gradient direction errors is 31°(+10%), both remaining within the same range as those reported in Figure 6 and in related works.

Ablation study

In this section, we evaluate the sensitivity of performances w.r.t. the various hyperparameters. Although NMPC is typically paired with a low-level feedback controller for added robustness against model mismatches, we isolate our analysis by modifying the NMPC to output body wrench commands directly. We perform the simulations in the Aerial Gym simulator (Kulkarni et al., 2025), which supports customizable physics stepping frequencies, randomized obstacle generation, and parallel rollouts. Control is executed at 50 Hz, while the physics engine runs at 500 Hz.

The simulated AR weights 1.25 kg and has a radius r = 25 cm. We fix the safety margin ϵ = 10 cm. Each rollout samples a new obstacle configuration within a 10 × 10× 5 m environment, as well as the start and goal locations on opposite sides. Rollouts terminate colliding with an obstacle, or timing out after 30 s, if the goal is not reached, indicating that the robot is stuck in a dead-end. This behavior is expected in highly cluttered scenarios, as the planner only has access to local information.

Performance metrics include success, failure, and timeout rates, as well as other metrics indicating the navigation performance: average speed, average minimum neural SDF evaluation during the rollout, the true SDF values based on range observations. The robot consistently reaches its target speed for a significant portion of time; specifically, we verify that the average 90-th percentile speed remains above 0.95v_ref.

We consider two classes of environments, pictured in Figure 8. First, the obstacles are vertical pillars of circular or square sections, with diameters (or diagonals) ranging from 0.2 to 0.4 m. These are sampled using Poisson discs, with a tunable minimum inter-obstacle distance d_min. This setup creates an effectively 2D navigation problem, enabling easier qualitative interpretation. Second, we use 3D randomly sampled cuboids, spheres, pillars, and rods, with smallest dimensions as low as 0.2 m. We also assess the impact of thinner obstacles (with smallest dimensions < 0.05 m), such as small cuboids and rods.

Figure 8.

Instances of the two classes of environments used in the ablation study.

The baseline conditions are defined after the Intel D455 depth camera FoV (i.e., α_H = 45°) at 25 Hz, the SDF₂₅₆₋₆₄ network and v_ref = 2 m/s. Gaussian noise is applied on the state feedback provided to the NMPC and on the Gaussian noise is applied to both state feedback and sensor observations (with stds σ = 0.03 and σ = 0.05, respectively), and control wrench perturbations are also introduced. Rollout visualizations for both environment classes are shown in Extension 1. Results are reported in Table 7 and discussed below.

Table 7.

Evaluation metrics in the two classes of environments in various simulation settings. Each line corresponds to metrics gathered over 200 randomized rollouts for a single set of parameters, enumerated and described in the first column.

	Altered parameters	Success rate	Timeout rate	Failure rate	Avg. velocity	Avg. min. neural SDF	Avg. min. SDF
Pillars	Baseline	1.0	0	0	1.75	0.339	0.446
	(01) Sensor frequency 2 Hz	1.0	0	0	1.74	0.359	0.482
	(02) Higher state noise (σ = 0.05)	0.985	0	0.015	1.70	0.378	0.473
	(03) Higher state noise (σ = 0.07)	0.96	0	0.04	1.59	0.508	0.441
	(04) v_ref = 3m/s	0.93	0	0.07	2.34	0.368	0.464
	(05) v_ref = 3m/s, SDF₅₁₂	0.97	0	0.03	2.34	0.368	0.464
	(06) α_H = 65	0.995	0	0.005	1.63	0.236	0.376
	(07) α_H = 65 , v_ref = 3m/s	0.95	0	0.05	2.1	0.288	0.413
	(08) α_H = 65 , v_ref = 3m/s, SDF₅₁₂	0.99	0	0.01	2.18	0.333	0.392
	(09) α_H = 180°	0.99	0	0.01	1.48	0.303	0.350
	(10) α_H = 180° , v_ref = 3m/s	0.945	0	0.055	2.18	0.321	0.361
	(11) α_H = 180° , v_ref = 3m/s, SDF₅₁₂	0.99	0	0.01	2.05	0.297	0.354
	(12) d_min = 0.80m	0.92	0	0.08	1.53	0.304	0.419
	(13) d_min = 0.75m	0.77	0.03	0.2	1.41	0.304	0.419
	(14) d_min = 0.80m, α_H = 65°	0.97	0	0.03	1.35	0.236	0.376
	(15) d_min = 0.75m, SDF₅₁₂	0.95	0	0.05	1.64	0.352	0.406
	(16) d_min = 0.75m, α_H = 65°	0.88	0.05	0.07	1.23	0.220	0.359
	(17) d_min = 0.75m, α_H = 65°, SDF₅₁₂	1.0	0	0	1.47	0.272	0.343
	(18) d_min = 0.75m, α_H = 180°	0.215	0.78	0.05	0.83	0.258	0.305
	(19) d_min = 0.75m, α_H = 180, SDF₅₁₂	1.0	0	0	1.42	0.277	0.295
	(20) d_min = 0.75m, v_ref = 1 m/s	0.94	0	0.06	1.09	0.421	0.374
Random	Baseline	0.88	0.065	0.055	1.72	0.288	0.345
	(21) v_ref = 3 m/s	0.88	0.02	0.1	2.43	0.419	0.465
	(22) SDF₅₁₂	0.905	0.05	0.045	1.76	0.453	0.442
	(23) α_H = 65°	0.925	0.03	0.045	1.71	0.298	0.352
	(24) α_H = 65° , v_ref = 1 m/s	0.885	0.095	0.02	1.13	0.207	0.252
	(25) α_H = 65° , SDF₅₁₂	0.925	0.035	0.04	1.67	0.301	0.320
	(26) α_H = 180°	0.95	0.05	0	1.70	0.241	0.310
	(27) α_H = 180° , SDF₅₁₂	0.97	0.025	0.05	1.65	0.256	0.402
	(28) +20 obstacles	0.87	0.05	0.08	1.70	0.454	0.450
	(29) +20 thin obstacles	0.82	0.05	0.13	1.68	0.448	0.421
	(30) +20 thin obstacles, SDF₅₁₂	0.88	0.03	0.09	1.72	0.435	0.420
	(31) +20 thin obstacles, α_H = 65°	0.88	0.03	0.09	1.69	0.258	0.296
	(32) +20 thin obstacles, α_H = 180°	0.9	0.03	0.07	1.56	0.240	0.355
	(33) +20 thin obstacles, α_H = 180° , SDF₅₁₂	0.92	0.03	0.05	1.61	0.234	0.368

Pillar environments

First, we observe (lines 1–3) that the framework exhibits resilience to reduced sensor frequency and moderate state noise—maintaining failure rates under 4% even with noisy feedback. The parameters that critically affect the success metrics are the reference velocity, the obstacle density, and the SDF network size. At higher velocities (lines 4–5, 7–8, and 10–11), reduced reaction times increase sensitivity to SDF approximation errors. This is mitigated by using the SDF₅₁₂ network, which enables sharper reconstructions (see Figure 7). In the simple baseline environment (lines 1–11), the FoV only marginally impacts the performance. However, increasing the obstacle density (lines 12–20) causes a drop in success (down to 20% failure rate, line 13). This is attributed to the slackened constraint (27g) which allows slight lateral drift and results in collisions in denser layouts. A larger FoV mitigates this issue. It can be noted that the baseline SDF network struggles to reconstruct properly the narrow gaps in dense environments, especially for large FoV (line 18). This leads the neural constraint to prevents motion entirely, causing timeouts despite no actual dead-ends being present (since we ensure d_min > 2(r + ϵ) = 0.7 m).

3D random clutter

Simulations in 3D environments show similar trends, though baseline failure rates are non-zero (5.5%), reflecting the known difficulty of mapless methods in cluttered settings with limited FoV. Slack FoV constraints also contribute to lateral collisions in dense areas. In such unstructured environments, increasing the FoV dramatically improves the success rate (lines 23–27), approaching 100% success rate in the 360° FoV case (lines 26–27). The framework also scales favorably with increasing obstacle density (lines 28), but performance degrades in the presence of thin obstacles (lines 29–33) (from 8% to 13% failure rate). Thin obstacles are challenging to encode and reconstruct, especially with the smaller SDF256−64 network. The larger SDF512 improves results (line 33), reducing the failure rate to 5%.

Across all simulations, the final columns of Table 7 indicate that the SDF network is generally conservative in its distance estimates. This suggests that most SDF constraint violations stem from discontinuities in the observations and poor estimations in the SDF network.

The average NMPC solving time on an Intel Core i7-12800H CPU is 8.51 ms using SDF₂₅₆₋₆₄, and 15.18 ms using SDF₅₁₂.

Comparison with existing methods

In this section, the proposed SDF-NMPC method is evaluated against three state-of-the-art collision avoidance approaches based on local depth observations, both neural and non-neural:

• Spline-based EGO-Planner from Zhou et al. (2021a) (and more specifically, the improved implementation EGO-Swarm(Zhou et al., 2021b; 2022))

• Neural, motion-primitive-based collision predictor ORACLE from Nguyen et al. (2024)

• k-d tree-based NMPC from Zhang et al. (2025).

A key distinction of our method is its formal consideration of safety as an explicit constraint in the optimization, rather than the heuristic collision score in ORACLE or the optimization co-objective in EGO and KDtree-NMPC.

The comparison is conducted using the Flightmare simulator (Song et al., 2021) and benchmarking metrics from Yu et al. (2023):

(1) Path optimality, that is, the ratio of excess traveled distance relative to an A* baseline

(2) Average speed, normalized by the optimal path length

(3) Energy efficiency (i.e., the integral of the jerk).

All methods are evaluated in the same 32 randomly sampled environments, both indoor and outdoor (Yu et al., 2023). The obstacles are sampled through Poisson disc distributions with radii chosen to cover a large range of traversability ( ≈ 4 to 16). The target navigation speed is 2 m/s.

Baseline parameters are drawn from publicly available implementations, and fine-tuned for performance in the evaluation environments. Simulations are performed on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU and an AMD Ryzen Threadripper 3970X CPU. A visualization of the simulations is included in Extension 1, and results are reported in Figure 9.

Figure 9.

Comparative metrics for the four evaluated methods. The boxplots are overlayed with the individual samples for each of the successful rollouts, colored by traversability, as defined in Yu et al. (2023).

The proposed NMPC method (using SDF₂₅₆₋₆₄) outperforms the other methods in terms of success rate, achieving zero collisions. While being more computationally demanding than EGO and ORACLE, it maintains a control frequency above 100 Hz and is more efficient than the KDtree-NMPC. Notably, its computation time is more consistent around the average, whereas the baselines show increased computational load as traversability decreases.

In terms of path optimality, both our NMPC and KDtree-NMPC outperform EGO and ORACLE, benefiting from objectives of minimizing deviation from a straight path. Although KDtree-NMPC achieves higher speeds, our method still surpasses EGO and ORACLE in this regard, demonstrating strong tracking performance without compromising safety.

Finally, energy efficiency is comparable across most methods, except for ORACLE, whose switching acceleration and yawing rate outputs do not account for dynamic feasibility and smoothness.

Resilience to drifting odometry

Because the collision avoidance is formulated in the local frame, our framework is resilient to drifting odometry, which would hinder the construction of a reliable map. In this section, we present an ablation study evaluating the method’s resilience to increasing odometry drift and decreasing sensor frequency. The latter is relevant since, without new measurements, the method relies on the forward-propagated last received measurement; thus, a lower frequency implies a higher sensitivity to drift.

In this study, we perform simulations in Gazebo, and deteriorate the state feedback provided to the NMPC. Specifically, velocity and yaw rate are perturbed using Gaussian noise and a random-walk bias. Position and heading are then obtained through integration. The tilt angles are not altered, as those are non-drifting quantities directly observable using an IMU.

The parameters of these noises are empirically selected to achieve the desired relative position error (RPE) (Geiger et al., 2012) (computed for a delta of 10 m). This metric computes the position RMSE per delta of traveled distance. It offers a more meaningful measure of drift than absolute position error. For reference, the visual-inertial odometry (VIO) methods ROVIO (Bloesch et al., 2017) and VINS-Mono (Qin et al., 2018) both report RPE of 1 to 2% for the same delta.

We perform these simulations in random, cluttered environments where spheres of 1 m radius are sampled using 3D Poisson Discs with a radius of 1.5 m. The environment is 50 m long, enclosed by walls, and the robot must reach a waypoint on the opposite end (with a reference speed of 2 m/s). This setup is intentionally challenging due to high clutter and a constrained FoV, making it a rigorous test of the method’s resilience to drift. An illustrative, 2D example of such an environment is provided in Figure 10. We evaluate the success rate of the NMPC by performing rollouts in each scenario.

Figure 10.

Top-view of a 2D trajectory where the NMPC receives drifting odometry (red). Despite the drift, the real system (green trajectory) avoids collisions, as the controller operates in a local observation frame independent of map-based positioning. The orange circle depicts the robot size.

Table 8 reports the number of successes out of ten different environments in each RPE/sensor frequency setting. The first column and first row correspond to the limit cases (perfect odometry and 100 Hz sensor frequency—i.e., the control frequency). We remark that the success rate is not 10 out of 10 in the two extreme cases (top right and bottom left cells). This is due to (a) the severely degraded velocity estimate in high RPE scenarios that leads to poor predictive performance (the NMPC formulation is not robust); and (b) the uncertainties in the SDF in the very cluttered environments lead to collisions when the sensor rate becomes too low.

Table 8.

Number of successes out of 10 rollouts with increasing RPE and decreasing sensor frequency. The reference velocity is 2 m/s.

		Sensor frequency
		100 Hz	30 Hz	10 Hz	1 Hz	0.5 Hz
RPE	0%	10	10	10	10	7
	∼ 5%	10	9	8	7	1
	∼ 10%	10	8	6	1	0
	∼ 15%	7	3	2	0	0

These results highlight the local aspect of the method, which achieves collision avoidance while relying on short-term, local consistency of the state estimate in the weakest sense without explicitly considering robust formulations (i.e., allows forward propagation in between two successive range measurements).

Real world experiments

System setup

We finally present hardware experiments for both the depth camera and 360° LiDAR cases. Experiments are conducted with a custom-built RMF-class quadrotor (De Petris et al., 2022), with dimensions 0.52 × 0.52×0.3 m and mass 2.58 kg. The AR integrates PX4-based autopilot avionics for low-level control, together with an NVIDIA Orin NX Single-Board Computer (SBC) running ROS Noetic. The exteroceptive sensor used for navigation is either an Ouster OS0-64 LiDAR or a Luxonis OAK-D Pro Wide depth camera (whose halved horizontal FoV α_H is 63°). The sensor frequencies are set to 20 Hz. Further, the system employs a Texas Instruments IWR6843AOP FMCW radar sensor. Odometry is obtained by fusing a VectorNav VN-100 IMU with the LiDAR observations using CompSLAM (Khattak et al., 2020), or with the radar measurements (Nissov et al., 2024) in the drifting case. A block diagram of the system is depicted in Figure 11.

Figure 11.

Block diagram on the SDF-NMPC framework. The contributed controller, which integrates the neural SDF network as a constraint, is highlighted in purple.

The VAE encoding is executed on the Orin NX GPU, with an average inference time of 12.4 ms. The neural NMPC solver (using SDF₂₅₆₋₆₄) runs on the CPU, with an average solving time of 15.4 ms. The resulting control frequency, including the reference velocity reference generation and the overhead induced by the Python implementation, is ≈ 40 Hz.

Recordings of each of the three following experiments are presented, respectively, in Extension 2, Extension 3, and Extension 4, along with relevant visualizations to appreciate the action of the NMPC.

Experiment with depth camera

This first experiment is conducted using the depth camera for navigation, with LIDAR-inertial odometry. Figure 12 presents an overview of the experiment. The AR is tasked with reaching a waypoint 15 m ahead, through a 12 m-long tree-filled section. The reference speed is set to v_ref = 1.5 m/s, and the observed 90th percentile velocity is 1.25 m/s. The resulting motion properly avoids the trees, despite the naive environment-agnostic reference velocity. The systematic noise in the input image (black pixels) is handled by the VAE. The encoded SDF shows a decrease toward the obstacles, and the constraint embedded in the NMPC (pictured as the blue line) becomes active, deflecting the trajectory to the side.

Figure 12.

Third-person view of a trajectory (white path) among trees, showing the aggregated pointclouds from the depth camera. On the left is the input depth image at a given time instant, along with its VAE reconstruction, and a top-view of the z_B = 0 slice of the neural SDF, its 0- and (r + ϵ)-levelsets (respectively, white and blue), the reference velocity (black arrow), and the predicted trajectory (green line).

Experiment with LiDAR with adversarial reference velocity

This second experiment is conducted using the LiDAR for navigation and LiDAR-inertial odometry. Unlike the previous setup, the reference policy is not provided by the goal-seeking planner. Instead, an adversarial velocity input is provided by a human operator through a remote joystick, actively trying to collide the AR with the trees. Figure 13 pictures three time instances along the experiments which highlight that the NMPC effectively prevents collision by either deflecting the trajectory through some free space (instances B and C), or by bringing the robot to a full stop (instance A) when the trees form a wall-like obstacle, blocking the half-space where the velocity projection would yield a cost decrease. Additionally, Figure 13 also illustrates that the VAE is capable of successfully reconstructing the range image despite the significant amount of invalid pixels in the input image (both in the long-range regions and within four vertical clusters obscured by the propellers and the outer cage of the AR).

Figure 13.

Bird’s eye visualization of a LiDAR-based experiment among trees. The top-most row depicts three zoomed-in time instances A, B, and C, along with the z_B = 0 slice of the neural SDF, following the same nomenclature as Figure 12. The next two rows show the corresponding LiDAR input range image and its VAE reconstruction.

Experiment with drifting odometry

This last experiment implements the drifting odometry case discussed in Section 6.5. The experimental settings are similar to the previous one, that is, the LiDAR range image is used for maintaining collision avoidance in the presence of adversarial velocity reference. However, instead of relying on the accurate LiDAR-based odometry, we implement a frequency modulated continuous wave (FMCW) radar-based velocity estimator which provides, through integration, a drifting position and heading odometry. Indeed, while the LiDAR-based position estimate is accurate in well-structured environments (including forests), the radar measurements have several orders of magnitude fewer points with generally greater noise, thus resulting in scan-to-scan matching approaches being impractical. However, the availability of Doppler measurements enable reliable velocity estimation, though without position corrections, inevitably leading to drift over time.

Specifically, we implement the radar-inertial part of the factor-graph estimator presented in Nissov et al. (2024), whose details are reported in Appendix D. The odometry has an RPE of 3% and a heading RPE of 2° against the LiDAR odometry from Khattak et al. (2020), which we use as ground truth.

Figure 14 shows a bird’s eye view of the trajectory. Both odometries (red for radar, green for ground truth) are reported, showing the progressive drift between the two. This experiment further verifies the associated results on drifting odometry previously presented in simulation within Section 6.5.

Figure 14.

Bird’s eye view of the trajectory with both odometries (drifted in red, ground truth in green). The white pointcloud depicts a single observation given to the NMPC. The controller is agnostic of the map, and operates in the local frame, and therefore maintains collision avoidance despite the drift.

Conclusion

This work contributes an NMPC framework for collision avoidance using a neural SDF as a differentiable, volumetric approximation of exteroceptive data. This representation enables the optimal controller to enforce position constraints, thus enforcing collision avoidance. The neural architecture is divided into two parts, trained sequentially: a VAE which encodes the image in a low-dimensional latent space, and a four-layer coordinate-based MLP which approximates the corresponding SDF. We further perform a theoretical analysis of the recursive feasibility and stability properties of the proposed NMPC, and derive corresponding terminal conditions. The neural SDF encoding is first evaluated, then the NMPC controller is evaluated through several ablation studies and comparisons. Furthermore, it is validated through outdoor experiments at 1.5 m/s, demonstrating the avoidance of trees against adversarial inputs and drifting odometry.

This method results in a tightly integrated control and collision-avoidance policy that achieves competitive navigation performance with low computational cost. Additionally, the approach exhibits strong resilience—within identified thresholds—in scenarios where odometry estimation drifts. This addresses the primary motivation for adopting mapless navigation policies. Finally, our results demonstrate the effectiveness of neural SDF encoding as a foundation for mapless navigation. Future work includes seven key directions. First, improving the framework’s implementation allows for reduced sensing-to-action delay. This could be achieved by optimizing neural network architecture and weights, and improving CPU-GPU memory transfer. Second, investigating energy-based approaches to ensure that the NMPC retains a dissipative behavior when the active collision constraints prevent matching the reference velocity. Third, extending the neural representation to consider the orientation would relax the spherical robot assumption and enhance maneuverability. Fourth, incorporating visual sensing alongside or instead of range data could mitigate sensor degradation and better capture fine features, with a bimodal VAE fusing sensor inputs. Fifth, relaxing the 0-memory assumption through temporal observation windows, key-frame selection, and uncertainty-aware filtering. Another avenue for this task is the use of neural world models, encoding a temporal sequence of image data into a latent representation. Sixth, relaxing the static world assumption by predicting obstacle motion over time would further benefit from world modeling. Finally, future work could focus on the uncertainty awareness, w.r.t. both the noisy state information and the coarsely approximated SDF from the (also noisy) observations, for example, by extending the proposed method to an uncertainty-informed stochastic MPC, or using a robust formulation that incorporates error bounds.

Supplemental Material

Footnotes

Acknowledgments

We thank Morten Nissov for his help in the setup of the radar and the corresponding estimation software.

ORCID iD

Martin Jacquet

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the European Commission Horizon projects DIGIFOREST (EC 101070405) and SPEAR (EC 101119774).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

Note

Appendix

References

Adhau

Gros

Skogestad

(2024) Reinforcement learning based MPC with neural dynamical models. European Journal of Control 80: 101048.

Alhaddad

Mironov

Staroverov

, et al. (2024) Neural potential field for obstacle-aware local motion planning. In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 9313–9320.

Alsmeier

Savchenko

Findeisen

(2024) Neural horizon model predictive control - increasing computational efficiency with neural networks. In: 2024 American Control Conference, IEEE, 1646–1651.

Andersson

Gillis

Horn

, et al. (2019) CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation 11(1): 1–36.

Azinović

Martin-Brualla

Goldman

, et al. (2022) Neural RGB-D surface reconstruction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 6290–6301.

Barron

Mildenhall

Verbin

, et al. (2022) Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 5470–5479.

Bauersfeld

Kaufmann

Foehn

, et al. (2021) NeuroBEM: hybrid aerodynamic quadrotor model. Robotics: Science and Systems. RSS, 42.

Bloesch

Burri

Omari

, et al. (2017) Iterated extended kalman filter based visual-inertial odometry using direct photometric feedback. The International Journal of Robotics Research 36(10): 1053–1072.

Brescianini

D’Andrea

(2020) Tilt-prioritized quadrocopter attitude control. IEEE Transactions on Control Systems Technology 28(2): 376–387.

10.

Bucki

Lee

Mueller

(2020) Rectangular pyramid partitioning using integrated depth sensors (RAPPIDS): a fast planner for multicopter navigation. IEEE Robotics and Automation Letters 5(3): 4626–4633.

11.

Cao

Tang

Mohamed

, et al. (2010) Parallel banding algorithm to compute exact distance transform with the GPU. 2010 ACM SIGGRAPH Symp. on Interactive 3D Graphics and Games. ACM, 83–90.

12.

Celestini

Gammelli

Guffanti

, et al. (2024) Transformer-based model predictive control: trajectory optimization via sequence modeling. IEEE Robotics and Automation Letters 9(11): 9820–9827.

13.

Chee

Jiahao

Hsieh

(2022) KNODE-MPC: a knowledge-based data-driven predictive control framework for aerial robots. IEEE Robotics and Automation Letters 7(2): 2819–2826.

14.

Chen

Zhang

(2019) Learning implicit fields for generative shape modeling. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 5939–5948.

15.

Criminisi

Sharp

Blake

(2008) GeoS: geodesic image segmentation 2008 European Conference on Computer Vision, 99–112.

16.

Dai

Ruizhongtai Qi

Niessner

(2017) Shape completion using 3D-encoder-predictor CNNs and shape synthesis. In: 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 5868–5877.

17.

Dawson

Lowenkamp

Goff

, et al. (2022) Learning safe, generalizable perception-based hybrid control with certificates. IEEE Robotics and Automation Letters 7(2): 1904–1911.

18.

De Petris

Nguyen

Dharmadhikari

, et al. (2022) RMF-Owl: a collision-tolerant flying robot for autonomous subterranean exploration. 2022 Int. Conference on Unmanned Aircraft Systems. IEEE, 536–543.

19.

Dellaert

GTSAM Contributors (2022) Georgia tech smoothing and mapping library. https://github.com/borglab/gtsam

20.

East

Gallieri

Masci

, et al. (2020) Infinite-horizon differentiable model predictive control. In: 2020 International Conference on Learning Representations. https://openreview.net/forum?id=ryxC6kSYPr

21.

Ebadi

Bernreiter

Biggie

, et al. (2024) Present and future of SLAM in extreme environments: the DARPA subt challenge. IEEE Transactions on Robotics 40: 936–959.

22.

Fang

Yang

Jain

, et al. (2017) Robust autonomous flight in constrained and visually degraded shipboard environments. Journal of Field Robotics 34(1): 25–52.

23.

Florence

Carter

Ware

, et al. (2018) Nanomap: fast, uncertainty-aware proximity queries with lazy search over local 3d data. In: 2018 IEEE International Conference on Robotics and Automation, IEEE, 7631–7638.

24.

Florence

Carter

Tedrake

(2020) Integrated perception and control at high speed: evaluating collision avoidance maneuvers without maps. In: Algorithmic Foundations of Robotics XII, ,IEEE 304–319.

25.

Forster

Carlone

Dellaert

, et al. (2017) On-manifold preintegration for real-time visual–inertial odometry. IEEE Transactions on Robotics 33(1): 1–21.

26.

Frison

Diehl

(2020) HPIPM: a high-performance quadratic programming framework for model predictive control. IFAC-PapersOnLine 53(2): 6563–6569.

27.

Furrer

Burri

Achtelik

, et al. (2016) Rotors—A Modular Gazebo Mav Simulator Framework. Springer, 595–625.

28.

Gao

, et al. (2019) Flying on point clouds: online trajectory generation and autonomous navigation for quadrotors in cluttered environments. Journal of Field Robotics 36(4): 710–733.

29.

Gao

Wen

Xing

, et al. (2024) An integrated framework for autonomous driving planning and tracking based on nnmpc considering road surface variations. IEEE Transactions on Intelligent Vehicles 10: 848–862.

30.

Gardner

Pleiss

Weinberger

, et al. (2018) GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. 2018 Conference on Neural Information Processing Systems. NeurIPS, 31.

31.

Geiger

Lenz

Urtasun

(2012) Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 3354–3361.

32.

Gros

Zanon

(2020) Data-driven economic NMPC using reinforcement learning. IEEE Transactions on Automatic Control 65(2): 636–648.

33.

Han

Gao

Zhou

, et al. (2019) FIESTA: fast incremental euclidean distance fields for online motion planning of aerial robots. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 4423–4430.

34.

Harms

Kulkarni

Khedekar

, et al. (2024) Neural control barrier functions for safe navigation In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems. https://arxiv.org/abs/2407.19907

35.

Hewing

Wabersich

Menner

, et al. (2020) Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems 3(1): 269–296.

36.

Higgins

Matthey

Pal

, et al. (2017) β-VAE: learning basic visual concepts with a constrained variational framework. 2017 International Conference on Learning Representations. ICLR.

37.

Hoeller

Wellhausen

Farshidian

, et al. (2021) Learning a state representation and navigation in cluttered and dynamic environments. IEEE Robotics and Automation Letters 6(3): 5081–5088.

38.

Jacquet

Alexis

(2024) N-MPC for deep neural network-based collision avoidance exploiting depth images In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 13536–13542.

39.

Jang

Lee

Kim

, et al. (2024) AiSDF: structure-aware neural signed distance fields in indoor scenes. IEEE Robotics and Automation Letters 9(5): 4106–4113.

40.

Kabzan

Hewing

Liniger

, et al. (2019) Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters 4(4): 3363–3370.

41.

Kahn

Abbeel

Levine

(2021a) BADGR: an autonomous self-supervised learning-based navigation system. IEEE Robotics and Automation Letters 6(2): 1312–1319.

42.

Kahn

Abbeel

Levine

(2021b) LaND: learning to navigate from disengagements. IEEE Robotics and Automation Letters 6(2): 1872–1879.

43.

Kamel

Stastny

Alexis

, et al. (2017) Model predictive control for trajectory tracking of unmanned aerial vehicles using robot operating system In: Model Predictive Control for Trajectory Tracking of Unmanned Aerial Vehicles Using Robot Operating System, Vol. 707, 3–39.

44.

Keyumarsi

Atman

MWS

Gusrialdi

(2024) Lidar-based online control barrier function synthesis for safe navigation in unknown environments. IEEE Robotics and Automation Letters 9(2): 1043–1050.

45.

Khattak

Nguyen

Mascarich

, et al. (2020) Complementary multi–modal sensor fusion for resilient robot pose estimation in subterranean environments. In: 2020 International Conference on Unmanned Aircraft Systems, IEEE, 1024–1029.

46.

Kulkarni

Alexis

(2023) Task-driven compression for collision encoding based on depth images. 2023 International Symposium on Visual Computing. Springer, 259–273.

47.

Kulkarni

Alexis

(2024) Reinforcement learning for collision-free flight exploiting deep collision encoding. In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 15781–15788.

48.

Kulkarni

Nguyen

Alexis

(2023) Semantically-enhanced deep collision prediction for autonomous navigation using aerial robots. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 3056–3063.

49.

Kulkarni

Rehberg

Alexis

(2025) Aerial gym simulator: a framework for highly parallelized simulation of aerial robots. IEEE Robotics and Automation Letters 10: 4093–4100.

50.

Lee

Liu

(2024) Reliable and accurate implicit neural representation of multiple swept volumes with application to safe human–robot interaction. SN Computer Science 5(3): 313.

51.

Lee

Leok

McClamroch

(2010) Geometric tracking control of a quadrotor uav on se(3). In: IEEE Conference on Decision and Control, IEEE, 5420–5425.

52.

Liang

Gao

Xiao

, et al. (2024) MTG: mapless trajectory generator with traversability coverage for outdoor navigation. In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 2396–2402.

53.

Long

Qian

Cortés

, et al. (2021) Learning barrier functions with memory for robust safe navigation. IEEE Robotics and Automation Letters 6(3): 4931–4938.

54.

Lopez

How

(2017) Aggressive 3-D collision avoidance for high-speed navigation. In: 2017 IEEE International Conference on Robotics and Automation, IEEE, 5759–5765.

55.

Loquercio

Kaufmann

Ranftl

, et al. (2021) Learning high-speed flight in the wild. Science Robotics 6(59): eabg5810.

56.

Tian

Shen

, et al. (2023) LPNet: a reaction-based local planner for autonomous collision avoidance using imitation learning. IEEE Robotics and Automation Letters 8(11): 7058–7065.

57.

Macklin

(2022) Warp: a high-performance python framework for GPU simulation and graphics. https://github.com/nvidia/warp.NVIDIAGPUTechnologyConf

58.

Marić

Calinon

(2024) Online learning of continuous signed distance fields using piecewise polynomials. IEEE Robotics and Automation Letters 9(6): 6020–6026.

59.

Matthies

Brockers

Kuwata

, et al. (2014) Stereo vision-based obstacle avoidance for micro air vehicles using disparity space. In: 2014 IEEE International Conference on Robotics and Automation, IEEE, 3242–3249.

60.

Maurer

Raghavan

(2003) A linear time algorithm for computing exact euclidean distance transforms of binary images in arbitrary dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(2): 265–270.

61.

Mescheder

Oechsle

Niemeyer

, et al. (2019) Occupancy networks: learning 3d reconstruction in function space. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 4460–4470.

62.

Mosbach

Behnke

(2022) Efficient representations of object geometry for reinforcement learning of interactive grasping policies. In: IEEE International Conference on Robotic Computing, IEEE, 156–163.

63.

Newcombe

Izadi

Hilliges

, et al. (2011) Kinectfusion: real-time dense surface mapping and tracking. In: IEEE International Symposium on Mixed and Augmented Reality, IEEE, 127–136.

64.

Nguyen

Fyhn

De Petris

, et al. (2022) Motion primitives-based navigation planning using deep collision prediction. In: 2022 IEEE International Conference on Robotics and Automation, IEEE, 9660–9667.

65.

Nguyen

Andersen

Boukas

, et al. (2024) Uncertainty-aware visually-attentive navigation using deep neural networks. The International Journal of Robotics Research 43(6): 840–872.

66.

Nissov

Khedekar

Alexis

(2024) Degradation resilient Lidar-Radar-inertial odometry. In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 8587–8594.

67.

Oleynikova

Taylor

Fehr

, et al. (2017) Voxblox: incremental 3D euclidean signed distance fields for on-board MAV planning. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 1366–1373.

68.

Ortiz

Clegg

Dong

, et al. (2022) iSDF: real-time neural signed distance fields for robot perception. Robotics: Science and Systems XVIII 30: 39.

69.

Pan

Zhong

Wiesmann

, et al. (2024) PIN-SLAM: LiDAR SLAM using a point-based implicit neural representation for achieving global map consistency. IEEE Transactions on Robotics 40: 4045–4064.

70.

Park

Florence

Straub

, et al. (2019) DeepSDF: learning continuous signed distance functions for shape representation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 165–174.

71.

Qin

Shen

(2018) VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics 34(4): 1004–1020.

72.

Reiter

Ghezzi

Baumgärtner

, et al. (2024) AC4MPC: actor-critic reinforcement learning for nonlinear model predictive control. https://arxiv.org/abs/2406.03995

73.

Romero

Song

Scaramuzza

(2024) Actor-critic model predictive control. In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 14777–14784.

74.

Salzmann

Kaufmann

Arrizabalaga

, et al. (2023) Real-time neural MPC: deep learning model predictive control for quadrotors and agile robotic platforms. IEEE Robotics and Automation Letters 8(4): 2397–2404.

75.

Salzmann

Arrizabalaga

Andersson

, et al. (2024) Learning for CasADi: data-driven models in numerical optimization. 2024 Learning for Dynamics and Control Conference. PMLR, 541–553. Available at: https://proceedings.mlr.press/v242/salzmann24a.html.

76.

Siegwart

Nourbakhsh

Scaramuzza

(2011) Introduction to Autonomous Mobile Robots. MIT press.

77.

Sitzmann

Martel

Bergman

, et al. (2020) Implicit neural representations with periodic activation functions. In: 2020 Conference on Neural Information Processing Systems, 7462–7473. https://arxiv.org/abs/2006.09661

78.

Song

Scaramuzza

(2022) Policy search for model predictive control with application to agile drone flight. IEEE Transactions on Robotics 38(4): 2114–2130.

79.

Song

Naji

Kaufmann

, et al. (2021) Flightmare: a flexible quadrotor simulator. In: 2020 Conference on Robot Learning, IEEE, 1147–1157. https://proceedings.mlr.press/v155/song21a.html

80.

Song

Shi

Penicka

, et al. (2023) Learning perception-aware agile flight in cluttered environments. In: 2023 IEEE International Conference on Robotics and Automation, IEEE, 1989–1995.

81.

Stutz

Geiger

(2018) Learning 3D shape completion from laser scan data with weak supervision. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 1955–1964.

82.

Sucan

Moll

Kavraki

(2012) The open motion planning library. IEEE Robotics and Automation Magazine 19(4): 72–82.

83.

Sucar

Liu

Ortiz

, et al. (2021) iMAP: implicit mapping and positioning in real-time. In: 2021 IEEE/CVF International Conference on Computer Vision, IEEE, 6229–6238.

84.

Sun

Romero

Foehn

, et al. (2022) A comparative study of nonlinear MPC and differential-flatness-based control for quadrotor agile flight. IEEE Transactions on Robotics 10: 3357–3373.

85.

Syntakas

Vlachos

(2024) Dropout MPC: an ensemble neural MPC approach for systems with learned dynamics. https://arxiv.org/abs/2406.02497

86.

Tancik

Srinivasan

Mildenhall

, et al. (2020) Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. 2020 Conference on Neural Information Processing Systems. NeurIPS, 7537–7547.

87.

Tian

Liu

, et al. (2020) Search and rescue under the forest canopy using multiple UAVs. The International Journal of Robotics Research 39(10-11): 1201–1221.

88.

Tolani

Bansal

Faust

, et al. (2021) Visual navigation among humans with optimal control as a supervisor. IEEE Robotics and Automation Letters 6(2): 2288–2295.

89.

Tong

Dawson

Fan

(2023) Enforcing safety for vision-based controllers via control barrier functions and neural radiance fields. In: 2023 IEEE International Conference on Robotics and Automation, IEEE, 10511–10517.

90.

Torrente

Kaufmann

Föhn

, et al. (2021) Data-driven MPC for quadrotors. IEEE Robotics and Automation Letters 6(2): 3769–3776.

91.

Tranzatto

Miki

Dharmadhikari

, et al. (2022) Cerberus in the darpa subterranean challenge. Science Robotics 7(66): eabp9742.

92.

Ugu

Pham

Kayacan

(2022) Sim-to-real deep reinforcement learning for safe end-to-end planning of aerial robots. Robotics 11(5): 109.

93.

Vasilopoulos

Garg

Huh

, et al. (2024) HIO-SDF: hierarchical incremental online signed distance fields. In: 2024 IEEE International Conference on Robotics and Automation, IEEE, 17537–17543.

94.

Verschueren

Frison

Kouzoupis

, et al. (2021) Acados – a modular open-source framework for fast embedded optimal control. Mathematical Programming Computation 14: 147–183.

95.

Wang

Zhu

Wang

, et al. (2020) TartanAir: a dataset to push the limits of visual SLAM. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 4909–4916.

96.

Wang

Liu

, et al. (2021) NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: 2021 Conference on Neural Information Processing Systems, 27171–27183. https://arxiv.org/abs/2106.10689

97.

Wang

Bleja

Agapito

(2022) Go-surf: neural feature grid optimization for fast, high-fidelity rgb-d surface reconstruction. In: 2022 International Conference on 3D Vision, 433–442.

98.

Wang

Agapito

(2023) Co-slam: joint coordinate and sparse parametric encodings for neural real-time slam. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 13293–13302.

99.

Williams

Wagener

Goldfain

, et al. (2017) Information theoretic MPC for model-based reinforcement learning. IEEE International Conference on Robotics and Automatio. IEEE, 1714–1721.

100.

Lee

KMB

Le Gentil

, et al. (2023) Log-GPIS-MOP: a unified representation for mapping, odometry, and planning. IEEE Transactions on Robotics 39(5): 4078–4094.

101.

Yadav

Tanner

(2020) Reactive receding horizon planning and control for quadrotors with limited on-board sensing. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 7058–7063.

102.

Yan

Tian

Shi

, et al. (2021) Continual neural mapping: learning an implicit scene representation from sequential observations. In: 2021 IEEE/CVF International Conference on Computer Vision, IEEE, 15782–15792.

103.

de Croon

GCHE

De Wagter

(2023) AvoidBench: a high-fidelity vision-based obstacle avoidance benchmarking suite for multi-rotors. In: 2023 IEEE International Conf. on Robotics and Automation, IEEE, 9183–9189.

104.

Zhang

Deng

, et al. (2025) Mapless collision-free flight via MPC using dual KD-Trees in cluttered environments. https://arxiv.org/abs/2503.10141

105.

Zhou

Wang

, et al. (2021a) EGO-planner: an ESDF-free gradient-based local planner for quadrotors. IEEE Robotics and Automation Letters 6(2): 478–485.

106.

Zhou

Zhu

Zhou

, et al. (2021b) EGO-swarm: a fully autonomous and decentralized quadrotor swarm system in cluttered environments. In: 2021 IEEE International Conference on Robotics and Automation, IEEE, 4101–4107.

107.

Zhou

Wen

Wang

, et al. (2022) Swarm of micro flying robots in the wild. Science Robotics 7(66): eabm5954.

108.

Zhu

Peng

Larsson

, et al. (2022) NICE-SLAM: neural implicit scalable encoding for SLAM. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, , IEEE12786–12796.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Neural NMPC through signed distance field encoding for collision avoidance

Abstract

Keywords

Introduction

Related work

Neural MPC

Mapless navigation

Implicit distance fields

Problem statement

Learning the environment representation

SDF from observations

NN architecture

Training procedure

VAE training

SDF training

Neural NMPC

Mathematical notations and coordinate frames

System modeling

Local navigation neural MPC

Obstacle avoidance constraint

Field of view constraints

Objective function for velocity tracking

Theoretical analysis

Recursive feasibility

Stability

Nonlinear programming

Implementation details

Validation

VAE evaluation

SDF reconstruction

Neural network size

Comparison with other SDF approximations

Generalizability

Ablation study

Pillar environments

3D random clutter

Comparison with existing methods

Resilience to drifting odometry

Real world experiments

System setup

Experiment with depth camera

Experiment with LiDAR with adversarial reference velocity

Experiment with drifting odometry

Conclusion

Supplemental Material

Supplemental Material

Supplemental Material

Supplemental Material

Footnotes

Acknowledgments

ORCID iD

Funding

Declaration of conflicting interests

Supplemental Material

Note

Appendix

References

Supplementary Material