Sage Journals: Discover world-class research

Abstract

High-speed online trajectory planning for UAVs poses a significant challenge due to the need for precise modeling of complex dynamics while also being constrained by computational limitations. This paper presents a multi-fidelity reinforcement learning method (MFRL) that aims to effectively create a realistic dynamics model and simultaneously train a planning policy that can be readily deployed in real-time applications. The proposed method involves the co-training of a planning policy and a reward estimator; the latter predicts the performance of the policy’s output and is trained efficiently through multi-fidelity Bayesian optimization. This optimization approach models the correlation between different fidelity levels, thereby constructing a high-fidelity model based on a low-fidelity foundation, which enables the accurate development of the reward model with limited high-fidelity experiments. The framework is further extended to include real-world flight experiments in reinforcement learning training, allowing the reward model to precisely reflect real-world constraints and broadening the policy’s applicability to real-world scenarios. We present rigorous evaluations by training and testing the planning policy in both simulated and real-world environments. The resulting trained policy not only generates faster and more reliable trajectories compared to the baseline snap minimization method, but it also achieves trajectory updates in 2 ms on average, while the baseline method takes several minutes.

Keywords

reinforcement learning multi-fidelity Bayesian optimization quadrotor motion planning time-optimal trajectory

1. Introduction

Traditional high-performance motion-planning algorithms typically rely on mathematical models that capture the system’s geometry, dynamics, and computation. Model predictive control, for example, utilizes these models for online optimization to achieve desired outcomes. However, due to computational constraints and the complexity of the inherent nonlinear optimization, these approaches often depend on simplified models, leading to conservative solutions. Alternatively, data-driven approaches like Bayesian optimization (BO) and reinforcement learning (RL) offer more complex modeling for better planning solutions. BO is particularly effective at identifying optimal hyperparameters from a limited dataset, beneficial for modeling real-world constraints. Yet, its real-time application is limited by the need to estimate model uncertainty. RL, on the other hand, constructs high-dimensional decision-making models representing intricate behaviors and is conducive to GPU acceleration for real-time use. However, given its substantial need for training data, it is often trained in simulated environments and later adapted to real-world scenarios through Sim2Real techniques, a process that might compromise performance for adaptability.

This paper addresses the challenge of time-optimal online trajectory planning for agile vehicles with dynamic waypoint changes, as illustrated in Figure 1. This capability represents a vital component in various unmanned aerial vehicle (UAV) applications. However, crafting high-speed quadrotor trajectories is particularly demanding due to the complex aerodynamics, turning trajectory generation into a time-consuming nonlinear optimization. Moreover, real-world constraints, including the effects of control delay and state estimation error, can make the generated trajectory unreliable even if it satisfies the ideal dynamics constraints. Online planning adds yet another layer of complexity by incorporating computational time constraints to update the trajectory in real-time. As a consequence, many existing online planning methods rely on simplified constraints such as bounding velocity and acceleration (Gao et al., 2018a; Tordesillas et al., 2019; Wu et al., 2023), or on hierarchical planning structures that combine the approximated global planner with the underlying control schemes (Romero et al., 2022a), which may lead to conservative solutions.

The main contribution of this paper is a novel multi-fidelity reinforcement learning (MFRL) framework. MFRL integrates RL and BO to develop a planning policy optimized for real-world and real-time scenarios. First, this method utilizes BO to directly model feasibility boundaries from high-fidelity evaluations—accurate but expensive methods, for example, real-world experiments. It improves policy performance by accurately modeling constraint boundaries where time-optimal trajectories are typically found. Second, we implement multi-fidelity Bayesian optimization (MFBO) to reduce required high-fidelity evaluations by using low-fidelity evaluations—quick but less accurate methods, for example, analytical checks with ideal dynamics—to rapidly screen out excessively fast or slow trajectories with obvious tracking results. Third, RL uses this constraint modeling in the reward signal for policy training, resulting in a computationally efficient policy that enables real-time generation of time-optimal trajectories for online planning scenarios. Lastly, the trained policy is validated through comprehensive numerical simulations and real-world flight experiments. It outperforms the baseline minimum-snap method by achieving up to a 25% reduction in flight time, averaging a 4.7% decrease, with just 2 ms of computational time—markedly less than the several minutes required by the baseline.

2. Related work

2.1. Quadrotor trajectory planning

Quadrotor planning often employs snap minimization, leveraging the differential flatness of quadrotor dynamics for smooth trajectory generation (Mellinger and Kumar, 2011; Richter et al., 2016). Minimizing the fourth-order derivative of position, that is, snap, and the yaw acceleration produces a smooth trajectory that is less likely to violate feasibility constraints. Several methods that outperform the snap minimization method have been presented in recent years. Foehn et al. (2021) employ a general time-discretized trajectory representation, using waypoint proximity constraints to find better, time-optimal solutions. Gao et al. (2018b) streamline minimum-snap computation by partitioning the optimization problem, separating speed profile and trajectory shape optimizations. Sun et al. (2021) and Burke et al. (2020) employ spatial optimization gradients to solve nonlinear optimization for time allocation, which offers numerical stability advantages over naive nonlinear optimization. Qin et al. (2023) improve the time optimization by incorporating linear collision avoidance constraints into the objective function through projection methods, transforming the problem into unconstrained optimization. Wu et al. (2023) utilize supervised learning to learn time allocation, minimizing the computation time for time optimization. These approaches optimize trajectory time while adhering to the snap minimization objective. Mao et al. (2023), however, directly minimize time using reachability analysis within speed and acceleration bounds. Instead of using the predefined velocity and acceleration constraints, Ryou et al. (2021) utilize MFBO to model the feasibility constraints from the data and utilizes the model to directly minimize trajectory time.

Most offline trajectory generation methods, due to the limited computation time, require approximation to be extended to the online replanning problem. Gao et al. (2018a) formulate the online planning problem as quadratic programming that can be solved in real-time by directly optimizing the velocity and acceleration that conforms to the specified velocity and acceleration constraints. Tordesillas et al. (2019) use the time allocations from the previous trajectory to warm-start the optimization for the updated trajectory, thereby reducing the computation time. Romero et al. (2022a) build a velocity search graph between the prescribed waypoints sequence and finds the optimal velocity profile with Dijkstra search. As the trajectory is generated in discrete state space, the resulting trajectory may not be feasible, so feasible control inputs are generated using underlying model predictive control (Romero et al., 2022b) or deep neural network policy (Molchanov et al., 2019). On the other hand, Kaufmann et al. (2023) train a neural network policy to directly generate thrust and body rate commands based on the vehicle’s position using RL, achieving superior performance even surpassing human racers. Nonetheless, further research is needed to expand the applicability of this learning-based approach to deploy the trained policy effectively in unseen, diverse environments.

These online planning methods often employ local planning with discrete state space optimization, allowing for greater expressiveness and easier implementation of constraints. In contrast, our paper opts for a polynomial trajectory representation. We choose this representation for its reduced optimization dimension, facilitating lighter neural networks and easier dataset creation for machine learning. Additionally, this approach allows for longer planning horizons, incorporating more future information for better overall performance. While the polynomial representation has trade-offs compared to discrete approaches, it has potential to improve expressiveness through additional variables like knot points or higher-degree polynomials.

2.2. Efficient reinforcement learning

The sampling efficiency must be improved in order to use RL in practical settings where computation time must be kept manageable. The model-based RL utilizes a predefined or a partially learned transition model to guide the policy search toward the feasible action space. Sutton (1990) trains the model of the transition function and then uses this model to generate extra training samples. Deisenroth and Rasmussen (2011) employ Gaussian processes to approximate the transition function, which allows model-based RL to be applied to more complex problems. Similarly, Gal et al. (2016) extend the same framework by utilizing a Bayesian neural network as the transition function model. Model-based RL is widely used in the robotics applications to train controllers or design model-predictive control schemes (Nguyen-Tuong and Peters, 2010; Williams et al., 2017; Xie et al., 2018). Among these robotics applications, Cutler et al. (2015) use model-based learning with multi-fidelity optimization by merging data from simulations and experiments, which is relevant to the proposed method.

Uncertainty-aware RL is an additional strategy to boost sample efficiency, which uses uncertainty estimation to streamline exploration throughout the RL training phase. Osband et al. (2016) use bootstrapping method to estimate policy output variance from dataset subsets, enhancing sample efficiency by selecting outputs that maximize rewards from the policy outputs’ posterior distribution. Kumar et al. (2019) use the same method of uncertainty estimation to evaluate out-of-distribution samples, aiming to stabilize the training process. Wu et al. (2021) utilize MC-dropout, which is introduced in Gal and Ghahramani (2016), to predict the variance of predictions and regularizes the training error with the inverse of this variance. Similarly, Clements et al. (2019) and Lee et al. (2021) apply an ensemble method, which involves using multiple policy models to estimate the uncertainty, facilitating safer exploration during RL. Reviews of these methodologies are included in Lockwood and Si (2022) and Hao et al. (2023). To handle high-fidelity samples with very limited data, we adopt a Gaussian process framework with a multi-fidelity kernel, moving away from neural-network-based uncertainty estimation. We instead compensate for the limited model capacity of the Gaussian process using a variational approach, elaborated on in the subsequent section.

Using a separately learned reward model also can improve the sample efficiency of RL. For instance, preference-based RL is frequently used when dealing with an unknown reward function that is difficult to design even with expert knowledge (Abbeel et al., 2010; Biyik et al., 2020; Wirth et al., 2017). These methods often utilize a simple model, such as linear feature-based regression or Gaussian process (Biyik et al., 2020), to learn the reward model from a small number of expert demonstrations. Konyushkova et al. (2020) utilize the idea of efficient reward learning in offline learning problem, and trains the policy in semi-supervised manner. Similar approaches are used in safety-critical systems where the number of real-world experiments must be kept to a minimum. For example, Srinivasan et al. (2020) train the model of feasibility function and use it to estimate the expected reward in order to safely apply RL. Zhou et al. (2022) utilize generative adversarial network (GAN) loss to train the model in a self-supervised manner. Christiano et al. (2017) update the reward model by incorporating human feedback, which is provided in the form of binary preferences for policy outputs. Likewise, Glaese et al. (2022) employ a human feedback-based approach to train large language models using RL.

Several methods have been proposed to deploy trained models in real-world robotic systems, addressing the discrepancies between simulated and real-world environments. One approach involves introducing random perturbations to the input of policy (Sadeghi et al., 2018; Akkaya et al., 2019) or dynamics model (Peng et al., 2018; Mordatch et al., 2015) during RL training. This method produces robust policies capable of handling errors related to the simulation-to-reality gap. However, this may compromise the policy’s performance, making it unsuitable for tasks that demand precise modeling of the system’s limits, like time-optimal trajectory generation. Alternatively, the real-world deployment can be enhanced by increasing the accuracy of the simulation. For instance, using system identification techniques, studies such as Tan et al. (2018), Kaspar et al. (2020), and Hwangbo et al. (2019) incorporate real-world constraints like actuation delay and friction into simulations. Meanwhile, Chang and Padif (2020), Lim et al. (2022), and Chebotar et al. (2019) iteratively refine simulations based on real-world data using learning-based methods. These approaches improve simulation accuracy and computational efficiency, making them more suitable for RL training. However, some components, like battery dynamics, remain challenging to model accurately. Moreover, complex systems such as autonomous vehicles often can’t be comprehensively simulated. These limitations necessitate real-world experiments on robot deployment, which our paper aims to make more efficient.

2.3. Bayesian optimization

In this work, we utilize Bayesian optimization techniques (BO) to improve the efficiency of real-world experiments. These methods are particularly useful in fields where the efficient use of limited experimental resources is crucial, such as in pharmacology (Lyu et al., 2019), analog circuit design (Zhang et al., 2019), aircraft wing design (Rajnarayan, 2009), physics (Dushenko et al., 2020), and psychology (Myung et al., 2013). BO continuously fine-tunes a surrogate model that estimates the information gain and uses this model to strategically choose subsequent experiments. Utilizing data from prior experiments, it creates a surrogate model to predict the information gain of future experimental options. BO then selects experiments from the available choices that best reduce uncertainty in the surrogate model (Takeno et al., 2019; Costabal et al., 2019) and precisely locates the optimum of the objective function (Mockus et al., 1978; Hernández-Lobato et al., 2014; Wang and Jegelka, 2017; Srinivas et al., 2012). For the surrogate model, the Gaussian process (GP) is often chosen due to its efficiency in estimating uncertainty from a limited number of data points (Williams and Rasmussen, 2006). To extend GP’s applicability to large, high-dimensional datasets, several scalable methods have been developed. The inducing points method is particularly notable for big datasets, employing pseudo-data points for uncertainty quantification (Snelson and Ghahramani, 2006; Hensman et al., 2013) instead of the full data set. Additionally, the deep kernel method (Wilson et al., 2016; Calandra et al., 2016) is utilized for high-dimensional data, compressing it into low-dimensional feature vectors via deep neural networks before applying GP.

When multiple information sources are available, multi-fidelity Bayesian optimization is employed to merge information from different fidelity levels. This approach leverages cost-effective low-fidelity evaluations, like basic simulations or expert opinions, to refine the design of more expensive high-fidelity measurements such as complex simulations or real-world experiments. To incorporate information from multiple sources, the surrogate model must be modified to combine multi-fidelity evaluations. Kennedy and O’Hagan (2000) and Le Gratiet and Garnier (2014) apply a linear transformation between Gaussian process (GP) kernels to model the relationship between different fidelity levels. Perdikaris et al. (2017) and Cutajar et al. (2019) extend this method by including a more sophisticated nonlinear space-dependent transformation. Additionally, Dribusch et al. (2010) utilize the decision boundary of a support vector machine (SVM) to reduce the search space of high-fidelity data points, and Xu et al. (2017) utilize the pairwise comparison of low-fidelity evaluations to determine the adversarial boundary of the high-fidelity model.

3. Preliminaries

3.1. Minimum-snap trajectory planning

The snap minimization method utilizes the smooth polynomial that is obtained by minimizing the fourth-order derivative of position and the second-order derivative of yaw. For a sequence of waypoints ${\tilde{p}}^{i}$ , each consisting of a prescribed position ${\tilde{p}}_{r}^{i} \in R^{3}$ and yaw angle ${\tilde{p}}_{ψ}^{i} \in S^{1}$ ( $\tilde{p} = [\begin{matrix} {\tilde{p}}^{0} \dots {\tilde{p}}^{m} \end{matrix}]$ ), this approach models the trajectory using piecewise continuous polynomials. These polynomials map time to position and yaw, that is, $p (t) = {[\begin{matrix} {p_{r} (t)}^{⊺} & p_{ψ} (t) \end{matrix}]}^{⊺}$ ( $p_{r} (t) \in C^{4}, p_{ψ} (t) \in C^{2}$ ), and each polynomial segment connects two adjacent waypoints. $S^{1}$ denotes the circular angle space, and $C^{n}$ is the differentiability class where its n-th order derivatives exist and are continuous.

The coefficients of the polynomial trajectory are determined by solving the following optimization problem:

\begin{gathered} \underset{p = [p_{r}, p_{ψ}], x \in R_{\geq 0}^{m}}{minimize} σ (x, p) + ρ \sum_{i = 1}^{m} x_{i} \\ subject to p (0) = {\tilde{p}}^{0}, p (\sum_{j = 1}^{i} x_{j}) = {\tilde{p}}^{i}, i = 1, \dots, m, \\ p \in P \end{gathered}

(1)

where

σ (x, p) = \int_{0}^{\sum_{i = 1}^{m} x_{i}} μ_{r} {‖\frac{d^{4} p_{r}}{d^{4} t}‖}^{2} + μ_{ψ} {(\frac{d^{2} p_{ψ}}{d^{2} t})}^{2} d t

(2)

with μ_r, μ_ψ, and ρ are weighting parameters, and

P

represents the set of feasible trajectories. x_i, the pivotal optimization variable, denotes the time allocation between two consecutive waypoints

{\tilde{p}}_{i - 1}

and

{\tilde{p}}_{i}

(

x = [\begin{matrix} x_{1} \dots x_{m} \end{matrix}]

), and is exclusively determined through nonlinear optimization.

This method consists of three components to efficiently solve the optimization problem: 1) Inner-loop spatial optimization that uses quadratic programming to derive polynomial coefficients based on the time allocation between waypoints:

\begin{gathered} χ (x, \tilde{p}) = \underset{p = [p_{r}, p_{ψ}]}{\arg \min} σ (x, p) \\ subject to p (0) = {\tilde{p}}^{0}, p (\sum_{j = 1}^{i} x_{j}) = {\tilde{p}}^{i}, i = 1, \dots, m, \end{gathered}

(3)

2) Outer-loop temporal optimization via nonlinear programming that minimizes the output of the inner-loop quadratic programming and determines the time allocation ratio:

\underset{\tilde{x} \in R_{\geq 0}^{m}}{minimize} σ (χ (\tilde{x}, \tilde{p}), \tilde{x}) subject to \sum_{i = 1}^{m} {\tilde{x}}_{i} = 1,

(4)

3) Line search to generate a feasible trajectory by scaling the time allocation ratio acquired from the outer-loop optimization:

\underset{α \in R_{> 0}}{minimize} α, subject to χ (α \tilde{x}, \tilde{p}) \in P

(5)

When a vehicle starts and ends in a stationary state, with zero velocity and acceleration, uniformly scaling time allocations does not alter the trajectory’s shape but shifts control commands away from the default stationary control commands. Consequently, the snap minimization method can determine feasible trajectory timings by scaling the time allocations after optimizing the time allocation ratio using quadratic programming and nonlinear optimization. Figure 2 shows how the line search procedure changes the speed profile and motor speed commands. However, this approach is not effective for re-planning from non-stationary states, as scaling then alters the trajectory shape and control command topology. The feasibility constraints, $P$ , can encompass various conditions, such as trajectories with admissible control commands or those with tracking errors within a certain threshold. Yet, due to computational constraints in real-time applications, these are often simplified to velocity and acceleration limits. This article primarily aims to train a planning policy capable of directly outputting time allocations from non-stationary states and applying more precise feasibility constraints for policy training.

Figure 1.

Time-optimal re-planning problem: While a quadrotor vehicle passes through the waypoints, the remaining waypoints are randomly shifted, necessitating real-time trajectory adaptation.

3.2. Multi-fidelity Gaussian process

In this paper, we employ the multi-fidelity Gaussian process classifier (MFGPC) to efficiently model the feasibility boundary based on sparse multi-fidelity evaluations. Given a collection of data points Z = {z₁, …, z_N} with their associated evaluation outcomes from the l-th fidelity level y_l = {y_l,1, …, y_l,N}, the prior Gaussian distribution is modeled using a multi-fidelity covariance kernel. The key idea of the multi-fidelity covariance kernel K_l(Z, Z) is utilizing cheap low-fidelity evaluations to efficiently model the high-fidelity measurements. The relationship between the distributions of different fidelity levels is represented by the nonlinear space-dependent transformation proposed in Cutajar et al. (2019). To be specific, k_l(z, z′), an element of the l-th covariance kernel K_l(Z, Z) corresponding to data points z and z′, is estimated as

\begin{aligned} k_{l} (z, z^{'}) = k_{l, c o r r} (z, z^{'}) (σ_{l, l i n e a r}^{2} f_{l - 1} {(z)}^{T} f_{l - 1} (z^{'}) \\ + k_{l, p r e v} (f_{l - 1} (z), f_{l - 1} (z^{'}))) + k_{l, b i a s} (z, z^{'}), \end{aligned}

(6)

where f_l−1 is the Gaussian process estimation from the preceding fidelity level, σ_l,linear is a constant scaling the linear covariance kernel, and k_l,prev, k_l,corr and k_l,bias represent the covariance with the preceding fidelity, the space-dependent correlation function and the bias function, respectively.

The covariance kernel is used to model a joint Gaussian distribution of the evaluations and the latent variables $f = [\begin{matrix} f_{1}, \dots, f_{N} \end{matrix}]$ . The latent variables and the hyperparameters of the kernel function are trained by maximizing the marginal likelihood function

\begin{align} P (y, f | Z) & = Π_{i = 1}^{N} P (y_{i} | f_{i}) P (f | Z) \end{align}

(7)

\begin{align} = Π_{i = 1}^{N} B (y_{i} | Φ (f_{i})) N (f | 0, K_{l} (Z, Z)), \end{align}

(8)

where

B (x)

is the Bernoulli likelihood and Φ(f_l,i) is the cumulative density function used to map the latent variable f_l,i onto the probability domain [0, 1]. The prediction, y_∗, on the l-th fidelity level over a new data point, z_∗, is obtained by marginalizing the latent variable, f_∗, as follows:

P (y_{*} | z_{*}, Z, y) = \int P (y_{*} | f_{*}) P (f_{*} | z_{*}, Z, y_{l}) d f_{*} .

(9)

Figure 3 illustrates the training and inferencing procedure of the multi-fidelity Gaussian process classifier.

Figure 2.

Comparison of the speed profile and reference motor commands over trajectory time. The leftmost figure visualizes a minimum snap trajectory (MinSnap Traj.), which maintains its shape with uniform scaling of time allocation. The second figure illustrates how reducing time allocations uniformly increases the speed profile. The figures on the right side compare the reference commands of the quadrotor’s four motors from a minimum snap trajectory with the same waypoints but different time allocation scaling. As time allocations reduce, the trajectory’s maximum and minimum motor speeds increasingly diverge from the hovering motor speed (Stationary).

MFGPC is further accelerated with the inducing points method (Snelson and Ghahramani, 2006; Hensman et al., 2013, 2015), which approximates the distribution P(f|Z) by introducing a variational distribution $q (u) = N (m, S)$ . The hyperparameters m, S represent the mean and covariance of the inducing points u. The method minimizes the following variational lower bound as loss function to determine the inducing points and maximizes the marginal likelihood P(y, f):

L_{M} (y, Z) = - E_{q (f)} [\log P (y | f)] + D_{KL} [q (u) ‖ P (u | Z)],

(10)

where q(f ) := ∫P(f|u)q(u)du.

M : = (f, θ, m, S)

is the set of all hyperparameters, and θ is the hyperparameters of the Gaussian process kernel. The hyperparameters of the MFGPC are determined by minimizing the sum of the variational lower bounds across all fidelity levels:

L_{MFGPC} = \sum_{l = 1}^{L} (L_{M_{l}} (y_{l}, Z_{l}) + c_{reg} \sum_{l^{'} = 1}^{l - 1} L_{M_{l^{'}}} (y_{l}, Z_{l})) .

(11)

The losses incurred at lower fidelity levels,

L_{M_{l^{'}}}

where l′ ∈ {1, …, l − 1}, are added to provide regularization, with c_reg representing the coefficient for the regularization term. During the MFRL training, samples with high uncertainty estimates from the boundary model are selected, enabling iterative multi-fidelity evaluations and boundary model updates. MFGPC is implemented with GPyTorch (Gardner et al., 2018) based on the work presented in Ryou et al. (2021).

3.3. Comparison with prior work

The proposed method is built upon the MFBO framework used in Ryou et al. (2021, 2022). In Ryou et al. (2021), MFBO is employed to create trajectories for specific waypoint sequences, while in Ryou et al. (2022), they expand this to general waypoint sequences by pretraining with a MFBO-labeled subdataset and fine-tuning through RL. This current work introduces significant advancements, particularly in training efficiency and real-time re-planning. We integrate MFBO with the RL training process, effectively shortening training time and enabling real-world experiment inclusion, contrasting the previous simulation-only approach. The model proposed in this study uses the current trajectory state as an input to the planning policy, enabling trajectory planning from non-stationary states, crucial for re-planning in mid-course. Additionally, our planning model directly determines the absolute time for waypoint traversal, unlike the previous method’s use of proportional time distributions. This eliminates the need for extra line-search procedures to maintain trajectory feasibility, thereby enhancing the computational efficiency of trajectory generation.

4. Algorithm

4.1. Problem definition

Our objective is to develop an online planning policy model capable of generating a time-optimal quadrotor trajectory that connects updated waypoints. As shown in Figure 1, we consider the scenario in which the vehicle follows a trajectory $p (t) = {[\begin{matrix} {p_{r} (t)}^{⊺} & p_{ψ} (t) \end{matrix}]}^{⊺}$ that traverses the m + 1 prescribed waypoints $\tilde{p} = [\begin{matrix} {\tilde{p}}^{0} & \dots & {\tilde{p}}^{m} \end{matrix}]$ , and the waypoints are updated as ${\tilde{p}}_{n e w} = [\begin{matrix} {\tilde{p}}^{0} & \dots & {\tilde{p}}^{k} {\tilde{p}}_{n e w}^{k + 1} \dots {\tilde{p}}_{n e w}^{m} \end{matrix}]$ . We assume that the vehicle is located between k − 1-th and k-th waypoints, and the waypoints are updated from the k + 1-th waypoint onwards. The time-optimal online planning problem is then formulated to minimize the traversing time of the updated trajectory p_new(t) as follows:

\begin{aligned} \underset{p_{n e w}}{minimize} T_{n e w, m} subject to \\ p_{n e w} (T_{n e w, i}) = {\tilde{p}}_{n e w}^{i}, i = k, \dots, m, \\ D p_{n e w} (T_{k}) = D p (T_{k}), \\ p_{n e w} \in P (D p (T_{k})), \end{aligned}

(12)

where T_i and T_new,i are the reference times of the original and updated trajectory, respectively, at the i-th waypoint (T_new,i = T_i ∀i = 0, …, k).

D p (t) = [\begin{matrix} p_{r}^{(1)} (t) & \dots & p_{r}^{(4)} (t) & p_{ψ}^{(1)} (t) & p_{ψ}^{(2)} (t) \end{matrix}]

is the set of all derivatives at time t, and

P (D p)

denotes the set of feasible trajectories that starts from the reference state

D p

. All initial trajectories begin in a stationary state, with all elements of

D p (0)

at zero, and both initial and adapted trajectories end in a stationary state, though potentially at different positions. Furthermore, we account for scenarios where k = 0, allowing the same planning policy to be applied for generating trajectories from the start.

Figure 3.

Multi-fidelity Gaussian process classifier training and inferencing. Training: Lower fidelity kernel captures the overall structure using ample data to optimize more hyperparameters. Higher fidelity learns linear transformation on top. Covariance between dataset pairs, z and z′, (cyan line) is estimated using previous level guidance. Inferencing: Covariances estimated between existing and a new data z* (red line) are used for prediction via marginalization of training variables.

The snap minimization method can be used to convert the quadrotor trajectory generation problem into a finite-dimensional optimization formulation (Mellinger and Kumar, 2011; Richter et al., 2016; Ryou et al., 2022)). In our work, we employ the extended minimum snap trajectory formulation (Ryou et al., 2022), which includes smoothness weights between polynomial pieces in the time allocation parameterization. The smoothness weights selectively relax snap minimization objective between trajectory segments, increasing the representability of the trajectory and enabling more aggressive maneuvers. We define $χ (x, D p (T_{k}), {\tilde{p}}_{n e w})$ as a function that maps time allocation and smoothness weights to the polynomial trajectory, which is obtained by solving the following quadratic programming problem:

\begin{gathered} \underset{p_{n e w}}{minimize} \sum_{i = k + 1}^{m} x_{w i} \int_{T_{n e w, i - 1}}^{T_{n e w, i}} μ_{r} {‖p_{r}^{(4)}‖}^{2} + μ_{ψ} {(p_{ψ}^{(2)})}^{2} d t \\ subject to p_{n e w} (T_{n e w, i}) = {\tilde{p}}_{n e w}^{i}, i = k, \dots, m, \\ D p_{n e w} (T_{k}) = D p (T_{k}), \end{gathered}

(13)

where μ_r and μ_ψ are the weighing parameters. The optimization variable x = [x_t, x_w] is composed of the time allocation

x_{t, i} (T_{n e w, i} = T_{k} + \sum_{j = k + 1}^{i} x_{t, j})

and the smoothness weight

x_{w, i} (\sum_{j = k + 1}^{m} x_{w, j} = 1)

on i-th polynomial piece connecting waypoints

{\tilde{p}}_{i - 1}

and

{\tilde{p}}_{i}

In this paper, we present a multi-fidelity reinforcement learning method that solves the following optimization problem:

\begin{aligned} \underset{x \in R_{\geq 0}^{2 \times (m - k)}}{minimize} T_{n e w, m}, subject to \\ χ (x, D p (T_{k}), {\tilde{p}}_{n e w}) \in P (D p (T_{k})) . \end{aligned}

(14)

Our goal is to minimize trajectory time while ensuring feasibility constraints are satisfied. Specifically, trajectory feasibility is defined as maintaining tracking errors within specified thresholds during real-world flight. Unlike the snap minimization method (2) that rely solely on smoothness objectives to indirectly address feasibility, the proposed method explicitly models unknown feasibility constraints

p \in P (D p)

to enhance time-optimality. With this constraints model, we train a planning policy that generates optimal time allocations and smoothness weights for any given waypoint sequence.

The algorithm integrates two complementary components: BO and RL. The BO component cost-effectively creates the training dataset for modeling feasibility constraints by identifying trajectories near decision boundaries and evaluating feasibility through real-world flight tests. To further improve efficiency, we employ MFGPC as a surrogate model, which incorporates low-fidelity evaluation results to accurately predict constraints with minimal real-world evaluations.

The RL component trains the planning policy to generate time-optimal parameters while satisfying the modeled feasibility constraints. It uses the expected trajectory time reduction as the reward function, calculated using the feasibility constraints model developed through BO. Rather than functioning as separate stages, these components function synergistically—BO leverages planning policy outputs instead of random trajectories to efficiently generate the training dataset, while RL continuously refines the policy based on the evolving constraint model. Figure 4 illustrates an overview of the proposed multi-fidelity reinforcement learning procedure.

Figure 4.

Overview of the proposed multi-fidelity reinforcement learning (MFRL) method: (1) Planning policy generates actions (time allocation and smoothness weights) for waypoint sequences. (2) Reward estimator predicts trajectory feasibility and estimates rewards, where reward is the product of feasibility prediction and time reduction achieved. (3) After updating the planning policy, portions of the training batch with high uncertainty in feasibility prediction are selected for further multi-fidelity evaluation. (4) Evaluation samples vary by computational cost of each fidelity level; results update the feasibility prediction model. (5) Iterative updates train a policy maximizing time reduction while maintaining feasibility. The method incorporates real-world experiments for direct deployment in real-world online planning applications.

4.2. Multi-fidelity evaluations

We utilize three different fidelity levels of feasibility evaluation across the paper, and our optimization goal is to satisfy the feasibility constraints at the highest fidelity level. The trajectory feasibility of each fidelity level is defined as inclusion in a corresponding feasible trajectory set, defined as $P_{1}$ , $P_{2}$ , and $P_{3}$ , respectively. The first fidelity evaluations are based on differential flatness of the quadrotor dynamics, which enables us to transform a trajectory and its time derivatives from the output space, that is, position and yaw angle with derivatives, to the state and control input space. The resulting reference control input trajectory u(t) = ζ(p, t) would enable a hypothetical perfect quadrotor aircraft to track the trajectory p. The set of feasible trajectories at this fidelity level is defined as

P_{1} = \{p |ζ (p, t) \in {[\underset{̲}{u}, \bar{u}]}^{4} \forall t \in [0, T]\},

(15)

where

\underset{̲}{u}

is the minimum,

\bar{u}

is the maximum motor speed, and T is the total trajectory time. The second fidelity evaluations are obtained using numerical simulation. The feasible set is defined as

\begin{align} P_{2} = \{p |||p_{r} (t) - r (t)|| \leq \bar{r} \land \\ | p_{ψ} (t) - ψ (t) | \leq \bar{ψ} \forall t \in [0, T]\}, \end{align}

(16)

where

\bar{r}

is the maximum allowable Euclidean position tracking error, and

\bar{ψ}

is the maximum allowable yaw tracking error. r(t) and ψ(t) are the simulated 3D position and yaw angle, respectively. At this fidelity level, FlightGoggles (Guerra et al., 2019), an open-source multicopter simulator, is used with trajectory tracking control (Tal and Karaman, 2020) to consider stochastic noise, motor dynamics, and aerodynamic effects. It’s worth mentioning that any vehicle dynamics and IMU simulation could be used, as our proposed algorithm is not dependent on the specific simulation or evaluation method. Tracking error calculation requires non-parallelizable dynamics updates and controller simulations with denser time steps, significantly increasing computational time compared to the first fidelity level. The third fidelity evaluations are obtained from real-world experiments using a quadrotor aircraft and motion capture system. For these actual flight tests, each evaluation incorporates dynamics of the full system, including actuation and sensor systems, vehicle vibrations, unsteady aerodynamics, battery performance, and estimation and control algorithms. We use the same controller as in simulation, and perform multiple flights to account for stochastic effects. The feasible set

P_{3}

is defined identically to (16), but the 3D position r_mocap(t) and yaw angle ψ_mocap(t) are measured using the motion capture system:

\begin{align} P_{3} = \{p |||p_{r} (t) - r_{mocap} (t)|| \leq \bar{r} \land \\ | p_{ψ} (t) - ψ_{mocap} (t) | \leq \bar{ψ} \forall t \in [0, T]\}, \end{align}

(17)

The feasibility bound can be estimated more precisely as the fidelity level increases, but the evaluation time grows exponentially. Empirically, evaluations at the first fidelity level take about 10 milliseconds, while the second level takes roughly 2 seconds, and the third level around 2 minutes, with times changing based on trajectory lengths. Consequently, it is impractical to directly employ the high-fidelity models in RL, which demands millions of training samples. To address this issue, our method utilizes MFBO to efficiently construct a high-fidelity model based on the low-fidelity feasibility model.

For slow trajectories, the differences between the three fidelity levels are minimal, but as speed increases, distinct behaviors emerge at each fidelity level. Figure 5 illustrates the variation in maximum and minimum motor speeds for the trajectory shown in Figure 1 across different fidelity levels. The maximum and minimum motor speeds are critical determinants of a trajectory’s feasibility because tracking errors increase when motor speed commands reach their boundaries and begin to saturate. As the trajectory time is reduced using the line search procedure in (5), the maximum and minimum motor speeds increasingly diverge from the stationary motor speed, which is the speed when the vehicle maintains hovering. At lower speeds, motor speeds are similar across all fidelity levels, but as speed increases, the motor speed at the lowest fidelity level—the reference motor speed commands—quickly diverges, whereas the motor speeds in simulations and flight experiments do so more gradually due to controllers filtering out speed command spikes. Consequently, in the trajectory depicted in Figure 1, the vehicle is capable of achieving higher speeds in the simulated environment compared to that estimated using the ideal dynamics model. It is noteworthy, however, that there may be opposite scenarios where the trajectory is slower in the simulated evaluation due to the control logic. Similarly, the discrepancies between simulated and real-world dynamics models cause the motor speeds to diverge differently between simulation and real-world scenarios. As trajectories’ speed increases, the disparity between fidelity levels becomes more pronounced. Optimal solutions for time-optimal trajectories are typically identified in areas where this fidelity gap is significant, highlighting the need for adaptation across different fidelity levels.

Figure 5.

Variations in maximum and minimum motor speeds with time allocation scaling across different fidelity levels: Ideal dynamics, Simulation, and Real-world flight. Lowering time allocations results in the trajectory’s maximum and minimum motor speeds diverging from the stationary hovering motor speed (Stationary). At low speeds, all fidelity levels exhibit similar motor speeds. However, as time allocations are reduced, the fidelity gap widens, leading to noticeable differences in the feasibility boundaries across the fidelity levels. Simulation and real-world flight cases are plotted until infeasibility or excessive tracking error occurs. The ideal-dynamics model is plotted further, becoming infeasible around 7.8 seconds when motor commands saturate.

4.3. Dataset generation

The training data comprises waypoint sequences commonly used in quadrotor trajectory planning, obtained by randomly sampling waypoints within topological constraints. Initially, a dataset of waypoint sequences $D_{i n i t}$ is created by randomly sampling sequences within the unit cube [−0.5,0.5]³. These samples are then accepted if they satisfy two topological criteria, the total Menger curvature, Menger (1930), and the total distance between waypoints, as follows:

\begin{align} I_{curvature} & = \sum_{i = 0}^{m - 2} R {({\tilde{p}}_{r}^{i}, {\tilde{p}}_{r}^{i + 1}, {\tilde{p}}_{r}^{i + 2})}^{- 1} \in [5,20], \end{align}

(18)

\begin{align} I_{distance} & = \sum_{i = 0}^{m - 1} d ({\tilde{p}}_{r}^{i}, {\tilde{p}}_{r}^{i + 1}) \in [0,30] \end{align}

(19)

where R(⋅) denotes the radius of the circle that passes through the three waypoints, and d(⋅) represents the Euclidean distance between two consecutive waypoints. The sequence of sampled waypoints is temporarily connected with a minimum-snap trajectory in (4) with no yaw objective function. The waypoint sequences that exceed the unit cube are rejected, and the yaw component for each waypoint are assigned to be tangential to the local velocity. The actual positions of waypoints are determined by scaling the waypoint sequence with the desired space scale

L_{space} \in R^{3}

. The minimum-snap time allocation ratios

{\tilde{x}}_{t}^{MS}

(

\sum_{i} {\tilde{x}}_{t, i}^{MS} = 1

) are determined by solving the nonlinear optimization (4) with the obtained reference yaw and the stationary initial and final states.

Once the time allocation ratios are determined, a line-search, as defined in (5), is performed to find the minimal trajectory time $T_{l}^{M S}$ for each fidelity level l ∈ {1, 2, 3} that satisfies the feasibility constraints. Figure 6 illustrates the line search procedure in each fidelity level. Due to the computational expense of calculating minimal times for all waypoint sequences in the second and third fidelity levels, a subset of sequences is used to estimate the inter-fidelity ratio, which then approximates minimal times for the higher levels. To be specific, for the lowest-fidelity level of dataset, the trajectory times are determined for the entire dataset $D_{i n i t}$ , and for the rest of the fidelity level, the trajectory times for the subset of the preceding fidelity dataset are determined ( $D_{3} \subset D_{2} \subset D_{1} = D_{i n i t}$ ). By comparing the trajectory times in the each fidelity dataset to the same waypoint sequence in the lowest fidelity dataset, we estimate the trajectory time ratio:

α_{l / 1} = \frac{1}{N_{l}} \sum_{i \in D_{l}} T_{l, i}^{MS} / T_{1, i}^{MS}

(20)

where N_l is the size of l-th fidelity subset

D_{l}

. For the lowest-fidelity level, the time between the waypoints are obtained by multiplying the trajectory time with the time allocation ratio:

x_{1}^{MS} = T_{1}^{MS} {\tilde{x}}^{MS}

, and for the rest of the fidelity level, the time between the waypoints are approximated as

x_{l}^{MS} = α_{l / 1} T_{1}^{MS} {\tilde{x}}^{MS}

. In the proposed simulation and real-world flight setup, α_l/1 ∈ [0.95, 1.05] across all fidelity levels and datasets. This approximates time allocation for higher-fidelity levels, avoiding costly line searches over the entire dataset. The parameter is used only for training; for testing, we conduct line searches to determine exact minimum time allocations. The planning policy is trained to reconstruct the trajectory starting from a randomly selected midpoint of a training waypoint sequence, while the reward estimator is trained to predict the feasibility of the reconstructed trajectory. During the testing phase, we use the time allocation

x_{l}^{MS}

from the highest fidelity level solution of (4) and (5) as the baseline. We also add position and yaw deviations to waypoints to assess policy robustness. Figure 7 shows the dataset usage in training and testing.

Figure 6.

Line search across fidelity levels. (Left) First level (l = 1): Reduce trajectory time to motor command boundary, $[\underset{̲}{u}, \bar{u}]$ . (Right) Second and third levels (l = 2, 3): Reduce time to tracking error threshold, $\bar{r}, \bar{ψ}$ .

Figure 7.

Waypoints dataset generation and usage in training and testing. Initial dataset: Randomly generated waypoints scaled to room size L_space, connected using snap minimization. Training: Planning policy reconstructs trajectories from random midpoints of waypoint sequences; reward estimator predicts trajectory feasibility. Testing: Position and yaw deviations added to waypoint sequences to evaluate policy robustness.

4.4. Training reward estimator

The reward estimator calculates the expected time reduction by predicting the feasibility of trajectories generated from given time allocations and smoothness weights x = [x_t, x_w], as follows:

P_{l} (y | x, D p, \tilde{p}) = \{\begin{cases} P (χ (x, D p, \tilde{p}) \in P_{l} (D p)) & if y = 1, \\ P (χ (x, D p, \tilde{p}) \notin P_{l} (D p)) & if y = 0, \end{cases}

(21)

where

χ (x, D p, \tilde{p})

refers to the polynomial trajectory generated with (13) and

P_{l}

is the set of feasible trajectories at the l-th fidelity level.

To efficiently capture the correlation between different fidelity levels from a sparse dataset, the feasibility probability is modeled using a gated recurrent unit (GRU) and a multi-fidelity Gaussian process classifier (MFGPC), as illustrated in Figure 8. The GRU is used to extract a feature vector from the given waypoints sequence $\tilde{p}$ , time allocation x_t, and smoothness weights x_w. The initial hidden state of GRU is obtained from the trajectory state $D p$ , and the last hidden output of GRU is converted into the feature vector through a multi-layer perceptron (MLP). Subsequently, MFGPC serves as the final stage in the feasibility estimation pipeline, utilizing this feature vector to predict feasibility. The parameters of GRU and MFGPC are trained through the minimization of a variational lower bound, as detailed in (11).

Figure 8.

Overview of the planning policy and reward estimator architecture. (Left) The planning policy, based on a sequence-to-sequence model, determines time allocation and smoothness weights for input waypoints. The model comprises an encoder (bi-directional GRU), decoder (GRU), VAE, and attention modules. (Right) The reward estimator predicts feasibility of the policy outputs using a multi-fidelity Gaussian process classifier. The reward signal is calculated as a product of feasibility prediction and time reduction achieved by the policy output.

The feasibility prediction model is pretrained using a dataset with fictitious labels. We randomly select half of the l-th fidelity dataset $D_{l}$ and reduce its trajectory time $T_{l}^{MS}$ by α_pretrain = 0.9. Although the data points with the reduced trajectory time may be feasible, as illustrated in Figure 9, they are currently labeled as infeasible, while the rest retain their feasible labels. This approach simulates infeasible data without risking actual infeasible trajectory generation, as all original trajectories are from the feasible set near the boundary. The model learns to approximate the feasibility boundary based on the principle that slower trajectories are generally more feasible. Once the model learns a brief structure of the feasibility boundary, it can be refined during the following MFRL procedure, which safely and efficiently collects a new dataset from scratch near the boundary. The time scaling factor α_pretrain balances two extremes: if too small, the model can’t differentiate near-boundary trajectories; if too large, it lacks data at speed extremes. This parameter is empirically chosen to maximize MFRL training stability, as extreme values may cause the model to converge to suboptimal solutions. For training, waypoint sequences are inputted to the model starting from randomly selected midpoints as shown in Figure 7.

Figure 9.

Trajectory labeling for reward estimator pretraining. (Left) Trajectories from snap minimization (dark blue region) are sped up to their feasibility limit; further acceleration makes them infeasible (gray region). The yellow shaded region represents the already traversed portion of the trajectory. (Right) After scaling, most trajectories become infeasible, but some remain feasible when reconstructed from random midpoints in the training data.

During the RL training, we employ BO to train the reward estimator. Specifically, the reward estimator chooses a subset of the dataset along with the corresponding policy outputs and annotates them with multi-fidelity evaluations. To be specific, policy outputs with a high level of uncertainty are selected, where the uncertainty is quantified using variation ratios (Freeman, 1965):

u_{l} (x_{t}, x_{w}) = 1 - \max_{c \in {0,1}} P_{l} (y = c | x, D p, \tilde{p}) .

(22)

The number of selected samples varies between the fidelity levels, with more samples chosen for low-fidelity evaluations and only a few for high-fidelity evaluation due to the associated cost. MFGPC compensates for the imbalanced sample size by modeling the correlation between the different fidelity levels. After evaluations are completed, the feasibility model is updated using a training dataset that includes the newly assessed data points and randomly selected prior evaluations. The subset selection and reward estimator training are performed sequentially across each fidelity level; the same data points may be evaluated at multiple fidelity levels. At the first iteration of the RL training, to enhance training stability, policy outputs closest to the minimum-snap time allocation are chosen instead of those with high uncertainty.

Finally, the expected time reduction is estimated by combining the relative time reduction with the feasibility probability:

r_{l} (x) = E_{P_{l} (y = 1 | x, D p, \tilde{p})} [{\bar{δ}}_{t i m e, l} (x)]

(23)

where the relative time reduction

{\bar{δ}}_{t i m e, l}

is obtained by comparing the time allocation x_t with minimum snap time allocation:

\begin{align} δ_{t i m e, l} (x) & = 1 - (\sum_{i = k + 1}^{m} x_{t, i}) / (α_{l / 1} T_{1}^{MS} \sum_{i = k + 1}^{m} {\tilde{x}}_{t, i}^{MS}) \\ {\bar{δ}}_{t i m e, l} (x) & = \max (δ_{t i m e, l} (x), r_{m a x}) + r_{b i a s} . \end{align}

(24)

To prevent the policy from converging to local minima, which continue to generate infeasible solutions with extremely short trajectory times, the time reduction is transformed with the parameters r_max, r_bias ∈ [0, 1].

4.5. Training planning policy

The planning policy determines time allocations and smoothness weights from a sequence of remaining waypoints and a current trajectory state. Figure 8 illustrates the planning policy, a sequence-to-sequence model comprising an encoder (bi-directional GRU), decoder (GRU), VAE, and attention modules. To handle variable sequence lengths, we use the sequence-to-sequence language model from Cho et al. (2014) as the main module. To prevent memorization, a variational autoencoder (VAE) connects the encoder and decoder hidden states, densifying the hidden state into a low-dimensional feature vector as described in Bowman et al. (2015). We employ content-based attention (Bahdanau et al., 2015) to capture the global shape of waypoint sequences, calculating similarity between encoder and decoder hidden states, $h_{j}^{enc}$ and $h_{i}^{dec}$ , as follows:

\begin{align} h_{i, j}^{att} = {softmax}_{j} (v_{att} \tanh (W_{att} [h_{j}^{enc}; h_{i}^{dec}])), \\ h_{i}^{att} = \sum_{j} h_{i, j}^{att} h_{j}^{enc}, \end{align}

(25)

where v_att and W_att are the weight matrix of the attention module. The nonlinear activation function is chosen to optimize training stability. The waypoint sequence is transformed into a fixed-dimensional feature vector through the encoder, and the initial hidden state of the encoder GRU is derived from the current trajectory state

D p

. The decoder output is then transformed into a latent time allocation

{\tilde{x}}_{t}

and a smoothness weight

{\tilde{x}}_{w}

via fully-connected layers (MLP). We manage separate policy models, π_l, for different fidelity levels, but each model shares most parameters except for the last MLP layer. To elaborate, the full architecture features one shared sequence-to-sequence backbone and three MLP heads, and this entire system—backbone and all heads—is jointly trained with the reward estimator. The latent smoothness weight undergoes normalization via a softmax function, while the latent time allocation is rescaled using the average time after an exponential function is applied:

\begin{align} x_{t} = T_{a v g} \exp ({\tilde{x}}_{t}) / dim ({\tilde{x}}_{t}), \end{align}

(26)

\begin{align} x_{w} = softmax ({\tilde{x}}_{w}), \end{align}

(27)

where

dim ({\tilde{x}}_{t})

refers to the number of the remaining waypoints. The average time T_avg is obtained by dividing the distance between remaining waypoints by the average speed of training sequences v_avg,

T_{a v g} = \sum_{i = k}^{m - 1} ‖{\tilde{p}}_{r}^{i + 1} - {\tilde{p}}_{r}^{i}‖ / v_{a v g}

. To improve training stability, the decoder also outputs reconstructed waypoints and is trained to minimize waypoint reconstruction loss:

L_{Recon} = ‖ {\tilde{p}}^{new} - {\tilde{p}}^{out} ‖^{2}

, which is the mean square error between the input and output waypoint sequences. The outputs of the planning policy are used to generate a polynomial trajectory, using the quadratic programming in (13).

Before the RL training, the planning policy is pretrained to reconstruct minimum-snap time allocations starting from a randomly selected midpoint in the sequence by minimizing

\begin{align} L_{Pretrain} = L_{Recon} + L_{VAE} \\ + \sum_{l = 1}^{L} (‖ x_{l, t} - α_{l / 1} T_{1}^{MS} {\tilde{x}}_{t}^{MS} ‖^{2} + ‖ x_{l, w} - x_{w}^{MS} ‖^{2}), \end{align}

(28)

where

L_{VAE}

is the evidence lower bound (ELBO) loss from the VAE (Kingma and Welling, 2014),

L_{Recon}

is the waypoint reconstruction loss, and

x_{t}, x_{w} \in R_{> 0}^{m - k}

represent the planning policy outputs for the input sequence started from the k-th waypoints. The smoothness weights are trained to output the normalized inverse of the time allocations,

x_{w, i}^{MS} = {(\sum_{j = k + 1}^{m} {\tilde{x}}_{t, i}^{MS} / {\tilde{x}}_{t, j}^{MS})}^{- 1}

, to prevent the fully connected layer associated with the smoothness weights from becoming zeros.

Figure 10.

Structure of RL episode. The policy outputs time allocation and smoothness weight between consecutive waypoints. The agent receives a reward at episode end, based on estimated time reduction from all actions. The l-th fidelity policy uses the corresponding fidelity reward estimator.

The planning policy is trained to maximize the expected time reduction which is obtained from the reward estimator model. As illustrated in Figure 10, we formulate a Markov decision process using the remaining waypoint as a state variable, s_i and the unnormalized output of planning policy ${\tilde{x}}_{t, i}$ and ${\tilde{x}}_{w, i}$ as an action a_i:

s_{i} ≔ {\tilde{p}}_{i}^{new}, a_{i} ≔ [{\tilde{x}}_{t, i}, {\tilde{x}}_{w, i}]

(29)

At the end of each episode, the actual time allocation is calculated using (26), which is used to estimate the expected time reduction r_l(x) from the reward estimator. r_l(x) is estimated using the corresponding level’s feasibility prediction and time reduction based on its minimum snap time allocation. Since the reward is zero except at the end of the episode, the discounted reward is used for the i-th step reward: r_l,i = γ^m−ir_l(x) (i = k + 1, …, m), where γ ∈ [0, 1] is the discount factor. At the end of each training epoch, rewards are batch-wise normalized by subtracting their mean and dividing them by the standard deviation. It is noteworthy that all time allocations and smoothness weights can be considered as a single action, formulating the optimization as an RL problem with a single rollout. Instead, we frame it as a multiple rollout RL problem, which improves training stability although this formulation doesn’t strictly satisfy the Markov condition.

The proximal policy optimization (PPO) algorithm (Schulman et al., 2017) is used to update the planning policy. At each training iteration, the action ${\tilde{a}}_{i} \sim π_{l} (\cdot | a_{i})$ is sampled from the normal distribution around the policy output a_i. We employ PPO without a value network since training a separate value network cannot adequately account for the reward bias when estimating rewards from the continually updated feasibility boundary model. The reward is estimated based on the action from the old policy $π_{l}^{old}$ , and the next planning policy π_l is updated in the sense that maximizes the PPO clipping reward:

\begin{align} L_{l, i}^{Clip} = \min (ω_{l, i} r_{l, i}, clip (ω_{l, i}, 1 - ϵ, 1 + ϵ) r_{l, i}), \\ L_{l, i}^{Entropy} = - \log π_{l} ({\tilde{a}}_{i} | a_{i}), \\ L_{Policy} = \sum_{l = 1}^{L} \sum_{i = k + 1}^{m} L_{l, i}^{Clip} + L_{l, i}^{Entropy}, \end{align}

(30)

where

ω_{l, i} = π_{l} ({\tilde{a}}_{i} | a_{i}) / π_{l}^{old} ({\tilde{a}}_{i} | a_{i})

is the probability ratio between the new policy and old policy. The PPO objective function is minimized together with the reconstruction loss and the VAE variational loss to stabilize the training procedure, that is,

L_{RL} = L_{Recon} + L_{VAE} + L_{Policy}

(31)

4.6. Balancing time reduction and tracking error

In addition to the training procedure of planning policy, we propose a method to effectively balance the time reduction and the tracking error of the output trajectory. The trained trajectory aims to operate within the tracking error bounds established during training, but it may be necessary to readjust the trajectory’s flight time and tracking error balance according to the target environment. For example, in obstacle-rich environments, precise tracking is critical and the trajectory’s speed must be reduced, whereas in opposite scenarios, the trajectory’s speed can be increased while accepting a certain degree of tracking error.

We leverage the property that the trajectory’s shape obtained from the quadratic programming in (13) remains invariant when both time allocation and the current trajectory state are scaled. To be specific, the polynomial $χ (α x, α^{- 1} D p, \tilde{p})$ maintains a consistent shape across all scale factors $α \in R_{> 0}$ where $α^{- 1} D p$ is defined as

α^{- 1} D p : = [\begin{matrix} α^{- 1} p_{r}^{(1)} \dots α^{- 4} p_{r}^{(4)} α^{- 1} p_{ψ}^{(1)} α^{- 2} p_{ψ}^{(2)} \end{matrix}] .

(32)

For α > 1, this transformation slows down the trajectory without altering its shape, enabling the reference motor commands to shift toward the hovering motor speed. Building upon this transformation, we initially scale the trajectory state by multiplying the scale factor as

D p \to α D p

and utilize the trained policy π to generate feasible time allocations for the scaled trajectory state:

x_{t}, x_{w} = π (\tilde{p}, α D p)

. The final trajectory is then generated with the policy output [αx_t, x_w] from the current trajectory state

D p

. It is noteworthy that the trajectory generated with the time allocation αx_t and the trajectory state

D p

preserves the same shape but the speed profile is scaled with α⁻¹ compared to the trajectory with x_t and

α D p

. Without needing additional planning policy inference, this transformation effectively balances time reduction and tracking error by scaling the speed profile while preserving the trajectory’s shape.

5. Experimental results

The proposed algorithm is evaluated by training the planning policy and testing the output of the planning policy model with flight experiments. Three different fidelity levels of evaluations are used to create the training dataset for the reward estimator. For the first fidelity level, the reference motor commands are obtained using the idealized quadrotor dynamics-based differential flatness transform presented in Mellinger and Kumar (2011). The feasible set $P_{1} (D p)$ is defined as the set of all trajectories that start from the state $D p$ and have reference motor commands that fall within the admissible range, $[\underset{̲}{u}, \bar{u}] = [0,2200]$ rad/s. In the second and third fidelity level, the feasible sets $P_{2} (D p)$ and $P_{3} (D p)$ include all trajectories that can be flown using the flight controller (Tal and Karaman, 2020) with position tracking errors less than the maximum allowable value of $\bar{r} = 20$ cm, and yaw tracking errors less than $\bar{ψ} = 15$ deg. The flight experiment at the second fidelity level is performed using a 6DOF flight dynamics and inertial measurement simulation (Guerra et al., 2019), whereas the flight experiment at the third fidelity level is conducted using the actual vehicle. Since the planning policy generates a trajectory starting from a midpoint in a waypoint sequence, the tracking error is estimated by simulating or flying the original trajectory to the midpoint and then switching to the updated trajectory. Initially, we conduct a comprehensive analysis of the proposed algorithm in a simulated environment, utilizing the first two fidelity levels. Then, we incorporate the third fidelity level to apply the algorithm to a real-world environment, effectively refining and adapting the policy for flight experiments. The experimental results are presented in the accompanying video: https://youtu.be/75AbKY3L5As, and the implementation details and source code are available at https://github.com/mit-aera/mfrlTrajectory.

5.1. Implementation details

The input of the planning policy is normalized as $x_{policy, i}^{in} = [{\tilde{p}}_{r}^{i} / (L_{space} / 2) \cos ({\tilde{p}}_{ψ}^{i}) \sin ({\tilde{p}}_{ψ}^{i}) f_{EOS}]$ , where f_EOS is a token indicating the end of the sequence. Following normalization, the input data undergoes embedding into a 64-dimensional vector through a 2-layer MLP, which is then utilized as the encoder input. The encoder’s initial hidden state is derived from the normalized current trajectory state $[\begin{matrix} p_{r}^{(1)} / (L_{space} / 2) \dots p_{r}^{(4)} / (L_{space} / 2) p_{ψ}^{(1)} p_{ψ}^{(2)} \end{matrix}]$ , passed through a Sigmoid function, and further processed by a 3-layer MLP. The VAE module compresses the last hidden state of the encoder into a 64-dimensional low-dimensional vector via a 2-layer MLP and then reconstructs it as the initial hidden layer for the decoder using another 2-layer MLP. The decoder’s output is subsequently processed using 3-layer MLPs dedicated to each fidelity level, generating reconstructed waypoints, latent time allocation, and smoothness weight separately for each level.

In the reward estimator, the input includes time allocation and smoothness weights in addition to the waypoint sequences: $x_{reward, i}^{in} = [{\tilde{p}}_{r}^{i} / (L_{space} / 2) \cos ({\tilde{p}}_{ψ}^{i}) \sin ({\tilde{p}}_{ψ}^{i}) f_{EOS} x_{t, i} / T_{a v g} x_{w, i}]$ . Similar to the planning policy, the input data is embedded into a 64-dimensional vector, with initialization of hidden states from the normalized current state and the use of identical MLP sizes. The final hidden state of the GRU is converted into a 64-dimensional feature vector via a 2-layer MLP. The MFGPC module predicts feasibility from this feature vector, Gaussian process is approximated with 128 inducing points. The regularization coefficient between fidelity levels, denoted as c_reg in (11), is set to 1e-4. Across the encoder, decoder, and all MLPs, the hidden layer dimensions are consistently set to 256. The reward parameters are configured as follows: r_max = 0.1, r_bias = 0.1, the reward decay factor is γ = 0.9, and the PPO clipping parameter is ϵ = 0.2. Both policy and reward estimator updates utilize the Adam optimizer with a learning rate of 1×10^-4.

The training dataset

D_{i n i t}

is made up of 10⁵ sequences of five to 14 waypoints scaled to the size of the test rooms (L_space =[9 m, 9 m, 3 m], and [20 m, 20 m, 4 m]). The parameters for minimum snap optimization in (1) are determined through grid search as follows: μ_r = 1, μ_ψ = 1, and ρ = 0, which are identified to generate the fastest trajectories on average within our simulated evaluation setup. Table 1 demonstrates the characteristics of waypoint sequences in the generated dataset.

Table 1.

Average distance and curvature between consecutive waypoints in the training dataset, with pairwise traversal times obtained by the snap minimization method.

L _space	Distance	Curvature	Time
[9 m, 9 m, 3 m]	4.46 m	0.25 m^-1	1.31 s
$[20$ m, 20 m, 4 m]	9.68 m	0.13 m^-1	1.61 s

The evaluation time varies across fidelity levels, resulting in differing amounts of labeled data for each level. Table 2 highlights the differences in evaluation time across fidelity levels, along with the total number of evaluations conducted at each level during dataset generation and MFRL training. The multi-fidelity datasets are created by labeling subsets of the lower-fidelity dataset with the corresponding evaluation method. All training datasets are augmented eightfold by rotating and flipping the sequences. To determine the trajectory time at each fidelity level using the line search procedure described in (5), a binary search method is employed, utilizing 10 evaluations for each sequence. During a single epoch, the planning policy is updated with 800 batches, consisting of 200 waypoint sequences. Within each batch, we evaluate 16 samples by checking reference motor commands and 1 sample using simulation. In the real-world scenarios, we additionally select 100 sequences every 100 epochs and label the output with the real-world experiments. Training is conducted in both simulated and real-world scenarios until convergence, which takes 1800 epoch in both cases; 1800 trajectories are thus evaluated with the real-world flight experiments.

Table 2.

Comparison of evaluation times and evaluation counts across fidelity levels. $P_{1}$ (ideal dynamics), $P_{2}$ (simulation), and $P_{3}$ (real-world experiments) represent each fidelity level. Evaluation time is the average per-trajectory evaluation time. Dataset size indicates labeled training sequences. Per-epoch BO subset shows the number of samples selected for reward estimator optimization. Number of evaluations represents the total evaluations during dataset generation and MFRL training.

	$P_{1}$	$P_{2}$	$P_{3}$
Evaluation time	75.35 ms	2.30 s	∼ 2 min
Dataset size	$\| D_{1} \|$ = 100,000	$\| D_{2} \|$ = 2000	$\| D_{3} \|$ = 50
Per-epoch BO subset	12,800	200	1
Number of evaluations
Dataset generation	1,000,000	20,000	500
MFRL training	23,040,000	1,440,000	1800

To enhance training efficiency, evaluations for the first and second fidelity levels are processed on an external CPU server with 28 Intel Xeon nodes and 72 cores, parallel to policy updates on an Nvidia RTX 6000 ADA GPU. Multiple trajectory evaluations are executed concurrently across available cores. The second fidelity level, including full controllers and dynamics simulation, demands significantly more memory compared to the first fidelity level, which limits the number of concurrent evaluations possible. The majority of training time, approximately 28 hours per epoch, is consumed by these evaluations, particularly those at the second fidelity level. In real-world scenarios, flight experiments are carried out every 100 epochs, evaluating 100 trajectories over 4 hours. The total training procedures across both scenarios are completed over 3 weeks, mostly due to the multi-fidelity evaluation to refine the reward estimator, while RL updates are conducted in parallel without becoming a bottleneck. Training dataset generation takes a few days as it requires a much smaller dataset, estimating the same trajectory multiple times at varying velocities.

5.2. Evaluation in simulated environment

To evaluate our policy, we reconstruct trajectories starting from the midpoint of the waypoint sequences. Subsequently, we introduce random shifts and rotations to the remaining waypoints to assess whether our policy could generate feasible trajectories. The testing dataset is derived from 4,000 randomly generated waypoint sequences, with trajectory times (T^MS) determined through numerical simulations. The testing trajectories consist of an average speed of 3.9 m/s, with a maximum of 11.2 m/s, and an average acceleration of 0.7 g, with a maximum of 2.7 g, where g denotes standard gravity acceleration (g = 9.81 m/s²).

Table 3 presents a comparison between the baseline minimum snap (MinSnap) trajectory and trajectories generated by our trained policy (MFRL). Along with the time reduction, we compare the mean and standard deviation of tracking errors for trajectories. The position and yaw tracking errors,

{\bar{r}}^{err}

and

{\bar{ψ}}^{err}

, are calculated as the maximum deviation between the vehicle’s pose and the reference pose:

\begin{align} {\bar{r}}^{err} & = \max_{t} ‖ p_{r} (t) - r (t) ‖, \\ {\bar{ψ}}^{err} & = \max_{t} | p_{ψ} (t) - ψ (t) | \end{align}

(33)

where r(t) and ψ(t) are the reference position and yaw angle, respectively, as defined in (16). p_r(t) and p_ψ(t) represent the simulated position and yaw angle of the vehicle at time t. For clarity, while constraining the tracking error maintains a marginal time difference between actual and reference trajectory times, this difference is negligible compared to our method’s overall time reduction. When there are deviations in the position and yaw of the remaining waypoints, the baseline technique employs the original trajectory’s time allocation and solves the minimum-snap quadratic programming problem using the updated waypoint positions to adapt the trajectory. As the remaining waypoints deviate further, the baseline method encounters greater average tracking error as well as increased error variance, reflecting its struggle to track a larger portion of the trajectory. On the other hand, the trained policy adjusts the time allocations to generate a trajectory that consistently maintains acceptable tracking errors. The table also showcases the time reduction (Δ Time) achieved by our trained policy, where a positive reduction indicates faster trajectories compared to the baseline method. Initial time allocations are optimized for smooth trajectories; thus, adapting these allocations demands extra time in response to random waypoint deviations. Consequently, time reduction is less in such scenarios compared to non-deviation cases. Even so, the trained model consistently achieves faster trajectories than the baseline. Figure 11 compares motor speed and tracking error between the baseline method and the trained policy on an example trajectory with randomly deviated waypoints. As a response to these deviations, the trained policy updates the trajectory to maintain an admissible tracking error and reduce flight time, unlike the baseline method, which fails to generate a feasible trajectory.

Table 3.

Comparison of tracking error and trajectory time between the minimum-snap trajectories (MinSnap) and trajectories generated by the trained policy (MFRL). The policy is trained with the first and second fidelity levels in the space size L_space =[9 m, 9 m, 3 m]. Waypoints undergo random shifts within the range of 0 to 3 m (Deviation—Pos) and rotations of 0 to 45° (Deviation—Yaw). Pos error and Yaw error denote mean and standard deviation of position ( ${\bar{r}}^{err}$ ) and yaw ( ${\bar{ψ}}^{err}$ ) tracking errors. Δ Time represents the trajectory time reduction achieved with the trained policy model relative to the minimum-snap trajectories. A positive time reduction indicates that the trained policy results in less trajectory time; for instance, MFRL produces trajectories that are 4.70% faster than the baseline when waypoints are unchanged.

Deviation	MinSnap		MFRL
Pos/Yaw	Pos error	Yaw error	Pos error	Yaw error	Δ Time
0.0 m/0°	0.08 ± 0.02 m	12.8 ± 1.3°	0.09 ± 0.01 m	12.0 ± 3.7°	4.70%
1.0 m/15°	0.16 ± 2.08 m	21.5 ± 29.4°	0.09 ± 0.03 m	13.5 ± 10.0°	4.50%
2.0 m/30°	0.26 ± 2.91 m	40.1 ± 52.7°	0.11 ± 0.07 m	19.2 ± 25.1°	3.51%
3.0 m/45°	0.36 ± 3.09 m	60.7 ± 66.7°	0.13 ± 0.14 m	27.3 ± 39.2°	2.49%

Figure 11.

Comparison between minimum snap and adapted trajectories in response to waypoint deviations, where waypoints are shifted by 2 m and rotated by 30°. (a) The initial and updated waypoints are shown alongside baseline and trained policy trajectories. (b) Reference motor speed commands equally sampled from each trajectory segment. (c) Simulated motor speeds, demonstrating that the adapted trajectory is 9.22% faster than the baseline while maintaining motor speeds within the admissible range. (d) and (e) Comparison of tracking errors, where the adapted trajectory remains below the threshold, unlike the minimum snap trajectory. The last row, (f) and (g) presents speed profiles before and after waypoint deviations. Background feasibility colormap is generated by random speed profile sampling and averaging of reference motor speed feasibility. Both trajectories are initially within the feasible region (f), but as waypoints change, the MFRL policy keeps the trajectory feasible, while the MinSnap baseline ventures into infeasible regions (g).

The incorporation of the first fidelity level evaluations proves essential to our methodology. Their absence would necessitate a prohibitively large amount of the second fidelity evaluations to replace the current data volume, making collection impractically time-consuming. Alternatively, maintaining only the current limited quantity of the second fidelity evaluations without the first fidelity data would result in highly inaccurate feasibility constraint predictions, causing the MFRL training process to diverge.

In Table 4, we compare our approach with two widely used methods: 1) MinSnap Heuristic: a heuristic method that adjusts time allocation proportional to the trajectory length, chosen for its comparable computation speed:

x_{t, i} = \frac{‖ {\tilde{p}}_{new}^{i} - {\tilde{p}}_{new}^{i - 1} ‖}{‖ {\tilde{p}}^{i} - {\tilde{p}}^{i - 1} ‖} x_{t, i}^{MS}

(34)

and 2) Supervised-only: a policy model trained to imitate minimum snap optimization output, serving as a proxy for comparison with optimization-based methods. The pretrained model, utilized for MFRL initialization, is used to evaluate the second method, employing only the time allocation output with the identity smoothness weight. The first method fails to adapt the trajectory, leading to diverging tracking errors when waypoints are randomly shifted. In contrast, the second method can update the trajectory to remain feasible, yet it results in a longer trajectory time compared to the MFRL-trained model.

Table 4.

Tracking error and trajectory time of the trajectory adapted with the heuristic method (MinSnap Heuristic) and the trajectory obtained from the model only trained with supervised learning (Supervised-only). Both the heuristic method and the supervised-only trained model yield slower trajectories and higher tracking errors compared to the MFRL policy.

Deviation	MinSnap heuristic			Supervised-only
Pos/Yaw	Pos error	Yaw error	Δ Time	Pos error	Yaw error	Δ Time
0.0 m/0°	0.08 ± 0.02 m	12.9 ± 1.3°	0.00%	0.08 ± 0.01 m	13.2 ± 4.7°	−0.69%
1.0 m/15°	0.17 ± 2.36 m	21.0 ± 27.9°	−0.50%	0.09 ± 0.04 m	16.0 ± 16.6°	−1.15%
2.0 m/30°	0.24 ± 2.39 m	39.7 ± 52.2°	−1.34%	0.11 ± 0.09 m	24.1 ± 34.7°	−2.01%
3.0 m/45°	0.39 ± 3.50 m	58.3 ± 65.3°	−2.14%	0.13 ± 0.18 m	32.5 ± 46.2°	−3.12%

We further apply the training in a larger space (L_space =[20 m, 20 m, 4 m]) using scaled waypoint sequence data. In the wider space, trajectories could attain higher linear speeds, resulting in testing trajectories with an average speed of 5.5 m/s and a maximum of 15.1 m/s, as well as an average acceleration of 0.7 g and a maximum of 2.9 g. As shown in Table 5, our trained model effectively updates trajectories while maintaining tracking accuracy, achieving even greater time reduction compared to the smaller space. In smaller spaces, tracking error bounds are typically determined by yaw tracking error due to the need for agile turns in confined areas. In contrast, in larger spaces, position tracking error becomes more prominent, as there is ample room for linear acceleration and reaching the vehicle’s maximum speed, providing our MFRL model with additional opportunities for optimization.

Table 5.

Comparison of tracking error and trajectory time between the minimum-snap trajectories (MinSnap) and trajectories generated by the trained policy (MFRL). The policy is trained with the first and second fidelity levels in the large target space L_space =[20 m, 20 m, 4 m]. The trained policy generates trajectories that are 6.23% faster than the baseline when waypoints are unchanged.

Deviation	MinSnap		MFRL
Pos/Yaw	Pos error	Yaw error	Pos error	Yaw error	Δ Time
0.0 m/0°	0.09 ± 0.02 m	13.4 ± 0.8°	0.12 ± 0.56 m	12.9 ± 7.0°	6.23%
2.0 m/15°	0.14 ± 1.30 m	18.5 ± 21.4°	0.13 ± 0.63 m	13.7 ± 8.8°	5.74%
4.0 m/30°	0.24 ± 2.16 m	29.3 ± 37.5°	0.14 ± 0.87 m	16.8 ± 17.8°	4.71%
6.0 m/45°	0.46 ± 3.24 m	42.0 ± 50.0°	0.15 ± 0.54 m	20.3 ± 25.0°	4.49%

5.3. Analysis of MFRL policy

We conduct further analysis of our training results by examining the correlation between trajectory features and time reduction. As depicted in the histogram of Figure 12(a), the trained policy consistently yields faster trajectories, denoted by positive time reduction, for the majority of waypoint sequences when compared to the baseline method. Figure 12 further illustrates the correlation between the number of remaining waypoints, the distance between these waypoints, initial speed, and time reduction. Intuitively, trajectories with more degrees of freedom to optimize, such as those with a larger number of remaining waypoints or greater remaining distances, lead the trained policy to generate faster trajectories. Likewise, slower initial speeds give the vehicle the flexibility to accelerate more in the remaining trajectory, reducing overall flight time.

Figure 12.

(a) Histogram depicting the time reduction achieved by the trained policy model in comparison to the minimum-snap trajectories. (b) Time reduction regarding the number of remaining waypoints. (c) Time reduction regarding the distance between the remaining waypoints. (d) Time reduction regarding the initial speed. For (b–d), the average time reduction is plotted along with the standard deviation.

A key correlation factor we’ve identified is motor utilization. We define the motor utilization factor U_motor as the average gap between motor speed commands and stationary motor commands, divided by the maximum gap, as described in the following equation:

U_{motor} = \frac{1}{4 N_{sample}} \sum_{n = 1}^{N_{sample}} \sum_{i = 1}^{4} \frac{| m s_{n, i} - m s_{sta} |}{(m s_{max} - m s_{min}) / 2} .

(35)

The maximum gap is approximated as half of the range of admissible motor speed values ((ms_max − ms_min)/2). This factor is calculated based on equally sampled reference motor speed commands of four rotors ms_n,i, (i = 1, …, 4) from each trajectory segment where N_sample refers to the number of samples. For instance, when the vehicle follows the trajectory with a bang-bang controller, motor speed commands consistently reach the upper and lower bounds, resulting in a utilization factor close to 100%.

Comparing the utilization factor between baseline trajectories and those from the trained model in Figure 13 reveals that the trained policy reduces the variance of the utilization factor and increases its mean. This indicates that MFRL trains the policy to maintain motor utilization at an optimal level, efficiently utilizing the vehicle’s capacity while ensuring feasibility. As shown in the right plot of Figure 13, in cases where the initial minimum snap trajectory has low motor utilization, the trained model achieves significantly greater reductions (about 12% faster) in flight time. Furthermore, the time reduction’s variation associated with motor utilization is smaller compared to other trajectory features, highlighting their strong correlation. This reduced variation demonstrates how the trained policy consistently improves performance based on the remaining optimization margin, as quantified by motor usage. Table 6, presenting data from the default space size, and Table 7, from a larger space, compare parameters such as speed, acceleration, and motor utilization between minimum-snap and MFRL trajectories. Both tables reveal that, while average speeds and accelerations are comparable, MFRL trajectories show higher maximums and a significantly greater average motor utilization with lower variance.

Figure 13.

(Left) Histogram comparing motor utilization factors between minimum-snap trajectories (MinSnap) and MFRL policy trajectories (MFRL). (Right) Time reduction achieved by the trained policy regarding the initial utilization factor of the minimum-snap trajectory. The average time reduction is plotted along with the standard deviation.

Table 6.

Comparison of parameters including speed (v), acceleration (a), and motor utilization factor (U_motor) between minimum-snap trajectories (MinSnap) and trajectories generated by the trained policy (MFRL) (L_space=[9 m, 9 m, 3 m]).

	v _avg	v _max	a _avg	a _max	U _motor
MinSnap	3.9 m/s	11.2 m/s	0.7 g	2.7 g	13.3 ± 3.9%
MFRL	3.8 m/s	12.1 m/s	0.8 g	3.0 g	15.4 ± 2.3%

Table 7.

	v _avg	v _max	a _avg	a _max	U _motor
MinSnap	5.5 m/s	15.1 m/s	0.7 g	2.9 g	10.0 ± 3.3%
MFRL	5.5 m/s	16.3 m/s	0.8 g	3.3 g	13.0 ± 2.5%

Figures 14 and 15, respectively, demonstrate scenarios where the initial trajectory has low and high motor utilization factor. Although the baseline minimum snap trajectory typically generates smoothly changing control commands, the controller may face difficulties in tracking it at high speeds, which is due to factors such as approximations in the dynamics model or the balance of internal control outputs. The MFRL policy adjusts the trajectory by decreasing motor speeds in segments with high tracking errors and increasing motor speeds in the remaining segments, ultimately reducing the overall flight time. A high motor utilization factor indicates limited space for trajectory optimization, often due to proximity to the final waypoints or excessively high initial speeds, necessitating immediate deceleration. In such cases, the capacity of MFRL to further reduce the flight time is constrained.

Figure 14.

Trajectory with a low initial motor utilization factor of 6.00%. This figure compares reference motor commands, simulated motor speed, and tracking error between the initial minimum snap trajectory and the trajectory generated by the trained model. The trained policy reduces the trajectory time by 15.31%, concurrently increasing the utilization factor to 16.32%.

Figure 15.

Trajectory with a high initial motor utilization factor of 20.28%. This figure compares reference motor commands, simulated motor speed, and tracking error between the initial minimum snap trajectory and the trajectory generated by the trained model. The trained policy reduces the trajectory time by only 0.66%, while also decreasing the utilization factor to 16.38%.

The tracking error and time reduction of the trained policy can be fine-tuned using the approach outlined in Section 4.6. Figure 16 illustrates the trade-off between tracking error and time reduction based on the scale factor α. When we scale down the time allocation using α, the trajectory accelerates, but tracking error increases. When waypoints deviate, both the average tracking error and time reduction may adapt to these deviations, while the overall relationship between tracking error and time reduction remains consistent.

Figure 16.

Changes of time reduction and tracking error regarding the scale factor α. As α increases, the speed profile decreases, resulting in reduced time reduction but improved tracking accuracy. Similar trends are observed when waypoints deviate randomly.

Utilizing this transformation, we analyze the performance improvements achieved through multi-fidelity evaluations and RL. By scaling the time allocation with the scale factor, we can observe trends between the scaled trajectory time and tracking error. Figure 17 compares these trends between the pretrained policy and the MFRL policy. Additionally, we train the same policy model using only low-fidelity evaluations and compare these results with those from the pretrained and RL policies developed under low-fidelity conditions. Fine-tuning the model with RL shifts this trend leftward, indicating that the planning can generate faster trajectories with similar tracking errors. Employing multi-fidelity evaluations further enhances these trends, offering improvements over training exclusively with low-fidelity models.

Figure 17.

Comparison of policies trained with a multi-fidelity model: the pretrained policy (MF-MS) and the MFRL policy (MF-RL). Additionally, comparisons include the pretrained policy (LF-MS) and RL policy (LF-RL) that are trained only with the low-fidelity model. The y-axis represents the relative time reduction compared to the unscaled trajectory output from the pretrained model, which is trained using multi-fidelity evaluation.

5.4. Real-world experiment

We further evaluate the proposed algorithm by training the planning policy in a real-world scenario and testing the policy on a dataset of 50 randomly generated waypoint sequences, each with trajectory times T^MS determined from real-world flight experiments. The reward estimator is trained using three different fidelity levels, including real-world flight experiments. Table 8 presents a comparison of the tracking error and the time reduction (Δ Time) achieved by policy models trained in simulated environments (using two fidelity levels) and real-world scenarios (utilizing three fidelity levels). The position and yaw tracking errors,

{\bar{r}}^{err}

and

{\bar{ψ}}^{err}

, are calculated from the reference pose and the motion capture system measurements obtained while executing the trajectory with actual aircraft:

\begin{align} {\bar{r}}^{err} & = \max_{t} ‖ p_{r} (t) - r_{mocap} (t) ‖, \\ {\bar{ψ}}^{err} & = \max_{t} | p_{ψ} (t) - ψ_{mocap} (t) | \end{align}

(36)

where r(t) and ψ(t) are the reference position and yaw, respectively, and r_mocap(t) and ψ_mocap(t) represent the position and yaw measured using the motion capture system, as defined in (17). Notably, when waypoints deviate, both the models trained with MFRL in simulation and real-world experiments adapt the trajectory better than the baseline method. The model trained with real-world experiments further optimizes the trajectory to make it even faster than the policy only trained in simulation. Table 9 compares speed, acceleration, and motor utilization between the baseline trajectories and the trajectories from the trained policies. Similar to the simulated results, while average speeds and accelerations are comparable, MFRL trajectories show higher maximums and a significantly greater average motor utilization.

Table 8.

Comparison of tracking error and trajectory time among minimum-snap trajectories (MinSnap), trajectories generated by the policy trained in simulation (MFRL (simulation)), and trajectories generated by the policy trained with real-world experiments (MFRL (real-world)). Δ Time represents the trajectory time reduction relative to the minimum-snap trajectories. The policy trained in simulation generates trajectories that are 1.69% faster than the baseline when waypoints are unchanged. Incorporating real-world experiment data into the training further improves the performance, resulting in a 4.26% time reduction compared to the baseline.

Deviation	MinSnap		MFRL (simulation)			MFRL (real-world)
Pos/Yaw	Pos error	Yaw error	Pos error	Yaw error	Δ Time	Pos error	Yaw error	Δ Time
0.0 m/0°	0.14 ± 0.03 m	13.3 ± 1.5°	0.12 ± 0.02 m	11.4 ± 3.8°	1.69%	0.13 ± 0.02 m	12.9 ± 3.7°	4.26%
1.0 m/15°	0.12 ± 0.02 m	14.8 ± 9.1°	0.13 ± 0.05 m	14.6 ± 9.1°	2.30%	0.13 ± 0.04 m	16.3 ± 9.5°	4.48%
2.0 m/30°	0.15 ± 0.07 m	23.3 ± 20.2°	0.12 ± 0.02 m	15.6 ± 13.0°	2.68%	0.13 ± 0.02 m	14.7 ± 7.1°	3.85%
3.0 m/45°	0.19 ± 0.16 m	29.3 ± 25.9°	0.14 ± 0.05 m	19.4 ± 18.2°	1.50%	0.13 ± 0.04 m	17.7 ± 16.8°	2.18%

Table 9.

Comparison of parameters including speed (v), acceleration (a), and motor utilization factor (U_motor) between minimum-snap trajectories (MinSnap) trajectories generated by the policy trained in simulation (MFRL (sim)), and trajectories generated by the policy trained with real-world experiments (MFRL (real)) (L_space =[9 m, 9 m, 3 m]).

	v _avg	v _max	a _avg	a _max	U _motor
MinSnap	3.9 m/s	10.7 m/s	0.8 g	2.1 g	15.9 ± 2.7%
MFRL (sim)	3.7 m/s	10.9 m/s	0.8 g	2.3 g	16.1 ± 2.0%
MFRL (real)	3.8 m/s	11.0 m/s	0.8 g	2.2 g	18.2 ± 2.6%

Additionally, we assess the trained model’s performance in an online planning scenario, where dynamically shifting waypoints necessitate real-time trajectory adaptation within the environment shown in Figure 18. We put the waypoints in the middle of two gates and continuously moving the gates to update the position of the waypoints. We commit a portion of the first trajectory segment, taking into account inference and communication delays, while adjusting the time allocation for the remainder. The time allocation between waypoints is determined by the trained policy, and we obtain the actual polynomial trajectory through quadratic programming. The trained policy is executed on the Titan Xp GPU in the host computer and communicates with the vehicle’s microcontroller to update the trajectory. By compiling the model with TensorRT, we significantly enhance the inference speed. This enables our trained model to adjust the trajectory in under 2 milliseconds, allowing real-time trajectory adaptation—a marked contrast to the baseline minimum-snap method, which takes several minutes for trajectory generation. Since the planning policy comprises simple MLPs, the inference time remains consistent even on embedded GPUs, averaging 3 milliseconds when executed on Jetson Orin. Table 10 presents a comparison of inference times between the baseline method and our trained model. The baseline method’s line search procedure significantly increases computation time, especially when using real-world flight experiment data where evaluating multiple speed profiles for a trajectory can take several minutes. Generally, the minimum snap method is used with the lowest fidelity line search, but even this approach requires around 400 ms, substantially longer than our method. Our proposed method streamlines this by substituting the line search with policy inference, making it suitable for real-time applications. All computation times are measured on an Intel Xeon CPU with 28 cores in the host computer, except for the policy model, which is evaluated on the aforementioned Titan Xp GPU in the same machine. Figure 19 compares the inference times based on the number of waypoints. Even with a small number of waypoints, the policy inference time is substantially lower, requiring far less computation time. The supplementary video includes demonstrations of the real-world experimental results.

Figure 18.

Image of the quadrotor vehicle and the experimental environment utilized in the flight experiment. The trained model is evaluated by deploying it on real-world applications in which the waypoints change during flight and the trajectory must be adapted accordingly.

Table 10.

Comparison of computation time between the minimum-snap method (MinSnap) and the trained policy (MFRL). NLP represents the computation time of nonlinear programming, which determines the ratio between time allocations. Line search ( $P_{1}$ ), Line search ( $P_{2}$ ), and Line search ( $P_{3}$ ) denote the time for determining the trajectory time with line search using the ideal dynamics model, simulations, and real-world flight experiments, respectively. The computation time increases by an order of magnitude between the different fidelity levels. QP refers to the time for quadratic programming to determine the polynomial coefficients. The trained policy replaces the NLP and Line search procedures with neural network inference. It can determine the flight time in 9 ms with a PyTorch implementation (PyTorch), and once the model is compiled with TensorRT, the inference time drops below 2 ms (TensorRT).

MinSnap					MFRL
NLP	Line search ( $P_{1}$ )	Line search ( $P_{2}$ )	Line search ( $P_{3}$ )	QP	PyTorch	TensorRT
370.9 ms	28.75 ms	33.58 s	∼15 min	1.72 ms	8.84 ms	1.66 ms

Figure 19.

Comparison of computation time with respect to the number of waypoints. The computation time of MinSnap increases as the number of waypoints increases, which also raises the dimension of optimization variables. Similarly, the computation time of MFRL rises with the increase in the number of GRU inferences, depending on the number of waypoints. However, the use of the MFRL policy improves computation time by an order of magnitude.

6. Conclusion

6.1 Contributions

We introduce a novel sequence-to-sequence policy for generating an optimal trajectory in online planning scenarios, as well as a reinforcement learning algorithm that efficiently trains this policy by combining the evaluation from multiple sources. Our approach models the feasibility boundary entirely from data by training a reward estimator, enabling full utilization of the vehicle’s capacity. Furthermore, we employ multi-fidelity Bayesian optimization to train the reward estimation module, efficiently incorporating real-world experiments. This approach enables extensive policy optimization in regions where simulation fails to capture real-world phenomena. The trained policy model generates faster trajectories with decent reliability compared to the baseline minimum snap method, even when the waypoints are randomly deviated and the baseline method fails.

6.2. Limitations and future work

The main drawback of the proposed method is the absence of a performance guarantee. While the trained model generates faster and more feasible trajectories for most waypoint sequences, it occasionally fails. We expect this limitation could be addressed by leveraging the reward estimator beyond its primary role in training. The trained reward estimator can potentially predict output performance before physical testing, offering crucial guarantees for high-stake dynamic systems. This predictive capability could ensure robustness and safety in planners, and facilitate reachability analysis for safety-critical fields.

The long training time can be improved by redesigning the reward estimator’s architecture. We’ve observed that reinforcement learning convergence is influenced by the estimator’s accuracy. In this work, we use Gaussian process for feasibility prediction due to their effectiveness with limited samples. As data accumulates, however, a basic neural network becomes more suitable, offering better accuracy with larger datasets. A challenge arises when low-fidelity datasets reach this transition threshold earlier than high-fidelity ones, forcing continued use of Gaussian process and limiting low-fidelity sample inclusion. Developing a model that can transition between architectures and account for varying dataset sizes across fidelity levels could lead to more efficient large-scale multi-fidelity optimization, ultimately reducing training time.

Building upon the idea of adaptive architectures, increasing the reward model’s complexity and capacity could further extend the proposed method’s applicability. Since our approach treats the full system as a black box and optimizes tests based on specific goals, a more sophisticated model could incorporate additional system parameters. This enhanced capacity would allow the method to handle more complex elements, such as internal controller states, or integrate with local planners like MPC. Currently, this work only uses IMU and motion capture inputs, which could limit sources of perception uncertainty. A larger model capacity can ultimately extend this approach to full autonomy pipelines with various sensory inputs including camera and LiDAR, handling unique and high-dimensional uncertainties from the full system.

Supplemental Material

Footnotes

ORCID iD

Gilhyun Ryou

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the Army Research Office through grant W911NF1910322 and the Hyundai Motor Company.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

References

Abbeel

Coates

(2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29(13): 1608–1639.

Akkaya

Andrychowicz

Chociej

, et al. (2019) Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 .

Bahdanau

Cho

Bengio

(2015) Neural machine translation by jointly learning to align and translate. 2015 3rd International Conference on Learning Representations (ICLR).

Biyik

Huynh

Kochenderfer

, et al. (2020) Active preference-based Gaussian process regression for reward learning. Robotics: Science and Systems.

Bowman

Vilnis

Vinyals

, et al. (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.

Burke

Chapman

Shames

(2020) Generating minimum-snap quadrotor trajectories really fast. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1487–1492.

Calandra

Peters

Rasmussen

, et al. (2016) Manifold Gaussian processes for regression. In: 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 3338–3345.

Chang

Padif

(2020) Sim2real2sim: bridging the gap between simulation and real-world in flexible object manipulation. In: 2020 Fourth IEEE International Conference on Robotic Computing (IRC). IEEE, 56–62.

Chebotar

Handa

Makoviychuk

, et al. (2019) Closing the sim-to-real loop: adapting simulation randomization with real world experience. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE, 8973–8979.

10.

Cho

van Merrienboer

Gulcehre

(2014) Proceedings of the 2014 conference on empirical methods in natural language processing. (emnlp).

11.

Christiano

Leike

Brown

, et al. (2017) Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30.

12.

Clements

Van Delft

Robaglia

, et al. (2019) Estimating risk and uncertainty in deep reinforcement learning. arXiv preprint arXiv:1905.09638.

13.

Costabal

Perdikaris

Kuhl

, et al. (2019) Multi-fidelity classification using Gaussian processes: accelerating the prediction of large-scale computational models. arXiv preprint arXiv:1905.03406.

14.

Cutajar

Pullin

Damianou

, et al. (2019) Deep Gaussian processes for multi-fidelity modeling. arXiv preprint arXiv:1903.07320.

15.

Cutler

Walsh

How

(2015) Real-world reinforcement learning via multifidelity simulators. IEEE Transactions on Robotics 31(3): 655–671.

16.

Deisenroth

Rasmussen

(2011) Pilco: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 465–472. Citeseer.

17.

Dribusch

Missoum

Beran

(2010) A multifidelity approach for the construction of explicit decision boundaries: application to aeroelasticity. Structural and Multidisciplinary Optimization 42(5): 693–705.

18.

Dushenko

Ambal

McMichael

(2020) Sequential Bayesian experiment design for optically detected magnetic resonance of nitrogen-vacancy centers. Physical Review Applied 14(5): 054036.

19.

Foehn

Romero

Scaramuzza

(2021) Time-optimal planning for quadrotor waypoint flight. Science Robotics 6(56): eabh1221.

20.

Freeman

(1965) Elementary applied statistics.

21.

Gal

Ghahramani

(2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. International Conference on Machine Learning 1050–1059.

22.

Gal

McAllister

Rasmussen

(2016) Improving pilco with bayesian neural network dynamics models. In: Data-Efficient Machine Learning Workshop, ICML, Vol. 4, 25.

23.

Gao

Lin

, et al. (2018a) Online safe trajectory generation for quadrotors using fast marching method and bernstein basis polynomial. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 344–351.

24.

Gao

Pan

, et al. (2018b) Optimal time allocation for quadrotor trajectory generation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4715–4722.

25.

Gardner

Pleiss

Bindel

, et al. (2018) GPyTorch: blackbox matrix-matrix Gaussian process inference with GPU acceleration. In: Advances in Neural Information Processing Systems.

26.

Glaese

McAleese

Trebacz

, et al. (2022) Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.

27.

Guerra

Tal

Murali

, et al. (2019) FlightGoggles: photorealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality. arXiv preprint arXiv:1905.11377.

28.

Hao

Yang

Tang

, et al. (2023) Exploration in deep reinforcement learning: from single-agent to multiagent domain. IEEE Transactions on Neural Networks and Learning Systems.

29.

Hensman

Fusi

Lawrence

(2013) Gaussian processes for big data. In: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13. Arlington, VI: AUAI Press, 282–290.

30.

Hensman

Matthews

Ghahramani

(2015) Scalable variational Gaussian process classification. In: Artificial Intelligence and Statistics. PMLR, 351–360.

31.

Hernández-Lobato

Hoffman

Ghahramani

(2014) Predictive entropy search for efficient global optimization of black-box functions. In: Advances in Neural Information Processing Systems, 918–926.

32.

Hwangbo

Lee

Dosovitskiy

, et al. (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4(26): eaau5872.

33.

Kaspar

Osorio

JDM

Bock

(2020) Sim2real transfer for reinforcement learning without dynamics randomization. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4383–4388.

34.

Kaufmann

Bauersfeld

Loquercio

, et al. (2023) Champion-level drone racing using deep reinforcement learning. Nature 620(7976): 982–987.

35.

Kennedy

O’Hagan

(2000) Predicting the output from a complex computer code when fast approximations are available. Biometrika 87(1): 1–13.

36.

Kingma

Welling

(2014) Auto-encoding variational bayes. International Conference on Learning Representations (ICLR).

37.

Konyushkova

Zolna

Aytar

, et al. (2020) Semi-supervised reward learning for offline reinforcement learning. arXiv preprint arXiv:2012.06899.

38.

Kumar

Soh

, et al. (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems 32.

39.

Le Gratiet

Garnier

(2014) Recursive co-kriging model for design of computer experiments with multiple levels of fidelity. International Journal for Uncertainty Quantification 4(5): 365–386.

40.

Lee

Laskin

Srinivas

, et al. (2021) Sunrise: a simple unified framework for ensemble learning in deep reinforcement learning. In: International Conference on Machine Learning. PMLR, 6131–6141.

41.

Lim

Huang

Chen

, et al. (2022) Real2sim2real: self-supervised learning of physical single-step dynamic actions for planar robot casting. In: 2022 International Conference on Robotics and Automation (ICRA). IEEE, 8282–8289.

42.

Lockwood

(2022) A review of uncertainty for deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18: 155–162.

43.

Lyu

Wang

Balius

, et al. (2019) Ultra-large library docking for discovering new chemotypes. Nature 566(7743): 224–229.

44.

Mao

Spasojevic

Hsieh

, et al. (2023) Toppquad: dynamically-feasible time optimal path parametrization for quadrotors. arXiv preprint arXiv:2309.11637.

45.

Mellinger

Kumar

(2011) Minimum snap trajectory generation and control for quadrotors. In: 2011 IEEE International Conference on Robotics and Automation. IEEE, 2520–2525.

46.

Menger

(1930) Untersuchungen über allgemeine metrik. vierte untersuchung. zur metrik der kurven. Mathematische Annalen 103: 466–501.

47.

Mockus

Tiesis

Zilinskas

(1978) The application of Bayesian methods for seeking the extremum. Towards global optimization 2(117-129): 2.

48.

Molchanov

Chen

Hönig

, et al. (2019) Sim-to-(multi)-real: transfer of low-level robust control policies to multiple quadrotors. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 59–66.

49.

Mordatch

Lowrey

Todorov

(2015) Ensemble-cio: full-body dynamic motion planning that transfers to physical humanoids. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5307–5314.

50.

Myung

Cavagnaro

Pitt

(2013) A tutorial on adaptive design optimization. Journal of Mathematical Psychology 57(3-4): 53–67.

51.

Nguyen-Tuong

Peters

(2010) Using model knowledge for learning inverse dynamics. In: 2010 IEEE International Conference on Robotics and Automation. IEEE, 2677–2682.

52.

Osband

Blundell

Pritzel

, et al. (2016) Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems 29.

53.

Peng

Andrychowicz

Zaremba

, et al. (2018) Sim-to-real transfer of robotic control with dynamics randomization. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3803–3810.

54.

Perdikaris

Raissi

Damianou

, et al. (2017) Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling. Proceedings. Mathematical, physical, and engineering sciences 473(2198): 20160751.

55.

Qin

Michet

Chen

, et al. (2023) Time-optimal gate-traversing planner for autonomous drone racing. arXiv preprint arXiv:2309.06837.

56.

Rajnarayan

(2009) Trading Risk and Performance for Engineering Design Optimization Using Multifidelity Analyses. Stanford University.

57.

Richter

Bry

Roy

(2016) Polynomial trajectory planning for aggressive quadrotor flight in dense indoor environments Robotics Research. Springer, 649–666.

58.

Romero

Penicka

Scaramuzza

(2022a) Time-optimal online replanning for agile quadrotor flight. Robotics and Automation Letters (RA-L).

59.

Romero

Sun

Foehn

, et al. (2022b) Model predictive contouring control for time-optimal quadrotor flight. IEEE Transactions on Robotics.

60.

Ryou

Tal

Karaman

(2021) Multi-fidelity black-box optimization for time-optimal quadrotor maneuvers. The International Journal of Robotics Research 40(12-14): 1352–1369.

61.

Ryou

Tal

Karaman

(2022) Real-time generation of time-optimal quadrotor trajectories with semi-supervised seq2seq learning. Conference on Robot Learning (CoRL).

62.

Sadeghi

Toshev

Jang

, et al. (2018) Sim2real viewpoint invariant visual servoing by recurrent control. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4691–4699.

63.

Schulman

Wolski

Dhariwal

, et al. (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .

64.

Snelson

Ghahramani

(2006) Sparse Gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems 1257–1264.

65.

Srinivas

Krause

Kakade

, et al. (2012) Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory 58(5): 3250–3265.

66.

Srinivasan

Eysenbach

, et al. (2020) Learning to be safe: deep rl with a safety critic. arXiv preprint arXiv:2010.14603 .

67.

Sun

Tang

Hauser

(2021) Fast uav trajectory optimization using bilevel optimization with analytical gradients. IEEE Transactions on Robotics 37(6): 2010–2024.

68.

Sutton

(1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine Learning Proceedings 1990. Elsevier, 216–224.

69.

Takeno

Fukuoka

Tsukada

, et al. (2019) Multi-fidelity Bayesian optimization with max-value entropy search. arXiv preprint arXiv:1901.08275 .

70.

Tal

Karaman

(2020) Accurate tracking of aggressive quadrotor trajectories using incremental nonlinear dynamic inversion and differential flatness. IEEE Transactions on Control Systems Technology 29(3): 1203–1218.

71.

Tan

Zhang

Coumans

, et al. (2018) Sim-to-real: learning agile locomotion for quadruped robots. Robotics: Science and Systems XIV.

72.

Tordesillas

Lopez

How

(2019) Faster: fast and safe trajectory planner for flights in unknown environments. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1934–1940.

73.

Wang

Jegelka

(2017) Max-value entropy search for efficient Bayesian optimization. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3627–3635.

74.

Williams

Rasmussen

(2006) Gaussian Processes for Machine Learning. Cambridge, MA: MIT press, Vol. 2.

75.

Williams

Wagener

Goldfain

, et al. (2017) Information theoretic mpc for model-based reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1714–1721.

76.

Wilson

Salakhutdinov

, et al. (2016) Deep kernel learning. In: Artificial Intelligence and Statistics. PMLR, 370–378.

77.

Wirth

Akrour

Neumann

, et al. (2017) A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research 18(136): 1–46.

78.

Zhai

Srivastava

, et al. (2021) Uncertainty weighted actor-critic for offline reinforcement learning. In: International Conference on Machine Learning. PMLR, 11319–11328.

79.

Sun

Spasojevic

, et al. (2023) Learning optimal trajectories for quadrotors. arXiv preprint arXiv:2309.15191.

80.

Xie

Berseth

Clary

, et al. (2018) Feedback control for cassie with deep reinforcement learning. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1241–1246.

81.

Zhang

Miller

, et al. (2017) Noise-tolerant interactive learning using pairwise comparisons. In: Advances in Neural Information Processing Systems, 2431–2440.

82.

Zhang

Lyu

Yang

, et al. (2019) An efficient multi-fidelity Bayesian optimization approach for analog circuit synthesis. In: Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 64.

83.

Zhou

Huang

, et al. (2022) Dynamic attention-based cvae-gan for pedestrian trajectory prediction. IEEE Robotics and Automation Letters.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Multi-fidelity reinforcement learning for time-optimal quadrotor re-planning

Abstract

Keywords

1. Introduction

2. Related work

2.1. Quadrotor trajectory planning

2.2. Efficient reinforcement learning

2.3. Bayesian optimization

3. Preliminaries

3.1. Minimum-snap trajectory planning

3.2. Multi-fidelity Gaussian process

3.3. Comparison with prior work

4. Algorithm

4.1. Problem definition

4.2. Multi-fidelity evaluations

4.3. Dataset generation

4.4. Training reward estimator

4.5. Training planning policy

4.6. Balancing time reduction and tracking error

5. Experimental results

5.1. Implementation details

5.2. Evaluation in simulated environment

5.3. Analysis of MFRL policy

5.4. Real-world experiment

6. Conclusion

6.1 Contributions

6.2. Limitations and future work

Supplemental Material

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

Supplemental Material

References

Supplementary Material