Sage Journals: Discover world-class research

Abstract

In this article, we introduce a learning-based vision dynamics approach to nonlinear model predictive control (NMPC) for autonomous vehicles, coined learning-based vision dynamics (LVD) NMPC. LVD-NMPC uses an a-priori process model and a learned vision dynamics model used to calculate the dynamics of the driving scene, the controlled system’s desired state trajectory, and the weighting gains of the quadratic cost function optimized by a constrained predictive controller. The vision system is defined as a deep neural network designed to estimate the dynamics of the image scene. The input is based on historic sequences of sensory observations and vehicle states, integrated by an augmented memory component. Deep Q-learning is used to train the deep network, which once trained can also be used to calculate the desired trajectory of the vehicle. We evaluate LVD-NMPC against a baseline dynamic window approach (DWA) path planning executed using standard NMPC and against the PilotNet neural network. Performance is measured in our simulation environment GridSim, on a real-world 1:8 scaled model car as well as on a real size autonomous test vehicle and the nuScenes computer vision dataset.

Keywords

Autonomous vehicles autonomous driving vision dynamics learning control artificial intelligence deep learning perception and control robot vision

Introduction

Research in the area of autonomous driving has been boosted in the last decade both by academia and industry. Autonomous vehicles are intelligent agents equipped with driving functions designed to understand their surroundings and derive control actions. As shown in the deep learning for autonomous driving survey of Grigorescu et al.,¹ the driving functions are traditionally implemented as perception-planning-action pipelines. Recently, approaches based on End2End learning from Bojarski et al.² and Pan et al.,³ or the deep reinforcement learning (DRL) shown by Kendall et al.⁴ have also been proposed although mostly as research prototypes.

In a modular perception-planning-action system, visual perception is most of the times decoupled from low-level control. A tighter coupling of perception and control was researched in the field of robotic manipulation with the concept of visual servoing, as in the case of the manipulation fault detector of Gu et al.⁵ However, this is not the case in autonomous vehicles, where intrinsic dependencies between the different modules of the driving functions are not taken into account.

This work is a contribution in the area of vision dynamics and control, where the proposed learning-based vision dynamics nonlinear model predictive control (LVD-NMPC) algorithm is used for controlling autonomous vehicles. An introduction of the vision dynamics concept in learning control can be found in the work of Grigorescu.⁶ The block diagram of LVD-NMPC is shown in Figure 1, where the main components are a vision dynamics model defined as a deep neural network (DNN) and a constrained nonlinear model predictive controller, which receives input from the vision model. The model, trained using the Q-learning algorithm, calculates desired state trajectories and the tuning gains of the NMPC.

Figure 1.

LVD-NMPC: learning-based vision dynamics nonlinear model predictive control for autonomous vehicles. The dotted lines illustrate the data flow used during training. LVD-NMPC: learning-based vision dynamics nonlinear model predictive control.

Synergies between data driven and classical control methods have been considered for imitation learning, where steering and acceleration control signals have been calculated in an End2End manner, as proposed by Pan et al.³ Their approach is designed for driving environments with predefined boundaries, without any obstacles present on the driving track.

As shown by Grigorescu et al.,¹ traditional decoupled visual perception systems use visual localization to estimate the pose of the ego-vehicle relative to a reference trajectory, together with obstacle detection. The information is further used by a path and behavioral planner to determine a safe driving trajectory, which is executed by the motion controller. In our work, we improve the traditional visual approach by replacing the classical perception-planning pipeline with a learned vision dynamics model. The model is used to calculate a safe driving trajectory and estimate the optimal tuning gains of the NMPC’s quadratic cost function. Our formulation exploits the advantages of model-based control with the prediction capabilities of deep learning methods, enabling us to encapsulate the vision dynamics within the layers of a DNN. The key contributions of the article are as follows:

the autonomous vehicle LVD nonlinear model predictive controller, based on an a-priori process model and a vision dynamics model;

a DNN architecture acting as a nonlinear vision dynamics approximator used to estimate optimal desired state trajectories and the NMPC’s tuning gains;

a method for training the LVD nonlinear model predictive controller, based on imitation learning and the Q-learning training approach.

The rest of the article is organized as follows. The related work is covered in the next section. The methodology of LVD-NMPC is given in “Vision dynamics model learning and control system” section, followed by the experimental results. Finally, the article is summarized in the “Conclusions” section.

Related work

Recent years have witnessed a growing trend in applying deep learning techniques to autonomous driving, especially in the areas of End2End learning, as in the methods proposed by Pan et al.,³ Fan et al.⁷ and Bojarski et al.,² as well as in DRL. Relevant algorithms for self-driving based on DRL can be found in the works of Kiran et al.,⁸ Kendal et al.,⁴ and Wulfmeier et al.⁹ Flavors of machine learning techniques have also been encountered in more traditional control approaches, such as NMPC, the uncertainty aware NMPC of Lucia and Karg¹⁰ and the learning controllers of Ostafew et al.¹¹ and McKinnon and Schoellig.¹²

End2end learning, as described by Amini et al.,¹³ directly maps raw input data to control signals. The approach in LVD-NMPC is similar to the one considered by Pan et al.³ Compared with our method, their DNN policy is trained for agile driving on a predefined obstacle-free track. This approach limits the applicability of their system to autonomous driving since a self-driving car has to navigate roads with dynamics obstacles and undefined lane boundaries. An end2end neural motion planner has been proposed by Zeng et al.,¹⁴ while Fan et al.⁷ designed an End2End learning system to predict the drivable surface. The work considers no obstacle detection and avoidance, improving solely the perception system, without tackling the intrinsic dependencies between perception and low-level vehicle control.

DRL is a type of machine learning algorithm, where agents are taught actions by interacting with their environment. An extensive review of DRL for autonomous driving has been published by Kiran et al.⁸ The main challenge with DRL for real-world physical systems is that the agent, in our case, a self-driving car, learns by exploring its environment. A solution here is provided by inverse reinforcement learning (IRL), which is an imitation learning method for solving Markov decision processes. Wulfmeier et al.⁹ have extended the maximum entropy IRL with a convolutional DNN for learning a navigation cost map in urban environments. However, such methods usually do not take into consideration the vehicle’s state and the feedback loop required for low-level control.

NMPC, as presented by Garcia et al.,¹⁵ is a control strategy, which computes control actions by solving an optimization problem tailored around a nonlinear dynamic system model. In the last decades, it has been successfully applied to autonomous driving applications, both in research and in the automotive industry. Nonlinear model predictive controllers have been proposed by do Nascimento et al.¹⁶ and Nascimento et al.¹⁷ for trajectory tracking in nonholonomic mobile robots. To deal with uncertainties, learning-based approaches to model predictive control have been used by Lucia and Karg,¹⁰ as well as by Gango et al.,¹⁸ for approximating an explicit NMPC system.

Learning controllers (traditional feedback controllers), such as NMPC, make use of an a-priori model composed of fixed parameters. Unlike controllers with fixed parameters, learning controllers make use of training information to learn their models over time. In previous works, learning controllers have been introduced based on simple function approximators, such as the Gaussian process modeling in the work of Ostafew et al.¹¹ or Bayesian regression algorithm of McKinnon and Schoellig.¹²

In the light of the current approaches and their limitations, the LVD-NMPC method is proposed for encapsulating the driving scene’s dynamics within a vision dynamics model, which adapts a constrained NMPC for executing the desired vehicle state trajectory.

Vision dynamics model learning and control system

Problem definition

Figure 2 shows a simple illustration of the autonomous driving problem. Given past vehicle states $z^{< t - τ_{i}, t >}$ , a sequence of observations $I^{< t - τ_{i}, t >}$ and a global reference route for tracking $z_{ref}^{< t - \infty, t + \infty >}$ , the task is to calculate the control signals $u_{opt}^{< t + 1 >}$ for time $t + 1$ , such that the self-driving car tracks a safe trajectory $z^{< t + 1, t + τ_{o} >}$ . Considering a discrete sampling time t, $τ_{i}$ and $τ_{o}$ are past and future temporal horizons, respectively.

Figure 2.

Autonomous driving problem definition. Given the vehicle’s state, the global route (red line) and a set of sensory measurements, the goal is to calculate a safe path for tracking (blue line) over the control horizon $[t + 1, t + τ_{o}]$ . The vision dynamics model estimates the curvature $c^{< t >}$ and width $w^{< t >}$ of the road. (a) Illustration of obstacle avoidance. (b) Observation $I^{< t >}$ (top images) and activation maps (bottom images) within the vision dynamics model. The chromatic scale shows the contribution of the neurons in the activation maps at pixel level, where red corresponds to a high contribution and blue to almost no contribution (best viewed in color).

The reference trajectory $z_{ref}$ represents the global route, which should be followed by the vehicle, from its start position to destination. It can be given as a set of waypoints (e.g. GPS coordinates). Since $z_{ref}$ is a global trajectory, it is considered to vary in the $[t - \infty, t + \infty]$ interval. $z_{ref}$ is different from the desired trajectory $z_{d}^{< t + 1, t + τ_{o} >}$ , the latter one being calculated for the interval $[t + 1, t + τ_{o}]$ . $z_{d}$ is executed by the nonlinear model predictive controller and must take into account the drivable area and the obstacles present in the scene.

The vehicle is modeled based on the kinematic bicycle model of a robot described by Paden et al.,¹⁹ with position state $z^{< t >} = (x^{< t >}, y^{< t >}, ρ^{< t >})$ and no-slip assumptions. x, y, and $ρ$ represent the position and heading of the vehicle in the 2D driving plane, respectively. We apply the longitudinal velocity and the steering angle as control actions: $u^{< t >} = (v_{cmd}^{< t >}, ω_{cmd}^{< t >})$ , where $ω_{cmd}^{< t >}$ is the angle of the current velocity of the center of mass with respect to the longitudinal axis of the car, as proposed by Borrelli et al.²⁰

The driving scene is modeled as the vision dynamic state $(c^{< t >}, w^{< t >})$ , where c and w are the curvature and traversable width of the road. A high w corresponds to relaxed constraints when estimating the desired trajectory, while a low w represents a path that has to be followed precisely to satisfy safety constraints. We consider a normalized value for the traversable width $w \in [0, 1]$ , where 1 is the maximum road width and 0 signals a nontraversable area, in which case the vehicle has to stop.

When acquiring training samples, the following quantities are stored as sequence data: the historic position states $z^{< t - τ_{i}, t >}$ , the sensory information acquired from a front facing camera $I^{< t - τ_{i}, t >}$ , the global reference trajectory $z_{ref}^{< t - τ_{i}, t + τ_{o} >}$ , and the control actions $u^{< t - τ_{i}, t >}$ recorded from a human driver. For practical reasons, the global reference trajectory is stored at sampling time t over the finite horizon $[t - τ_{i}, t + τ_{o}]$ . A single image corresponds to an observation instance $I^{< t >}$ , while a continuous sequence of images is denoted as $I^{< t - τ_{i}, t >}$ .

Control system design

The block diagram of LVD-NMPC is shown in Figure 1. Consider the following nonlinear, state-space system

z^{< t + 1 >} = f_{true} (z^{< t >}, u^{< t >})

where $z^{< t >} \sim N ({\bar{z}}^{< t >}, σ_{f}^{2})$ is the observable state, $z^{< t >} \in ℝ^{n}$ , ${\bar{z}}^{< t >}$ is the mean state, and $u^{< t >} \in ℝ^{m}$ is the control input at discrete time t. We assume that each state measurement is corrupted by zero-mean additive Gaussian noise with variance $σ_{f}^{2}$ . $f_{true}$ is approximated as a sum between an a-priori model and an experience-based vision dynamics model

z^{< t + 1 >} = \underset{a - p r i o r i model}{f (z^{< t >}, u^{< t >})} + \underset{vision dynamics model}{h (s^{< t - τ_{i}, t >})}

where $s^{< t >} \in ℝ^{p}$ represent environmental dependencies.

These dependencies are composed of the system’s and environment’s states at time t, defined as

s^{< t >} = (z^{< t - τ_{i}, t >}, I^{< t - τ_{i}, t >})

where $s^{< t >}$ represents the set of historic dependencies integrated along time interval $[t - τ_{i}, t]$ by the so-called augmented memory component.

The models $f (\cdot)$ and $h (\cdot)$ are nonlinear process models. $f (\cdot)$ is a known process model, representing our knowledge of $f_{true} (\cdot)$ , while $h (\cdot)$ is the learned vision dynamics model. With the sampling time defined as $δ t$ , the nominal process model employed by LVD-NMPC is

f (z^{< t >}, u^{< t >}) = z^{< t >} + δ t [\begin{matrix} cos (ρ^{< t >} + ω_{cmd}^{< t >}) \\ sin (ρ^{< t >} + ω_{cmd}^{< t >}) \\ \frac{1}{L} sin (ω_{cmd}^{< t >})) \end{matrix}] v_{cmd}^{< t >}

where L is the length between the front and the rear wheel.

We distinguish between the given reference trajectory $z_{ref}^{< t - \infty, t + \infty >}$ , which, from a control perspective, is practically infinite, and a desired state trajectory $z_{d}^{< t + 1, t + τ_{o} >}$ , calculated over a finite prediction horizon $τ_{o}$ . The optimal future states are computed by a constrained nonlinear model predictive controller, based on a desired state trajectory $z_{d}^{< t + 1, t + τ_{o} >}$ , estimated by the vision dynamics model.

The vision dynamics model learns to predict the driving scene’s dynamics $(c^{< t >}, w^{< t >}) \sim N ({\bar{c}}^{< t >}, {\bar{w}}^{< t >}, σ_{h}^{2})$ from historic states integrated by the augmented memory component. The visual measurements are corrupted by zero-mean Gaussian noise with variance $σ_{h}^{2}$ . When queried, it returns a prediction for $c^{< t >}$ and $w^{< t >}$ , given a sequence of historic states. The $h (\cdot)$ model in equation (2) is calculated by multiplying c and w with three weighting constants. c is used to correct the heading, while w adapts the position of the vehicle in the 2D driving plane.

The scene’s dynamics are used to calculate the desired trajectory $z_{d}^{< t + 1, t + τ_{o} >}$ , taking into account the global reference route $z_{ref}^{< t - \infty, t + \infty >}$ . $z_{d}^{< t + 1, t + τ_{o} >}$ can be interpreted as the quantitative deviation of the vehicle from the reference route

z_{d}^{< t + 1, t + τ_{o} >} = z_{r e f}^{< t + 1, t + τ_{o} >} + z_{c}^{< t + 1, t + τ_{o} >}

where $z_{c}^{< t + 1, t + τ_{o} >}$ is the vision dynamics estimation for the optimal trajectory of the vehicle’s path, computed in $x y$ coordinates relative to the vehicle’s pose. Once $h (\cdot)$ has been queried, the coordinates of the predicted path are sampled over $[t + 1, t + τ_{o}]$ based on $(c^{< t >}, w^{< t >})$ as

y^{< t + i >} = w^{< t >} - ρ \cdot^{< t >} (x^{< t + i >} - l) + 0.5 \cdot c^{< t >} \cdot x^{< t + i >}^{2}

where $(t + i) \in [t + 1, t + τ_{o}]$ and l is a lookahead distance calculated for a sampling time of $3 s$ .

As detailed in the “Learning a vision dynamics model” section, a DNN is utilized to encode $h (\cdot)$ and on top of $h (\cdot)$ , we define a quadratic cost function to be optimized by the constrained NMPC over discrete time interval $[t + 1, t + τ_{o}]$

J (z, u) = (z_{d} - z)^{T} Q (z_{d} - z) + u^{T} R u

where $Q \in ℝ^{τ_{o} n \times τ_{o} n}$ is a positive semidefinite scalar state weight matrix, $R \in ℝ^{τ_{o} M \times τ_{o} M}$ is a positive definite scalar input weight matrix, $z_{d}^{< t + 1, t + τ_{o} >}$ is a sequence of desired states estimated by the vision dynamics model

z_{d}^{< t + 1, t + τ_{o} >} = [z_{d}^{< t + 1 >}, \dots, z_{d}^{< t + τ_{o} >}]

where $z^{< t + 1, t + τ_{o} >}$ is the sequence of predicted states

z^{< t + 1, t + τ_{o} >} = [z^{< t + 1 >}, \dots, z^{< t + τ_{o} >}]

and $u^{< t, t + τ_{o} - 1 >}$ is the control input sequence

u^{< t, t + τ_{o} - 1 >} = [u^{< t >}, \dots, u^{< t + τ_{o} - 1 >}] .

The curvature $c^{< t >}$ is calculated using a polynomial interpolation of the trajectory points in $z_{d}^{< t + 1, t + τ_{o} >}$ , while $w^{< t >}$ is proportional to the velocity of the agent (maximum velocity corresponds to $w^{< t >} = 1$ , while no motion is represented as $w^{< t >} = 0$ ). Equation (6) is used to convert between trajectories and $(c^{< t >}, w^{< t >})$ . At training time, $z_{d}$ is given as human driven trajectories.

The traversable width $w^{< t >} \in [0, 1]$ is used to automatically determine the ratios between the state and input weight matrices in equation (7). The operation is performed by weighting the coefficients of the main diagonals in R and Q based on w

diag (Q) = w^{< t >}

diag (R) = 1 - w^{< t >}

In this way, more aggressive control actions can be chosen when the road has high traversability, indicated by a high value of w.

The constrained NMPC objective is to find a set of control actions that optimize the vehicle’s motion over a given time horizon $τ_{o}$ while satisfying a set of hard and soft constraints

(z_{o p t}^{< t + 1 >}, u_{o p t}^{< t + 1 >}) = \underset{z, u}{argmin} J (z^{< t + 1, t + τ_{o} >}, u^{< t + 1, t + τ_{o} >})

13a

such that

z^{< 0 >} = z^{< t >}

13b

\begin{array}{l} z^{< t + i + 1 >} = & f (z^{< t >}, u^{< t >}) + h (s^{< t - τ_{i}, t >}) \end{array}

13c

\begin{array}{l} e_{min}^{< t + i >} \leq e^{< t + i >} \leq e_{max}^{< t + i >} \end{array}

13d

\begin{array}{l} u_{min}^{< t + i >} \leq u^{< t + i >} \leq u_{max}^{< t + i >} \end{array}

13e

\begin{array}{l} {\dot{u}}_{min}^{< t + i >} \leq \frac{{\dot{u}}^{< t + i >} - {\dot{u}}^{< t + i - 1 >}}{Δ t} \leq {\dot{u}}_{max}^{< t + i >} \end{array}

13f

where $i = 0, 1, \dots, τ_{o} - 1$ , $z^{< 0 >}$ is the initial state, and $Δ t$ is the sampling time of the controller. $e^{< t + i >} = z_{d}^{t + i} - z^{t + i}$ is the cross-track error, $e_{min}^{< t + i >}$ and $e_{max}^{< t + i >}$ are the lower and upper tracking bounds, respectively. Additionally, we consider $u_{min}^{< t + i >}$ , ${\dot{u}}_{min}^{< t + i >}$ and $u_{max}^{< t + i >}$ , ${\dot{u}}_{max}^{< t + i >}$ as lower and upper constraint bounds for the actuator and actuator rate of change, respectively. The DL-NMPC-RSD controller implements

u^{< t >} = u_{opt}^{< t + 1 >}

at each iteration t.

We leverage on the quadratic cost function from equation (7) and solve the nonlinear optimization problem described above using the Broyden–Fletcher–Goldfarb–Shanno algorithm of Fletcher.²¹ The optimization problem from equation 13(a) has been solved in real time using the C++ version of the open-source Automatic Control and Dynamic Optimization (ACADO) toolkit of Houska et al.^22,23

Learning a vision dynamics model

The role of the vision dynamics model is to estimate the scene’s dynamics $(c^{< t >}, w^{< t >})$ using the temporal information stored in the augmented memory component and the global reference trajectory $z_{ref}^{< t - \infty, t + \infty >}$ . For practical reasons, we consider the reference trajectory to vary within the finite time interval $[t - τ_{i}, t + τ_{o}]$ . The model is encoded by combining a convolutional DNN with the robust temporal predictions of two long short-term memory (LSTM) networks, one for each element in $(c^{< t >}, w^{< t >})$ .

Although the model could learn local state sequences directly, as in the previous NeuroTrajectory work of Grigorescu et al.,²⁴ we have chosen to learn the vision dynamics model $(c^{< t >}, w^{< t >})$ , which can be used both for state prediction in the form of ego-vehicle poses and for tuning the NMPC’s quadratic cost function.

Given a sequence of temporal observations $I^{< t - τ_{i}, t >} : ℝ^{w} \times τ_{i} \to ℝ^{w} \times τ_{o}$ , the system’s state $z^{< t >} \in ℝ^{n}$ in $I^{< t >}$ and the reference set-points $z_{r e f}^{< t + τ_{o} >} \in ℝ^{n}$ in observation space at time t, the task is to learn $(c^{< t >}, w^{< t >})$ to navigate from state $z^{< t >}$ to destination state $z^{< t + τ_{o} >}$ .

In reinforcement learning terminology, the autonomous driving problem can be described as a partially observable Markov decision process (POMDP)

M : = (I, S, Z_{d}, T, R, γ)

where

I represents the sensory measurements;

S is a finite set of states;

Z_d is a set of trajectory sequences, used by the vehicle to navigate the driving environment measured via $I^{< t - τ_{i}, t >}$ ;

$T : S \times Z_{d} \times S \to [0, 1]$ is a stochastic transition function;

$R : S \times A \times S \to ℝ$ is a scalar reward controlling the computation of $z_{d}^{< t + 1, t + τ_{o} >}$ ;

$γ$ is a discount factor regulating the importance of future versus immediate rewards.

The training objective is to find the desired trajectory that maximizes the associated cumulative future reward. We define the optimal action-value function $Q^{*} (\cdot, \cdot)$ for computing future discounted rewards for the starting $s^{< t >}$ and control commands $u^{< t + 1, t + τ_{o} >}$ for the state trajectory $z_{d}^{< t + 1, t + τ_{o} >}$

Q^{*} (s, z_{d}) = max_{π} E [R^{< \hat{t} >} | s^{< \hat{t} >} = s, z_{d}^{< \hat{t} + 1, \hat{t} + τ_{o} >} = z_{d}, π]

where $π$ is an action, also known as policy, encapsulating a probability density function over a set of possible actions that can take place in a given state. For computing $Q^{*} (s, z_{d})$ , we propose the DNN from Figure 3, where sequences of consecutive images are processed by a set of convolutional layers, before being fed to two LSTM branches.

Figure 3.

Vision dynamics model implemented as a deep neural network. The training data consist of observation sequences, historic system states, and reference state trajectories. A convolutional neural network firstly processes the observations data stream. Secondly, separate LSTM branches are responsible for calculating the road’s curvature and width, which are then used to obtain the desired path. LSTM: long short-term memory.

Our deep network is processing sequences of continuous temporal observations from the augmented memory component. The augmented memory acts as a buffer, where the observations $I^{< t - τ_{i}, t >}$ and vehicle states $z^{< t - τ_{i}, t >}$ are synchronized and stored for the past temporal interval $[t - τ_{i}, t]$ , where t is the current time and $τ_{i}$ is the maximum amount of time for which we store observations and states.

The architecture of our DNN is mainly based on convolutional, recurrent, and fully connected layers, shown in Figure 3 in gray, orange, and blue, respectively. The sequence of four convolutional layers is used for encoding the visual input into a latent one-dimensional intermediate representation that can be fed to the subsequent recurrent layers. In particular, the visual input is firstly passed through $64$ convolutional filters of size $7 \times 7$ , followed by $32$ filters of size $5 \times 5$ and two blocks of $32$ filters, both having a $3 \times 3$ size. The recurrent layers have been implemented as LSTM networks, where the input is represented by the sequences stored in the augmented memory. The first two LSTMs are used to encode the sequence of vehicle states $z^{< t, t - τ_{i} >}$ and given reference trajectory $z_{ref}^{< t - τ_{i}, t + τ_{o} >}$ . The outputs are concatenated as a flatten extension of the one-dimensional latent representation. The complete latent space is composed of $1.536$ neurons, all activated using the rectified linear unit (ReLU) activation function. For improving convergence during training, the intermediate representation reduced to $1.024$ neuron units using fully an additional connected layer. The network then branches into two heads responsible for estimating the current curvature $c^{< t >}$ and width $w^{< t >}$ of the road, respectively. Both branches use an LSTM and two sequential fully connected layers of $128$ neurons each, which also activated using the ReLU function. Finally, the obtained curvature and road width are used to project the desired future trajectory of the vehicle $z_{d}^{< t + 1, t + τ_{o} >}$ , which in turn is used for calculating the optimal action-value function in equation (16).

One of the biggest challenges of using data-driven techniques for control is the so-called “DaGGer effect” described by Pan et al.,³ which is a significant drop in performance when the training and testing trajectories are significantly different. One measure to cope with this phenomenon is to ensure that sufficient data are provided at training time, thus increasing the generalization capabilities of the neural network. The “DaGGer effect” is also the main reason why a deep Q-learning approach, which uses the reward function to explore different trajectories, is preferred over standard supervised imitation learning. In the next section, it is shown that a high generalization can be achieved, demonstrating that LVD-NMPC can safely navigate the driving environment, even if the encountered obstacles were not given at training time.

Experiments

The performance of LVD-NMPC was benchmarked against a baseline nonlearning approach, coined DWA-NMPC, as well as against PilotNet of Bojarski et al.² DWA-NMPC uses the DWA proposed by Fox et al.²⁵ and Chang et al.²⁶ for path planning and a constraint NMPC for motion control, relying for perception on the YoloV3 algorithm of Redmon and Farhadi.²⁷

LVD-NMPC has been tested on three different environments: (I) in the GridSim simulator, (II) for indoor navigation using the 1:8 scaled model car from Figure 4(a) and (III) on real-world driving with the full scale autonomous driving test vehicle from Figure 4(b), as well as on the nuScenes computer vision dataset.

Figure 4.

Test vehicles used for data acquisition and testing. (a) Audi 1:8 scaled model car and (b) real-sized VW Passat autonomous test vehicle.

Competing algorithms and performance metrics

DWA has been implemented based on the robot operating system DWA local planner, taking into account obstacles provided by the YoloV3 object detector of Redmon and Farhadi.²⁷ In the case of the PilotNet algorithm proposed by Bojarski et al.,² the input images are mapped directly to the steering command of the vehicle. The steering commands are executed with an incremental value of 0.01°, dependent on the PilotNet’s output, while the velocity is controlled using a proportional feedback law, with gain $K = 1.6$ .

To assess the success rate of each algorithm, the ground truth is considered as the path driven by a human driver. The ground truth of the curvature and road width is calculated as for the trajectory sequences Z_d in the POMDP setup from “Learning a vision dynamics model” section. The curvature is given by the polynomial interpolation of the human driven path, while the road width’s ground truth is correlated to the longitudinal velocity of the vehicle.

Ideally, each method should navigate the environment collision-free at maximum speed and as close as possible to the ground truth. As pointed out by Codevilla et al.,²⁸ there are limits to the offline policy evaluation employed in experiment III, which can be partially overcome by choosing an appropriate evaluation metric. The cumulative speed-weighted absolute error of Codevilla et al.²⁸ has been chosen as performance metric. This metric is intended to equally quantify experiments I and II, which are pure closed loop experiments, with the offline evaluation performed in experiment III. Additionally, the average speed, the curvature error e_c and the processing time have been evaluated. e_c represents the difference between the estimated and actual path curvature, calculated using polynomial interpolation, while $e_{x y}$ is defined as

e_{x y} = \frac{1}{m} {‖\sum_{t = 0}^{T} ({\hat{p}}^{< i + t >} - p^{< i + t >}) \cdot v_{i + t}‖}_{1}

where $\hat{p}$ and $p$ are coordinates on the estimated and human driven trajectories, respectively. l is a lookahead distance calculated for a prediction horizon $τ_{o} = 3 s$ . To compare the three competing methods, the metric in equation (17) quantifies the total system error.

The percentage of times an algorithm crashed the vehicle and the number of times the destination goals were reached have also been measured for experiments I and II. In the following, we discuss the obtained values of the computed metrices for the three competing algorithms in the experiments, as summarized in Table 1.

Table 1.

Results for experiments I, II, and III.

Experiments	Method	Crash	Goal reach	Average speed (m/s)	$e_{x y} \pm STD (m)$	$e_{c} \pm STD$	Processing time (ms)
I (GridSim simulation)	DWA-NMPC	13%	87%	4.74	14.03	0.12	77
	PilotNet	35%	65%	5.29	30.15	0.37	21
	LVD-NMPC	15%	85%	6.17	16.95	0.14	38
II (Indoor navigation)	DWA-NMPC	30%	70%	0.83	0.21	0.26	98
	PilotNet	40%	60%	1.28	1.52	0.83	53
	LVD-NMPC	20%	80%	1.61	0.21	0.20	61
III (Real-world driving)	DWA-NMPC	—	—	25	56.38	0.14	97
	PilotNet	—	—	25	118.36	0.28	54
	LVD-NMPC	—	—	25	49.04	0.08	63
III (nuScenes dataset)	DWA-NMPC	—	—	4.44	41.81	0.11	101
	PilotNet	—	—	4.44	93.50	0.26	58
	LVD-NMPC	—	—	4.44	37.02	0.07	66

LVD-NMPC: learning-based vision dynamics nonlinear model predictive control; DWA-NMPC: dynamic window approach nonlinear model predictive control; STD: standard deviation.

Experiment I: Simulation algorithm comparison

The first set of experiments are simulations over 10 goal-navigation trials performed in GridSim. GridSim, proposed by Trasnea et al.,²⁹ is our autonomous driving simulation engine that uses kinematic models to generate synthetic occupancy grids from simulated sensors. It allows for multiple driving scenarios to be easily represented and loaded into the simulator. The simulation parameters are the same as in the NeuroTrajectory state trajectory planning approach of Grigorescu et al.²⁴

For training PilotNet and LDV-NMPC, the goal navigation task was run over $10$ driving routes, as follows. A trajectory database has been mapped to sensory information, while the ego vehicle was manually driven by a person. This resulted in over $200.000$ pairs of data samples. Since GridSim provides observations in the form of occupancy grids, the raw input data consisted of distances between the vehicle and the obstacles, sampled at an angle of $2^{\circ}$ , without any visual observation being considered. Due to the simplified structure of the occupancy grid observations, which does not consider visual data, the sets of simulated experiments are used as simple sanity checks for evaluating the competing methods.

Overall, as indicated by the values of the percentage of crashes ( $13 %$ ), goal reach ( $87 %$ ) and position ( $14.03$ m) and curvature ( $0.12$ ) errors from experiment set I (GridSim simulation) in the first row of Table 1, the analytical nonlearning DWA-NMPC approach provided top quantitative results in this experiment. This comes to no surprise, since the full state of the vehicle and the precise locations of the obstacles are known a-priori. Nevertheless, as pointed by the results for GridSim simulation if Table 1, LVD-NMPC came relatively close to similar values for the computed metrices ( $15 %$ crashes, $85 %$ goal reaching, an average speed of $6.17$ m/s, $16.95$ m, and $0.14$ for position and curvature errors, respectively, as well as 38 ms computation time), even outperforming the other methods is terms of vehicle speed. As also shown in the next experiments, PilotNet provided the fastest processing time, mainly due to the fact that the most cost efficient component of PilotNet is the forward propagation of the input data over its neural network. The DaGGer effect could be avoided, since the simulation environment allowed us to gather as much training data as necessary.

Experiment II: Indoor algorithm comparison

In this experiment, we have tested using the 1:8 scaled Audi model-car vehicle from Figure 4(a) with different indoor navigation tasks. The reference routes which the car had to follow were defined as straight lines, sinusoids, circles, and a $75$ -m prerecorded loop. The first set of $10$ trials were performed without any obstacles present on the reference routes, while the second $10$ trials set contained static and dynamic obstacles. The static obstacles are composed of cardboard boxes, which the model-car can sidestep, while the dynamic obstacles are represented by two other model cars, manually driven at different speeds and with different trajectories. The state of the vehicle was measured using wheels odometry and an inertial measurement unit.

LVD-NMPC provided the highest quantitative results, apart from the processing time, which was better for PilotNet. The main reasons for DWA-NMPC’s increase in computation time comes from uncertainties in environment perception and localization. This is a common phenomenon encountered in decoupled processing pipelines, where a decrease in perception accuracy produces a decrease in control performance and vice versa. This is not the case with LVD-NMPC, since perception is tightly coupled to motion control through our vision dynamics model. The model-based nature of our algorithm allowed us to outperform a model-free method such as PilotNet, which tends to have a jittering effect in its control output.

A snapshot from the control loop of LVD-NMPC is shown in Figure 5, where the desired trajectory (depicted in green) is calculated using the output of the proposed DNN from Figure 3 and a set of candidate trajectories (shown in blue) calculated using the dynamic model of the vehicle from equation (4). We have observed that the advantage of LVD-NMPC relies in the combination of the analytical dynamic model of the car, which is subsequently adapted to unseen situations by the deep network’s estimations, as specified in the state estimation equation (2).

Figure 5.

Desired trajectory estimation using LVD-NMPC. The vehicle selects the best desired trajectory (green) from the set of possible candidates (blue) (best viewed in color). LVD-NMPC: learning-based vision dynamics nonlinear model predictive control.

Figure 6 shows velocity, steering, position errors, and heading errors for a trial distance of $10$ m, containing an obstacle at position $(0 m,5 m)$ on the reference route. The obstacle’s position is relative to the starting position. It can be observed that LVD-NMPC has a smoother trajectory, starting to adapt the control inputs earlier than the competing methods. DWA-NMPC and PilotNet are more aggressive, a jittering effect being encountered in both control outputs. This uncertainty is correlated to the accuracy of the perception system and the lack of a system model in the case of DWA-NMPC and PilotNet, respectively.

Figure 6.

(a–d) Velocity, steering, position errors and heading errors versus a 10-m travel distance using the model car from Figure 4(a). LVD-NMPC provides a smoother vehicle trajectory, leveraging on learned obstacles and environmental landmarks when learning the vision dynamics model. Actuator constraints are shown with black lines. LVD-NMPC: learning-based vision dynamics nonlinear model predictive control.

To assess the behavior of the methods with respect to the DaGGer effect, we have placed on the reference route obstacles, which were not given at training time. The DNNs embedded in LVD-NMPC and PilotNet managed to bypass the obstacles, leveraging on environment landmarks present in the training data. This points to the fact that although the learning-based approaches were able to correctly adjust the vehicle’s trajectory, they still require enough training data to recognize environment landmarks.

Experiment III: On-road algorithm comparison

Finally, the third experiment tested LVD-NMPC on over $100$ km of real-world driving in different environments. In total, approximately $463.000$ samples were acquired and varied into $75 %$ training and $25 %$ test sets. This data has been used for training the deep network from Figure 3 in a self-supervised fashion, where pairs of observations (images) and labels (driving trajectories) were given as input to the optimization function in equation (16). Although the training performed in a self-supervised manner, the same real-world data could be used to train a DNN solely based on the observation using inverse DRL. Such a method was proposed by Wulfmeier et al.,⁹ where cost functions for mobile navigation have been learned using inverse DRL. Nevertheless, the most straightforward approach to self-driving using DRL is to learn the parameters of the network in a simulated environment, where a car would autonomously explore its driving environment. In such a system, the reward function would change the position and orientation of the car based on the visual input. Finally, the challenge in this case would be the mapping of the trained DNN to the real-world vehicle.

In addition to our own real-world driving data, we have evaluated the competing methods on the nuScenes dataset (https://www.nuscenes.org/). Among different benchmarking datasets, we have chosen nuScenes due to its sensor setup and odometry information. The data collection contains over $15$ h of driving data divided into $1000$ driving scenes collected in Boston and Singapore, each scene having a length of $20$ s. Approximately $242$ km were covered at an average speed of $16$ km/h. We have converted the ego-vehicle poses to our global 2D reference path coordinates. The original $1600 px \times 900 px$ resolution has been downscaled to $640 px \times 360 px$ .

We have encountered similar results to the ones in experiment II, with LVD-NMPC providing a more accurate estimate of control outputs. Due to the fact that the test vehicle is a real-sized car (as opposed to the 1:8 Audi model car), the values of $e_{x y}$ are larger than that in experiment II. On the other hand, the curvature error is lower, since the driving itself contained less curves than in the case of the indoor navigation experiment. The results on the nuScene dataset were slightly better, mainly due to the relative low speed of the vehicle during data acquisition.

As shown by the results from the real-world experiments I and II, given in Table 1, our model delivers an inference time of slightly above 60 ms on an embedded NVIDIA AGX Xavier development board, equipped with an integrated Volta GPU processor, having a 512 CUDA cores. Depending on the speed of the vehicle, this inference time could be sufficient if the vehicle is traveling at a relatively low speed. However, an increase in computation time is required for high speed vehicles, where the environment also varies with a higher speed.

The metrics used in Table 1 could be aggregated together in a single metric, where each element, that is percentages of crashes and goal reach, average speed, position and curvature errors and processing time, would be combined in a single weighted function. Nevertheless, in this case, the intrinsic values of the individual measurements would be lost. As an example, a model yielding an optimal metric value due to high processing time could crash more often than a model, which is slower in terms of computation time.

Conclusions

This article introduces the LVD-NMPC approach for controlling autonomous vehicles. The method uses a DNN as a vision dynamics model, which estimates both the desired state trajectory of the vehicle, given as input to a constrained nonlinear model predictive controller, as well as the weighting gains of the aforementioned controller. One of the advantages of LVD-NMPC is that the Q-learning training is self-supervised, without requiring the manual annotation of the training data. The experimental results show the robustness of the approach with respect to state-of-the-art competing algorithms, both classical as well as learning-based.

As future work, we plan to investigate the stability of LVD-NMPC, especially in relation to the functional safety requirements needed for automotive grade deployment. Being already implemented on an embedded device, that is the NVIDIA AGX Xavier, we believe that the controller can be used on real-world cars, provided that the safety requirements are met. Setting safety aside, its current implementation is directly linked to the computation power of the vehicle’s computer. The faster its deep learning accelerator is, the more dynamic situations LVD-NMPC could cope with.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed the receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 800928, European Processor Initiative EPI.

ORCID iD

Sorin Grigorescu

Supplemental material

Supplemental material for this article is available online.

References

Grigorescu

Trasnea

Cocias

, et al. A survey of deep learning techniques for autonomous driving. J Field Robot 2020; 37: 362–386.

Bojarski

Testa

Dworakowski

, et al. End to end learning for self-driving cars. arXiv preprint. arXiv:1604.07316. 2016.

Pan

Cheng

Saigol

, et al. Imitation learning for agile autonomous driving. Int J Robot Res (IJRR) 39: 286–302.

Kendall

Hawke

Janz

, et al. Learning to drive in a day. In: 2019 International conference on robotics and automation (ICRA). Montreal, QC, Canada, 20–24 May 2019, pp. 8248–8254. IEEE.

Wang

, et al. Active fault detection of soft manipulator in visual servoing. IEEE Trans Ind Electron 2020: 1–1. DOI:10.1109/TIE.2020.3028813.

Grigorescu

. Vision dynamics-based learning control. In: Zhang

Wei

(eds) Learning control. Amsterdam: Elsevier, 2021, pp. 243–257.

Fan

Wang

Cai

, et al. SNE-RoadSeg: incorporating surface normal information into semantic segmentation for accurate freespace detection. In: European conference on computer vision ECCV 2020, Glasgow, UK, 23–28 August 2020. IEEE.

Kiran

Sobh

Talpaert

, et al. Deep reinforcement learning for autonomous driving: a survey. IEEE Trans Intell Transp Syst 2021: 1–18. DOI:10.1109/TITS.2021.3054625.

Wulfmeier

Wang

Posner

. Watch this: scalable cost-function learning for path planning in urban environments. In: IEEE/RSJ international conference on intelligent robots and systems (IROS 2016). Daejeon, Korea (South), 9–14 October 2016. IEEE.

10.

Lucia

Karg

. A deep learning-based approach to robust nonlinear model predictive control. IFAC-PapersOnLine 2018; 51(20): 511–516. DOI: https://doi.org/10.1016/j.ifacol.2018.11.038

11.

Ostafew

Schoellig

Barfoot

. Robust constrained learning-based NMPC enabling reliable mobile robot path tracking. Int J Robot Res 2016; 35(13): 1547–1563.

12.

McKinnon

Schoellig

. Learn fast, forget slow: safe predictive control for systems with locally linear actuator dynamics performing repetitive tasks. IEEE Robot Autom Lett 2019; 4(2): 2180–2187.

13.

Amini

Gilitschenski

Phillips

, et al. Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robot Autom Lett 2020; 5(2): 1143–1150.

14.

Zeng

Luo

Suo

, et al. End-to-end interpretable neural motion planner. In: IEEE conference on computer vision and pattern recognition (CVPR). Long Beach, CA, USA, 15–20 June 2019, pp. 8652–8661. IEEE.

15.

Garcia

Prett

Morari

. Model predictive control: theory and practice - a survey. Automatica 1989; 25(3): 335–348.

16.

do Nascimento

Basso

Dórea

CET

, et al. Perception-driven motion control based on stochastic nonlinear model predictive controllers. IEEE/ASME Trans Mechatron 2019; 24(4): 1751–1762.

17.

Nascimento

Dórea

CET

Gonçalves

LMG

. Nonlinear model predictive control for trajectory tracking of nonholonomic mobile robots: a modified approach. Int J Adv Robot Syst 2018; 15: 1–14.

18.

Gango

Peni

Toth

. Learning based approximate model predictive control for nonlinear systems. IFAC-PapersOnLine 2019; 52(28): 152–157.

19.

Paden

Cáp

Yong

, et al. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans Intell Veh 2016; 1(1): 33–55.

20.

Kong

Pfeiffer

Schildbach

, et al. Kinematic and dynamic vehicle models for autonomous driving control design. In: 2015 IEEE intelligent vehicles symposium (IV). Seoul, Korea (South), 28 June–1 July 2015, pp. 1094–1099. IEEE.

21.

Fletcher

. Practical methods of optimization. 2nd ed. New York, NY: John Wiley & Sons, 1987.

22.

Houska

Ferreau

Diehl

. ACADO toolkit – an open source framework for automatic control and dynamic optimization. Optim Control Appl Methods 2011; 32(3): 298–312.

23.

Houska

Ferreau

Diehl

. An auto-generated real-time iteration algorithm for nonlinear MPC in the microsecond range. Automatica 2011; 47(10): 2279–2285.

24.

Grigorescu

Trasnea

Marina

, et al. Neurotrajectory: a neuroevolutionary approach to local state trajectory learning for autonomous vehicles. IEEE Robot Autom Lett 2019; 4(4): 3441–3448.

25.

Fox

Burgard

Thrun

. The dynamic window approach to collision avoidance. Robot Autom Mag IEEE 1997; 4(1): 23–33.

26.

Chang

Shan

Jiang

, et al. Reinforcement based mobile robot path planning with improved dynamic window approach in unknown environment. Autonom Robot 2021; 45: 51–76.

27.

Redmon

Farhadi

. Yolov3: an incremental improvement. arXiv preprint. arXiv:1804.02767. 2018.

28.

Codevilla

López

Koltun

, et al. On offline evaluation of vision-based driving models. In: 15th European conference computer vision (ECCV). Munich, Germany, 8–14 September 2018, pp. 246–262. Springer.

29.

Trasnea

Marina

Vasilcoi

, et al. GridSim: a simulated vehicle kinematics engine for deep neuroevolutionary control in autonomous driving. In: International conference on robotic computing IRC 2019. Naples, Italy, 25–27 February 2019. IEEE.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

LVD-NMPC: A learning-based vision dynamics approach to nonlinear model predictive control for autonomous vehicles

Abstract

Keywords

Introduction

Related work

Vision dynamics model learning and control system

Problem definition

Control system design

Learning a vision dynamics model

Experiments

Competing algorithms and performance metrics

Experiment I: Simulation algorithm comparison

Experiment II: Indoor algorithm comparison

Experiment III: On-road algorithm comparison

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Supplemental material

References

Supplementary Material