Sage Journals: Discover world-class research

Abstract

Over the last decade, there has been rising interest in automated driving systems and adaptive cruise control (ACC). Controllers based on reinforcement learning (RL) are particularly promising for autonomous driving, being able to optimize a combination of criteria such as efficiency, stability, and comfort. However, RL-based controllers typically offer no safety guarantees. In this paper, we propose SECRM (the Safe, Efficient, and Comfortable RL-based car-following Model) for autonomous car-following that balances traffic efficiency maximization and jerk minimization, subject to a hard analytic safety constraint on acceleration. The acceleration constraint is derived from the criterion that the follower vehicle must have sufficient headway to be able to avoid a crash if the leader vehicle brakes suddenly. We critique safety criteria based on the time-to-collision (TTC) threshold (commonly used for RL controllers), and confirm in simulator experiments that a representative previous TTC-threshold-based RL autonomous-vehicle controller may crash (in both training and testing). In contrast, we verify that our controller SECRM is safe, in training scenarios with a wide range of leader behaviors, and in both regular-driving and emergency-braking test scenarios. We find that SECRM compares favorably in efficiency, comfort, and speed-following to both classical (non-learned) car-following controllers (intelligent driver model, Shladover, Gipps) and a representative RL-based car-following controller.

Keywords

automated/autonomous vehicles intelligent transportation systems safety reinforcement learning artificial intelligence and advanced computing applications

Autonomous driving started to come to reality with the development of sensors and artificial intelligence (AI). One of the main advantages of autonomous vehicles (AVs) is their ability to overcome the inherent system randomness in human driving behavior that creates instability in the traffic system ( 1 ) resulting in traffic jams ( 2 ). Furthermore, AVs could potentially learn to outperform human driving in safety, efficiency (tight headways), and comfort (low jerk) ( 3 ).

A car-following controller is the component of an AV system that sets the longitudinal (within-lane) acceleration of a vehicle. Achieving safe, efficient, and comfortable car-following is crucial in autonomous driving. In traffic flow theory, classic car-following models (CFMs) are based on physical knowledge and human driving behaviors. Several standard CFMs have been developed to mimic human driving behavior. For example, the Gipps model ( 4 ) imitates human driving by considering both speed-following mode (without leading vehicle) and leader-following mode (with the leading vehicle) and takes the smaller of the two velocities as the target to decide whether to apply acceleration or deceleration. The target speed is also affected by some safety constraints ( 4 ). Another example is the intelligent driver model (IDM) ( 5 ), in which the applied acceleration depends on the desired velocity, desired headway, relative velocity, and true headway.

Recently, different applications that depend on deep learning (DL)/deep neural networks (DNNs) have outperformed human experts in different fields, motivating many researchers to adopt these methods in the area of AVs ( 3 , 6 –8). The deep reinforcement learning (DRL) technique is the use of reinforcement learning (RL) with DNNs to learn the optimization of certain metrics such as safety, efficiency, and comfort in autonomous driving. The model interacts with the controlled environment and learns from experience to optimize the given set of metrics (formalized as a reward signal). Isele et al. ( 9 ) utilized DRL to optimize lane-changing maneuvers. In Isele et al. ( 9 ), Gong et al. ( 10 ), and Zhou et al. ( 11 ), DRL is applied to optimize safety and efficiency. Only a few research papers tried to design a safe, efficient, and comfortable car-following model using DRL ( 3 , 12 –14).

There are some limitations that have not been considered by the previously mentioned DRL-based CFMs. First, all the existing DRL-based CFMs design their optimal behavior (e.g., desired headway) using real-life data sets such as the HighD data set ( 15 ), NGSIM data ( 16 ), and data from Shanghai Naturalistic Driving Study ( 17 ). That results in a model that tries to mimic human driver behavior which is not the optimal driving behavior; that is, these models have no potential to produce better-than-human performance. Second, all the existing DRL-based CFMs neglect to train and test on some common but safety-critical driving scenarios where the leader suddenly decelerates to a complete stop, and which may result in a collision. Third, DRL CFMs often focus on car-following mode, ignore speed-following mode, or do not offer a seamless switch between car-following mode and speed-following mode when the leader is no longer present ( 3 , 14 ). According to Treiber and Kesting ( 1 ), a complete car-following model must be able to seamlessly deal with such different situations as driving in free traffic, following the leader in both stationary and non-stationary situations, emergency situations when full braking is required, and approaching slow traffic caused by congestion or red traffic lights. Fourth, most of the existing DRL-based CFMs depend on time-to-collision (TTC) as a metric for safety. However, according to Vogel ( 18 ), following TTC-based safety criteria cannot guarantee safety and can lead to very dangerous situations and accidents in some cases. Fifth, generalization is missing in most of the existing DRL-based CFMs. In Packer et al. ( 19 ), generalization is defined as the ability of the model to preserve a good performance in different environments even if these environments were not seen before. Training and testing of RL models are often done in the same environment with the same parameters, which can lead to overfitting. The work ( 20 ) conducted a performance comparison between DRL and model predictive control for adaptive cruise control (ACC); DRL showed very good performance until the researchers conducted an out-of-distribution validation, where it was found that a substantial degradation in performance happened.

To overcome the limitations and fill in the mentioned gaps in literature, in this paper we propose a complete autonomous driving DRL-based car-following model that:

- Optimizes efficiency (unlike some previous RL CFMs partly based on human driving data), while preserving safe and comfortable driving behavior;

- Can handle all driving scenarios, such as speed-following scenarios (with different speed limits) as well as leader-following driving scenarios (normal driving with different speed limits and leader emergency-braking scenarios);

- Uses a newly designed reward function that depends on the proximity of the vehicle’s speed to the maximal safe speed for safety, efficiency, and speed-following, and the vehicle’s jerk for comfort;

- Uses a randomized environment during training to help improve generalizability to various car-following scenarios, such as regular driving with different speed limits, sudden speed change in emergency braking, and speed-following with different speed limits.

This paper is structured as follows. In the “Methods” section, we begin by briefly defining the RL problem and its formalization in finding an optimal policy for a Markov decision process (MDP). Then, we discuss adding safety constraints to an RL agent and provide a brief description of the area of safe RL. We then formulate a hard safety constraint that will be used for our agent and justify using a worst-case-based safety criterion instead of a TTC-threshold-based safety criterion for the constraint. Following this, we formally introduce the observations, actions, and rewards of SECRM (the Safe, Efficient, and Comfortable RL-based car-following Model), the training algorithm (deep deterministic policy gradient [DDPG]), and our training and evaluation scenarios. In the “Results” section, we describe experimental results obtained in the five evaluation scenarios (two regular-driving scenarios, two emergency-braking scenarios, and one speed-following scenario). We conclude by discussing several aspects of our agent.

Methods

Notation and Conventions

In this paper, we propose a controller for the longitudinal (within-lane) acceleration of AVs. We call the controlled vehicle the follower vehicle F, and the vehicle immediately in front of the follower vehicle (if such a vehicle exists) the leader vehicle L. The velocity of the follower is denoted by $v_{F}$ , and when the leader exists the velocity of the leader is denoted by $v_{L}$ . The distance gap $g_{d}$ between the follower and the leader is defined as the distance between the front of the follower and the back of the leader. The length of the leader vehicle is not included in the distance gap, in distinction to the headway distance $h_{d}$ , which is the distance from front of follower to front of leader and does include the length of the leader vehicle (Figure 1). In case there is no leader vehicle, by convention the distance gap is infinite.

Figure 1.

The distance gap $g_{d}$ and headway distance $h_{d}$ between the follower F and the leader L.

The time gap between the follower and the leader is defined as $g_{t} = \frac{g_{d}}{v_{F}}$ . The time gap is equal to the time that it would take the follower to drive through the distance gap if the follower kept driving at its current speed. The conversion between distance gap and time gap is immediate, and when the distinction between distance gap and time gap is not important we simply speak of the gap.

We denote the speed limits of the road section that the follower and (if it exists) the leader is driving on by $s_{F}$ and $s_{L}$ , respectively. We denote the maximal acceleration of the follower and leader vehicles by $a_{F}$ and $a_{L}$ , respectively, and the maximal deceleration (which by convention is a positive number) of the follower and leader by $b_{F}$ and $b_{L}$ , respectively. We denote the follower’s reaction time by $r$ . The reaction time includes the time taken by the controller (whether human or automated) to decide on an action, as well as the time it takes the vehicle system to apply the action. It is simply the time lag during which the follower is not responding to stimuli. The acceleration controller of the follower vehicle is assumed to apply an acceleration action every time step chosen to be the same as r in seconds.

By a follower–leader configuration (with respect to fixed parameters $a_{F}$ , $a_{L}$ , $b_{F}$ , $b_{L}$ , $r$ ), we mean the tuple $C = (g_{d}, v_{F}, v_{L}, s_{F}, s_{L})$ .

We use $g_{d} [t]$ , $v_{F} [t]$ , $v_{L} [t]$ , $s_{F} [t]$ and $s_{L} [t]$ to denote the distance gap, velocities of the follower and leader, and speed limits of follower and leader at time t, respectively. We let $v_{rel} [t] = v_{F} [t] - v_{L} [t]$ denote the relative velocity at time t, $a_{F} [t] = \frac{v_{F} [t] - v_{F} [t - 1]}{r}$ denote the follower’s acceleration at time t, and similarly $a_{L} [t]$ denote the leader’s acceleration at time t. (Please note that $a_{F}$ denotes the maximal acceleration, while $a_{F} [t] \in [- b_{F}, a_{F}]$ denotes the actual acceleration at time t, and similarly for $L$ .)

Reinforcement Learning and Markov Decision Processes

RL is a subfield of machine learning that studies methods for training intelligent controllers (agents) using reward signals obtained by the agent’s interaction with its environment ( 21 ). The agent’s decision-making process is frequently formalized in the concept of an MDP (or a variant, for example partially observable MDP [ 22 ] and constrained MDP [CMDP] [ 23 ]).

An (infinite-horizon) MDP is a five-tuple $(S, A, T, R, γ)$ . The set $S$ is the state space; it is the set of all possible agent–environment configurations. The set $A$ is the action space; it is the set of possible agent actions. The function $T : S \times A \times S \to [0, 1]$ is the transition function; $T (s^{'}, a, s)$ is the probability that the system passes to state $s'$ given initial state $s$ and agent action $a$ . The function $R : S \times A \to R$ is the reward function ( $R$ denotes the real numbers); $R (s, a)$ is the reward obtained after taking action $a$ in state $s$ . Finally, $γ \in [0, 1)$ is the discount factor. That $T$ and $R$ are functions of the present state $s$ only, and not the previous state history, is referred to as the Markov assumption.

The agent iteratively interacts with the environment, at time t starting at state $s_{t} \in S$ , taking action $a_{t} \in A$ , and receiving reward $r_{t} = R (s_{t}, a_{t}) .$

A policy $π$ is a mapping $S \to P (A)$ from the state space to the set of probability distributions over the action space. The probability of taking action $a$ in state $s$ is denoted $π (a | s)$ . Assuming an initial probability distribution $Π_{t_{0}}$ over $S$ at time $t_{0},$ the goal of the RL agent is to find a policy $π^{*}$ that maximizes the expected discounted cumulative return

J (π) = E_{\begin{matrix} s_{t_{0} ~ Π_{t_{0}}} \\ \begin{matrix} s_{t} ~ T (\cdot, s_{t - 1}, a_{t - 1}) \\ a_{t} ~ π (\cdot, s_{t}) \end{matrix} \end{matrix}} [\sum_{t = t_{0}}^{\infty} γ^{t - t_{0}} r_{t}] .

Multiple algorithms are known for approaching the RL problem. For example, classical algorithms include dynamic programming, Monte Carlo methods and Q-learning, and more recent deep RL algorithms include deep Q-learning (DQN) ( 24 ), DDPG ( 25 ), proximal policy optimization (PPO) ( 26 ), and soft actor-critic ( 27 ). Classical algorithms are theoretically relatively well understood (with proven convergence properties) but can have difficulties in scaling to larger problems (with a richer state representation and more complex actions). In recent years, there has been success in using neural networks as function approximators for representing the policy function and other auxiliary objects involved in RL algorithms (such as the state value function, and the state-action value function); collectively, the resulting new family of algorithms is known as deep RL (DRL). DRL algorithms are less well understood theoretically but often offer gains in performance over RL. In this paper, we use the DDPG algorithm, which is a deep RL algorithm. We choose DDPG because of its sample-efficiency (being an off-policy algorithm), and for easier comparison with previous RL CFMs, many of which (for example, Zhu et al. [ 3 ], Shi et al. [ 14 ], Lin et al. [ 28 ]) use DDPG. We provide more description of DDPG below, in the “Training” section.

Safe RL and the Worst-Case Action Bound

Safety of Previous RL Car-Following Controllers

In general, RL car-following controllers rely on reward alone for safety. Typically, the reward is a linear combination of several terms including safety, efficiency, comfort, speed-following, energy consumption, and so forth, with one of the terms in the reward function being a safety reward. The safety term is often either a large penalty (negative reward) for a crash (or a very small gap) in training ( 28 ), or a large penalty whenever the follower has a low TTC with respect to the leader ( 3 , 14 ). In either case, for agents trained using reward alone, the satisfaction of safety constraints is not guaranteed. One reason for this is that RL agents see only a finite part of the observation space in training; even a well-trained agent may find itself in a part of the observation space in testing that was not sufficiently well explored in training. Despite having some capacity for generalization, agents can fail in such situations. In support of the claim that reward alone may not be sufficient for satisfying safety constraints, as described in the “Experiments” section, we found that RL CFMs whose safety relies on reward alone (and that learn not to crash in training) may collide when the leader vehicle starts decelerating suddenly (i.e., in an emergency-braking scenario).

Because safety is paramount for autonomous driving systems, we find it necessary to place additional restrictions on an RL car-following controller to guarantee safety.

Safe RL

The question of how to impose safety criteria on RL agents gives rise to a subfield of reinforcement learning called safe RL. A wide variety of approaches to safe RL have been proposed. Please see for example Gu et al. ( 29 ) or Brunke et al. ( 30 ) for surveys of the field.

We find that we can formulate our safety constraint in the relatively simple form of an explicit analytic state-dependent acceleration upper bound $a_{safe} (s)$ that, if satisfied, guarantees that the controlled vehicle stays within a safe configuration in the next time step. Which configurations are safe is determined by the worst-case criterion described below, and the formula for $a_{safe} (s)$ is derived below.

Therefore, we can avoid the complications of passing to a framework such as CMDPs and algorithms appropriate to it, as is frequently required in safe RL, and instead directly modify the formulation of our basic MDP, placing an upper bound on the acceleration of the controlled vehicle, so that the set of actions at state $s$ is $[- b_{F}, a_{safe} (s)]$ instead of $[- b_{F}, a_{F}]$ . We can then apply unconstrained MDP methods to the problem.

Worst-Case Safety Criterion

In this paragraph, we formulate the hard constraint on our controller’s actions.

We adopt the following criterion to distinguish between safe and unsafe follower–leader configurations:

(Worst-case criterion) A follower–leader configuration is safe if and only if, in the event the leader brakes with maximal deceleration $b_{L}$ until coming to a complete stop, the initial gap $g_{d}$ is sufficiently large for the follower to be able to react and stop without crashing.

Based on the above criterion, we define the unsafe region as the set of gaps that are unsafe (the gap is not large enough for the follower to be able to stop), and the safe region as the set of gaps that are safe. The maximal safe speed is the highest follower speed in the following time step such that the follower does not cross into the unsafe region.

The worst-case criterion for safe driving is not new, appearing in multiple prior works, such as Gipps ( 4 ) and the General Motors (GM) model ( 31 ). It is the safety criterion adopted in the Vienna Convention on Road Traffic ( 32 ). We provide a justification for our preference for the worst-case criterion over another common safety criterion, based on a TTC threshold, later in the text. Note that although our model uses worst-case scenario for safety like the above-mentioned models, it is not an RL replica of the prior models, as our model includes other criteria such as concurrently balancing traffic efficiency (minimizing headways) and comfort (minimizing jerk), as will be discussed later in the text as well.

Derivation of the Maximal Safe Speed

Although our derivation of the maximal safe speed is based on similar principles to the well-known Gipps and GM models ( 4 , 31 ), for completeness and the convenience of the reader, we include the derivation details here.

Our goal is to find an upper bound for $v_{F} [t + 1]$ so that the follower can avoid a crash if the leader begins decelerating at maximal rate $b_{L}$ at time t and continues until a complete stop.

We begin by deriving a criterion for a safe gap, assuming that $v_{F} [t + 1]$ is known. From the established laws of motion, the braking distance of the leader is equal to $\frac{v_{L} {[t]}^{2}}{2 b_{L}}$ . The follower begins by accelerating from $v_{F} [t]$ to $v_{F} [t + 1]$ during the initial reaction time r, driving a distance of $\frac{v_{F} [t] + v_{F} [t + 1]}{2} r$ (note: we assume that the acceleration is uniform during the reaction time), and then (assuming that the follower applies maximal deceleration) drives an additional braking distance of $\frac{v_{F} {[t + 1]}^{2}}{2 b_{F}}$ . To avoid the vehicles stopping bumper-to-bumper, an additional small extra distance $ϵ$ in the gap is added to the initial distance gap. Therefore, the distance gap $g_{d} [t]$ at time t is safe if and only if the following inequality holds:

g_{d} [t] \geq \frac{v_{F} [t] + v_{F} [t + 1]}{2} r + \frac{v_{F} {[t + 1]}^{2}}{2 b_{F}} - \frac{v_{L} {[t]}^{2}}{2 b_{L}} + ϵ .

(1)

Next, assuming all quantities at time t (including the gap $g_{d} [t]$ ) are known, we can use Inequality 1 to obtain an upper bound on $v_{F} [t + 1]$ that makes the current gap $g_{d} [t]$ safe. Inequality 1 is still valid and becomes a quadratic inequality in the unknown $v_{F} [t + 1]$ , with the remaining variables fixed. The set of speeds $v_{F} [t + 1]$ that satisfy the inequality are those for which the gap $g_{d} [t]$ is safe. The coefficient of the quadratic term $\frac{1}{2 b_{F}}$ is a positive number, so the parabola opens toward the positive y axis, and the largest non-positive solution of Inequality 1 is found at the larger of the two (possibly equal) roots of the associated quadratic polynomial $p$ defined by

\begin{matrix} p (v_{F} [t + 1]) \\ = \frac{v_{F} {[t + 1]}^{2}}{2 b_{F}} + \frac{r}{2} v_{F} [t + 1] + \frac{r}{2} v_{F} [t] - \frac{v_{L} {[t]}^{2}}{2 b_{L}} - g_{d} [t] + ϵ \end{matrix}

(please see Figure 2). Using the quadratic formula, we find that the maximal safe speed is given by

\begin{matrix} v_{F, safe} [t + 1] \\ = - \frac{r b_{F}}{2} + \sqrt{{(\frac{r b_{F}}{2})}^{2} - 2 b_{F} (\frac{r v_{F} [t]}{2} - \frac{v_{L} {[t]}^{2}}{2 b_{L}} - g_{d} [t] + ϵ)} \end{matrix}

(2)

Please see Figures 3 and 4 for two heatmaps of the value of $v_{F, safe} [t + 1]$ . In these plots, $r = 0.5$ s, and the maximal decelerations are $b_{F} = b_{L} = 3 m / s^{2}$ . The tiles in which $v_{F, safe} [t + 1]$ cannot be reached from the initial follower–leader configuration because of the deceleration constraint have been hidden. By Gipps ( 4 ), a follower that always obeys the maximal safe speed bound will not enter such configurations. On the left-hand heatmap, $v_{F, safe} [t + 1]$ varies more along rows than columns, indicating a stronger dependence of $v_{F, saf e} [t + 1]$ on the leader speed $v_{L} [t]$ than the follower speed $v_{F} [t]$ . Because the speed $v_{F} [t]$ only affects the distance driven during the initial reaction time, the dependence of $v_{F, safe} [t + 1]$ on $v_{F} [t]$ grows stronger with larger r and weaker with smaller r.

Figure 2.

Deriving the maximal safe next speed $v_{F, safe} [t + 1]$ .

Figures 3 (left) and 4 (right).

Heatmaps of $v_{F, safe} [t + 1]$ . On the left, the initial distance gap is fixed at 5 m, and the safe velocity is displayed as a function of leader and follower speeds. On the right, the leader speed is fixed at 20 m/s, and the safe velocity is displayed as a function of initial distance gap and follower speed.

Critique of Safety Criteria That Are Based on a TTC Threshold

We recall that the TTC of a follower–leader configuration is given by

TTC = \frac{g_{d}}{v_{F} - v_{L}} .

A safety criterion that is commonly used for RL approaches to longitudinal car-following takes the form

(TTC-threshold criterion) A follower–leader configuration is safe if and only if $TTC > c$ for a choice of constant c.

For example, $c = 4$ is used in Zhu et al. ( 3 ). The paper ( 18 ) surveys the literature and gives the range 1.5 ≤ c ≤ 5. The choice of $c$ is ad hoc, based on opinion and experiments. In addition to the ad hoc nature of the choice of threshold, we point out two disadvantages of TTC-threshold-based safety criteria:

There exist follower–leader configurations that are safe according to any TTC-threshold criterion (i.e., any choice of constant c), yet unsafe according to the worst-case criterion.For example, consider the case when $v_{F} = v_{L}$ . In this case, the TTC is infinite, and the configuration is considered safe according to the TTC-threshold criterion, no matter what threshold c is chosen and no matter how close the follower is to the leader vehicle. Yet if $v_{F} [t] = 20$ m/s, $v_{F} [t + 1] = 20$ m/s, $v_{L} [t] = 20$ m/s, $r = 0.1$ s, $g_{d} [t] = 1.5$ m for example, Inequality 1 fails, meaning that the follower does not have a sufficient gap to stop in case the leader applies a maximal deceleration.

TTC-threshold safety criteria do not depend on the follower’s reaction time $r$ , the follower’s acceleration action at time t, nor the maximal decelerations $b_{F}$ and $b_{L}$ . These parameters can be decisive in determining whether the follower has a sufficient gap to stop in case of a sudden deceleration of the leader. Thus, of two follower–leader configurations with equal TTC, one may be safe and the other unsafe according to the worst-case criterion.Differences in maximal decelerations arise often in practice. For example, each of the following vehicle types can be expected to have a different maximal deceleration from the others: sedans, sports cars, buses, freight trucks, and others.

The article ( 18 ) is devoted to analyzing the relative advantages and disadvantages of distance gap and TTC as safety indicators. The author’s thesis is that small gaps represent “potential or actual danger” whereas small TTC represents “actual danger.” For example, in the situation when the follower is tailgating the leader, with approximately equal speeds, the gap is small, yet the TTC is large (identifying the configuration as safe). If the leader suddenly decelerates, the TTC will become small, but the follower will not be able to avoid a crash. Staying safe according to the worst-case criterion may thus be seen as avoiding potential (and therefore actual) danger in the categories of Vogel ( 18 ). Using a TTC-threshold safety criterion is not sufficient for formulating hard constraints that provide safety guarantees.

Safety in Low-Visibility Conditions

In low-visibility conditions (for example, fog or heavy snowfall), it is necessary to add another (but conceptually similar) speed constraint. We assume that the system can determine its detection range at time t as $d_{vis} [t]$ . Modifying the worst-case safety criterion for the low-visibility setting, we require that the visibility range must not exceed the distance driven during the reaction period, plus the follower’s stopping distance. Thus, following a similar derivation to above, we require that

d_{vis} [t] \geq \frac{v_{F} [t] + v_{F} [t + 1]}{2} r + \frac{v_{F} {[t + 1]}^{2}}{2 b_{F}} + ϵ

and obtain the maximal safe speed in low-visibility conditions,

\begin{matrix} v_{F, vis} [t + 1] = - \frac{r b_{F}}{2} \\ + \sqrt{{(\frac{r b_{F}}{2})}^{2} - 2 b_{F} (\frac{r v_{F} [t]}{2} - d_{vis} [t] + ϵ)} . \end{matrix}

(3)

Alternatively, we could have reduced the derivation to the previous case by imagining a virtual stopped leader vehicle at the edge of the detection range.

Definitions of Efficiency and Comfort

In addition to safety, our controller aims to maximize efficiency and comfort.

Efficiency

We define the target speed of the follower at time $t + 1$ as

v_{F, tgt} [t + 1] = min (v_{F, safe} [t + 1], v_{F, vis} [t + 1], s_{F} [t])

where $v_{F, safe}$ is the maximal safe speed constrained by the leader Equation 2, $v_{F, vis}$ is the maximal safe speed constrained by visibility conditions Equation 3, and $s_{F}$ is the speed limit. Because the minimum of the three terms is taken, the target speed simultaneously satisfies both leader and low-visibility safety constraints, and is less than or equal to the speed limit.

We then define the follower inefficiency over a trajectory $t = 0, \dots, T$ as

Inefficienc y_{F} = \frac{1}{T} \sum_{t = 1}^{T} | v_{F, tgt} [t] - v_{F} [t] |

where $| . | denotes$ the absolute value. That is, inefficiency is measured as the average absolute deviation from the target speed. Our controller seeks to minimize the follower inefficiency.

We discuss three separate cases to justify our definition of efficiency.

In the case where there is a close leader vehicle ( $v_{F, tgt} = v_{F, safe}$ ), the follower that is driving at $v_{F, safe}$ is driving as fast as possible without crossing into the unsafe region. Therefore, driving at velocity $v_{F, safe}$ (i.e., maximizing efficiency according to our definition) greedily minimizes the follower–leader gap, subject to safety constraints.

Minimizing gaps between consequent pairs of vehicles in a system leads to a higher system capacity. Suppose, for example, that the average vehicle length is 5 m; then, in a steady-state stream of vehicles at common speed v and time gap $g_{t}$ , the flow in vehicles per hour is given by $3600 / (g_{t} + (5 / v))$ .

From Figure 5 we can observe that with a smaller time gap, the flow capacity will be larger. This calculation is highly idealized, but it illustrates clearly the effect that decreasing vehicle gaps has on system capacity.

Figure 5.

The motivation for decreasing time gaps between vehicles (maximizing efficiency) is the resulting increase of system capacity.

The case when the speed is constrained by low-visibility conditions ( $v_{F, tgt} = v_{F, vis}$ ) is similar to the first case. Each vehicle greedily minimizes its distance to its detection boundary subject to safety constraints, increasing steady-state system capacity.

Finally, the case in which the speed is constrained by the speed limit ( $v_{F, tgt} = s_{F}$ ) is conceptually distinct from the first two. By our definition, a more efficient follower drives at the speed limit as much as it can. Better efficiency in this sense will lead to a shorter travel time for the vehicle.

Comfort

We define the follower discomfort over a trajectory $t = 0, \dots, T$ as

Discomfor t_{F} = \frac{1}{T} \sum_{t = 0}^{T} j_{F} {[t]}^{2}

where the follower jerk (rate of change of acceleration) at time t is given by $j_{F} [t] = \frac{a_{F} [t] - a_{F} [t - 1]}{r}$ . This is an intuitively appealing measure of discomfort and is commonly used in the literature ( 3 ). Our controller aims to minimize discomfort (sudden changes in acceleration). We also tried to minimize the quantity $\frac{1}{T} \sum_{t = 0}^{T} | j_{F} [t] |$ , where $| . | denotes$ the absolute value, but found that the learned policy was slightly better with the sum-of-squares version defined above.

SECRM

In this section, we introduce our reinforcement-learning-based car-following model, which we call SECRM. The core idea is to constrain the acceleration of the controlled vehicle so that the speed is always below the maximal safe speed. Subject to this constraint, the controller learns to take actions that bring the speed as close to the maximal safe speed as possible, maintaining safety and maximizing efficiency while minimizing jerk.

MDP Formulation

The MDP models the follower’s decision-making. The controller controls the follower’s longitudinal acceleration.

State: The follower receives the following tuple as the observation of the state of the environment at time t (cf. the “Notation” section; $d_{vis} [t]$ denotes the detection range),

s_{t} = (v_{F} [t], v_{rel} [t], g_{d} [t], a_{F} [t], a_{L} [t], d_{vis} [t], s_{F} [t]),

and in cases when there is no leader, or the leader is beyond the detection range, we set $g_{d} [t]$ to $\infty$ .

• Actions: Given the observation at time t, the follower computes $v_{F, safe} [t + 1]$ according to Equation 2 (the terms $r$ , $b_{F}$ and $ϵ$ are controller parameters, whereas an estimate is used for $b_{L}$ ), and $v_{F, vis} [t + 1]$ according to Equation 3. Letting $v_{F, bound} [t + 1] = min (v_{F, safe} [t + 1], v_{F, vis} [t + 1]),$ the maximal safe acceleration is

a_{F, \max} [t + 1] = clip (\frac{v_{F, bound} [t + 1] - v_{F} [t]}{r}, - b_{F}, a_{F})

where $clip (a, b, c) = min (max (a, b), c)$ .

The follower may apply any action in $[- b_{F}, a_{F, \max} [t + 1]]$ . In practice, the closed interval $[- 1, s 1]$ is the action space, and an action $a_{t} \in [- 1, 1]$ is mapped to the agent acceleration $a_{F} [t + 1] = - b_{F} + \frac{(a_{t} + 1) (a_{F, \max} [t + 1] + b_{F})}{2} \in [- b_{F}, a_{F, \max} [t + 1]]$ . This is done to normalize the neural network output.

Rewards: The reward is the linear combination of two separate parts.

Efficiency (and speed-following): We formulate the efficiency reward following the target speed $v_{F, tgt} [t + 1] = min (v_{F, safe} [t + 1], v_{F, vis} [t + 1], s_{F} [t])$ .

This choice allows us to control the cases when the follower’s speed is constrained by (1) its proximity to the leading vehicle (leader-following mode), (2) low-visibility conditions, and (3) the speed limit (speed-following mode), with the same RL model. The minimum function dynamically switches between the three objectives, based on which of the three speeds is lower.

The efficiency/speed-following reward is piecewise-linear, based on how close the actual velocity is to the target (writing $v_{F, tgt} [t + 1] = v_{tgt}$ to reduce notation):

R_{ES} (s_{t}, a_{t}) = {\begin{matrix} \frac{v_{F} [t + 1]}{v_{tgt}}, v_{F} [t + 1] \leq v_{tgt} \\ 2 - (\frac{v_{F} [t + 1]}{v_{tgt}}), v_{F} [t + 1] > v_{tgt} \end{matrix}

Please see Figure 6. Notice that in the car-following and poor-visibility cases, the acceleration constraint ensures that $\frac{v_{F} [t + 1]}{v_{tgt}} \leq 1$ , so that the right-side part of the reward function (past the peak) is not used. In speed-limit-following, we allow the vehicle to exceed the speed limit, but penalize this behavior relative to following the speed limit exactly.

Figures 6 (left) and 7 (right).

Shapes of the reward functions. The efficiency/speed-following reward function is displayed on the left, and the comfort reward function on the right. For the comfort reward example, $a_{F} = b_{F} = 3$ and $r = 0.1$ .

Comfort: The comfort reward is formulated to penalize large jerk. The value is normalized to lie between −1 and 0. Thus,

R_{comfort} (s_{t}, a_{t}) = - {(\frac{a_{F} [t + 1] - a_{F} [t]}{a_{F} + b_{F}})}^{2} = - {(\frac{j_{F} [t + 1]}{\frac{a_{F} + b_{F}}{r}})}^{2}

where $j_{F} [t + 1] = \frac{a_{F} [t + 1] - a_{F} [t]}{r}$ is the follower jerk at time $t + 1$ (Figure 7).

The full reward is then given by

R (s_{t}, a_{t}) = R_{ES} (s_{t}, a_{t}) + w R_{comfort} (s_{t}, a_{t})

(4)

for some parameter $w \geq 0$ .

We experimented with $w \in {0.1, 0.2, \dots, 0.9}$ and concluded that $w = 0.7$ achieved the best efficiency and comfort in our experiments. The results described below are for a controller trained with $w = 0.7$ .

We remark that in safety-critical situations the action of the controller is highly constrained by the bound $a_{F, \max} [t + 1]$ . In particular, in the extreme case when the follower is driving as closely to the leader as permitted by the safety constraint (with equal velocities), and the leader performs an emergency deceleration, the safety constraint will also force the follower to undergo an emergency deceleration as well (the action is forced to be $b$ ). The weight $w$ can be intuitively regarded as balancing between efficiency and comfort, while safety guarantees are relegated to the safety constraint.

Importance of Using a Target Speed Instead of a Target Gap

It is common (for example Zhu et al. [ 3 ], Shi et al. [ 14 ], Lin et al. [ 28 ]) to formulate the efficiency part of the RL car-following reward as following a set target gap. In our work, we instead formulate efficiency as following the dynamic maximal safe next speed. We find that our formulation has the following three advantages.

There is no target gap setting that is optimal for all follower–leader configurations. Usually, a given gap will either be inefficient or unsafe. We use a dynamic target speed, effectively following a dynamic target gap.

As mentioned above, formulating efficiency as speed-following allows us to uniformly treat the cases when the follower’s speed is constrained by the leader (car-following mode), poor visibility conditions, and by the speed limit (no leader present and sufficient visibility).

The follower’s action directly controls the speed, whereas the gap depends additionally on the (uncontrolled) acceleration of the leader. Consequently, we find that learning with a target speed is simpler than learning with a target gap.

Training

Deep Deterministic Policy Gradient

We use the DDPG algorithm ( 25 ) to train our controller. DDPG is a model-free, off-policy actor-critic algorithm. DDPG is an analog of the DQN algorithm that works with continuous action spaces.

To describe more details, we recall that the state-action value function of policy $π$ is given by

Q^{π} (s_{t}, a_{t}) = r_{t} + γ E_{\begin{matrix} s_{k} ~ T (\cdot | s_{k - 1}, a_{k - 1}) \\ a_{k} ~ π (\cdot | s_k) \end{matrix}} [\sum_{k = t + 1}^{\infty} γ^{t + 1 - k} r_{k}]

The state-action value function $Q^{π} (s_{t}, a_{t})$ of policy $π$ is the expected cumulative return of $π$ if the trajectory starts by taking action $a_{t}$ at state $s_{t}$ and follows $π$ afterward. It is well known that the Q-function of an optimal policy $π^{*}$ satisfies the Bellman equation,

Q^{π^{*}} (s, a) = E_{s^{'} ~ T (\cdot, s, a)} [R (s, a) + γ max_{a'} Q^{π^{*}} (s^{'}, a^{'})] .

Motivated by the Bellman equation, the classical Q-learning algorithm creates a sequence $Q_{t}$ of approximations of $Q^{π^{*}}$ , by updating $Q_{t}$ as follows after taking action $a_{t}$ in state $s_{t}$ and observing the new state $s_{t + 1}$ and reward $R (s_{t}, a_{t}) = r_{t}$ ,

\begin{matrix} Q_{t + 1} (s_{t}, a_{t}) \\ = (1 - α) Q (s_{t}, a_{t}) + α (R (s_{t}, a_{t}) + γ max_{a'} Q (s_{t + 1}, a^{'})) . \end{matrix}

In deep RL, the iterative Q-function approximations are replaced by a neural network with parameters $θ$ , denoted $Q_{θ} (s, a)$ (the approach generalizes to other function approximators, but we discuss only neural networks here).

In Q-learning (both tabular and deep), the agent chooses the action that maximizes its current Q-value estimates, during both training and deployment. Because maximizing the Q-value over all possible actions can be a difficult problem in itself when the action space is continuous, DDPG trains a deterministic policy function (the actor) in addition to learning the (estimate of the) Q-value function (the critic). The actor’s decisions are also computed using a neural network with parameters $ϕ$ , and the policy is denoted $π_{ϕ} (s) .$

The DDPG algorithm keeps a replay buffer of recent experience by storing tuples $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ obtained by following the deterministic actions obtained using $π_{ϕ} (s)$ . The critic network parameters are periodically updated (for example, every environment step) using minibatch stochastic gradient descent to minimize the loss function

L (θ) = \frac{1}{| B |} \sum_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \in B} \frac{1}{2} (Q_{θ} (s_{t}, a_{t}) - (r_{t} + γ Q_{θ} (s_{t + 1}, π_{ϕ} (s_{t + 1})))^{2}

where $B$ denotes a minibatch of samples from the experience replay buffer, and the update target $(r_{t} + γ Q_{θ} (s_{t + 1}, π_{ϕ} (s_{t + 1})))$ is motivated by the Bellman equation as in classical Q-learning.

The critic network is not used for deciding the agent actions, but it is used for updating the actor network by maximizing the current estimates of the cumulative return provided by the critic, using minibatch stochastic gradient ascent with respect to $ϕ$

J_{θ} (ϕ) = \frac{1}{| B |} \sum_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \in B} Q_{θ} (s, π_{ϕ} (s)) .

To stabilize learning, target copies of the actor and critic are kept, whose weights are updated by taking an exponential moving average of the most recent and previous target weights. To encourage exploration, a noise term in the form of an Ornstein–Uhlenbeck process is added to the actor. For full details of the DDPG algorithm, please see the original paper ( 25 ).

The hyperparameter settings for the DDPG algorithm are listed in Table 1.

Table 1.

DDPG Parameter Settings

DDPG parameter	Value
Reward discount factor $(γ)$	0.99
Target network weights update factor $(τ)$	0.001
Actor learning rate	0.0001
Critic learning rate	0.001
Experience replay buffer size	1e6
Minibatch size	64
Actor Ornstein–Uhlenbeck noise	Mean 0 SD 0.02

Note: DDPG = deep deterministic policy gradient; SD = standard deviation.

Training Details

During training, we use a loop road network (please see Figure 8). We train for 200 episodes with a horizon of 3000 time steps per episode (except that in the event of a crash, an episode is prematurely terminated). Every 10 episodes, we assign new speed limits to each section of the loop. To allow the agent to gather more experience (avoid initial crashes), we use curriculum learning strategy; during the first 20 episodes we sample speed limits uniformly from ${5, 10, 15}$ , and for the rest of training we sample speed limits uniformly from ${5, 10, 15, 20, 23, 28}$ . We use this experiment setting to allow the training to start from easy mode (with small speed-limit change), progressing to hard mode (with larger speed-limit change). Initially, we do not impose the action bound in training, allowing all actions in $[- b_{F}, a_{F}] .$ Later in training, we start imposing the action bound. This is because we find that if we impose the action bound at the beginning, the agent will learn some irrational behavior, such as keeping accelerating or keeping decelerating. In addition, at the start of training we add a safety buffer time gap to the follower reaction time when computing the maximal safe speed, allowing the follower more time to decide on its action. The safety buffer can result in slower target speeds and fewer crashes and it starts with 0.7 and is annealed down to 0 using the expression $0.7 \exp (- \frac{e}{T})$ , where $e$ is the current episode index and $T = 10$ is the temperature.

Figure 8.

Network geometry for the emergency-braking (top) and regular-driving (bottom) test scenarios.

Evaluation Scenarios

Regular Driving and Emergency Braking

In the regular-driving and emergency-braking scenarios, there are two vehicles driving in the loop network with a single lane. Please see Figure 8 for the network geometry. The difference between the two scenarios is that in the emergency-braking scenario, one of the loop sections has a speed limit of 5 m/s, with the immediate upstream section’s speed limit equal to 28 m/s which forces the leader to aggressively decelerate, emulating emergency slowdown.

The follower vehicle is controlled by SECRM in both scenarios. In regular driving, the leader is controlled by IDM (described in the “Baselines” section); in emergency braking, the leader is also controlled by IDM, except that on the emergency-braking section the leader’s action is overridden to the maximal deceleration $b_{L}$ until reaching a speed of $\leq 5$ m/s. This models a sudden high deceleration by the leader.

Speed-Following Test

In the speed-following test, there is a single vehicle on a straight segment with varying speed limits, with no leader. Please see Figure 9 for the geometry and the specific speed limits. We created this straight network to allow the vehicle to drive a longer distance with no leader vehicle and without any curvature that might affect following the target speed.

Figure 9.

Network geometry for the speed-following test scenario.

Baselines

Intelligent Driver Model ( 5 )

The IDM was proposed to study the phase transition between free-flow traffic and stop-and-go traffic on freeways. It is commonly used to model both human drivers and AVs. Translating into the notation of our paper, the action of the IDM is given by

a_{F} [t] = a_{F} [1 - {(\frac{v_{F} [t]}{v_{0}})}^{δ} - {(\frac{g^{*}}{g_{d} [t]})}^{2}]

where $1 \leq δ \leq 5$ ( $δ = 4$ was used for our experiments), $v_{0}$ is the desired velocity (this is often the speed limit), and the effective desired distance gap $g_{d}^{*}$ is given by

g_{d}^{*} (v_{F} [t], v_{rel} [t]) = ϵ + max (v_{F} [t] T + \frac{v_{F} [t] v_{rel} [t]}{2 \sqrt{a_{F} b_{F}^{comf}}}, 0)

where $ϵ$ is the smallest permitted gap to a standing vehicle, $T$ is the desired time gap in congested but moving traffic, and $b_{F}^{comf}$ is the highest comfortable deceleration.

In free-flow traffic ( $g_{d} \to \infty)$ the acceleration simplifies to $a_{F} [1 - {(\frac{v_{F} [t]}{v_{0}})}^{δ}]$ . At $δ = 1$ , the acceleration has an exponential behavior (acceleration decreases in magnitude as $v_{F} [t] \to v_{0}$ ); as $δ \to \infty$ , if $v_{0} > v_{F} [t]$ , then ${(\frac{v_{F} [t]}{v_{0}})}^{δ} \to 0$ , so the acceleration is approximately equal to $a_{F}$ until reaching the speed limit; if $v_{0} = v_{F} [t]$ , then the acceleration is 0; if $v_{0} < v_{F} [t]$ , then $- {(\frac{v_{F} [t]}{v_{0}})}^{δ} \to - \infty$ , resulting in sharp braking. When following the leader with approximately equal speeds, the effective desired distance gap is $ϵ + v_{F} [t] T$ . When approaching much slower or stopped vehicles, the additional term comes into effect. To clarify, we focus on $- a_{F} {(\frac{v_{F} [t] v_{rel} [t]}{2 \sqrt{a_{F} b_{F}^{comf}} g_{d} [t]})}^{2} \approx - {(\frac{v_{F}^{2}}{2 g_{d} [t]})}^{2} 1 / b_{F}^{comf}$ (neglecting the other terms). Note that $b = \frac{v_{F}^{2}}{2 g_{d} [t]}$ is the deceleration needed to stop from initial speed $v_{F} [t]$ over distance $g_{d} [t]$ . So, the resulting deceleration is approximately equal to $- b_{F}^{comf}$ if $b \approx b_{F}^{comf}$ , and is otherwise scaled by the multiplicative factor $b / b_{F}^{comf}$ .

Shladover’s ACC Model ( 33 )

We use the unilateral ACC model proposed in the paper, and not the collaborative ACC, for a fair comparison with the other tested models. The paper proposes a simple model of ACC vehicles that is based and tested on experimental data gathered from commercial ACC vehicles. The model (translating into our notation) is

a_{F} [t] = k_{1} (g_{d} [t] - g_{t}^{0} v_{F} [t]) + k_{2} (v_{L} [t] - v_{F} [t])

where $g_{t}^{0}$ is the target time gap, and $k_{1}$ and $k_{2}$ are hyperparameters chosen based on experimental data. The Shladover model shows a good fit to experimental data and is used for modeling ACC vehicles.

Car-Following Model-RL ( 3 )

The car-following model-RL (CFM-RL) is an RL-based longitudinal car-following model. We use the unilateral (not bilateral) version of the controller, for a fair comparison with the other tested models. The reward is given by (translating to our notation)

R (s, a) = ω_{s} R_{safety} (s, a) + ω_{e} R_{eff} (s, a) + ω_{c} R_{comfort} (s, a)

where

R_{safety} (s, a) = {\begin{matrix} \log (\frac{g_{d} [t]}{4 (v_{F} [t] - v_{L} [t])}), 0 \leq \frac{g_{d} [t]}{v_{F} [t] - v_{L} [t]} \leq 4 \\ 0, otherwise \end{matrix}

R_{comfort} (s, a) = - {(\frac{jerk}{\frac{a_{F} + b_{F}}{r}})}^{2}, and

$R_{eff} (s, a) = \frac{1}{\sqrt{2 π} σ h} \exp (- \frac{{(\ln (h) - u)}^{2}}{2 σ^{2}})$ where $h$ denotes the time gap in state $s$ .

The efficiency reward is given by the probability density function of the log-normal distribution with parameters $u, σ$ . The parameters are chosen so that the peak of the distribution, which occurs at $\exp (u - σ^{2})$ , is equal to the desired time gap. In Zhu et al. ( 3 ), the parameters $u = 0.4226$ , $σ = 0.4365$ are used, giving a target headway of 1.26 s. Please see Figures 10 and 11 for several examples of the shape of the CFM-RL efficiency and safety rewards.

Figures 10 (left) and 11 (right).

Examples of the shape of the CFM-RL efficiency (left) and safety (right) rewards.

The paper ( 3 ) sets the weights to $ω_{s} = ω_{e} = ω_{c} = 1$ . We note that safety relies on a reward function formulated using a TTC-threshold criterion, and efficiency is formulated using a fixed target time gap.

In the CFM-RL training phase, we use the exact same network to train our model as SECRM. As for SECRM, we also tried the curriculum learning framework to gradually increase the learning difficulties, that is, adding smaller speed-limit change in the first few episodes but changing to larger speed-limit change in the following few episodes. However, we found that if we have smaller headway in emergency stop cases, the CFM-RL model cannot converge well.

Gipps Model ( 4 )

When the leader vehicle is sufficiently close to the follower, the Gipps model’s acceleration is based on the worst-case criterion, just like SECRM (as we have discovered after independently formulating the criterion and deriving the action bound). In this case, Gipps follows the maximal safe speed $v_{safe} [t + 1]$ obtained in Equation 2. In the other case, that is, when the leader vehicle is far from the follower (or there is no leader), the speed of the Gipps controller evolves as

\begin{matrix} v_{F, speed - follow} [t + 1] = v_{F} [t] + 2.5 a_{F} r (1 - \frac{v_{F} [t]}{s_{F} [t]}) \\ {(0.025 + \frac{v_{F} [t]}{s_{F} [t]})}^{\frac{1}{2}} . \end{matrix}

According to Gipps ( 4 ), this function was derived by fitting a curve to a plot of instantaneous speeds and accelerations from a sensor-equipped vehicle with a human driver on an arterial road in moderate traffic. The complete Gipps model is

\begin{matrix} v_{F} [t + 1] = min (v_{F} [t] + a_{F} r, \\ v_{F, speed - follow} [t + 1], v_{safe} [t + 1]) \end{matrix}

Advantages of SECRM over the Gipps Model

Because the SECRM maximal safe speed $v_{safe} [t + 1]$ is derived using the same principles as the Gipps model, we may ask what the advantages of SECRM are relative to Gipps.

In leader-following mode: In the presence of a leader, the Gipps model always takes on the maximal safe speed. This means the motion of the vehicle is quite jerky, with large second-to-second variance in accelerations. In Treiber and Kesting ( 1 ), large jerk is said to be one of the main disadvantages of the Gipps model. Because we additionally optimize a comfort term that rewards the controller for minimizing the cumulative (normalized square of) the jerk, SECRM is significantly better than Gipps for comfort, and therefore more practical.

In speed-following mode: To formulate the speed-following model, Gipps relied on experimental data obtained from a sensor-equipped vehicle with a human driver, fitting an ad hoc function to the data. Because of this, the behavior of the Gipps controller in speed-following mode is human-like and inefficient.

In leader-following mode, SECRM can be thought of as trading in a bit of efficiency for smaller jerk, while in speed-following mode, SECRM is both more efficient and less jerky than Gipps. Both advantages are verified by our experiments described below.

Simulator

We perform the experiments in the Simulation Of Urban Mobility (SUMO) microsimulator ( 34 ). To interface between the simulator and our implementation of the DDPG algorithm, we use an augmented version of the middleware Flow ( 35 ) to which we have added features useful for our experiments. In turn, Flow uses SUMO’s TraCI API to interact and control the simulator.

Experimental Results

In the regular-driving and emergency-braking scenarios, we select two desired time-gap configurations as follows. First, since models with a target gap need a gap value as an input, we test each model with a target time gap equal to SECRM’s average time gap in that scenario for fair comparison (except Gipps, which does not have a target time gap). Second, we perform a “smallest safe time gap comparison.” Namely, by incrementing the desired time gap by 0.1 s, we find the smallest target time gap that does not crash in the emergency-braking scenario for each model. Then, we compare the safe models in normal driving.

The smallest safe time-gap setting is the one we would use in practice. On the other hand, the smallest safe time gap is in general quite high across all models, and we found it valuable to also test each model in regular driving with the target time gap equal to SECRM’s average gap because based on our previous proof we can assume this is the most efficient and safe time gap.

For all experiments, we use $r = 0.1$ s, $a_{F} = a_{L} = b_{F} = b_{L} = 3 m / s^{2}$ , and $ϵ = 2$ m. The detection range is infinite in our experiments. The reaction time of 0.1 s (which includes sensor time, controller computation time, and system response to the controller decision) is short but has been used in previous studies as a futuristic value for AV response time ( 36 ). We find that such a short reaction time (which results in higher maximal safe speeds) provides a good stress-test for the safety of our system.

Regular-Driving Scenario

In this section, we want to test the model in regular car-following scenario (no sudden leader accelerations or decelerations).

Regular Driving—SECRM’s Average Time Gap

With the target gap equal to SECRM’s average, CFM-RL and Gipps will have slightly smaller average time gap than SECRM, but higher average jerk by approximately an order of magnitude than SECRM. This makes sense, because SECRM’s reward is formulated to smooth out the high jerk characteristic of the Gipps model, at some expense of efficiency.

Please see Figures 12 and 13 for the time- and distance-gap comparisons, Figure 14 for the jerk comparison, and Table 2 for the average result over the simulation. From the results, we can see that the average time gap of Gipps is the smallest one; on the other hand, SECRM has a very similar average gap to that of Gipps. However, Gipps’s jerk is much higher than SECRM.

Figures 12 (left) and 13 (right).

Time gap (left) and distance gap (right) for the regular-driving scenario. Target time gap = SECRM’s average time gap.

Figure 14.

Jerk comparison for the regular-driving scenario. For non-SECRM models, target gap = SECRM’s average time gap.

Table 2.

Method Comparison for Regular Driving (for Non-SECRM Models, Target Gap = SECRM’s Average Time Gap)

Method	Avg. time gap	Avg. jerk
CFM-RL	0.6082	0.099
IDM	0.7102	0.129
Gipps	0.6040	0.131
Shladover	0.6524	0.309
SECRM	0.6182	0.013

Note: SECRM = safe, efficient, and comfortable reinforcement-leaning-based car-following model; Avg. = Average; CFM-RL = car-following model (reinforcement learning); IDM = intelligent driver model.

Regular Driving—Smallest Safe Time Gap

The time gap of each model (except Gipps and SECRM) is set to the smallest safe time gap (as measured by the emergency test scenario). Unsurprisingly, each human-driving-based model, including CFM-RL, IDM, Shladover will have larger average time gap.

Please see Figures 15 and 16 for the time- and distance-gap comparisons, Figure 17 for the jerk comparison, and Table 3 for the average result over the simulation.

Figures 15 (left) and 16 (right).

Time gap (left) and distance gap (right) for the regular-driving scenario. Target gap = smallest safe gap.

Figure 17.

Jerk comparison for regular driving. For non-SECRM models, target gap = SECRM’s average time gap.

Table 3.

Method Comparison for Regular Driving (for Non-SECRM Models, Target Time Gap = Smallest Safe Time Gap)

Method	Avg. time gap	Avg. jerk
CFM-RL	1.2032	0.1482
IDM	2.2737	0.1482
Gipps	0.6040	0.1311
Shladover	2.1395	0.3536
SECRM	0.6182	0.013

Emergency-Braking Scenario

In this section, we test each model in a scenario in which the leader undergoes a sudden maximal deceleration from 28 m/s to 5 m/s which is the emergency stop network from Figure 8.

Emergency Stop—SECRM’s Average Time Gap

Based on our findings, we observe that the models with fixed target time gap will be more likely to crash given a smaller target time gap. SECRM outdoes Gipps in both the average time gap and average jerk, while the other models crash. Because the CFM-RL crashes in this scenario, while it does not crash in training, we verify our claim that RL models that rely on reward alone for safety may not generalize sufficiently to avoid unsafe situations like crashes.

Please see Figures 18 and 19 for the time- and distance-gap comparisons, Figure 20 for the jerk comparison, and Table 4 for the average result over the simulation.

Figures 18 (left) and 19 (right).

Time gap (left) and distance gap (right) for the emergency-braking scenario. Target gap = SECRM’s average time gap.

Figure 20.

Jerk comparison for the emergency-braking scenario. Target gap = SECRM’s average time gap.

Table 4.

Method Comparison for Emergency Braking (for Non-SECRM Models, Target Time Gap = SECRM’s Average Time Gap)

Method	Avg. time gap	Avg. jerk
CFM-RL	Crash	Crash
IDM	Crash	Crash
Gipps	0.6365	0.2534
Shladover	Crash	Crash
SECRM	0.5798	0.0557

Emergency Braking—Smallest Safe Time Gap

We find that all models except Gipps require a significantly higher safe target time gap to safely pass the emergency-braking scenario. IDM and CFM-RL are comparable to SECRM in jerk, but have significantly higher average time gap, indicating loss of efficiency.

Please see Figures 21 and 22 for the time- and distance-gap comparisons, Figure 23 for the jerk comparison, and Table 5 for the average result over the simulation.

Figures 21 (left) and 22 (right).

Time gap (left) and distance gap (right) for the emergency-braking scenario. Target time gap = smallest safe time gap.

Figure 23.

Jerk comparison for the emergency-braking scenario. Target time gap = smallest safe time gap.

Table 5.

Method Comparison for Emergency Braking (for Non-SECRM Models, Target Time Gap = Smallest Safe Time Gap)

Method	Avg. time gap	Avg. jerk
CFM-RL	1.48005	0.06598
IDM	3.0598	0.0658
Gipps	0.6365	0.2534
Shladover	2.8828	0.3599
SECRM	0.5798	0.0557

Speed-Following Scenario

In the previous section, we analyzed the car-following mode. In this section, we will analyze how the follower vehicle can follow the speed limit in freeway without a leader vehicle. Note that the CFM-RL is not trained with any speed-following reward so we do not include it as a baseline. Meanwhile, because in speed-following scenario, there is no leader, the jerk will be very small, which makes it hard to use for making a comparison, so we use the acceleration for comparison.

First, we use the same baselines as in the previous section.

Please see Figures 24 and 25 for the velocity and acceleration comparisons, and Table 6 for the average result over the simulation. From the results, we find that Gipps cannot follow the target speed very well as a result of the second term of the Gipps equation, which is the safe target speed constraint to avoid sudden acceleration/deceleration incurred by sharp speed change. IDM can better catch up the target speed, but it needs a longer time. The Shladover model will catch up the target speed very fast, but it will end up with the highest jerk. SECRM will catch up soon, but it will not have very high jerk. To summarize, Gipps has two target speeds. One is efficient (car-following), the other is quite inefficient (speed-following). One major advantage of SECRM over Gipps is that it optimizes speed-following too.

Figures 24 (left) and 25 (right).

Velocity (left) and acceleration (right) comparison in the speed-following scenario.

Table 6.

Method Comparison for Speed-Following Scenario

Method	Avg. speed difference	Avg. jerk
IDM	0.45780	0.00163
Gipps	2.81917	0.00139
Shladover	0.20876	0.00254
SECRM	0.23639	0.00206

Note: Avg. = Average; IDM = intelligent driver model. SECRM = safe, efficient, and comfortable reinforcement-leaning-based car-following model.

Discussion

Safety, efficiency, and comfort: In our experiments, we find that SECRM is safe and has an efficiency advantage over the models with a fixed target time gap (IDM, Shladover, CFM-RL); for the latter models, a large target time gap is required for the models to avoid a collision in an emergency-braking scenario. Such a large target makes the models inefficient in regular driving. Because SECRM and Gipps have a dynamic target speed (formulated to be safe according to the worst-case criterion), they can drive with more efficiency, while still avoiding collisions in both regular driving and emergency braking. SECRM optimizes an additional comfort term, which solves a major deficiency of the Gipps model—impractically high jerk.

Unification of speed-following and efficiency: Because efficiency is formulated as following the maximal safe speed, we can unify the speed-limit-following and efficiency reward terms, obtaining a single model that works in both speed-following and leader-following scenarios, shifting between the two dynamically (without requiring an ad hoc threshold choice to switch between the two modes).

Generalization and robustness: To ensure that the RL controllers are not overfitting to the training scenarios (and to obtain models that work well in both regular-driving and emergency-braking scenarios), we train on a network whose sections have randomly assigned speed limits that are regularly reassigned during training. The training scenario is different from all three testing scenarios. Nevertheless, the trained models perform well, showing a capacity for generalization, and providing evidence that the trained model is robust.

Extendable framework: By promoting safety from one of the terms of a reward function to a hard action constraint, we obtain a flexible framework for training safe car-following RL models. In this paper, we have focused on optimizing comfort in addition to efficiency, but by modifying the reward function it is possible to add other optimization criteria (for example, cooperative reward function terms for within-platoon optimization, mixed-autonomy scenarios, string stability). Such enhancements will be the subject of future work.

Comfortable vs efficient driving behavior: From our results, we can see that the Gipps model can have slightly smaller headway in a regular-driving scenario than does SECRM; however, SECRM will have more comfortable performance. Generally, if we want to achieve a higher performance of one criterion then we need to sacrifice another criterion.

Conclusion

CFMs have been investigated for decades and have significantly matured. They are heavily used in microscopic traffic system simulation. Over the last decade, there has been renewed and rising interest in improving CFMs because of the rapid emergence of automated driving and ACC.

Autonomous driving systems based on RL have particular promise, being able to optimize a range of desirable features, such as efficiency and comfort, but have several potential drawbacks. In this paper, we have addressed three such potential drawbacks, improving on past work. First, previous RL controllers typically offer no safety guarantees, and the safety reward component is frequently based on a TTC threshold (which we have observed in this work cannot guarantee safety). We improve the system safety characteristics by formulating a hard safety constraint that offers analytic safety guarantees. Second, RL controllers may overfit to the scenarios seen during training. We improve system robustness by including a wide variety of leader vehicle behaviors in training. Third, previous RL controllers typically pass between leader-following and speed-following (free-flow) modes based on an ad hoc threshold. We improve by combining both leader-following and free-flow modes into a single speed target. The resulting agent performs well in our test scenarios, avoiding crashes even in emergency braking (whereas a representative previous RL controller does not), with excellent efficiency, speed-following, and comfort characteristics.

In future work, we plan to extend the controller by including more optimization targets in the reward, including system stability, as well as adding a lane-changing module.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: conceptualization, methodology, investigation, software, writing—original draft: T. Shi; conceptualization, investigation, writing—original draft: O. ElSamadisy; methodology, investigation, writing—original draft: I. Smirnov; supervision, writing—review and editing: B. Abdulhai. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Omar ElSamadisy

Baher Abdulhai

References

Treiber

Kesting

Traffic Flow Dynamics: Data, Models and Simulation, 1st ed. Springer, Berlin, Heidelberg, 2013.

Kreidieh

A. R.

Bayen

A. M.

Dissipating Stop-and-Go Waves in Closed and Open Networks via Deep Reinforcement Learning. Proc., 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, IEEE, New York, 2018, pp. 1475–1480.

Zhu

Wang

Safe, Efficient, and Comfortable Velocity Control Based on Reinforcement Learning for Autonomous Driving. Transportation Research Part C: Emerging Technologies, Vol. 117, 2020, p. 102662.

Gipps

P. G.

A Behavioural Car-Following Model for Computer Simulation. Transportation Research Part B: Methodological, Vol. 15, No. 2, 1981, pp. 105–111.

Treiber

Hennecke

Helbing

Congested Traffic States in Empirical Observations and Microscopic Simulations. Physical Review E, Vol. 62, No. 2, 2000, pp. 1805–1824.

Chong

Abbas

M. M.

Medina

Simulation of Driver Behavior with Agent-Based Back-Propagation Neural Network. Transportation Research Record: Journal of the Transportation Research Board, 2011. 2249: 44–51.

Zhou

A Recurrent Neural Network Based Microscopic Car Following Model to Predict Traffic Oscillation. Transportation Research Part C: Emerging Technologies, Vol. 84, 2017, pp. 245–264.

Zhang

Sun

Yin

Lin

Wang

Human-Like Autonomous Vehicle Speed Control by Deep Reinforcement Learning with Double Q-Learning. Proc., IEEE Intelligent Vehicles Symposium (IV), Changshu, China, IEEE, New York, 2018, pp. 1251–1256.

Isele

Rahimi

Cosgun

Subramanian

Fujimura

Navigating Occluded Intersections with Autonomous Vehicles Using Deep Reinforcement Learning. Proc., IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, IEEE, New York, 2018, pp. 2034–2039.

10.

Gong

Abdel-Aty

Yuan

Cai

Multi-Objective Reinforcement Learning Approach for Improving Safety at Intersections with Adaptive Traffic Signal Control. Accident Analysis & Prevention, Vol. 144, 2020, p. 105655.

11.

Zhou

Development of an Efficient Driving Strategy for Connected and Automated Vehicles at Signalized Intersections: A Reinforcement Learning Approach. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, No. 1, 2020, pp. 433–443.

12.

Yen

Y. T.

Chou

J. J.

Shih

C. S.

Chen

C. W.

Tsung

P. K.

Proactive Car-Following Using Deep-Reinforcement Learning. Proc., IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, IEEE, New York, 2020, pp. 1–6.

13.

Zhu

Wang

Human-Like Autonomous Car-Following Model with Deep Reinforcement Learning. Transportation Research Part C: Emerging Technologies, Vol. 97, 2018, pp. 348–368.

14.

Shi

ElSamadisy

Abdulhai

Bilateral Deep Reinforcement Learning Approach for Better-Than-Human Car Following Model. arXiv Preprint arXiv:2203.04749, 2022.

15.

Krajewski

Bock

Kloeker

Eckstein

The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems. Proc., 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, IEEE, New York, 2018, pp. 2118–2125.

16.

Halkias

Colyar

Next Generation SIMulation Fact Sheet. FHWA-HRT-06-135. U.S. Department of Transportation, Federal Highway Administration, Washington, D.C., 2006.

17.

Zhu

Wang

Tarko

Fang

Modeling Car-Following Behavior on Urban Expressways in Shanghai: A Naturalistic Driving Study. Transportation Research Part C: Emerging Technologies, Vol. 93, 2018, pp. 425–445.

18.

Vogel

A Comparison of Headway and Time to Collision as Safety Indicators. Accident Analysis & Prevention, Vol. 35, No. 3, 2003, pp. 427–433.

19.

Packer

Gao

Kos

Krähenbühl

Koltun

Song

Assessing Generalization in Deep Reinforcement Learning. arXiv Preprint arXiv:1810.12282, 2018.

20.

Lin

McPhee

Azad

N. L.

Comparison of Deep Reinforcement Learning and Model Predictive Control for Adaptive Cruise Control. IEEE Transactions on Intelligent Vehicles, Vol. 6, No. 2, 2021, pp. 221–231.

21.

Puterman

M. L.

Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed. Wiley-Interscience, Hoboken, NJ, 2005.

22.

Kaelbling

L. P.

Littman

M. L.

Cassandra

A. R.

Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, Vol. 101, No. 1–2, 1998, pp. 99–134.

23.

Altman

Constrained Markov Decision Processes, Vol. 7. CRC Press, Boca Raton, FL, 1999.

24.

Mnih

Kavukcuoglu

Silver

Rusu

A. A.

Veness

Bellemare

M. G.

Graves

, et al. Human-Level Control through Deep Reinforcement Learning. Nature, Vol. 518, No. 7540, 2015, pp. 529–533.

25.

Lillicrap

T. P.

Hunt

J. J.

Pritzel

Heess

Erez

Tassa

Silver

Wierstra

Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971, 2015.

26.

Schulman

Wolski

Dhariwal

Radford

Klimov

Proximal Policy Optimization Algorithms. arXiv Preprint arXiv:1707.06347, 2017.

27.

Haarnoja

Zhou

Abbeel

Levine

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proc., 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR80, January4, 2018.

28.

Lin

McPhee

Azad

N. L.

Longitudinal Dynamic Versus Kinematic Models for Car-Following Control Using Deep Reinforcement Learning. Proc., IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, IEEE, New York, 2019, pp. 1504–1510.

29.

Yang

Chen

Walter

Wang

Yang

Knoll

A Review of Safe Reinforcement Learning: Methods, Theory and Applications. arXiv Preprint arXiv:2205.10330, 2022.

30.

Brunke

Greeff

Hall

A. W.

Yuan

Zhou

Panerati

Schoellig

A. P.

Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning. arXiv Preprint arXiv:2108.06266, 2021.

31.

Chandler

R. E.

Herman

Montroll

E. W.

Traffic Dynamics: Studies in Car Following. Operational Research, Vol. 6, No. 2, 1958, pp. 165–184.

32.

Vanholme

Gruyer

Lusetti

Glaser

Mammar

Highly Automated Driving on Highways Based on Legal Safety. IEEE Transactions on Intelligent Transportation Systems, Vol. 14, No. 1, 2013, pp. 333–347.

33.

Milanés

Shladover

S. E.

Modeling Cooperative and Autonomous Adaptive Cruise Control Dynamic Responses Using Experimental Data. Transportation Research Part C: Emerging Technologies, Vol. 48, 2014, pp. 285–300.

34.

Lopez

Wiessner

A., E.

Behrisch

Bieker-Walz

Erdmann

Flötteröd

Y. P.

Hilbrich

Lücken

Rummel

Wagner

Microscopic Traffic Simulation Using SUMO. Proc., 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, IEEE, New York, 2018, pp. 2575–2582.

35.

Kreidieh

Parvate

Vinitsky

Bayen

A. M.

Flow: A Modular Learning Framework for Mixed Autonomy Traffic. arXiv Preprint arXiv:1710.05465, 2017.

36.

Makridis

Mattas

Ciuffo

Response Time and Time Headway of an Adaptive Cruise Control. An Empirical Characterization and Potential Impacts on Road Capacity. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, No. 4, 2020, pp. 1677–1686.

Safe,Efficient,and Comfortable Reinforcement-Learning-Based Car-Following for AVs with an Analytic Safety Guarantee and Dynamic Target Speed

Abstract

Keywords

Methods

Notation and Conventions

Reinforcement Learning and Markov Decision Processes

Safe RL and the Worst-Case Action Bound

Safety of Previous RL Car-Following Controllers

Safe RL

Worst-Case Safety Criterion

Derivation of the Maximal Safe Speed

Critique of Safety Criteria That Are Based on a TTC Threshold

Safety in Low-Visibility Conditions

Definitions of Efficiency and Comfort

Efficiency

Comfort

SECRM

MDP Formulation

Importance of Using a Target Speed Instead of a Target Gap

Training

Deep Deterministic Policy Gradient

Training Details

Evaluation Scenarios

Regular Driving and Emergency Braking

Speed-Following Test

Baselines

Intelligent Driver Model ( 5 )

Shladover’s ACC Model ( 33 )

Car-Following Model-RL ( 3 )

Gipps Model ( 4 )

Advantages of SECRM over the Gipps Model

Simulator

Experimental Results

Regular-Driving Scenario

Regular Driving—SECRM’s Average Time Gap

Regular Driving—Smallest Safe Time Gap

Emergency-Braking Scenario

Emergency Stop—SECRM’s Average Time Gap

Emergency Braking—Smallest Safe Time Gap

Speed-Following Scenario

Discussion

Conclusion

Footnotes

Author Contributions

Declaration of Conflicting Interests

Funding

ORCID iDs

References