Sage Journals: Discover world-class research

Abstract

To address the problems of underutilization of samples and unstable training for intelligent vehicle training in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, a TD3 algorithm based on the Composite Prioritized Experience Replay (CPR-TD3) mechanism is proposed. It considers experience immediate reward value and Temporal Difference error (TD-error) based and respectively to construct priorities to rank the samples. Subsequently composite average ranking of the samples to recalculate the priorities for sampling, uses the collected samples to train the target network. Then introduces the minimum lane change distance and the variable headway time distance to improve the reward function. Finally, the improved algorithm is proved to be effective by comparing it with the traditional TD3 on the highway scenario, and the CPR-TD3 algorithm improves the training efficiency of intelligent vehicles.

Keywords

Intelligent vehicle driving decision making deep reinforcement learning composite priority experience replay

1. Introduction

In recent years, with the rapid development of artificial intelligence, big data and other technologies, automatic driving has become an important part of the future intelligent transport system. The automatic driving system is generally composed of four modules: perception, decision-making, planning and control. The decision-making module is to generate reasonable driving behaviors based on the the information of the surrounding environment and the state of the self-vehicle, and to incorporate the generated driving behaviors into the motion control system, so as to realize the efficient driving of the intelligent vehicle. As the core of intelligent vehicle, how to efficiently make instructions according to the environment is an important prerequisite for the automatic driving [1]. Currently, the algorithms of decision-making module are mainly classified into rule-based algorithms and learning-based algorithms.

1.1 Rule-based algorithms

The rule-based decision-making approach relies on a database of rules constructed which is based on a large number of traffic regulations, driving experience, and driving knowledge [3], and then determines strategies based on different states of the vehicle this category represents methods for Finite State Machine (FSM) [28, 29]. In [2], the authors classify the vehicle’s behavior into four states, such as lane keeping, and use the FSM method to build the relationships between individual driving behaviors and the transitions between states. In [32], the authors use a hierarchical state machine approach for decision making in lane changing for intelligent vehicles, where the decision is made by dividing the lane change into a forced lane change scenario and a free traveling lane change scenario. In [33], the authors consider the difficulty of extracting lane-changing rules in complex environments, and use the rule extraction method of CART decision tree to extract the lane-changing rules, which proves that the extracted rules are effective. The main advantage of the rule-based decision-making algorithm is good interpretability. In the case of abnormal system behaviors, the expert can quickly identify the module that has a malfunction. However, its design still requires a lot of driving experience and repeated adjustments, and the whole system becomes difficult to maintain. Such algorithms are difficult to establish high-quality driving rules and cannot enumerate all possible events when dealing with complex and dynamic road scenarios.

1.2 Learning-based algorithms

Compared with the rule-based algorithm, the decision algorithm based on learning are mainly represented by deep reinforcement learning algorithms, obtains the optimal strategy by continuous trial and error through the continuous interaction between the agent and the environment and gets rid of the limitations of rules [4]. In [5], the authors use a Deep Q Network (DQN) approach for intelligent vehicle highway scenario lane-changing behavior decision-making, and simulation experiments show that the decision-making performance of this approach is better than that of the traditional rule-based approach. In [6] the authors study the lateral lane changing problem by adding rule constraints. In [30], the authors propose an end-to-end reinforcement learning strategy for lane keeping, which is simulated on a racing simulator using a deep Q-network algorithm with good lane keeping results. Intelligent vehicle decision learning based on deep Q-network algorithms avoids the process of discretization of the state space. However, the traditional reinforcement learning method represented by DQN adopts discrete action space during the training of intelligent vehicles, which leads to frequent lane changing of intelligent vehicles on highways. With the development of reinforcement learning, the algorithms based on continuous action space such as DDPG and TD3 have achieved better control results in the field of autonomous driving, where intelligent vehicles interact with the environment to output continuous actions, and complete decision-making tasks such as lane keeping [7] and lane changing [8] by controlling throttle opening and steering wheel angle. In [9], the authors use a stochastic strategy and experience replay to improve the lane keeping ability of an autonomous driving strategy under new road conditions. In [10], the authors construct a decision-making framework to achieve vehicle safety for handling emergency situations based on the DDPG algorithm. In [26], the authors fuse the DDPG algorithm with a mixture of sensor information features to improve vehicle control in terms of speed and lateral displacement. While the DDPG algorithm can lead to the problem of overestimation of the value during training because of the instability of the value function, the TD3 algorithm limits this problem by using a dual estimation Q network and achieves better results In [11], the authors use the TD3 algorithm for training the lane keeping ability of intelligent vehicles. In [12], the authors propose a dynamic delayed policy update to speed up the convergence of the algorithm based on the TD3 algorithm. Deep reinforcement learning obtains the optimal strategy by interacting with the environment with self-adjustment and self-learning ability, which provides a new idea for driving decision-making of intelligent vehicles in complex traffic environment, but the driving decision-making based on deep reinforcement learning still exists poor data utilization ability and low learning efficiency.

1.3 Experience replay

Intelligent vehicles need to interact with the environment a lot during training process, and the experience tuples obtained from the interaction once are stored into the experience pool, which the reply buffer is one of the most important parts of deep reinforcement learning. Initially, the experience pool sampling only learns based on the previous state, which can cause that there will be a deep correlation between neighboring states. In [13], the authors use the experience replay method to uniformly and randomly sample from the experience cache pool, which reduces the correlation between neighboring samples. The TD3 algorithm adopts the experience replay mechanism to reduce the correlation between samples, and updates the network parameters by random sampling. However, at the beginning of training the intelligent vehicle generates a large number of lower quality samples in the exploration phase, which will interfere with the training effect of the intelligent vehicle and increase the training time. To further improve the efficiency of experience replay in reinforcement learning, in [14], the authors proposed Priority Experience Reply (PER) which takes TD-error as a benchmark for the importance of experience, and prioritizes the selection of experiences with larger errors for training. In [15], the authors combine the DDPG algorithm with the PER method to accelerate the convergence speed of the model. In [16], the authors store the experience samples in episodes, calculated the cumulative returns of the episodes separately, and sampled the samples with larger returns to improve the training quality. In [27], the authors propose a TD error-based resampling preference mechanism for updating neural networks to improve algorithm performance by lowering the priority of high-priority experiences. Although the PER mechanism improves the utilization of data, as the experience pool continues to expand, when updating the network Q parameters, only a lesser number of samples are updated in priority, and most of the experiences can no longer satisfy the sampling requirements, which can lead to an increase in the priority difference between the stored and actual samples [17, 18], and ranking experiences by TD-error ignores the effect of immediate returns on the convergence of neural networks, while PER based on immediate returns is more robust but less capable of utilizing good samples [19].

In order to better solve the problems of high randomness of action selection and low training efficiency in the training process of intelligent vehicles under reinforcement learning algorithms, we establish a TD3 intelligent vehicle driving decision model based on composite priority experience replay in highway scenario. Firstly, the priority of immediate return and the priority based on time error are calculated and sorted respectively, and the experience is sorted by composite average. The sampling probability is calculated according to the composite priority for sampling the experience pool and updating the network parameters, and this kind of method ensures the diversity of samples. Secondly, the action space selection of intelligent vehicles is considered, and the reward function is designed by combining the theory of minimum lane change safety and variable headway to guide the intelligent vehicles to learn from high-value strategies and improve driving safety.

2. Modelling framework

2.1 Reinforcement learning model

Reinforcement learning models obtain rewards to feed back to the intelligences by the intelligences interacting with the environment. Through continuous learning, maximizes the reward value to obtain the optimal strategy. The learning process can usually be expressed as Markov Decision Process (MDP), which is defined as a quintuple, denoted as (S, A, R, P, $\gamma$ ), where the state space $S$ is the finite set of states of the environment; the action space $A$ is the set of actions of the intelligent body; the state transfer probability $P$ is the transfer probability of the environment; the reward function $R$ is the immediate reward that will be obtained by taking an action in the current state; the reward attenuation factor $\gamma$ is the weight between the immediate reward and the long-term reward. The specific process is shown in Fig. 1.

Figure 1.

Reinforcement learning model.

2.2 TD3 algorithm

The TD3 algorithm is a reinforcement learning algorithm based on the Actor-Critic framework. The actor network takes the state of the intelligent vehicle as input and is responsible for action generation according to the network parameters. The critic network is responsible for the estimation of the state-action Q value, judging the performance of the actor network in executing the action, and guiding the actor network to be updated [20]. The DDPG algorithm is unstable in training and may fall into local optimum due to the problem of over-estimation of the value function. The TD3 algorithm optimizes the actor network and critic network on this basis, including:

(1)
It adopts a double-Q network for updating, establishes two independent critic network computations, and selects two computational errors with smaller objective values to limit the overestimation problem of the DDPG algorithm as shown [21]:

$\displaystyle y_{t}=r_{t}+\eta\mathop{\min}\limits_{i=1,2}Q_{\omega_{i}}^{% \prime}(s_{t+1},{a}^{\prime}_{t+1})$ (1)

where $r_{t}$ is the agent’s reward at moment $t$ ; $\eta$ is the discount factor; $Q_{\omega_{i}}^{\prime}$ is the $Q$ value of the target $Q$ network obtained from the state value $s_{i+1}$ , the action value $a_{i+1}$ and the network parameter $W_{i}$ at moment $t+1$ .
(2)
To avoid the problem of overfitting, the TD3 algorithm also adds noise $\varepsilon$ based on a normal distribution to the computation of the target action value function to improve the robustness of the algorithm.

$\displaystyle{a}^{\prime}_{t+1}=\mu^{\bar{\theta}}(s_{t+1})+c_{\textit{clip}}(% \varepsilon,-c,c)$ (2)

where $\mu^{\bar{\theta}}$ is the target actor network, $\bar{\theta}$ is its parameter, and $c_{\textit{clip}}(\varepsilon,-c,c)$ is the restriction of the added normal noise $\varepsilon$ to within the interval $[-c,c]$ .
(3)
A delayed strategy is used to update, as opposed to the critic value network parameters which are updated every round, the actor policy network is updated less frequently, typically every 2 rounds to reduce the error in the approximated action value function. The actor network parameters $\mu^{\theta}$ are updated by gradient back propagation, and the loss gradient is computed as shown:

$\displaystyle\left\{{{\begin{array}[]{l}{\nabla J(\omega)=\min_{\omega}E_{\pi}% \left[\frac{1}{2}(Q_{\omega}(s_{t},a_{t})-y_{t})^{2}\right]}\\ {\nabla J(\theta)=\max_{\theta}E_{\pi}[Q_{\omega}(s_{t},a_{t})]}\\ \end{array}}}\right.$ (3)

Subsequently, each target network is updated using soft updates as shown in Eq. (4).

$\displaystyle\left\{{{\begin{array}[]{l}{\bar{\theta}=\tau\theta+(1-\tau)\bar{% \theta}}\\ {\bar{\omega}_{i}=\tau\omega+(1-\tau)\bar{\omega}_{i},\mbox{ }(i=1,2)}\\ \end{array}}}\right.$ (4)

where $\tau$ is the soft update rate, usually taking the value of 0.01 or 0.001.

Figure 2.
Twin delayed deep deterministic policy gradient.

2.3 Priority experience replay

In traditional deep reinforcement learning algorithms, samples from the experience pool are randomly drawn with equal probability, however, such sampling does not take into account the fact that different experience samples have different importance, and randomly drawn samples are under-utilized for samples that play a large role in updating the network model parameters. The principle of priority experience replay is to make the probability of each experience sample being sampled and the absolute value of the respective TD-error monotonically related to each experience sample, in order to improve the probability of poorly performing samples being sampled and the learning efficiency of the model. The larger the value of TD-error, the more valuable the sample is.

In the priority experience replay, TD-error is defined as:

$\displaystyle\sigma_{t}=r_{t+1}+\gamma Q_{\pi}(s_{t+1},a_{t+1})-Q_{\pi}(s_{t},% a_{t})$ (5)

where $\gamma$ is the discount factor, $Q_{\pi}(s_{t},a_{t})$ is the state action value function, and is the expected Q value obtained by the agent following the policy to take action $a_{t}$ in state $s_{t}$ .

Non-uniform sampling of samples in the priority experience replay, its priority has two forms of expression: the first is directly with the absolute value of TD-error $\left|{\sigma_{t}}\right|$ to characterize; the second is based on the value of $\left|{\sigma_{t}}\right|$ on the samples for the descending order of the samples to get the sequence $\textit{rank}\left(i\right)$ of the samples $i$ , and then calculate the priority index, such as Eq. (6). The second expression has a better robustness, so in this paper we use the second form of the calculation of the priority index $T_{i}$ of the samples:

$\displaystyle T_{i}=\frac{1}{\textit{rank}(i)}>0$ (6)

The sampling probability of the sample is:

$\displaystyle{T}^{\prime}(i)=\frac{T_{i}^{\alpha}}{\sum\limits_{i}{T_{i}^{% \alpha}}}$ (7)

where $\alpha\in[0,1]$ , when $\alpha=0$ , the sample is sampled using uniform sampling; when $\alpha=1$ , the sample is sampled using TD-error to calculate the priority.

2.4 Composite priority experience replay

Priority experience replay increases the sampling probability of samples with larger weights by introducing TD-error to speed up the convergence of the training model. However, in the TD3 algorithm network, the parameters of the neural network will be updated with each step of the agent, when a certain set of samples with a larger absolute value of TD-error enters into the experience pool, the sample will be given a larger weight, and at the same time the probability of this sample being sampled is higher But with the continuous updating of the network parameters, the absolute value of the TD-error of this sample is reduced, and the probability of this sample data being sampled remains high due to the previous weight assignment, resulting in the probability of the sample data being sampled is still high, which will lead to the emergence of some lower quality samples, interfering with the learning efficiency of the model [22]. Therefore, this paper combines the TD-error mechanism and the immediate return mechanism, which are able to ensure the sample priority while also considering other important sample information [23].

(1)
Define the priority $R_{t}$ in the immediate return based priority mechanism as:

$\displaystyle R_{j}=r_{t}+\mu$ (8)

where $r_{t}$ is the immediate return of the sample; $\mu$ is a small non-zero positive number that ensures each sample has a non-zero priority.
(2)
Arranging the experiences in order of $R_{j}$ from largest to smallest yields $\textit{rank}\left(j\right)$ , which is obtained by compounding the average ordering of the experiences:

$\displaystyle l(k)=\frac{\textit{rank}\left(i\right)+\textit{rank}\left(j% \right)}{2}$ (9)

Calculate the priority of the composite:

$\displaystyle P_{k}=(1/l(k))^{\beta}$ (10)

where the variable $\beta$ is the degree of adoption priority and takes values in the range [0, 1], when $\beta=0$ represents uniform sampling.
(3)
Define the probability of sampling experience as:

$\displaystyle p_{k}=\frac{P_{k}}{\sum\limits_{n}{P_{n}}}$ (11)

where $n$ is the number of experiences.

3. Intelligent vehicle decision-making model based on CPR-TD3

In order to reduce the meaningless collisions in the training process of intelligent vehicles and accelerate the training convergence efficiency, the experience replay pool and reward function in the TD3 algorithm are improved, and the highway scenario is used as the training environment for simulation verification.

3.1 Overall framework

In order to improve the stability of intelligent vehicles in the training process and increase the utilization of priority samples, this paper introduces a composite priority experience replay mechanism and the design of the reward function. The decision-making framework based on CPR-TD3 is shown in Fig. 3.

Figure 3.

The framework for TD3 algorithm based on composite priority experience replay.

During the driving process, the intelligent vehicle driving on the highway scenario generates $e=\{s_{t},a_{t},s_{t+1},r_{t},d\}$ as experience to be deposited into the experience pool, forming historical experience data When a certain number of rounds are reached, the target network starts to take samples for training and inputs the state values into the Q estimation network and the Q target network respectively. Deep reinforcement learning obtains immediate rewards through the interaction of actions with the environment and computes the loss function, which in turn updates the network parameters until the algorithm completes its iterations [31].

3.2 Design of state space and action space

The state space is the state information of the vehicle during driving, containing whether the vehicle is observed or not and the normalized lateral and longitudinal relative speeds and relative positions. The action space is the action commands executed by the auto-vehicle. In this paper, we use the continuous action space, which is set to [throttle, steering], to control the vehicle travelling by adjusting the throttle and steering wheel turning angle. The steering wheel angle is set to [ $-$ 45, 45] [24] in ${}^{\circ}$ . The longitudinal acceleration range is set to [ $-$ 3, 3] in m/s ${}^{2}$ .

3.3 Minimum lane change distance

The lane change minimum safe distance scenario is shown in Fig. 4, where L ${}_{1}$ is the front vehicle in the current lane, L ${}_{2}$ is the front vehicle in the adjacent lane, L ${}_{3}$ and L ${}_{4}$ is the rear vehicle in the adjacent lane, and M is the lane change vehicle.

Figure 4.

Minimum safe distance scenario for lane change.

The minimum safe distance for changing lanes is [25]:

$\displaystyle\left\{{{\begin{array}[]{*{20}c}{S(L_{1},M)=\max\{S_{L_{1}}-S_{M}% +D_{\min}+W\sin(\varphi)\}}\hfill\\ {\begin{array}[]{l}S(L_{2},M)=\max\{S_{L_{2}}-S_{M}+D_{\min}+W\sin(\varphi)\}% \\ S(M,L_{3})=\max\{S_{M}-S_{L_{3}}+D_{\min}+W\sin(\varphi)\}\\ S(M,L_{4})=\max\{S_{M}-S_{L_{4}}+D_{\min}+W\sin(\varphi)\}\\ \end{array}}\hfill\\ \end{array}}}\right.$ (12)

where $W$ is the width of the lane, $\varphi$ is the angle between the vehicle and the lane line when changing lanes, $D_{\min}$ is the headspace between the two vehicles after the end of the vehicle lane change, $S_{i}$ is the longitudinal displacement of vehicle $i$ in the process of changing lanes.

The distance to safety at variable headways is:

$\displaystyle S_{V}=vT_{h}+d_{0}$ (13)

where $v$ is the speed of the intelligent vehicle; $d_{0}$ is the minimum inter-vehicle distance; $T_{h}$ is the variable headway parameter.

$\displaystyle T_{h}=\left\{{{\begin{array}[]{ll}T_{h\_\max},&t_{0}-k_{r}v% \geqslant T_{h\_\max}\\ t_{0}-k_{r}v_{r},&T_{h\_\min}<t_{0}-k_{r}v<T_{h\_\max}\\ T_{h\_\min},&t_{0}-k_{r}v\leqslant T_{h\_\min}\\ \end{array}}}\right.$ (14)

where $T_{h\_\max}$ and $T_{h\_\min}$ are the maximum and minimum values of the variable headway parameter setting respectively; $k_{r}$ is the coefficient of relative speed; $v_{r}$ is the relative speed between the vehicle and the vehicle in front of it; $t_{0}$ is the headway between the vehicle and the vehicle in front of it.

3.4 Design of the reward function

The reward function plays an important role in guiding whether the intelligent vehicle can obtain the optimal strategy. The original reward function in this paper’s scenario only considers speed reward and collision penalty, which is easy to fall into the dilemma of only pursuing high speed and leading to collision or driving at the lowest speed until the end of the round, and it is not conducive to the training of intelligent vehicles. Aiming at the shortcomings of the original reward function Eqs (15) (16), this paper considers factors such as relative speed and relative distance to improve the importance of distance maintenance by intelligent vehicles.

(1)
Speed reward function $r_{1}$ . In order to motivate the intelligent vehicle to travel at a faster speed, this paper designs the following speed reward function:

$\displaystyle r_{1}=\lambda\frac{v-v_{\min}}{v_{\max}-v_{\min}}$ (15)

where $v$ is the current speed of the intelligent vehicle (m/s); $v_{\max}$ and $v_{\min}$ are the highest and lowest speed limits of the vehicle (m/s); $\lambda$ is the reward coefficient of the corresponding highest speed, which is taken as 0.7.
(2)
Collision reward function $r_{2}$ . In order to ensure driving safety and to penalize intelligent vehicles for colliding with other vehicles, a collision reward function is designed:

$\displaystyle r_{2}=\left\{{\begin{array}[]{ll}1&\quad\omega_{\text{collision}% }=1\\ 0&\quad\omega_{\text{collision}}=0\\ \end{array}}\right.$ (16)

where $\omega_{\text{collision}}$ represents whether or not the vehicle was involved in a collision.
(3)
Distance reward function $r_{3}$ . In order to ensure that a certain driving safety distance is maintained with the front vehicle, and the following distance is adjusted. In this paper, feedback is established by MSD and VTH, and its reward function is:

$\displaystyle r_{3}=D_{f}v_{r}/10$ (17)

where $D_{f}$ is the distance factor

$\displaystyle D_{f}=\left\{{{\begin{array}[]{cl}0,&\quad d\geqslant S_{\textit% {VTH}}\ \textit{or}\ v_{r}>0\\ 0.5,&\quad D_{\textit{MS}}<d<S_{\textit{VTH}}\\ \frac{5(D_{\textit{MS}}-d)}{D_{\textit{MS}}},&\quad D_{\textit{MS}}\geqslant d% \\ \end{array}}}\right.$ (18)

where $v_{r}$ is the speed of the vehicle in front, $d$ is the distance between the intelligent vehicle and the vehicle in front.
(4)
Lane changing reward function $r_{4}$ . In order to avoid frequent lane-changing during the driving process of the intelligent vehicle, which leads to the reduction of passenger comfort, the lane-changing penalty is:

$\displaystyle r_{4}=\left\{{{\begin{array}[]{ll}0.15,&\text{ lane change}\\ 0.35,&\text{continuous lane change}\\ 0,&\text{other cases}\\ \end{array}}}\right.$ (19)
(5)
Lane keeping reward function $r_{5}$ In order to maintain a certain distance between vehicles, this paper focuses on the intelligent vehicle to stay in the lane with the maximum Time To Collision (TTC) with the front vehicle during travelling as a function of:

$\displaystyle r_{5}=0.5$ (20)

In summary, the modified composite reward function is:

$\displaystyle r=w_{1}r_{1}+w_{2}r_{2}+w_{3}r_{3}+w_{4}r_{4}+w_{5}r_{5}$ (21)

where $w_{i}$ is the weighting factor of each item.
4. Simulation analysis

4.1 Simulation parameters and environment settings

In this study, the highway scene is selected to build a simulation environment, and the TD3 algorithm based on composite priority experience replay and improvement of the reward function (CPR-TD3) is applied to intelligent vehicle driving behavior decision-making To verify the validity of the algorithm in the highway scene and the convergence speed, and to compare with the traditional TD3.

The simulation environment is as follows: the CPU is Inter Core i7-12700, the memory is 16 GB, and the deep reinforcement learning compilation framework is Pytorch. According to the applicable scenarios and needs of vehicle decision-making, the highway scene is set as a one-way 4 lanes, and the number of other vehicles in the scenario is 50. The other vehicles are caused by the minimization of total braking caused by the lane change and the intelligent driver model for horizontal and vertical control, the specific training parameters are shown in Table 1, and each parameter of the Highway-env environment is shown in Table 2.

Table 1
Experimental training parameters of CPR-TD3 algorithm

Parameters	Value	Parameters	Value
Maximum episodes	6000	Soft update factor	0.005
Replay buffer size	100000	Discount rate	0.9
Strategy frequency	2	Maximum duration(s)	60
Min-batch size	128	Learning decay rate	0.98
Actor network learning rate	0.0001	Critic networks learning rate	0.0002

Table 2

Environmental parameters of highway

Parameters	Value	Parameters	Value
Number of lanes/lane	4	Vehicle width/m	2
Lane length/m	3000	Vehicle length/m	5

Table 3

Experimental data of TD3 algorithm

	TD3	P-TD3	CP-TD3
Average reward	85.86	117.95	127.75
Average distance travelled/m	801.73	1054.97	1177.76
Average speed/ m $\cdot$ s ${}^{-1}$	28.18	28.80	28.91
Success rate	42%	55%	65%

4.2 Analysis of simulation results

4.2.1 Comparison of different algorithms for improving the experience replay mechanism

In this experiment, the TD3 algorithm is used as the basis for comparison experiments, and the TD3 with the addition of priority experience replay (P-TD3) and the TD3 with the addition of composite priority experience replay (CP-TD3) are implemented to compare their training respectively. In the simulation analysis, the number of training rounds is uniformly set to 2000 rounds due to the time cost. In order to avoid the volatility of the training results from interfering with the reading, and at the same time, to make the results display more intuitive, the average reward, speed, and maximum distance travelled of the output of the three algorithms were smoothed using python’s built-in SavitzkyGolay filter.

Figure 5.

Comparison of average reward function.

Figure 6.

Comparison of maximum distance travelled.

Figure 7.

Comparison of average vehicle speeds.

Figures 5–7 show the comparison of the average reward value, distance travelled, and average speed for the three TD3 algorithms with the original reward function, respectively. The three methods converge at approximately 1420, 1200, and 854 rounds, respectively. From Table 4, it can be seen that the traditional TD3 algorithm converges more slowly, with a success rate of only 42%. By adding the preferred experience replay mechanism, thanks to the improvement of sampling experience, P-TD3 accelerates the convergence speed by about 16% compared with the traditional TD3 algorithm, and obtains higher reward values and driving distances. The CP-TD3 algorithm, by composite sorting of the TD-error and the immediate reward, has a higher experience sample utilization is higher, the convergence speed is accelerated by 42% compared to the traditional TD3 algorithm, and the obtained reward value and driving distance are also significantly improved.

4.2.2 Comparison of the introduction of the reward function and the experience replay

According to the original reward function’s, the modified reward function was trained several times, and after several validations, the weights took the values of 1, $-$ 5.0, 2.0, $-$ 2.0, and 2.0.

Table 4
Comparison of two algorithms for improving the reward function

	CP-TD3	CPR-TD3
Average distance travelled/m	1177.76	1510.11
Average speed/ m $\cdot$ s ${}^{-1}$	28.91	29.26
Success rate	65%	78%

Figure 8.

Comparison of maximum travelling distance.

Figure 9.

Comparison of average vehicle speeds.

Taking the TD3 algorithm which introduces the composite priority experience replay mechanism as the base algorithm, combined with Figs 8, 9 and Table 4, by comparing the original reward function and the improved reward function, the average distance travelled and the success rate are significantly improved, the fully improved TD3 algorithm converges at about 410 rounds, and the success rate is improved by 36% compared with the traditional TD3 algorithm, maintaining a safe distance from the vehicle in front of the vehicle has become the key to the success of the intelligent vehicle, which demonstrates that the composite priority experience replay mechanism and the reward function designed according to the minimum distance have further improved the performance of the intelligent vehicle.

5. Conclusion

Aiming at the problems of low training efficiency and insufficient sample utilization of intelligent vehicles under reinforcement learning algorithm, this paper improves the sampling process of TD3 algorithm and the main work is summarized as follows:

(1)

Introducing a composite priority experience replay mechanism, sorting the experience separately by TD-error priority and immediate return priority, and then calculating the experience sampling probability by composite sorting, which improves the utilization efficiency of the experience, increases the robustness, and speeds up the training speed of the network model.

(2)

Designing a combined reward function based on the minimum safe distance and variable headway spacing to improve the success rate of intelligent vehicles during lane changing.

Meanwhile, the study also has the following shortcomings:

(1)

The reward function designed in this paper does not consider the steering wheel steering angle and other factors and the reward function designed in this paper has room for supplementation.

(2)

The experimental environment of this paper is relatively simple, and the difference between the real application scene is large, the follow-up should be considered in the real car scene for experimentation.

References

Liu

Huang

Deng

, et al. Heuristics-oriented overtaking decision making for autonomous vehicles using reinforcement learning. IET Electrical Systems in Transportation. 2020 Nov 13; 10(4): 417-424.

Huang

, et al. Decision-making analysis of autonomous driving behaviors for intelligent vehicles based on finite state machine. Automotive Technology. 2018 Dec 11; 12: 1-7.

Kurt

Özgüner

. Hierarchical finite state machines for autonomous mobile systems. Control Engineering Practice. 2013 Feb 1; 21(2): 184-194.

Kiran

Sobh

Talpaert

, et al. Deep reinforcement learning for autonomous driving: A Survey. IEEE Transactions on Intelligent Transportation Systems. 2021 Jan 23; 23(6): 4909-4926.

Mirchevska

Pek

Boedecker

. High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2018 Nov 1, pp. 2156-2162.

Wang

Zhang

Zhao

, et al. Lane change decision-making through deep reinforcement learning with rule-based constraints. In 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019 Sep 30, pp. 1-6.

Kendall

Hawke

Janz

, et al. Learning to drive in a day. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019 Aug 12, pp. 8248-8254.

Wang

. Chan

. Continuous control for automated lane change behavior based on deep deterministic policy gradient algorithm. In: 2019 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2019 Aug 29, pp. 1454-1460.

Wang

Luo

Liu

, et al. End-to-end self-driving policy based on the deep deterministic policy gradient algorithm considering the state distribution. Journal of Tsinghua University (Science and Technology). 2020 Nov 11; 61(9): 881-888.

10.

, et al. A decision-making strategy for vehicle autonomous braking in emergency via deep reinforcement learning. IEEE Transactions on Vehicular Technology. 2020 Apr 14; 69(6): 5876-5888.

11.

. Study on driving policy of autonomous unmanned system based on deep reinforcement learning. Guangzhou: Guangdong University of Technology; 2020.

12.

Sun

. Research on intelligent control strategy of autonomous driving based on deep reinforcement learning. Daqing: Northeast Petroleum University; 2020.

13.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature. 2015 Feb 25; 518: 529-533.

14.

Schaul

Quan

Antonoglou

, et al. Prioritized experience replay. 2015 Nov 18; ArXiv: 151105952.

15.

Zhang

Xiong

Bai

. Improved deep deterministic policy gradient algorithm based on prioritized sampling. Proceedings of 2018 Chinese Intelligent Systems Conference. 2018 Oct 6; 528: 205-215.

16.

Zhang

Liu

. Deep deterministic policy gradient with episode experience replay. Computer Science. 2021 Jun 29; 48(10): 37-43.

17.

Zhang

, et al. Self-Adaptive priority correction for prioritized experience replay. Applied Sciences. 2020 Oct 10; 10(19): 6925.

18.

Bai

Liu

Zhao

. Active sampling for deep Q learning based on TD-error adaptive correction. Journal of Computer Research and Development. 2019 Feb 1; 56(02): 262-280.

19.

Gao

Liu

, et al. Prioritized experience replay method based on experience reward. In: 2021 International Conference on Machine Learning and Intelligent Systems Engineering, IEEE, 2021 Nov 25, pp. 214-219.

20.

. Time matters in using data augmentation for vision-based deep reinforcement learning. 2022 Oct 19; ArXiv: 210208581.

21.

Zhang

Zhou

Zhang

, et al. Energy management for hybrid tracked vehicles based on TD3-PER. Automotive Engineering, 2022 Sep 25; 44(9): 1400-1409.

22.

Cideron

Seurin

Strub

, et al. HIGhER: Improving instruction following with hindsight generation for experience replay. In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI), IEEE, 2021 Jan 5, pp. 225-232.

23.

Chen

. Improvement and application of deep reinforcement learning based on experience replay mechanism. Nanjing: Southeast University, 2021.

24.

Liu

. Research on Human-Machine interaction strategy of take-over in-autonomous driving. Xiamen: Xiamen University of Technology, 2022.

25.

Wang

. Rule-Based constrained deep reinforcement learning for intelligent vehicle driving decisions in highway scenarios. Automotive Technology. 2023 Apr 10; 1-9.

26.

Yang

Jiang

Liu

. Autonomous driving policy learning based on deep reinforcement learning and multi-type sensor data. Journal of Jilin University (Engineering and Technology Edition). 2019 July 9; 49(4): 1026-1033.

27.

Matthew

Peter

. Deep recurrent Q-learning for partially observable MDPs. 2017 Jan 17; ArXiv: 150706527.

28.

Tan

. Finite State machine and its application. Guangzhou: South China University of Technology; 2013.

29.

Talebpour

Mahmassani

Hamdar

. Modeling lane-changing behavior in a connected environment: A game theory approach. 21st International Symposium on Transportation and Traffic Theory. 2015 July 12; 59: 216-232.

30.

Sallab

Abdou

Yogamani

. End-to-End Deep Reinforcement Learning for Lane Keeping Assist. 2016 Dec 13; ArXiv: 161204340.

31.

Mnih

Kavukcuoglu

Silver

, et al. Playing Atari with Deep Reinforcement Learning. 2013 Dec 19; ArXiv: 13125602.

32.

Xiong

Kang

, et al. Decision-making of lane change behavior based on RCS for automated vehicles in the real environment. 2018 IEEE Intelligent Vehicles Symposium (IV). 2018 Oct 21; 1400-1405.

33.

Zhang

Cao

Zhang

. Decision-making rule extraction and decision-making algorithm for lane change in dense fog environment. Science Technology and Engineering. 2019 Jul 28; 19(21): 301-308.

Research on decision making of intelligent vehicle based on composite priority experience replay

Abstract

Keywords

1. Introduction

1.1 Rule-based algorithms

1.2 Learning-based algorithms

1.3 Experience replay

2. Modelling framework

2.1 Reinforcement learning model

3.1 Overall framework

3.3 Minimum lane change distance

4.1 Simulation parameters and environment settings

Table 1 Experimental training parameters of CPR-TD3 algorithm

4.2.1 Comparison of different algorithms for improving the experience replay mechanism

Table 4 Comparison of two algorithms for improving the reward function

References

Table 1
Experimental training parameters of CPR-TD3 algorithm

Table 4
Comparison of two algorithms for improving the reward function