Sage Journals: Discover world-class research

Abstract

This paper proposes a novel human-centric approach to enhance decision-making for autonomous vehicles in complex urban driving situations by integrating Deep Q-Network (DQN) reinforcement learning and social value orientation. In the proposed method, a deep neural network (DNN) is employed to approximate the optimal Q-values for various states and actions in the space of possible actions and reachable states. In order to improve the optimization convergence, an Adam optimization is proposed by combining the advantages of adaptive learning rates and momentum methods. The proposed framework also incorporates a collision avoidance component that allows vehicles to navigate safely through pedestrian crossings. The proposed method is validated through simulation experiments, which show that the proposed approach outperforms traditional decision-making and RL methods in terms of safety and efficiency. Finally, the results demonstrate that integrating social value orientation and DQN-RL can lead to more human-like and socially compliant decision-making frameworks for automated vehicles. This research contributes to developing a new human-centric cyber-physical approach for automated vehicle decision-making and has significant implications for designing future intelligent transportation systems.

Keywords

Intelligent transportation decision-making reinforcement learning control

Introduction

The emergence of autonomous vehicles (AVs) has presented us with transformative opportunities for future intelligent transportation systems.^1,2 AVs offer the potential to improve driving safety and enhance traffic efficiency.³ Despite many efforts to reduce road accidents and the resulting deaths, the number of casualties worldwide has significantly increased, exceeding 1.35 million deaths, according to the World Health Organization (WHO) report in 2018.⁴ Therefore, developing safe and reliable AV frameworks is essential to ensure all road users’ safety, including cyclists, pedestrians, and other forms of vehicles. Deploying autonomous vehicles at a large scale while ensuring safety requires addressing multifaceted technical, social, and ethical challenges and incorporating them into the decision-making process of AVs.⁵ One of the most critical challenges is rendering safe and ethical decisions when faced with complex traffic scenarios,⁶ as AVs rely on pre-planned decision-making algorithms, unlike human-driven cars. To design human-centric AVs, it is crucial to understand how individuals with unique characteristics behave in various scenarios as drivers, bicyclists, and pedestrians. Once the relative geometric positions of all road users are identified for the ego vehicle, an optimal path must be determined between the AV’s current location and its destination, minimizing travel time while satisfying the constraints imposed by the vehicle’s kinetodynamics and physical limitations.⁷ Recent advancements in vehicle dynamics control, particularly in vehicle slip angle estimation using integrated GNSS and IMU data, as demonstrated in two novel approaches,^8,9 significantly enhance the ability of autonomous vehicles to navigate safely in complex scenarios like slalom and double lane changes. In particular, navigating safely through complex and interactive environments that may include pedestrians poses a substantial challenge for autonomous driving.¹⁰

Despite their usefulness in addressing motion control for autonomous vehicles interacting with another road, many motion control frameworks have two primary limitations.^4,11 The first drawback is their excessive tendency to be overly cautious, resulting in an unpredictable driving style that can potentially cause accidents compared to an average human driver.^12–14 The second issue is that these frameworks face difficulty adapting to unforeseen situations, which is a significant concern due to the countless number of potential road scenarios.^4,15 These limitations underscore the need for advanced motion control algorithms to mitigate these issues and enable safe and reliable autonomous driving. In recent years, a plethora of decision-making methods have been proposed, ranging from rule-based methods to machine learning-based methods.¹⁶ Rule-based methods rely on a set of pre-defined rules and heuristics to make decisions.¹⁷ Although these methods are intelligible and straightforward, they often lack flexibility and adaptability to dynamic traffic situations.

On the other hand, machine learning-based methods, such as reinforcement learning (RL), have shown promising results in decision-making for autonomous driving.¹⁸ RL algorithms learn from environmental interactions and can adapt to changing situations. However, these methods require large amounts of data and time to train, and the learned policies may not always generalize well to unseen situations.¹⁹ Deep learning-based approaches have also been proposed, where neural networks are used to estimate the Q-value function in the context of an RL.²⁰ These methods can handle high-dimensional state and action spaces but may suffer from overfitting and instability during training. In addition to these methods, recent works have explored the use of game theory²¹ and human-centered approaches²² for decision-making in autonomous driving. These methods consider the interactions between multiple agents and the human factors involved in driving. Despite the progress made in decision-making for autonomous driving, there are still challenges and limitations that need to be addressed. These include the trade-off between safety and efficiency, the need for real-time decision-making, and the interpretability and transparency of the decision-making process.²³

RL has been a promising approach for developing AV decision-making algorithms, which involve learning from experience by maximizing a reward function that reflects the performance of the AV in different traffic scenarios.²⁴ Hence, RL is a potentially capable method to address the second issue pointed out with respect to classical motion control frameworks. However, RL algorithms may not necessarily guarantee safety and ethical behavior since they optimize the reward function without considering potential risks and consequences.²⁵ To meet this objective, it is imperative to integrate advanced decision-making algorithms with human-centered risk assessment and social-behavioral cues. Human-centered risk assessment (HCRA) is a well-established methodology for identifying, evaluating, and mitigating risks associated with complex systems such as AVs.²⁶ HCRA entails analyzing human factors, system design, and context of use to identify potential hazards and assess their likelihood and severity. Integrating HCRA with RL can provide a mechanism for ensuring AVs’ safe and ethical behavior by constraining the RL algorithms’ optimization process based on the identified risks and constraints.

Social-behavioral cues are observable signals and gestures that humans use to communicate their intentions, emotions, and expectations.²⁷ Integrating social-behavioral cues into the decision-making process of automated vehicles (AVs) can enable them to anticipate and respond accurately to other road users’ actions, which is particularly important for safe and effective interactions with pedestrians and bicyclists.²⁸ Therefore, this paper proposes a cyber-physical model that combines RL, human cognitive risk assessment (HCRA), and social-behavioral cues to enable AVs to make safe and ethical decisions by learning from experience while considering potential risks and constraints. However, existing RL methods focus solely on the ego vehicle’s goals, neglecting the potential negative impact of its actions on surrounding vehicles. Typically, the reward function considers only the ego vehicle’s goals. To address this limitation, we aim to train an autonomous vehicle (AV) agent to prioritize the comfort and safety of surrounding road users, in addition to its own goals, by shaping the reward function using an advanced SVO framework.

The proposed method in this paper aims to enhance the ethical and socially responsible decision-making of autonomous vehicles by employing an advanced SVO framework. SVO is a well-established concept in social psychology that measures an individual’s prioritization of their own welfare compared to that of others. It characterizes individuals into two primary categories: prosocial and egoistic, with prosocial individuals prioritizing the welfare of others over their own and egoistic individuals prioritizing their own welfare over others.

A significant contribution of this paper is incorporating an advanced SVO framework into the reward function of the Deep Q-network method. By doing so, we can achieve more human-intuitive vehicle actions and ethical and socially intelligent decision-making. The reward function is designed to encourage the car to consider the comfort and safety of surrounding road users, in addition to its own goals, by including a penalty term that discourages the vehicle from taking actions that could potentially harm other road users. The reward function also rewards the car for taking actions that prioritize the safety and well-being of others while also taking the vehicle to its desired spatial location with minimal delay. Our approach can lead to more socially responsible decision-making in autonomous driving, with a focus on both the safety of the ego vehicle and the well-being of surrounding road users.

The existing paper makes the following primary contributions:

A novel HCRA-based, model-free deep Q-network reinforcement learning framework is proposed for decision-making in complex urban driving conditions with mixed traffic of a pedestrian crossing scenario.

By combining SVO with deep Q-network, the reward function architecture is shaped to modify the ego-vehicle strategies. This achieves behaviors that range from egoistic to pro-social, resulting in improved pedestrian safety.

The paper demonstrates how the vehicle’s performance improves over time and how different learning parameters (such as the learning rate and discount factor) affect the speed and quality of learning. This enhances learning and convergence.

Deep Q-network is used to estimate the optimal policy and a balanced exploration and exploitation ratio is achieved by leveraging epsilon-greedy policy to balance the exploration versus exploitation rate.

The organization of the remainder of this paper is as follows. In “Problem formulation” section, the problem formulation is presented. In “Deep Q-networks reinforcement learning” section, the preliminaries of reinforcement learning are given, and the proposed DQN-ADAM-SGD and human-centric SVO-based DQN-ADAM-SGD are formulated and developed. The results are presented and discussed in “Results and discussion” section, and “Conclusions” section concludes the paper.

Problem formulation

For the problem formulation, a scenario is considered where a car approaches a pedestrian crossing the street. The goal is to create a collision avoidance system that ensures the car passes the pedestrian safely. To achieve this, the system must consider the distance between the car and the pedestrian and adjust the car’s speed accordingly (Figure 1).

Figure 1.

Proposed framework for collision avoidance scheme for AVs approaching a pedestrian crossing the street.

To define the system state, we use a tuple $(x, v)$ where $x$ represents the distance between the car and pedestrian, and $v$ represents the car’s velocity. The car’s available actions are binary-it can either accelerate or brake. The car’s motion is determined by the following dynamics:

x_{t + 1} = x_{t} - v_{t} Δ t + \frac{1}{2} a_{t} Δ t^{2},

(1)

v_{t + 1} = v_{t} + a_{t} Δ t,

(2)

where $Δ t$ is the time step, $a_{t}$ is the acceleration/deceleration of the car at time $t$ , and $x_{t}$ and $v_{t}$ are the distance and velocity of the car at time $t$ , respectively. To create a collision risk function, a safety distance $d_{s}$ is determined between the car and the pedestrian. If the distance between them is less than the safety distance, then a collision is likely to occur. The collision risk function, denoted by $r (x)$ , is defined as follows:

r (x) = {\begin{matrix} 0, if x > d_{s} \\ 1, if x d_{s} \end{matrix},

(3)

where $r (x) = 0$ indicates no collision risk and $r (x) = 1$ indicates a collision risk. An optimization problem is formulated to design a system that can ensure the car reaches its destination while minimizing the risk of collision:

min_{a_{0}, a_{1}, . . ., a_{T}} [\sum_{t = 0}^{T} r (x_{t}) + λ \sum_{t = 0}^{T} ϕ (a_{t})],

(4)

where $ϕ (a_{t})$ is a penalty function that discourages excessive acceleration/deceleration. Additionally, $a_{0}, a_{1}, . . ., a_{T}$ represent the sequence of acceleration/deceleration actions the car takes from time $0$ to time $T$ . Additionally, $λ$ is a regularization parameter:

ϕ (a_{t | t = 0, . . ., T}) = \sum_{t = 0}^{T - 1} | a_{t + 1} - a_{t} |,

(5)

where $ϕ$ penalizes sudden changes in acceleration/deceleration. Therefore, the problem is to design a collision avoidance system for a car that encounters a pedestrian crossing the road. Then, the primary goal is to calculate and modify the car’s speed according to the distance between the car and the pedestrian, thereby reducing the chances of a collision and ensuring that the car arrives at its intended destination. To tackle this issue, one can formulate an optimization problem that includes a collision risk function and a penalty function for excessive acceleration or deceleration.

Deep Q-networks reinforcement learning

Preliminaries of reinforcement learning

Reinforcement learning (RL) typically involves an agent that interacts with the environment by taking actions and perceiving their consequences. This interaction can be formalized as a tuple, denoted as $(S, A, P, R, γ)$ , where $S$ represents the set of possible states, and $A$ denotes the set of possible actions. The transition probability kernel (TPK), denoted by $P$ , specifies the probability of transitioning from the current state to the next state upon taking an action. Specifically, $P (s' | s, a)$ indicates the probability of transitioning to state $s'$ from state $s$ upon taking action $a$ . The immediate reward function, denoted by $S$ , specifies the reward received upon taking an action in a given state. The discount factor, denoted by $γ$ , determines the trade-off between immediate and future rewards. The transition probability from the current state $s$ to the next state $s'$ given the action $a$ is denoted by $p_{sa, s'}$ , and the reward for taking action $a$ in state $s$ is denoted by $r (s, a)$ .

By taking into account the longer-term reward policy for the agent, the infinite horizon discounted model is applied. Besides, the subsequent rewards are topologically discounted on account of a discount factor ranged between 0 and 1 $(0 \leq γ < 1)$ such as $E (\sum_{t = 0}^{\infty} γ^{t} r_{t})$ . The control policy $π$ is the distribution over the control actions $a$ , given the current state $s$ . The optimal value function is exhibited as the finite expected discounted sum of the rewards:

V^{*} (s) = \max_{π} E (\sum_{t = 0}^{\infty} γ^{t} r_{t}),

(6)

Based on the uniqueness and existence of the optimal result, the solution to the concurrent equations is determined in terms of a recursion expression²⁹:

V^{*} (s) = max_{α} {r (s, α) + γ \sum_{s' \in S} p_{sa, s'} + V^{*} (s')}, \forall s \in S,

(7)

where $V^{*} (s)$ represents the value of s corresponding to the initial optimal action and the above statement shows that the value of the state is the total sum of the expected instantaneous reward and the discounted value of the subsequent state values based on the current action. According to the optimal design, the desired value function is explained as follows²⁹:

π^{*} (s) = \underset{α}{\arg max} {r (s, α) + γ \sum_{s' \in S} p_{sa, s'} + V^{*} (s')},

(8)

Moreover, the action-value function $Q (s, α)$ is described:

Q (s, α) = r (s, α) + γ \sum_{s' \in S} p_{sa, s'} Q (s', α'),

(9)

Hence, the associated optimal solution $Q^{*} (s, α)$ is defined according to the action-value function²⁹:

Q^{*} (s, α) = r (s, α) + γ \sum_{s' \in S} p_{sa, s'} Q^{*} (s', α'),

(10)

where $Q^{*} (s, α)$ denotes the expected discounted reinforcement associated with the $a$ in state $s$ continuously. In equation (10), $Q^{*} (s, α)$ denotes the optimal action-value function, which is the maximum expected reward that can be achieved from state $s$ by taking action $α$ , following the best policy thereafter. The term $r (s, α)$ represents the immediate reward obtained after executing action $α$ in state $s$ . The factor $γ$ is the discount factor, indicating the importance of future rewards compared to immediate rewards. The summation $\sum_{s' \in S} p_{sa, s'} Q^{*} (s', α')$ calculates the expected future rewards, where $p_{sa, s'}$ is the transition probability from state $s$ to state $s'$ after taking action $α$ , and $Q^{*} (s', α')$ is the optimal future value from state $s'$ . Moreover, the $Q$ -learning algorithm explains the update concerned with the $Q$ value according to the delayed parameter $(Θ \in [0, 1])$ :

Q (s, α) : = Q (s, α) + Θ (\begin{matrix} γ max_{α'} Q (s', α') - \\ Q (s, α) + r (s, α) \end{matrix}),

(11)

where the current Q-value for a state-action pair $(s, α)$ , $Q (s, α)$ , is updated based on the new information gained, $Θ$ is the learning rate, determining how much the new information influences the current Q-value. To incorporate RL-based predictive decision-making and prevent obstacle collisions based on a predictive model, the proposed modification is employed. The term $γ max_{α'} Q (s', α')$ represents the maximum discounted reward that can be achieved from the next state $s'$ , optimizing over all possible actions $α'$ . The difference $γ max_{α'} Q (s', α') - Q (s, α)$ , combined with the immediate reward $r (s, α)$ , forms the temporal difference error. This error quantifies the discrepancy between the estimated Q-value and the observed Q-value, adjusted by the immediate reward, guiding the update of the Q-value for better future decision-making.

DQN-ADAM+SGD

In order to enhance Q-learning’s efficacy, particularly in intricate settings, a technique known as Deep Q-Networks (DQN) is employed. Rather than utilizing a lookup table, DQN utilizes a neural network to estimate Q-values. DQN has demonstrated effectiveness in managing state spaces with high dimensions, which may be challenging to represent with a straightforward lookup table. The DQN approach employs a deep neural network to approximate the Q-value function, resulting in an approximation of the optimal Q-function’s Q-value function represented as follows:

Q (s, a; θ) \approx Q^{*} (s, a),

(12)

The Q-value function, denoted as $Q^{*} (s, a)$ , represents the true value of an action taken in a given state, while $θ$ refers to the weights present in a neural network. To estimate Q-values, the DQN algorithm employs a variant of Q-learning called the Q-learning update rule. This rule is defined as follows:

Q (s, a) = Q (s, a) + α [r + γ ma x_{a'} Q (s', a') - Q (s, a)],

(13)

The objective of the DQN algorithm is to acquire the optimal policy $π^{*} (s)$ , which maximizes the expected cumulative discounted reward as:

J (π) = E [\sum_{t = 0}^{\infty} γ^{t} r_{t}],

(14)

where $γ$ is the discount factor and $r_{t}$ is the reward at time step $t$ . A variation of Q-learning, referred to as the Q-learning update rule, is employed to train the neural network.

Δ θ = α (y - Q (s, a; θ)) \nabla θ Q (s, a; θ),

(15)

where $α$ is the learning rate. In order to ensure stability during the learning process, DQN employs a distinct target network that generates target Q-values for training purposes. This target network is an identical copy of the main network, with its weights fixed, and is utilized to generate target Q-values for training. The target Q-value for a given state (s) and action (a) is calculated as follows:

y = r + γ ma x_{a'} Q (s', a'; θ^{-}),

(16)

where $y$ is the target Q-value, and $θ^{-}$ are the weights of a separate target network that is updated less frequently than the main network. To diminish the dependence between successive state-action pairs during training, DQN adopts a method called experience replay. This technique entails preserving previous experiences $(s, a, r, s')$ in a memory buffer and randomly selecting a subset of them to train the network. Besides the Q-value function, DQN employs a target network to stabilize the learning procedure. This target network is an identical copy of the primary network with constant weights and is employed to produce the target Q-values for training. DQN trains via stochastic gradient descent (SGD) utilizing the mean squared error (MSE) loss function:

L (θ_{i}) = E [(y_{i} - Q (s, a; θ_{i}))^{2}],

(17)

where $θ_{i}$ represents the weights of the network at iteration $i$ , and $y_{i}$ is the target Q-value at iteration $i$ .

Backpropagation is employed by DQN to calculate the gradient of the MSE loss relative to the weights $θ$ , which is then utilized by stochastic gradient descent (SGD) to update the neural network’s weights. The gradient of the MSE loss with respect to the weights may be computed as follows:

\nabla θ L (θ_{i}) = E [(y_{i} - Q (s, a; θ_{i})) \nabla θ Q (s, a; θ_{i})]

(18)

Thus, the weights are updated using the SGD update rule as:

θ_{i + 1} = θ_{i} - α \nabla θ L (θ_{i}),

(19)

In DQN, the deep neural network’s structure usually includes multiple fully connected layers. The input layer takes the state (s) and action (a) as inputs, while the output layer generates the forecasted Q-value $(Q (s, a; θ))$ .

One can express the output of a fully connected layer, which employs weights $W$ and biases $b$ , and operates on the input $x = [s, a]$ using the sigmoid activation function as a mathematical equation:

h (x) = σ (W^{T} x + b),

(20)

where σ is the sigmoid activation function, and the bias term b is added to introduce a shift in the activation function.

The Q-network takes the state-action pair $(s, a)$ as input to the first fully connected layer. This pair is formed by concatenating the state and action vectors. The output of the last fully connected layer is a predicted Q-value $Q (s, a; θ)$ , represented as a single scalar value.

The mathematical equation for the Q-value function approximation using a deep neural network can be written as:

Q (s, a; θ) = f (s, a; θ),

(21)

where $f (s, a; θ)$ is the output of the neural network with weights $θ$ for the state-action pair $(s, a)$ .

The process of backpropagation involves utilizing the chain rule to calculate the gradients of the loss function relative to the weights of each layer. When dealing with a fully connected layer that has biases $b$ and weights $W$ , the derivative of the chain rule can be expressed as:

\begin{matrix} \frac{\partial L}{\partial W} = \frac{\partial L}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial W} \\ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial b} \end{matrix}

(22)

where $L$ is the loss function, $h$ is the output of the layer, and $z$ is the weighted sum of the inputs to the layer.

For the sigmoid activation function $σ$ , the derivative can be written as:

\frac{\partial s (z)}{\partial z} = s (z) (1 - s (z)) = \frac{e^{- x}}{{(1 + e^{- x})}^{2}},

(23)

The gradients of the loss function with respect to the weights of the final layer can be computed as:

\frac{\partial L}{\partial θ} = \frac{\partial L}{\partial Q} \frac{\partial Q}{\partial θ},

(24)

The gradient of the Q-value with respect to the weights of the network is denoted as $\partial Q / \partial θ$ , where Q is the predicted Q-value. These gradients are utilized to modify the weights of the network by applying the SGD update rule.

To aid the update of parameters in our proposed approach, we utilize the Adam optimizer. The Adam update formula is expressed as:

θ_{i + 1} = θ_{i} - α m_{t} / \sqrt{} m_{t} + ε,

(25)

where $α$ is the learning rate, $m_{t}$ and $v_{t}$ are the first and second moments of the gradients, and $ε$ is a small constant to prevent division by zero. The moments are updated during each iteration as:

\begin{matrix} m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} \\ v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} \end{matrix},

(26)

where $g_{t}$ is the gradient at iteration t, and $β_{1}$ and $β_{2}$ are the decay rates for the first and second moments, respectively. Our proposed hybrid Adam optimizer helps SGD through improving (1) adaptive learning rate, (2) momentum, and (3) bias correction.

To achieve adaptive learning rate, the gradients’ first and second moments are calculated and utilized to update the learning rate for each weight. The learning rate for weight $i$ at time $t$ , also known as $α_{i} (t)$ , is computed as follows:

α_{i} (t) = α \sqrt{(1 - β_{2}^{t})} / (1 - β_{1}^{t}),

(27)

where $α$ is the initial learning rate, $β_{1}$ and $β_{2}$ are the decay rates for the first and second moments, respectively, and $t$ is the time step.

In order to comprehend the rationale behind adjusting the learning rate for each weight in this equation, we need to examine the elements involved. The denominator, represented as $(1 - β_{1}^{t})$ , guarantees that the learning rate is modified by a factor that is determined by the number of time steps that have elapsed since the optimizer began updating the weights. As time progresses, this factor becomes smaller, which in turn decreases the learning rate.

On the other hand, the numerator, denoted as $\sqrt{(1 - β_{2}^{t})}$ , ensures that the learning rate is also influenced by a factor based on the second moment of the gradients. Specifically, as time increases, the accuracy of the second moment estimate $(v_{i} (t))$ improves, and the learning rate is lowered to avoid significant updates to the weights.

In the Adam optimizer, the momentum term serves to average out the gradients over time and smooth out any fluctuations in the updates. The momentum term is calculated using a moving average of the gradients, which is updated at each time step $t$ .

v_{i} (t) = β_{1} v_{i} (t - 1) + (1 - β_{1}) g_{i} (t),

(28)

where $v_{i} (t)$ is the moving average of the gradient for weight $i$ at time $t$ , $g_{i} (t)$ is the gradient of the loss with respect to weight $i$ at time $t$ , and $β_{1}$ is the decay rate for the moving average.

Expanding and simplifying the equation reveals why it calculates the momentum term:

\begin{matrix} v_{i} (t) = β_{1} v_{i} (t - 1) + (1 - β_{1}) g_{i} (t) = \\ = {β_{1}}^{2} v_{i} (t - 2) + β_{1} (1 - β_{1}) g_{i} (t - 1) + (1 - β_{1}) g_{i} (t) \\ = {β_{1}}^{3} v_{i} (t - 3) + {β_{1}}^{2} (1 - β_{1}) g_{i} (t - 2) \\ + β_{1} (1 - β_{1}) g_{i} (t - 1) + (1 - β_{1}) g_{i} (t), \end{matrix}

(29)

It is apparent that the momentum term in the Adam optimizer is calculated by taking a weighted average of past gradients, where the more recent gradients have higher weights. This averaging process helps to eliminate noise in the updates and prevents oscillations during training. Furthermore, the Adam optimizer uses bias correction terms to adjust for the bias introduced by the first and second-moment estimates. To be more precise, these bias correction terms are computed in the following way:

{\hat{m}}_{i} (t) = m_{i} (t) / (1 - β_{1}^{t}) {\hat{v}}_{i} (t) = v_{i} (t) / (1 - β_{2}^{t}),

(30)

where $m_{i} (t)$ and $v_{i} (t)$ are the first and second moment estimates for weight $i$ at time $t$ , and $β_{1}$ and $β_{2}$ are the decay rates for the first and second moments.

The equations presented here aim to compensate for the bias in moment estimates that emerges during the early stages of training. This bias can impede convergence and lead to suboptimal updates, as the estimates tend to be skewed towards zero. To address this issue, the equations incorporate bias correction terms that adjust the estimates to account for this bias, resulting in more precise updates. Thus, the Adam optimizer’s momentum and bias correction terms play a crucial role in enhancing convergence’s stability and acceleration by averaging gradients over time and correcting for the bias in moment estimates.

SVO-based DQN+ADAM

In this paper, the Social Value Orientation (SVO) model has been developed, which relies on the concept of cooperation and competition among individuals in social dilemmas. The SVO framework is selected for its robust capability to model and predict individual decision-making behaviors in social contexts, making it ideal for simulating human-like decision-making in AVs. Its flexibility in accounting for a spectrum of behavioral inclinations, ranging from egoistic to prosocial, is crucial for accurately modeling the diversity in human decision patterns. This approach enhances the ethical aspect of decision-making, allowing AVs to make choices that balance self-interest and societal welfare. Furthermore, the integration of SVO into the Deep Q-network’s reward function equips AVs with enhanced social intelligence, which is crucial for navigating mixed-traffic environments. This leads to AV behavior that is not only safe and efficient but also perceived as ethical and socially considerate by human road users. The inclusion of the SVO framework marks a significant advancement in our research, contributing to the development of AVs that are more aligned with ethical standards and societal expectations. It is formulated as a function that considers both the pedestrian’s heading direction relative to the direction of the car, represented by the angle $ϕ$ , as well as the distance between the pedestrian and the car, represented by $x$ .

SV (ϕ, x) = c_{1} + c_{2} \cos (ϕ) + c_{3} \sin (ϕ) - c_{4} x,

(31)

where $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ are constants. In equation (31), the function $SV (ϕ, x)$ is designed to quantify the social value orientation of a pedestrian relative to the autonomous vehicle. The terms involving $\cos (ϕ)$ and $\sin (ϕ)$ capture the directional aspect of the pedestrian’s orientation with respect to the vehicle, where $ϕ$ is the angle between the pedestrian’s heading direction and the direction of the car. The term $- c_{4} x$ accounts for the influence of the distance $x$ between the pedestrian and the car, with a larger distance generally reducing the interaction’s intensity. The constants $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ are calibrated to ensure that the function’s range is between −1 and 1. A negative value of $SV (ϕ, x)$ , which can occur due to a competitive alignment of the pedestrian’s direction with respect to the car or a greater distance, indicates a competitive orientation. This suggests that the pedestrian’s movement is not facilitating the car’s motion or objectives. Conversely, a positive value signifies a cooperative orientation, implying that the pedestrian’s movement is aligned in a way that supports or does not impede the car’s motion. By integrating this social value orientation into the DQN algorithm’s reward function, the autonomous vehicle can better understand and respond to pedestrians’ intentions and behaviors in its decision-making process. The function ranges between −1 and 1, where negative values indicate a competitive orientation and positive values indicate a cooperative orientation. To incorporate the social value orientation into the DQN algorithm, we adjust the reward function by adding a component that considers the pedestrian’s social value orientation. The new reward function can be expressed as:

r (s, a, s') = R_{col} - α SV (ϕ, x),

(32)

If a collision occurs, and 0 otherwise. Furthermore, $R_{col}$ denotes the collision penalty, $α$ represents a weighting factor, and $SV (ϕ, x)$ is the social value orientation function that has been defined earlier. To determine the Q-values for each state-action combination, a deep neural network is employed. The network’s output layer is altered by incorporating an additional neuron to approximate the social value orientation term. The network generates a collection of Q-values for each possible action, along with an estimate of the social value orientation term. The network’s loss function is defined as follows:

\begin{matrix} L (θ) = (r + γ max_{a'} Q (s', a', θ^{-}) - Q (s, a, θ))^{2} \\ + β SV (ϕ, x)^{2}, \end{matrix}

(33)

where $θ$ are the network weights, $θ^{-}$ are the target network weights, and $β$ is a regularization factor. The standard DQN loss constitutes the initial term of the loss function, while the regularization term for social value orientation represents the second term. To train the network, the Adam optimizer is employed, along with bias correction and gradient clipping techniques. The summary of the proposed SVO-oriented DQN+ADAM decision-making scheme is presented in Algorithm 1, and the overall block diagram of the proposed framework based on SVO oriented DQN+ADAM optimizer for the human-centric decision-making of AVs are illustrated in Figure 2.

Algorithm 1: Pseudo-code of the proposed DQN-based decision-making for autonomous car with SVO pedestrian avoidance and Adam optimizer
1: Initialize replay buffer $D$ with capacity $N$ 2: Initialize Q-network with random weights $θ$ 3: Initialize target Q-network with weights $θ^{-} = θ$ 4: Set hyperparameters: discount factor $γ$ , learning rate $α$ , mini-batch size $B$ , momentum parameter $β_{1}$ , $β_{2}$ , and $ϵ$ for Adam optimizer 5: Initialize time step $t \leftarrow 0$ 6: for episode $e = 1$ to $M$ do 7: Initialize state $s_{1}$ 8: Set cumulative reward $R \leftarrow 0$ 9: for $t = 1$ to $T$ do 10: With probability $ϵ$ select a random action $a_{t}$ , otherwise select $a_{t} = \arg max_{a} Q (s_{t}, a; θ)$ 11: Execute action $a_{t}$ in the environment and observe reward $r_{t}$ and new state $s_{t + 1}$ 12: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $D$ 13: Sample a mini-batch of transitions $(s_{i}, a_{i}, r_{i}, s_{i + 1})$ from $D$ 14: Set target for Q-learning: 15: if $s_{i + 1}$ is terminal then 16: $y_{i} \leftarrow r_{i}$ 17: else 18: $y_{i} \leftarrow r_{i} + γ max_{a} Q (s_{i + 1}, a; θ^{-})$ 19: end if 20: Compute loss $L (θ) = \frac{1}{B} \sum_{i} {(y_{i} - Q (s_{i}, a_{i}; θ))}^{2}$ 21: Compute gradients $\nabla_{θ} L (θ)$ using backpropagation 22: Compute momentum and bias-corrected first and second moment estimates: $\begin{matrix} m_{t} & \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{θ} L (θ) \\ v_{t} & \leftarrow β_{2} v_{t - 1} + (1 - β_{2}) (\nabla_{θ} L (θ {))}^{2} \end{matrix}$ $\begin{matrix} {\hat{m}}_{t} & \leftarrow \frac{m_{t}}{1 - β_{1}^{t}} \\ {\hat{v}}_{t} & \leftarrow \frac{v_{t}}{1 - β_{2}^{t}} \end{matrix}$ 23: Update weights using Adam optimizer: $θ \leftarrow θ - α \frac{{\hat{m}}_{t}}{\sqrt{\hat{v} t} + ϵ}$ 24: Every $C$ steps, update target Q-network weights: $θ^{-} \leftarrow θ$ 25: Set $s_{t} \leftarrow st + 1$ and $R \leftarrow R + r_{t}$ 26: end for 27: end for

Algorithm 1: Pseudo-code of the proposed DQN-based decision-making for autonomous car with SVO pedestrian avoidance and Adam optimizer

1: Initialize replay buffer

D

with capacity

N

2: Initialize Q-network with random weights

θ

3: Initialize target Q-network with weights

θ^{-} = θ

4: Set hyperparameters: discount factor

γ

, learning rate

α

, mini-batch size

B

, momentum parameter

β_{1}

β_{2}

, and

ϵ

for Adam optimizer
5: Initialize time step

t \leftarrow 0

6: for episode

e = 1

M

do
7: Initialize state

s_{1}

8: Set cumulative reward

R \leftarrow 0

9: for

t = 1

T

do
10: With probability

ϵ

select a random action

a_{t}

, otherwise select

a_{t} = \arg max_{a} Q (s_{t}, a; θ)

11: Execute action

a_{t}

in the environment and observe reward

r_{t}

and new state

s_{t + 1}

12: Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

D

13: Sample a mini-batch of transitions

(s_{i}, a_{i}, r_{i}, s_{i + 1})

from

D

14: Set target for Q-learning:
15: if

s_{i + 1}

is terminal then
16:

y_{i} \leftarrow r_{i}

17: else
18:

y_{i} \leftarrow r_{i} + γ max_{a} Q (s_{i + 1}, a; θ^{-})

19: end if
20: Compute loss

L (θ) = \frac{1}{B} \sum_{i} {(y_{i} - Q (s_{i}, a_{i}; θ))}^{2}

21: Compute gradients

\nabla_{θ} L (θ)

using backpropagation
22: Compute momentum and bias-corrected first and second moment estimates:

\begin{matrix} m_{t} & \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{θ} L (θ) \\ v_{t} & \leftarrow β_{2} v_{t - 1} + (1 - β_{2}) (\nabla_{θ} L (θ {))}^{2} \end{matrix}

\begin{matrix} {\hat{m}}_{t} & \leftarrow \frac{m_{t}}{1 - β_{1}^{t}} \\ {\hat{v}}_{t} & \leftarrow \frac{v_{t}}{1 - β_{2}^{t}} \end{matrix}

23: Update weights using Adam optimizer:

θ \leftarrow θ - α \frac{{\hat{m}}_{t}}{\sqrt{\hat{v} t} + ϵ}

24: Every

C

steps, update target Q-network weights:

θ^{-} \leftarrow θ

25: Set

s_{t} \leftarrow st + 1

and

R \leftarrow R + r_{t}

26: end for
27: end for

Figure 2.

A general block diagram of the proposed framework based on SVO oriented DQN+ADAM optimizer for the human-centric decision-making of AVs.

Results and discussion

In the preceding sections, we constructed a decision-making model for an autonomous vehicle in a scenario where a pedestrian is crossing the road. The model uses the distance between the vehicle and the pedestrian, as well as the pedestrian’s social value orientation and the vehicle’s collision risk function. The model employs deep Q-learning and Adam optimization with SGD to train and obtain the optimal policy for the vehicle. In this section, simulation results demonstrate that the model can avoid collisions with pedestrians while minimizing travel time. We compare the model’s performance with other deep learning and Q-learning models, with and without human-centric social value orientation, and discuss the results regarding safety and efficiency for designing decision-making systems for autonomous cars in pedestrian-dense environments. Table 1 displays the hyperparameters used in the proposed social value orientation-based deep Q-learning scheme, including the learning rate, discount factor, epsilon (exploration rate), momentum, and batch size. The simulation experiment results of the proposed algorithm with different hyperparameter settings are also presented in the table.

Table 1.

Hyperparameters used in SVO-based DQN.

Hyperparameter	Value
Learning rate $(α)$	0.00025
Discount factor $(γ)$	0.99
Initial exploration rate $(ϵ)$	1.0
Final exploration rate $(ϵ_{\min})$	0.1
Exploration rate decay ( $ϵ$ -decay)	1e-6
Replay buffer size	100,000
Batch size	32
Target network update frequency	10,000
Adam optimizer learning rate	0.0001
Adam optimizer $β_{1}$	0.9
Adam optimizer $β_{2}$	0.999

The findings indicate that the algorithm’s performance is sensitive to the selection of hyperparameters. A higher learning rate and discount factor lead to better outcomes. A lower epsilon value results in better performance due to the exploration-exploitation trade-off. The momentum parameter is also significant, with a higher momentum value improving performance. Finally, larger batch sizes result in faster convergence of the algorithm.

The Q-values obtained from learning in the autonomous car-pedestrian collision avoidance problem indicate that as the distance between the car and pedestrian decreases, the Q-value decreases as well for both acceleration and brake actions (as shown in Figure 3). The Q-value for acceleration drops at a faster rate than that for braking, suggesting that in this situation, acceleration is riskier than braking. When the car approaches the pedestrian, the Q-values for both actions reach zero, indicating that the risk of collision is at its highest in this region. Hence, these learned Q-values offer valuable insights into the best course of action to take in different states to minimize the risk of collision in the autonomous car-pedestrian scenario.

Figure 3.

Learned Q-values for two actions of braking and acceleration depending on the states.

Figure 4 demonstrates that the car is able to maintain a safe distance from the pedestrian and successfully avoids collisions. The black dashed line represents the position of the pedestrian, and the brown line represents the position of the car. The simulation shows that the car is able to avoid such collisions and maintain a safe distance throughout the simulation.

Figure 4.

Maintaining safe distance from pedestrian and avoiding collisions during the simulation time.

Figure 5 displays the state-action visit outcome for the SVO-based DQN+ADAM approach. It indicates the frequency with which the agent has explored and exploited different actions in each state throughout the learning process. This frequency is represented by the number of times each state-action pair has been visited. A state-action pair with a high visit count, such as the pair of state 96 and action 7, indicates that the agent has learned and updated its DQN-based Q-values more accurately for that pair.

Figure 5.

State-action visit counts: Frequency of states and actions observed during the learning process in a reinforcement learning algorithm.

In Figure 6, the Q-values for each state-action pair are depicted, where the Q-values for the “brake” action are displayed in blue, and the Q-values for the “accelerate” action are represented by the red line. The brown line represents the optimal policy curve learned by the DQN algorithm. This curve displays the Q-values for the optimal action that the agent should take at each state. One notable observation is that the optimal policy curve changes rapidly across different states, suggesting that the optimal action varies frequently depending on the car’s state and the pedestrian’s position. Moreover, the optimal policy curve has higher values than the Q-values for the “brake” and “accelerate” actions, indicating that the optimal policy outperforms the individual actions in terms of avoiding a collision with the pedestrian.

Figure 6.

Q-values and optimal policy for collision avoidance with a pedestrian: Comparison of “brake” and “accelerate” actions with optimal policy curve.

The results discussed earlier are reinforced by Figure 7. The histogram of rewards illustrates the distribution of rewards that the car obtained during the training process. The plot depicts that the majority of the rewards, approximately 96%, are positive and equivalent to 1, which indicates that the car was successful in avoiding collisions with pedestrians. However, there were a few cases where the car collided with a pedestrian, resulting in a negative reward of −1. The plot indicates that only four such instances occurred, accounting for just 4% of the total frequency. This suggests that the trained model is effective in preventing collisions with pedestrians, as shown by the high frequency of positive rewards.

Figure 7.

The histogram of rewards illustrates the distribution of rewards that the car obtained during the training process.

The relationship between car speed and pedestrian position can provide valuable insights into the safety of autonomous driving systems, as shown in Figure 8. The graph demonstrates that collision risk increases as the car gets closer to the pedestrian, but the rate of increase differs depending on the speed. When driving at lower speeds, the risk increases slowly, with a maximum of 0.04 reached at a distance of 1 m. However, at higher speeds, the risk rises quickly, with a maximum of 0.12 reached at a distance of only 0.5 m. This highlights the significance of maintaining a safe distance from pedestrians, particularly when driving at high speeds. The findings also suggest that reducing the speed as the car approaches a pedestrian, even if they are still at a safe distance, could be beneficial. These results emphasize the importance of advanced collision avoidance strategies that consider both the pedestrian’s position and the car’s speed.

Figure 8.

The relationship between vehicle speed and pedestrian position: Collision risk analysis for the proposed SVO-oriented DQN+ADAM.

In reinforcement learning, balancing exploration and exploitation is crucial, and the exploration strategy employed can have a significant impact on an agent’s performance. Figure 9 illustrates the importance of exploration in the proposed DQN-based SVO approach by comparing exploration versus exploitation over 100 episodes. The plot shows that exploitation remains constant at 0.5, indicating a fair balance between exploration and exploitation in the agent’s policy. However, the decreasing exploration rate from 0.95 to 0.05 suggests that the agent’s policy becomes more deterministic over time, which may limit its ability to find the optimal solution. Therefore, to ensure that the agent explores all possible states and actions and finds the optimal solution, the proposed approach requires a high exploration rate. It is critical to maintain a high exploration rate during the early stages of training for better performance.

Figure 9.

Exploration versus exploitation in DQN-based SVO approach: Impact on agent’s performance.

Table 2 compares the performance of three algorithms: DQN without SVO, DQN with SVO, and Q-learning with SVO. The comparison is based on two metrics, namely, the average collision risk and the average episode length. The findings indicate that DQN with SVO performs significantly better than the other two algorithms, with an average collision risk of 13.2% and an average episode length of 130 s over 100 episodes. On the other hand, DQN without SVO and Q-learning with SVO exhibit average collision risks of 18.5% and 15.0%, respectively, and average episode lengths of 120 and 115 s over 100 episodes, respectively. These results suggest that integrating SVO into DQN decision-making enhances the algorithm’s ability to avoid collisions and enables the car to navigate more complex environments, resulting in longer episodes. As a result, the proposed algorithm with SVO has the potential to be a promising approach for autonomous car navigation in real-world scenarios.

Table 2.

Comparison of DQN performance with and without SVO based on collision risk and average episode length.

Algorithm	Avg. collisionrisk (%)	Avg. episodelength
DQN without SVO	18.5	120 s over 100 episodes
DQN with SVO	13.2	130 s over 100 episodes
Q-learning + SVO	15.0	115 s over 100 episodes
SARSA with SVO³⁰	16.0	134 s over 100 episodes

The three subplots in Figure 10 provide crucial insights into the performance of our algorithm. In the first subplot, we observe that the value function oscillates significantly, starting from an initial value of 1 and continuously decreasing while oscillating across the 90 states. This behavior is expected, as the value function is learning to approximate the expected long-term rewards for each state-action pair. The oscillation can be attributed to the learning process, which involves updating the value estimates based on the observed rewards and transitions.

Figure 10.

Performance evaluation of the proposed algorithm: (a) oscillation of the value function over 90 states, (b) average reward obtained over 100 episodes, and (c) cumulative reward obtained over time steps.

The second subplot shows the average reward obtained over 100 episodes. Interestingly, the average reward remains constant at 1 for the first 25 episodes, after which it varies significantly between 0.9 and 1 for the remaining episodes. This variation in the average reward can be attributed to the exploration-exploitation trade-off that the algorithm has to balance. Initially, the algorithm tends to explore different actions and hence obtains higher rewards, while later, it exploits the learned policy and obtains lower rewards.

The third subplot depicts the cumulative reward obtained over time steps. Starting from an initial value of 0, the reward gradually accumulates and reaches 100 by the end of 100 time steps. This trend is expected as the algorithm learns to maximize the long-term rewards by selecting the optimal actions for each state. Overall, the subplots suggest that our algorithm is learning to approximate the expected rewards and converge to an optimal policy.

Conclusions

In this paper, we proposed a new human-centric cyber-physical approach for automated vehicle decision-making by integrating Deep Q-Network (DQN) reinforcement learning and social value orientation. Our approach enables automated vehicles to make decisions that are not only optimal but also aligned with human social values.

To achieve this, we formulated the decision-making problem as a Markov Decision Process and used DQN to learn the optimal Q-value function. We also integrated the Adam optimization algorithm with stochastic gradient descent to improve the learning process. Furthermore, we introduced a social value orientation model for pedestrians to capture their preferences and behavior in the decision-making process. This model enables the automated vehicle to consider the pedestrian’s social values and preferences when making decisions, thus promoting human-centric decision-making.

Our results demonstrate that our proposed approach outperforms existing approaches in terms of both collision avoidance and human-centric decision-making. We believe that our approach has significant implications for the future of automated vehicle decision-making and can lead to a safer, more efficient, and more socially responsible transportation system.

In conclusion, our proposed approach integrates DQN reinforcement learning, Adam optimization, and social value orientation to enable automated vehicles to make decisions that are optimal, safe, and aligned with human social values. The incorporation of social value orientation in automated vehicle decision-making is a critical step towards achieving a human-centric transportation system, which is crucial for realizing the full potential of automated vehicles.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Hamid Taghavifar

References

Taghavifar

Shojaei

. Adaptive robust control algorithm for enhanced path-tracking performance of automated driving in critical scenarios. Soft Comput 2023; 27: 8841–8854.

Bai

Hao

Shangguan

, et al. Hybrid reinforcement learning-based eco-driving strategy for connected and automated vehicles at signalized intersections. IEEE Trans Intell Transp Syst 2022; 23(9): 15850–15863.

Liu

Wang

Zhou

, et al. Trajectory prediction of preceding target vehicles based on lane crossing and final points generation model considering driving styles. IEEE Trans Veh Technol 2021; 70(9): 8720–8730.

Crosato

Shum

, et al. Interaction-aware decision-making for automated vehicles using social value orientation. IEEE Trans Intell Veh 2022; 8(2): 1339–1349.

Dubljevic

List

Milojevich

, et al. Toward a rational and ethical sociotechnical system of autonomous vehicles: a novel application of multi-criteria decision analysis. PLoS One 2021; 16(8): e0256224.

Zhou

Zhang

, et al. A review on key challenges in intelligent vehicles: safety and driver–oriented features. IET Intell Transp Syst 2021; 15(9): 1093–1105.

Wang

. Trust-based and individualizable adaptive cruise control using control barrier function approach with prescribed performance. IEEE Trans Intell Transp Syst 2021; 23(7): 6974–6984.

Xia

Hashemi

Xiong

, et al. Autonomous vehicle kinematics and dynamics synthesis for sideslip angle estimation based on consensus Kalman filter. IEEE Trans Control Syst Technol 2022; 31(1): 179–192.

Liu

Xia

Xiong

, et al. Automated vehicle sideslip angle estimation considering signal measurement characteristic. IEEE Sens J 2021; 21(19): 21675–21687.

10.

Tian

Markkula

Wei

, et al. Impacts of visual and cognitive distractions and time pressure on pedestrian crossing behaviour: a simulator study. Accid Anal Prev 2022; 174: 106770.

11.

Wang

Taghavifar

, et al. MME-EKF-based path-tracking control of autonomous vehicles considering input saturation. IEEE Trans Veh Technol 2019; 68(6): 5246–5259.

12.

Seth

Cummings

. Traffic efficiency and safety impacts of autonomous vehicle aggressiveness. Simulation 2019; 19: 20.

13.

Trautman

Krause

. Unfreezing the robot: navigation in dense, interacting crowds. In: 2010 IEEE/RSJ international conference on intelligent robots and systems, October 2010, pp.797–803. New York: IEEE.

14.

Taghavifar

. Robust AISMC-neural network observer-based control of high-speed autonomous vehicles with unknown dynamics. Proc IMechE, Part D: J Automobile Engineering 2023; 237: 09544070221145742.

15.

Taghavifar

, et al. Optimal reinforcement learning and probabilistic-risk-based path planning and following of autonomous vehicles with obstacle avoidance. Proc IMechE, Part D: J Automobile Engineering 2023; 09544070221149278https://doi.org/10.1177/09544070221149278.

16.

Wang

Jiang

, et al. Decision-making driven by driver intelligence and environment reasoning for high-level autonomous vehicles: a survey. IEEE Trans on Intell Transp Syst 2023. https://doi.org/10.1109/TITS.2023.3275792

17.

Negash

Yang

. Driver behavior modeling towards autonomous vehicles: comprehensive review. IEEE Access 2023; 11: 22788–22821.

18.

Aradi

. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Trans Intell Transp Syst 2020; 23(2): 740–759.

19.

Duan

Eben Li

Guan

, et al. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data. IEEE Intell Transp Syst 2020; 14(5): 297–305.

20.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518(7540): 529–533.

21.

Hang

Huang

, et al. Cooperative decision making of connected automated vehicles at multi-lane merging zone: a coalitional game approach. IEEE Trans Intell Transp Syst 2021; 23(4): 3829–3841.

22.

Hang

Xing

, et al. Human-like decision making for autonomous driving: a noncooperative game theoretic approach. IEEE Trans on Intell Transp Syst 2020; 22(4): 2076–2087.

23.

Hang

Huang

, et al. An integrated framework of decision making and motion planning for autonomous vehicles considering social behaviors. IEEE Trans on Veh Technol 2020; 69(12): 14458–14469.

24.

Sallab

Abdou

Perot

, et al. Deep reinforcement learning framework for autonomous driving. arXiv preprint arXiv:1704.02532, 2017.

25.

Kiran

Sobh

Talpaert

, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst 2021; 23(6): 4909–4926.

26.

Shin

Kim

, et al. Human-centered risk assessment of an automated vehicle using vehicular wireless communication. IEEE Trans Intell Transp Syst 2018; 20(2): 667–681.

27.

Hulse

. Pedestrians’ perceived vulnerability and observed behaviours relating to crossing and passing interactions with autonomous vehicles. Transp Res Part F: Traffic Psychol Behav 2023; 93: 34–54.

28.

Hang

Huang

, et al. An integrated framework of decision making and motion planning for autonomous vehicles considering social behaviors. IEEE Trans Veh Technol 2020; 69(12): 14458–14469.

29.

Liu

, et al. Reinforcement learning optimized look-ahead energy management of a parallel hybrid electric vehicle. IEEE/ASME Trans Mechatron 2017; 22(4): 1497–1507.

30.

Taghavifar

Wei

Taghavifar

. Socially intelligent reinforcement learning for optimal decision-making for autonomous vehicle control in traffic scenarios. IEEE Trans Autom Sci Eng 2024https://doi.org/10.1109/TASE.2023.3347264.

Integrating deep reinforcement learning and social-behavioral cues: A new human-centric cyber-physical approach in automated vehicle decision-making

Abstract

Keywords

Introduction

Problem formulation

Deep Q-networks reinforcement learning

Preliminaries of reinforcement learning

DQN-ADAM+SGD

SVO-based DQN+ADAM

Results and discussion

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References