Sage Journals: Discover world-class research

Abstract

Same-day delivery (SDD) services have become increasingly popular in recent years. These have been usually modeled by previous studies as a certain class of dynamic vehicle routing problem (DVRP) where goods must be delivered from a depot to a set of customers in the same day that the orders were placed. Adaptive exact solution methods for DVRPs can become intractable even for small problem instances. In this paper, the same-day delivery problem (SDDP) is formulated as a Markov decision process (MDP) and it is solved using a parameter-sharing Deep Q-Network, which corresponds to a decentralised multi-agent reinforcement learning (MARL) approach. For this, a multi-agent grid-based SDD environment is created, consisting of multiple vehicles, a central depot, and dynamic order generation. In addition, zone-specific order generation and reward probabilities are introduced. The performance of the proposed MARL approach is compared against a mixed-integer programming (MIP) solution. Results show that the proposed MARL framework performs on par with MIP-based policy when the number of orders is relatively low. For problem instances with higher order arrival rates, computational results show that the MARL approach underperforms MIP by up to 30%. The performance gap between both methods becomes smaller when zone-specific parameters are employed. The gap is reduced from 30% to 3% for a 5 × 5 grid scenario with 30 orders. Execution time results indicate that the MARL approach is, on average, 65 times faster than the MIP-based policy, and therefore may be more advantageous for real-time control, at least for small-sized instances.

Keywords

data and data science deep learning reinforcement learning last-mile delivery freight systems city logistics and last mile strategies street use

Same-day delivery (SDD) services as used by online retailers seek to deliver goods to customers less than 24 h after the order is placed. SDD services play an integral part in e-commerce by allowing online retailers to offer the immediacy of in-person shopping to their consumers. In fact, SDD market size is predicted to grow from $4.7 billion to $9.6 billion in 2022 in the U.S. alone ( 1 ). This increase in demand is likely to have accelerated even more because of consumers massively adopting online shopping during the COVID-19 pandemic.

However, this upward trend in SDD services has led to new challenges related to sustainability, costs, and aging workforce. According to Boysen et al., an increase in delivery vans entering city centers negatively affects the environment, human health, and road safety ( 2 ). Labour costs associated with traditional van deliveries are high and an aging workforce in many industrialised countries exacerbates the problem of labour shortage for such a physically demanding job ( 3 ). Furthermore, SDD may reduce efficiencies by diminishing the chance for consolidated deliveries with a full load. This worsens the impact of last-mile delivery on the environment. Therefore, as the SDD market continues to grow, it becomes increasingly important to optimise its operations.

Since SDD services have only recently been widely adopted, the literature in the context of the dynamic vehicle routing problem (DVRP) is relatively scarce. A few of the recent approaches in the literature include a sample-scenario planning approach, a dynamic dispatch waves approach, and mixed-integer programming (MIP). These studies will be further analysed in the Literature Review.

Exact solution methods for DVRPs rarely scale past a few vehicles ( 4 ). Reinforcement learning (RL), on the other hand, has arisen as a powerful method that can potentially scale to thousands of vehicles and orders. Recently, the combination of deep neural networks (DNNs) with RL has shown to achieve superior performance against MIP solvers in similar fleet management problems ( 6 ). In addition, DNNs allow for offline training and real-time execution, which is crucial for the same-day delivery problem (SDDP).

To the best of the authors’ knowledge, only Chen et al. have implemented RL to solve an SDDP involving heterogeneous fleets of drones and vehicles ( 5 ). However, the RL developed only solves the assignment part of the problem of whether an order is assigned to a drone or a vehicle. Other decisions, such as route planning and pre-emptive depot return, are solved using separate heuristics and algorithms which may lead to sub-optimal policies. Outside the SDD literature, RL has been used to solve a similar on-demand pick-up and delivery (ODPD) problem. Balaji et al. showed that RL outperforms a MIP solution approach to the ODPD problem, for a single vehicle scenario ( 6 ).

This paper makes an important contribution to the SDD literature by exploring the implementation of a state-of-the-art decentralised multi-agent deep RL approach. More specifically, a multi-agent version of the well-known Deep Q-Network (DQN) method, called parameter-sharing DQN, is adopted. The main advantage of this method over MIP-based solvers is the real-time execution (after offline training), which is particularly important in SDD to meet customers’ expectation of immediate response. In addition, the MARL formulation gives extra flexibility by allowing agents to determine policies about which deliveries can strategically wait at the depot or pre-emptively return to depot, reject services, or execute unconventional routes in anticipation of future orders. These flexibilities have been shown to significantly improve performance ( 7 – 9 ). Nevertheless, one of the biggest challenges of RL for vehicle routing problem (VRP)-related problems is the large action space, which can limit algorithm scalability.

In short, this paper aims to further explore and fill in the gaps in the SDD literature with the following step-by-step objectives:

To model SDDP as a Markov decision process (MDP) with an action space that allows for different agent strategies, such as the ones described above.

To develop a virtual environment based on the MDP model as a simulator for the MARL framework.

To implement a state-of-the-art MIP solution approach for SDDP as a benchmark for the MARL approach.

To assess the performance and scalability of the proposed MARL approach.

The rest of the paper is structured as follows. First, the literature related to SDDP and the RL solution approach is discussed. Then, a general description of SDDP is given, along with a general MDP model formulation, followed by a presentation of the RL solution approach, or, more specifically, the parameter-sharing DQN algorithm. The results section describes the set-up of the experiments, the implementation of DQN, the benchmark algorithm, and, lastly, the results of the experiments. Finally, the paper is concluded and the potential directions for future work are discussed.

Same-Day Delivery Problem (SDDP) and Vehicle Routing Problem (VRP)

Same-Day Delivery Problem (SDDP): General Background

SDDP involves the delivery of goods from a depot to customers, with customer requests coming in over the course of a day. The goods must be picked up from a depot before delivery can occur and must be delivered within the same day. SDD-specific literature is relatively scarce as it is very recent, although it is becoming an increasingly popular service in the real world.

Azi et al. considered SDDP for a fleet of vehicles and approached the problem by considering multiple scenarios for future requests to make decisions on whether to accept a request or not ( 7 ). Similarly, Voccia et al. solved the same problem with the same sample-scenario planning approach, but it differs in that it allows vehicles to strategically wait at the depot in anticipation of future requests that can fit into the current planned route ( 8 ). This additional flexibility is particularly useful when orders are heterogeneously spaced in time. In contrast to waiting, Ulmer et al. investigated a strategy that allows vehicles to pre-emptively return to depot before serving all customers ( 9 ). They found that this strategy increases the number of customers served per workday using a combination of routing heuristic together with approximate dynamic programming methods.

Several studies have approached SDDP by formulating it as a dynamic dispatch waves problem (DDWP) ( 10 – 12 ). DDWP only allows starting of routes at certain dispatch epochs, in this case, every hour. The first two mentioned papers consider only a single delivery vehicle and implement an a priori policy which applies the rollout algorithm to decide when to leave the depot and which customers to serve. On the other hand, Heeswijk et al. formulate an MDP model ( 12 ). For large spaces, they solve the problem using an integer linear program together with linear value function approximation.

Ming et al. considered SDDP with a focus on local small retail stores as depots with crowd-shippers as the delivery vehicle ( 13 ). Assignment decisions are made based on a mixed-integer linear programming (MILP) model with a rolling horizon structure to consider future requests. Liu explored on-demand meal delivery with drones ( 14 ). Similarly, a MILP model is also devised to represent the problem while using heuristics to solve the dynamic part of the problem.

All previous papers explore only homogenous fleets, but there are a few recent papers dedicated to studying SDDP with heterogenous fleets (SDDPHF). Liu is the first to study the incorporation of drones into a fleet of vehicles in the context of SDD, applying a parametric policy function approximation (PFA) approach ( 14 ). Balaji et al. improved on this by using a deep Q-learning approach, which demonstrated superior performance when compared with PFA ( 6 ). They attributed this improvement to DQN’s ability to incorporate more information into its decision-making process, such as availability of resources and demands. Furthermore, both papers used a technique novel to the SDD order assignment problem—the minimum cost insertion heuristic.

Multi-Agent Reinforcement Learning (MARL) in the Vehicle Routing Problem (VRP)

MARL refers to multiple learnable agents that take actions and receive back rewards from the same environment ( 15 ). Agents in the same environment can be trained independently or cooperatively ( 16 ), where the latter allows communication between agents. Busoniu et al. showed that cooperative Q-learning can significantly outperform independent Q-learning in many distinct settings if it can be used efficiently ( 15 ). Cooperative training benefits from the additional observations from other agents and results in a faster learning speed, but at a cost of communication. Tampuu et al. applied DQN in a MARL setting which showed the potential of DQN as a tool in decentralised learning of multi-agent systems ( 17 ).

In the context of SDD, decentralised MARL has yet to be explored, but in the broader literature of DVRP, there are more relevant works exploring decentralised MARL. Balaji et al. implemented DQN to optimise the route of a single delivery driver in a stochastic and dynamic and environment ( 6 ). Their work has shown that the RL agent can consistently outperform a state-of-the-art MIP. However, the training time was roughly 3 days for a single vehicle in an 8 × 8 grid map with 10 maximum orders. This may lead to unacceptable training time for scenarios with larger fleets. Lin et al. showed that careful design of the simulator can allow scalability to large-scale fleet management systems ( 18 ). Contextual deep Q-learning and contextual multi-agent actor critic algorithms are used to achieve explicit coordination between agents, which also outperformed state-of-the-art approaches in empirical studies. In contrast to the former paper, the latter successfully scaled up to a 504 hexagonal grid with thousands of orders. Nevertheless, it is still important to note that the problems solved by these two authors are significantly different. The latter studied the fleet repositioning problem, which has a smaller state-action space when compared the former which studied the ODPD problem with a much larger state-action space.

In summary, DQN has been shown to be useful in decentralised learning of multi-agent systems. However, scalability to city-sized instances has only been attempted for problems where the RL agents’ actions are limited. For flexible policies in which agents have flexible action choices, only a single-vehicle scenario has been tested ( 6 ). This paper will first test the solution approach by Balaji et al. on SDDP for a single agent and then attempt to scale it up by utilizing a decentralised multi-agent system, the parameter-sharing DQN ( 6 ).

The solution proposed in this paper also aims to take advantage of many of the strategies employed by the SDD literatures discussed earlier, which has yet to be done, possibly because of the complexity in implementing multiple flexibilities using only heuristics. For example, in Voccia et al. vehicles are allowed to strategically wait at the depot but a pre-emptive depot return is not allowed ( 8 ). RL methods can easily implement these strategies by designing a flexible action space such that it can anticipate future requests and strategically wait at the depot, as well as pre-emptively return to the depot. However, it is worth noting that these added flexibilities do not guarantee that the RL agent can learn to exploit all of them in an effective manner.

Methodology

Problem Description

The problem involves a fleet of vehicles to deliver parcels from a depot to customers whose requests are stochastic and dynamic across the period of a day. Although the probability distribution of where and when a customer will appear is known, the actual location and time of the customer request are unknown until revealed.

When a request comes in, each vehicle in the environment is required to accept or reject the request. The vehicle must make this decision within the next time step to simulate real-world SDD service providers, which give immediate feedback to the customers on whether an SDD service is available.

On order acceptance, the parcel must be delivered within a fixed deadline, and missing a deadline will result in a penalty. Penalties will be given to any accepted but missed orders because this will mean customer expectation is not being met, resulting in low customer satisfaction. If more than one vehicle accepts the order at the same time step, the order will be assigned to the agent with the minimum insertion cost. This assignment is made irreversible, since the process of packaging and loading onto a specific vehicle would have begun.

Among the assigned orders, the vehicle can also decide the route plan for which orders to serve first. In addition, the vehicles can choose to wait strategically at the depot in anticipation of future requests. While en route, agents are also allowed to pre-emptively return to the depot to consolidate the delivery route. If an order is rejected by all vehicles, no penalty or reward will be given, as it is assumed that the order is simply assigned to another delivery service such as next-day delivery.

Model Preparation

SDDP is modeled as an MDP. The fleet of $m$ identical vehicles is denoted as $F = {v_{1}, v_{2} \dots, v_{m}}$ with their respective locations, $L = {l (v_{1}), l (v_{2}) \dots, l (v_{m})}$ initialised at the depot location, denoted as $l (d)$ at the beginning of episode, time $t_{0}$ . The subscript in set $F$ is referred to as the vehicle ID denoted as $i .$ Throughout the episode, customer orders come in until a fixed cut-off time $t_{c}$ . The set of $n$ customers is denoted as $C = {c_{1}, c_{2} \dots, c_{n}}$ . The time of order and location of customer orders is denoted by $t (c_{j})$ and $l (c_{j})$ , respectively, where $j$ represents the customer ID. The time window to deliver the order is between $t (c_{n})$ and $t (c_{n}) + δ$ , where $δ$ is a fixed time length between the order time and order deadline. If the fixed order deadline $t (c_{n}) + δ$ is after the end of episode time $t_{e}$ , then the order deadline is set to $t_{e}$ . All vehicles are required to return by time $t_{e}$ . The time taken to pick up and load a package onto the vehicle is $t_{p}$ , whereas the time taken to deliver an order on reaching customer location is $t_{d} .$

Markov Decision Process (MDP) Model Formulation

SDDP was modeled as in Ulmer et al. and Ulmer and Thomas ( 9 , 19 ). There are five main components to an MDP—decision point, state space, action space, rewards, and transition. The decision point is defined as the time at which a decision is made. For this problem, a decision is required at every time step. Therefore, the time step representing the $k^{th}$ decision point is denoted as $t_{k}$ .

The state contains all the information needed to make the decision at a particular decision point. The state at time, $t_{k}$ is denoted as state $S_{k}$ . For this problem, the state contains the following information:

$t_{k} :$ time of $k^{th}$ decision point, that is, the current time step

$L = {l (v_{1}), l (v_{2}) \dots, l (v_{m})} :$ vehicle locations

$Ω = {l (c_{1}), l (c_{2}) \dots, l (c_{n})}$ : customer order locations

$Φ = {o (c_{1}), o (c_{2}) \dots, o (c_{n})}$ : customer order statuses

$T = {t (c_{1}), t (c_{2}) \dots, t (c_{n})}$ : customer order time of requests

The order status has the following values:

$o (c_{j}) = - 1$ , if the order of customer $c_{j}$ is inactive.

$o (c_{j}) = 0,$ if the order of customer $c_{j}$ is open and not yet accepted.

$\begin{matrix} o (c_{j}) = 1, \end{matrix}$ if the order of customer $c_{j}$ has been accepted and assigned to another vehicle in the environment.

$\begin{matrix} o (c_{j}) = 2, \end{matrix}$ if the order of customer $c_{j}$ has been accepted by the vehicle observing this state space but hasyet to be picked up by the vehicle at the depot.

$\begin{matrix} o (c_{j}) = 3, \end{matrix}$ if the order of customer $c_{j}$ has been accepted by the vehicle observing this state space and has been picked up and loaded onto the vehicle.

$o (c_{j}) = 4$ if the order of customer $c_{j}$ has been successfully delivered to the customer before deadline.

The state is mathematically defined as:

S_{t} = (t, L, Ω, Φ, Τ)

(1)

At each decision point, an action within the action space is to be selected. Each vehicle in the set $F$ is required to select their own action, $x_{k}$ at all decision points, $t_{k}$ . The number of actions within an action space is dependent on the episode’s setting, more specifically, the setting that determines the maximum number of orders that can be simultaneously active during a time step. Nevertheless, the action space is always fixed throughout an episode with a particular setting. At decision point $t_{k}$ , the vehicle can decide between the following actions.

$x_{k}$ consists of the following possible actions:

$x_{k} = 0$ : wait and do nothing

$x_{k} = 1 : accept customer request$

$x_{k} = 2 :$ move a step towards the depot Nothing happens if already at depot.

$x_{k} = j + 2 :$ move a step towards the customer location, $l (c_{j})$ .

$x_{k} = n + 2 :$ move a step towards the customer location, $l (c_{n})$ .

Therefore, the number of available actions is $n + 2 .$ This action space is particularly large, especially during a large-scale simulation when the number of orders that needs to be simultaneously active becomes very large. In a decentralised scenario, it is possible for more than one vehicle to compete to accept the same order. As described earlier, a minimum cost insertion algorithm is used to resolve this conflict. The minimum cost insertion algorithm calculates the increase in distance when an extra order is added into a current route. If the vehicle is still in the depot, all assigned orders are included as the current route. The difference between the total distance of the new route with the newly accepted order and the total distance of the current route is calculated. The order is then assigned to the vehicle that has the minimum cost increase resulting from the new order insertion. If the vehicle is not at the depot and is delivering orders in a route, then the orders that are already packed into the van are not included in this minimum cost insertion calculation. Only the orders assigned to the vehicle but which have yet to be picked up at the depot are included in the route distance calculation.

The reward for a state-action pair is denoted as $R (S_{k}, x_{k})$ . Let r be the total reward for successfully delivering an order. If an order is accepted, a third of the reward $r / 3$ is given. Once the delivery is successfully, the remaining two-thirds of the reward $2 r / 3$ is given. No reward or penalty is given for rejected customers. A negative reward or penalty is given for any invalid actions. Invalid actions include accepting an order when no order is open or attempting to deliver an inactive order.

$R (S_{k}, x_{k}) = r / 3$ , for successfully accepting an order.

$R (S_{k}, x_{k}) = 2 r / 3,$ for successfully delivering an order before deadline.

$R (S_{k}, x_{k}) = - π_{1}, for invalid actions .$

$R (S_{k}, x_{k}) = - π_{2},$ for missing deadline of assigned orders.

$R (S_{k}, x_{k}) = - π_{3},$ for failing to return to depot at the end of episode.

$R (S_{k}, x_{k}) = 0, otherwise$

Once the decision is made, the environment transitions to the post-decision state, $S_{k, p} .$ The post-decision state will update deterministic changes, as a result of the action taken by the fleet. From the post-decision state, $S_{k, p},$ the environment transitions to the next pre-decision state, $S_{k + 1} .$ During this transition, exogenous information is revealed which is independent of the actions taken by the vehicles. In this case, the exogenous information is the generation of orders and customers’ location. The process terminates $t_{k} = t_{e}$ .

Reinforcement Learning (RL) Solution Approach

DQN is a modified version of the simpler Q-learning algorithm, which is an RL method that learns the value of a state-action pair, known as Q-values. Each Q-value is an estimate of the expected future reward for taking an action in any given state. A table containing the Q-values of all possible state-action pairs is known as a Q-table.

However, it is virtually impossible to exhaustively explore all the possible Q-values for tabulation in complex environments with multiple agents. Therefore, a DNN is used to approximate Q-values, using a given set of features from the state space as inputs. This is known as DQN and it is based on the following loss function:

L (θ) = E_{s_{t,} a_{t}, s_{t + 1}} [r_{t + 1} + γ {max}_{a} Q_{θ^{'}} (s_{t + 1}, x) - Q_{θ} (s_{t}, x_{t})]

(2)

where

$Q_{θ}$ = neural network with set of parameters $θ$ , and

$Q_{θ^{'}}$ = neural network with set of parameters $θ'$ .

$Q_{θ^{'}}$ corresponds to the target network and is not updated every through gradient descent, but simply copied from $Q_{θ}$ every certain number of episodes.

A parameter-sharing DQN framework is used, that is, each agent can learn an independent policy, but all agents share the parameters of the network. This makes training more scalable than the fully decentralized approach, because the number of trainable parameters does not depend on the number of agents.

Experiments

The experiments were run on a grid world environment, where agents can only move north, south, east, or west. Each episode consists of 144 time steps ( $t_{e} = 144$ ), to reflect a working 12 h delivery day with one time step representing 5 min in the real world. The vehicle speed is set to one grid per unit time. The time required to deliver, pick up, and accept orders are set as $t_{d} = t_{p} = t_{a} = 1 .$

The default depot location is set at the center of the grid. If the grid map is an even number and there are four center locations, then the depot will be set to any one of the center grids. For example, the depot location can be at (5,5) or (6,6) in a 10 × 10 grid map. A representation of the grid world is shown in Figure 1.

Figure 1.

Illustration of a 10 × 10 gridworld environment with five vehicles.

The deadline of the order, $δ$ is set to 48 units time (4 h in the real world) from the time a customer requests an order, $t (c_{n})$ . The penalties for invalid action, missed order, and failure to return to depot by end of episode is set as $π_{1} = π_{2} = π_{3} = 5$ and the reward for a successful order is set as $r = 3 .$ Lastly, the number of orders generated in the environment is stochastic as it is set according to the expected number of orders given by the user. At every time step, there is a certain probability that an order will appear, such that, by the end of the episode, the total number of orders will be roughly equal to the given expected number of orders. The grid map size, number of vehicles, and expected number of orders are three of the settings that are varied across experiments which will be specified accordingly.

The same DQN architecture is used for all experiments in this paper. A neural network architecture consists of three hidden layers, consisting of 256, 256, and 128 nodes, respectively. The input layer receives a set of features from the environment, that is, the state observation described in the MDP formulation. The number of nodes in the output layer corresponds to the number of possible actions in the environment.

A flowchart of each step is presented in Figure 2. The full state is retrieved from the environment and passed to the parameter-sharing DQN, which then outputs the individual Q values for each vehicle. Based on the Q values, the policy outputs the action, which is then passed to the environment to perform vehicle movements.

Figure 2.

Flowchart of a step of the proposed solution algorithm.

Baseline Algorithm

The proposed DQN approach is compared against a baseline model consisting of a deterministic MIP solution approach. Balaji et al. also used MIP as a baseline model for the dynamic and stochastic pick-up and delivery problem ( 6 ).

Whenever a new order arrives, a solution from the deterministic MIP is obtained for the current available orders in the environment. MIP can either accept or reject the order and will execute a route plan such that it can serve all the currently assigned orders in the shortest possible time.

A near-optimal solution, instead of the optimal solution, can be accepted if the runtime for MIP exceeds the set maximum optimisation time. This is because finding a solution to MIP can become computationally intractable for more complex scenarios. However, the maximum optimisation time is set such that this occurs in less than 10% of tested episodes.

It is worth noting that it is difficult to directly compare the performance of algorithms from different papers in the literature review section even though they all solve seemingly similar SDDP. This is because of differences in different papers’ set-up and definition of SDDP, and, besides, it is difficult to accurately reproduce the exact problem formulation from another paper because problem descriptions can often possess some ambiguities ( 9 ). Therefore, the solution approaches from other papers will need to be reproduced to fairly compare the different methods. This is one of the main reasons the MIP model is used as a benchmark solution instead of directly comparing results to other SDD papers. Comparison of DQN with anticipatory models described earlier is left for future works.

The mathematical model for MIP is now presented:

Sets:

V: Current vehicle location, V = {0}

P: Pick-up location (depot location, associated with orders that are not in transit)

D: Delivery locations representing all orders that are not in transit

A: Delivery locations representing the orders that are accepted by driver, but not in transit

T: Delivery locations representing orders that are accepted by driver, but in transit

R: Return location (depot location, used for final return)

N: Set of all nodes/locations in the graph, $N = V \cup P \cup D \cup A \cup T \cup R$

E: Set of all edges, $E = {(i, j), \forall i, j \in N}$

Decision variables:

$x_{ij} :$ Binary variable, 1 if vehicle uses the arc from node $i$ to $j$ , 0 otherwise; $i, j \in N$

$y_{i} :$ Binary variable, 1 if the order $i$ is accepted, 0 otherwise; $i \in D$

$B_{i} :$ Auxillary variable to track the time as of node $i; i \in N$

Parameters:

n: number of orders available to pick up, n = |D|

cij: Symmetric Manhattan distance matrix between node i and j; $(i, j) \in E$

li: Remaining time to deliver order i, $i \in D \cup T$

m: Travel cost per mile

ri: Reward for orders associated with deliveries that are not in transit, D

M: A big real number

t: Time to travel 1 mi

d: A constant service time spent on accept, pick-up and drop-off

Model:

\max \sum_{i \in D} r_{i} y_{i} - m \sum_{(i, j) \in E} c_{ij} x_{ij}

(3)

Subject to:

\sum_{i \in V} \sum_{j \in N} x_{ij} = 1

(4)

\sum_{i \in P} \sum_{j \in N} x_{ij} = 1

(5)

\sum_{j \in N} x_{ij} = y_{i} \forall i \in D

(6)

\sum_{j \in N} x_{ij} = 1 \forall i \in T

(7)

\sum_{i \in N \ R} \sum_{j \in R} x_{ij} = 1

(8)

\sum_{j \in N \ R} x_{ji} - \sum_{j \in N} x_{ij} = 0 \forall i \in P \cup D \cup T

(9)

y_{i} = 1 \forall i \in A

(10)

B_{i} + d + c_{ij} t - M (1 - x_{ij}) \leq B_{j} \forall i, j \in N

(11)

B_{i} + c_{ij} t - M (1 - y_{j}) \leq B_{j} \forall i \in P, j \in D

(12)

d \sum_{i \in D \ A} y_{i} = B_{0}

(13)

B_{i} \leq l_{i}

(14)

x_{ij}, y_{i} \in {0, 1} \forall i, j \in N

(15)

Constraints 4 to 7 restrict the flow of vehicles. Constraint 4 ensures the vehicle leaves its current location only once. Constraint 5 ensures vehicle leaves the depot only once for pick-up. Constraints 6 and 7 ensure vehicle leaves the delivery destinations only once. Lastly, constraint 8 ensures the vehicle return to the depot. Constraint 9 ties everything together by enforcing a zero net flow through the nodes, therefore ensuring that sets P, D, and T are visited once and only once.

Constraint 10 ensures that all previously accepted orders are included in the route. Constraints 11 to 15 are time constraints. Constraint 11 ensures that time window is met. Constraint 12 sets the priority that orders not-in-transit must be picked up at the depot before being delivered. These two constraint were originally non-linear, and were both linearised using the big M method ( 20 ). Constraint 13 ensures times required to accept and deliver orders are accounted for. Equation 14 ensures order to be delivered before expiry. Lastly, Constraint 15 ensures decision variable x and y are binary. It should be noted that the capacity of the vehicle in this problem formulation is unconstrained. This is reasonable for small order sizes, which results in vehicles only taking a limited amount of orders, but would not be realistic for bigger order sizes.

Results

For each set of experiments, DQN was trained over 150,000 episodes and then tested over 100 episodes, while the MIP method was executed over 100 random episodes. Moreover, because of the stochasticity of the environment, three independent training rounds were obtained for the DQN algorithm. The total episodic reward is used as the performance measure for each method. It is worth noting that both methods were assessed on the exact same environment, with the same reward function. For training, a workstation was used with an Intel Core i9-10900X CPU processor and a NVIDIA RTX 3090 GPU.

Single-Agent Scenario

The proposed parameter-sharing DQN method is first compared with the MIP baseline on a single-agent environment. Different environment parameters were experimented with, considering two grid map sizes (5 × 5 and 10 × 10) and two expected number of orders (5 and 30). These values were chosen to show how the size of the grid world and the number of orders affect the performance of MIP. The results are shown in Table 1, where there are two sections referring to a homogenous and a heterogenous scenario. Scenarios marked with an asterisk in Table 1 are trained over a larger number of episodes to achieve more consistent results over the three independent runs. The homogenous scenario corresponds to an environment where order generation and rewards are the same across the whole map, whereas in the heterogenous scenario, order generation and rewards are given by a random probability distribution. For the heterogenous scenario, the grid map was divided into four zones with relative order probabilities of {0.3, 0.4, 0.2, 0.1}, with each zone respectively generating orders with maximum and minimum rewards of {[12,8], [8,6], [5,3], [3,1]}.

Table 1.

Comparison of Reinforcement Learning (RL) Performance versus Mixed-Integer Programming (MIP) for a Single-Agent Scenario

	Problem Instance	Deep Q-Network (DQN)	MIP	Difference in performance (%)
Homogenous	5 x 5 map, 5 orders	14.3	14.0	2.6
	5 x 5 map, 30 orders	62.1	72.9	−14.8
	10 x 10 map, 5 orders	14.0	14.2	−1.4
	10 x 10 map, 30 orders*	34.8	50.1	−30.6
Heterogenous	5 x 5 map, 30 orders	103.7	106.9	−3.0
Heterogenous	10 x 10 map, 30 orders*	52.1	70.3	−25.9

^*Trained over 250,000 episodes instead of 150,000 episodes.

Table 1 shows that the proposed DQN method can achieve similar final rewards as MIP when the order number is small. However, for the problem instance with a high number of orders (30), MIP significantly outperforms DQN. This is likely because of the DQN method converging to a policy that takes a sub-optimal route. Furthermore, the lower reward obtained by DQN is also a result of some missed orders and failure to return to the depot by the end of episode. In the heterogenous scenario, the DQN approach still underperformed MIP, but performed more closely to the MIP benchmark. This is especially true for the smaller 5 x 5 map with 30 orders scenario where DQN only underperforms MIP by 3% in the heterogenous scenario as compared with 15% in the homogenous scenario. Note that a higher number of episodes was needed to achieve convergence for the scenarios consisting of a 10 x 10 map and 30 orders. This is mainly because the higher environment complexity of having more orders.

Figure 3 shows the DQN training curve for the 5 x 5 map with 30 heterogenous orders distribution shown in Table 1. The graph shows that the DQN method has large inter-episode variance, and some of the episodes outperform the MIP benchmark. Furthermore, the DQN method occasionally incurs in large penalties of up to −150, which negatively affects the average performance. This is likely because of the active exploration of state-action space by the DQN algorithm.

Figure 3.

Training curve for the Deep Q-Network (DQN) algorithm, considering a 5 × 5 grid map with 30 heterogeneously distributed orders.

Multi-Agent Scenario

In the following set of experiments, a 10 × 10 grid map with 30 orders was used, and the agent number was increased to five. Results are summarized in Table 2. As shown, positive rewards for all cases were achieved by the MARL approach after 150 k training episodes. Furthermore, when the number of agents reached four, the rewards plateau as the agents can virtually deliver all 30 orders. The DQN method performed worse than the MIP-based method for one, two, and three agents. However, when the number of agents is increased to four and five, DQN achieved better results than MIP. This suggests that the DQN approach may be more advantageous than MIP in environments with a higher number of agents.

Table 2.

Comparison of Reinforcement Learning (RL) Performance for the Multi-Agent Scenario

Number of agents	Deep Q-Network (DQN)	Mixed-integer programming (MIP)	Difference in performance (%)
1*	34.8	50.1	−30.6
2	37.5	69.0	−31.5
3	64.8	74.1	−12.6
4	78.3	71.9	9.0
5	78.7	73.4	7.3

Trained over 250,000 episodes instead of 150,000 episodes.

Execution Time Evaluation

Experiments were performed on execution time to assess the efficiency of each method and their suitability for real-time control. Tables 3 and 4 show the average execution time per episode for each method. Results show that the execution time for the proposed MARL approach is far superior when compared with the MIP-based approach for all scenarios. This is an important advantage, since real-time decisions are essential for real-world SDD services. MIP tends to take much longer when the ratio between the number of orders and the grid size is higher. This is because high orders in a small grid will result in more simultaneous active orders, which means that there are more nodes, edges, and decision variables that need to be considered by MIP.

Table 3.

Comparison of Average Execution Time between Reinforcement Learning (RL) and Mixed-Integer Programming (MIP) for the Single-Agent Scenario

Problem instance		Deep Q-Network (DQN) execution time (s)	MIP execution time (s)	Execution time ratio (MIP/DQN)
Homogenous.single- agent	5 x 5 map, 5 orders	0.15	0.50	3
	5 x 5 map, 30 orders	0.21	140.33	670
	10 x 10 map, 5 orders	0.15	1.62	10
	10 x 10 map, 30 orders	0.22	31.47	140
Heterogenous.single-agent	5 x 5 map, 30 orders	0.22	51.73	230
Heterogenous.single-agent	10 x 10 map, 30 orders	0.34	39.47	110

Table 4.

Comparison of Average Execution Time between Reinforcement Learning (RL) and Mixed-Integer Programming (MIP) for the Multi-Agent Scenario

Number of agents	Deep Q-Network (DQN) execution time (s)	MIP execution time (s)	Execution time ratio (MIP/DQN)
1	0.22	31.47	140
2	0.81	45.02	50
3	1.03	16.64	15
4	0.71	24.18	30
5	0.91	16.57	18

Conclusions and Future Work

In this paper, SDDP was formulated as an MDP and it was solved using a parameter-sharing DQN, which corresponds to a decentralised MARL approach. An established MIP algorithm was used as a benchmark comparison for the proposed MARL approach. To compare both methods, an SDD environment consisting of a central depot, multiple delivery vehicles, and dynamic order generation was designed and implemented. Two different scenarios were then experimented on: single-agent and multi-agent.

For the single-agent scenario, it was shown that, when the order rate is small, the DQN approach is, at least, competitive with the MIP-based method. For problem instances with higher order arrival rates, the computational results showed that the MARL approach underperformed MIP, regardless of the grid size. The reason for this is that the MIP-based solver corresponds to a centralized method where all decisions are made by a single central unit that optimizes the global objective of the system. In contrast, MARL agents are based on the decentralized learning paradigm, that is, they optimize local decisions using a partial observation of the environment which may be more realistic in certain real-world applications. When zone-specific order generation and reward probabilities are introduced, the gap between the two methods is smaller, which may suggest that the DQN approach can be more advantageous in environments with higher complexity.

For the multi-agent case, it was shown that the proposed approach achieves similar performance as the MIP-based method, while being up to 65 times faster during execution, on average. This is a crucial advantage for SDDP, as decisions must be made in real-time. Moreover, real-world problem instances can scale to hundreds of independent drivers and thousands of order requests within a day. It should be noted that the DQN was trained offline, and the trained model was used to develop the real-time control policies. Training times for the DQN approach for all instances took from a few hours up to a day.

The main limitation of this study is that the performance of the MARL approach was evaluated for a small number of agents. The authors believe that the method should be assessed on a real-sized instance, where stability issues may arise because of agents learning independently. Thus, the next step would be to implement DQN on a large-scale SDDP environment. Real-world instances will inevitably result in higher environment complexity, and, therefore, new methods for faster training should be evaluated. Recent efforts in this domain include the use of curriculum learning (CL) and the use of policy ensembles ( 21 , 22 ).

It is worth noting that both models can easily incorporate extra constraints, such as vehicle capacity, although adding constraints to MIP will likely increase the complexity of the problem, and consequently the execution time. In contrast, adding constraints to the MARL formulation will probably not affect execution time, although an additional number of episodes may be required to achieve convergence. This is because adding constraints to the MARL formulation is directly related to the environment complexity.

Another potential direction for future research is to explore different combinations of input features. For example, some of the information, such as distance from orders, is implicitly derived from the state space. It is possible that explicit inclusion of such information as input features can improve the DQN’s solution. The authors believe it is also worth exploring the use of different neural network architectures to account for different features of the state space, for instance, using convolutional neural networks, recurrent neural networks, or both.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: P. Angeloudis, L. Parada, J. Escribano; data collection: E. Ngu, L. Parada; analysis and interpretation of results: E. Ngu; draft manuscript preparation: E. Ngu, L. Parada, J. Escribano. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by the Chilean National Agency for Research and Development (ANID) through the “BECAS DOCTORADO EN EXTRANJERO” programme, Grant No. 72210279.

ORCID iDs

Elvin Ngu

Leandro Parada

Jose Javier Escribano Macias

Panagiotis Angeloudis

References

Mazareanu

Same-Day Delivery Market Size in U.S. Statista, 2019. https://www.statista.com/statistics/1068886/us-same-day-delivery-market-size/.

Boysen

Fedtke

Schewerdfeger

Last Mile Delivery Survey From an Operational Research Perspective. OR Spectrum: Quantitative Approaches in Management, Vol. 43, No. 1, 2020, pp. 1–58. https://doi.org/10.1007/s00291-020-00607-8.

Otto

Boysen

Scholl

Walter

Ergonomic Workplace Design in the Fast Pick Area. OR Spectrum, Vol. 39, No. 1, 2017, pp. 945–975. https://link.springer.com/article/10.1007/s00291-017-0479-x.

Cordeau

J.-F.

A Branch-and-Cut Algorithm for the Dial-a-Ride Problem. Operations Research, Vol. 54, No. 3, 2006, pp. 573–586.

Chen

Ulmer

M. W.

Thomas

B. W.

Deep Q-Learning for Same-Day Delivery With Vehicles and Drones. arXiv Preprint arXiv:1910.11901, 2021. https://arxiv.org/abs/1910.11901.

Balaji

Bell-Masterson

Bilgin

Damianou

Garcia

P. M.

Jain

Luo

Maggiar

Narayanaswamy

ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems. arXiv Preprint arXiv:1911.10641, 2019. https://arxiv.org/abs/1911.10641.

Azi

Gendreau

Potvin

J. Y.

A Dynamic Vehicle Routing Problem With Multiple Delivery Routes. Annals of Operations Research, Vol. 199, No. 1, 2011, pp. 103–112. https://link.springer.com/article/10.1007/s10479-011-0991-3.

Voccia

S. A.

Campbell

A. M.

Thomas

B. W.

The Same-Day Delivery Problem for Online Purchases. Transportation Science, Vol. 53, No. 1, 2017, pp. 167–184. https://doi.org/10.1287/trsc.2016.0732.

Ulmer

M. W.

Thomas

B. W.

Mattfield

D. C.

Preemptive Depot Returns for Dynamic Same-Day Delivery. EURO Journal on Transportation and Logistics, Vol. 8, No. 4, 2019, pp. 327–361. https://doi.org/10.1007/s13676-018-0124-0.

10.

Klapp

M. A.

Erera

A. L.

Toriello

The One-Dimensional Dynamic Dispatch Waves Problem. Transportation Science, Vol. 52, No. 2, 2018, pp. 229–496. https://doi.org/10.1287/trsc.2016.0682.

11.

Klapp

M. A.

Erera

A. L.

Toriello

The Dynamic Dispatch Waves Problem for Same-Day Delivery. European Journal of Operational Research, Vol. 271, No. 2, 2018, pp. 519–534. https://doi.org/10.1016/j.ejor.2018.05.032.

12.

Heeswijk

W. J. A.

Mes

M. R. K.

Schutten

J. M. J.

The Delivery Dispatching Problem With Time Windows for Urban Consolidation Centers. BETA Working Papers, Vol. 493, No. 1, 2017, pp. 203–221. https://doi.org/10.1287/trsc.2017.0773.

13.

Ming

Qing

Liu

Hampapur

Same-Day Delivery With Crowdshipping and Store Fulfillment in Daily Operations. Transportation Research Procedia, Vol. 38, No. 1, 2019, pp. 894–913. https://doi.org/10.1016/j.trpro.2019.05.046.

14.

Liu

An Optimization-Driven Dynamic Vehicle Routing Algorithm for On-Demand Meal Delivery Using Drones. Computers & Operations Research, Vol. 111, No. 1, 2019, pp. 1–20. https://doi.org/10.1016/j.cor.2019.05.024.

15.

Busoniu

Babuska

Schutter

B. D.

A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, Vol. 38, No. 2, 2008, pp. 156–172. https://doi.org/10.1109/TSMCC.2007.913919.

16.

Tan

Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. Proc., International Conference on Machine Learning, Amherst, MA, Morgan Kaufmann, Burlington, MA, 1993, pp. 330–337. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.8066.

17.

Tampuu

Matiisen

Kodelja

Kuzovkin

Korjus

Aru

Vicente

Multiagent Cooperation and Competition With Deep Reinforcement Learning. PLoS One, Vol. 12, No. 4, 2017, p. e0172395. https://doi.org/10.1371/journal.pone.0172395.

18.

Lin

Zhao

Zhou

Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning. Proc., International Conference on Knowledge Discovery & Data Mining, London, 2018, pp. 1774–1783. https://dl.acm.org/doi/10.1145/3219819.3219993.

19.

Ulmer

M. W.

Thomas

B. W.

Same-Day Delivery With Heterogeneous Fleets of Drones and Vehicles. Networks, Vol. 72, No. 4, 2018, pp. 475–505. https://doi.org/10.1002/net.21855.

20.

Toth

Vigo

(eds.). Vehicle Routing: Problems, Methods, and Applications. Society for Industrial and Applied Mathematics, Philadelphia, 2014.

21.

Agarwal

Kumar

Sycara

Learning Transferable Cooperative Behavior in Multi-Agent Teams. arXiv Preprint arXiv:1906.01202, 2019.

22.

Lowe

Y. I.

Tamar

Harb

Pieter Abbeel

Mordatch

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems ( Guyon

Luxburg

U. V.

Bengio

Wallach

Fergus

Vishwanathan

Garnett

, eds.), Vol. 30, 2017.

Decentralised Multi-Agent Reinforcement Learning Approach for the Same-Day Delivery Problem

Abstract

Keywords

Same-Day Delivery Problem (SDDP) and Vehicle Routing Problem (VRP)

Same-Day Delivery Problem (SDDP): General Background

Multi-Agent Reinforcement Learning (MARL) in the Vehicle Routing Problem (VRP)

Methodology

Problem Description

Model Preparation

Markov Decision Process (MDP) Model Formulation

Reinforcement Learning (RL) Solution Approach

Experiments

Baseline Algorithm

Results

Single-Agent Scenario

Multi-Agent Scenario

Execution Time Evaluation

Conclusions and Future Work

Footnotes

Author Contributions

Declaration of Conflicting Interests

Funding

ORCID iDs

References