Multi-period emergency resource allocation problem with a hybrid ant colony optimization and deep Q-network algorithm

Abstract

To mitigate losses caused by emergency resource shortages, this paper investigates a multi-period resource allocation problem. By recognizing the interdependence among affected areas, demand points are modeled as a system. A systemic loss metric model is developed to quantify the cascading impacts arising from these interdependencies. Then, an emergency resource allocation model is constructed to minimize the loss and maximize the fairness. To solve the proposed model efficiently, a novel hybrid algorithm (ACO-DQN) that integrates ant colony optimization (ACO) with deep q-network (DQN) is designed. To enhance the convergence and stability of the algorithm, the pheromone mechanism of the ACO is employed to dynamically guide the exploration process and to adjust the Q-value update strategy. Numerical experiments demonstrate that, compared to the DQN, the proposed ACO-DQN shows significant advantages in terms of solution quality, convergence speed, and robustness. Finally, a case study based on the Wenchuan earthquake shows that considering interdependence enables decision-makers to better balance efficiency and fairness when resources are constrained. The findings provide important decision support for improving the overall resilience and recovery post disasters.

Keywords

emergency resource allocation interdependence deep reinforcement learning ant colony optimization

1. Introduction

Frequent global emergencies, including earthquakes, floods, and public health incidents, severely disrupt socioeconomic systems and endanger lives due to their inherent destructiveness and uncertainty. Against this backdrop, humanitarian emergency logistics emerges as the lifeline of relief operations and determines the effectiveness of disaster response.¹ Efficiently and fairly allocating limited resources among dispersed demand points is a central and persistent challenge in humanitarian emergency logistics. Therefore, developing models for optimizing emergency supply allocation is of significant theoretical and practical urgency.

The emergency resource allocation problem is characterized by inherent complexities such as dynamic demand, multi-stakeholder coordination, and resource constraints. Research methodologies in this field have continually evolved to address these challenges. Research predominantly relies on operational research methods like stochastic programming² and robust optimization³ to model parameter uncertainties. As problem scales expanded, heuristic and meta-heuristic algorithms (e.g., genetic algorithms, simulated annealing) were applied to obtain feasible solutions for complex instances.⁴ More recently, deep reinforcement learning (DRL) has emerged for dynamic and sequential decision-making, given its capability to learn adaptive policies through interaction with complex environments. Particularly Deep Q-Networks (DQN) and their variants have gained preliminary attention in resource allocation due to their structural stability.⁵ Building on this, some studies have further integrated DQN with heuristic algorithms (e.g., NSGA-II) to tackle emergency resource distribution problems.^6,7 Therefore, the integration of DQN with heuristic algorithms presents a highly promising and necessary direction for solving complex emergency resource allocation problems.

The impact of emergency supply shortages is severe, yet existing metrics for quantifying this impact are often oversimplified. A common approach models the penalty for shortages as a linear cost function, calculated simply as the product of the shortage quantity and a fixed unit penalty.^8,9 To achieve greater realism, subsequent research incorporated the time dimension, developing deprivation cost models that account for both the amount and the duration of shortages.^10,11 However, these existing loss metrics fail to capture cascading loss in interdependent regions. In interdependent regions, a resource shortage at a critical node can trigger cascading failures through supply chains and mobility networks, imposing significant secondary losses on interconnected points.^12,13 Consequently, allocation models relying solely on localized loss achieve only local optimality, failing to enhance overall system resilience. Therefore, a comprehensive loss assessment should incorporate the interdependent perspective.

As a powerful sequential decision-making framework, deep reinforcement learning has shown promise in optimizing emergency resource allocation.^5,14 However, its application to large-scale, complex disaster scenarios is often constrained by two interrelated challenges: low learning efficiency and non-adaptive exploration. Standard approaches, which rely on a single learning paradigm and fixed strategies like ε-greedy, struggle to fully utilize past experience and adapt their exploration schemes. This frequently leads to suboptimal convergence and fails to guarantee robust performance in real-world emergency operations.

To bridge these gaps, this paper investigates multi-period emergency resource allocation by explicitly modeling regional interdependencies. First, a loss metric model is developed to quantify the cascading impact of resource shortages arising from these interdependencies. Subsequently, the allocation problem is formulated as a sequential decision-making process. To solve this model efficiently and robustly, a novel hybrid ACO-DQN algorithm is proposed. The main contributions of this work are summarized as follows:

(1) A loss metric model incorporating regional interdependencies is proposed. To address the existing models’ oversight of cascading effects, this work develops a loss function that explicitly accounts for interdependencies among affected areas. This shifts the assessment from evaluating isolated points to analyzing the integrated affected areas. Consequently, the model quantifies the cascade losses, enabling resource allocation strategies that optimize the overall recovery of all disaster affected areas.

(2) A hybrid ACO-DQN algorithm with enhanced learning efficiency and adaptive exploration is proposed. 1) A hybrid learning paradigm integrating value learning and experience accumulation is developed. To address low learning efficiency, we integrate the temporal-difference (TD) error from DQN with the pheromone mechanism from ACO. This integration allows for the assessment and prioritization of high-quality historical experiences. The outputs of these two learning strategies are then adaptively weighted, enabling more efficient and stable policy learning. 2) A pheromone-driven adaptive exploration strategy is proposed. To overcome the limitations of fixed exploration strategies, we design a mechanism that uses pheromone information to dynamically modulate exploration intensity. This strategy progresses through three distinct phases: promoting broad coverage in the initial stage, shifting to targeted exploration guided by accumulated pheromone, and finally converging on refined exploitation. This ensures an optimal balance between exploration and exploitation throughout the learning process.

The remainder of this paper is structured as follows: Section 2 presents a literature review. Section 3 constructs the loss metric model and the emergency resource allocation model. Section 4 details the design of the ACO-DQN algorithm. Section 5 designs simulation experiments to comparatively analyze the performance of the proposed model and algorithm. Section 6 concludes the paper and outlines directions for future research.

2. Literature review

2.1. Modeling the loss of emergency resource shortages

Emergency resource shortages frequently arise from uncertainties inherent in disaster demand, transportation, and supply. Beamon proposed a fundamental distinction between humanitarian relief chains and commercial supply chains, stating that while commercial supply chains focus on profit maximization, humanitarian relief chains aim to minimize loss of life and alleviate human suffering.¹⁵ Therefore, quantifying and incorporating the loss associated with emergency resource shortages is a critical issue that must be considered in the emergency resources allocation.

To measure losses resulting from emergency resource shortages, linear penalty cost functions have been commonly adopted in the literature. In such an approach, the incurred loss is quantified as the product of the shortage quantity and a predetermined unit penalty coefficient. For instance, Balcik et al. employed a linear penalty model in their formulation of last-mile distribution planning.¹⁶ Similarly, Rawls and Turnquist incorporated a linear shortage penalty within the objective function of their emergency supply pre-positioning model.⁹ Extending this line of work, Ahmadi et al. also applied a linear penalty structure to quantify shortage losses under conditions of network disruption.¹⁷

Recognizing that human suffering intensifies with the duration of unmet needs, recent studies have incorporated the time dimension into the measurement of shortage-related losses. A pivotal development in this regard is the concept of deprivation costs, introduced by Holguín-Veras et al.¹⁸ to quantify the physiological and psychological suffering caused by delayed access to relief supplies. This concept was subsequently established as a theoretical cornerstone of humanitarian logistics¹⁸ and has been empirically calibrated using post-disaster data.¹⁹ As a result, deprivation cost functions have been widely integrated into optimization models for emergency logistics, guiding resource allocation and distribution decisions in recent research.^11,20,21

To further mitigate the losses arising from emergency resource shortages, fairness has been increasingly integrated into allocation models as a critical objective. Tzeng et al. measured fairness by maximizing the minimum satisfaction rate across affected areas.²² Balcik and Beamon emphasized the balance between maximizing total demand fulfillment and maintaining distributional fairness.¹⁶ Other studies have modeled the trade-off between efficiency and fairness, either by quantifying unmet demand through penalty costs²³ or by embedding the Gini coefficient within the fairness.²⁴

Despite these methodological advances, a fundamental limitation persists across linear penalty, deprivation cost, and fairness-aware models. These approaches typically treat losses at individual demand points as independent and additive, relying on a conventional assumption of localized impact that confines the consequences of a shortage to the directly affected area. However, empirical evidence consistently shows that disruptions are inherently propagative, rendering such simplifying assumptions increasingly untenable. This is particularly critical given the interconnected nature of modern supply chain and socio-economic networks, where risks are prone to cascading effects.²⁵ For instance, research on flood disasters¹² and urban rainstorms¹³ demonstrates significant economic and logistical spillovers propagated through regional linkages. From a systemic risk perspective, studies have drawn analogies to epidemic models, revealing threshold behaviors in risk diffusion.²⁶ Despite this growing recognition of cascade effects, regional interdependencies and the resulting cascading losses remain notably absent from the loss functions used in emergency resource allocation models.

2.2. Deep reinforcement learning for resources allocation problems

Deep Reinforcement Learning (DRL), with its capacity for sequential decision-making and adaptation in complex, uncertain environments, has emerged as a promising paradigm for resource allocation problem.

2.2.1. Applications in emergency management

In emergency management, DRL has been applied to a range of resource allocation and logistics routing problems. Fan et al. designed a DQN-based method for emergency supply allocation, demonstrating superior solution quality and reduced computational time compared to conventional optimization methods.⁵ For public health crises, Zeng et al. established a DRL-based dispatch model for the allocation of medical supplies.²⁷ Lei et al. developed a DRL support system for multi-hazard response, reporting significant gains in operational efficiency and resource utilization.²⁸ Beyond pure allocation, DRL has also been employed for integrated logistics problems, such as the truck-drone cooperative routing challenge tackled by Peng et al. formulate the truck-drone collaborative routing problem in humanitarian logistics as a Markov game and solve it using a multi-agent deep reinforcement learning algorithm enhanced with prioritized experience replay and invalid action masking.²⁹ Gao et al. employ deep reinforcement learning to optimize an anticipatory routing, acceptance, and postponement policy for the multi-period dynamic vehicle routing problem with stochastic requests, demonstrating that DRL effectively enhances emergency supplies distribution under dynamic and uncertain conditions.³⁰ Wang et al. propose an adversarial deep reinforcement learning framework (RESA) that models multi-period emergency supplies allocation under demand uncertainty as a two-player zero-sum Markov game, and show that their RESA-PPO algorithm-combining combinatorial action representation and reward clipping-significantly outperforms heuristic and standard RL methods.³¹

2.2.1. Hybrid DRL-heuristic algorithms

To enhance the solution efficiency and stability of DRL in complex optimization contexts, a growing line of research focuses on integrating DRL with heuristic and metaheuristic algorithms. These hybrid frameworks aim to combine the adaptive learning capability of DRL with the robust search mechanisms of traditional heuristics.

Wu et al.proposed a weight-aware deep reinforcement learning (WADRL) approach to solve the multi-objective vehicle routing problem with time windows. This method utilizes the DRL model to address the entire multi-objective optimization problem, and then employs the non-dominated sorting genetic algorithm-II (NSGA-II) to further optimize the solutions generated by WADRL, thereby mitigating the limitations of each method individually.¹⁴ In the emergency resource allocation problem, Wu et al. combined DRL with a genetic algorithm for maritime search and rescue resource allocation. Their algorithm was able to provide stable optimal solutions within 300 seconds, meeting the timeliness requirements of emergency response.³²

Beyond genetic algorithms, heuristic algorithms such as Particle Swarm Optimization (PSO), Simulated Annealing (SA), and Differential Evolution (DE) have been successfully integrated with DRL. Pradhan et al. proposed a deep reinforcement learning with particle swarm optimization (DRPO) algorithm, which utilizes PSO to avoid unnecessary searches in the deep deterministic policy gradient (DDPG) method.³³ Kosanoglu et al. designed a hybrid method combining a Double DQN (DDQN) agent with Simulated Annealing (SA). In each episode of their proposed algorithm, the best solution found by DRL is passed to SA as an initial solution, while the best solution from SA is passed back to DRL as an initial state.³⁴ Li et al. developed an adaptive multi-objective differential evolution algorithm based on DRL, where DRL serves as a controller integrated into the multi-objective differential evolution algorithm, enabling adaptive selection of mutation operators and parameters according to different search domains.³⁵

DRL has gained prominence in resource allocation due to its proficiency in sequential decision-making under uncertainty, demonstrating considerable promise in the domain of emergency logistics. To further improve its solution quality and stability, a growing body of research has developed hybrid frameworks that integrate DRL with heuristic or metaheuristic algorithms (e.g., PSO, SA). However, these integrations largely maintain a modular separation, in which DRL functions as a high-level orchestrator for parameter adaptation, while the embedded heuristic executes the core search process. This design over looks the potential of embedding the learning mechanisms (i.e., the pheromone-based feedback in ACO) in heuristic directly into the evolution or value estimation processes of DRL. Consequently, a deeper algorithmic hybridisation remains underexplored.

2.3. Gap analysis

There are two key research gaps remain in the existing literature.

First, current loss models fail to capture the cascading losses resulting from regional interdependencies in emergency resource allocation. Most loss metrics, including the deprivation cost model,¹⁸ are designed for independent demand points and do not account for the propagation of shortages through economic, logistical, or social linkages. Consequently, allocation strategies derived from such models may be locally optimal but systemically inefficient. While fairness considerations have been studied,²³ they are typically static and not integrated with a loss function that explicitly embeds network centrality. In contrast, our proposed loss metric directly incorporates interdependency factors into a sigmoid-based loss function. This formulation translates regional systemic importance into a nonlinear, threshold-sensitive loss that penalizes shortages in high-centrality regions more severely, thereby internalizing cascade effects.

Second, while hybrid DRL-heuristic frameworks have emerged, a deep integration remains absent. Existing approaches (e.g., NSGA-II-DQN, PSO-DRL) typically employ heuristics for action filtering, population initialization, or separate policy shaping, but the heuristic principles are not embedded into the core learning mechanics of the DRL agent. In our ACO-DQN algorithm, the pheromone mechanism directly modulates the Q-value update and the pheromone concentration is updated using episodic rewards, creating a closed-loop interaction between the long-term memory of ACO and the temporal-difference learning of DQN. This is fundamentally different from modular or loosely coupled hybrids, as the heuristic feedback becomes an integral part of the value function approximation process.

Therefore, this study makes two distinct contributions. First, a novel loss metric model is developed that considers the cascade effect in interdependent regions. This transforms the allocation problem from optimizing local performance to maximizing global system benefit. Second, the ACO-DQN algorithm is proposed as an efficient solver for the proposed model. This algorithm achieves a deeper integration by embedding the pheromone-based feedback mechanism of ACO directly into the experience learning and exploration process of the DQN. This integration aims to achieve superior convergence and policy robustness in the complex emergency resource allocation problem.

3. Model development

3.1. Problem description

Efficient and equitable multi-period allocation of emergency resources is crucial for effective disaster response. This paper investigates a multi-period emergency resource allocation problem within a system comprising a central distribution center (DC) and multiple interdependent demand points (DPs), as illustrated in Figure 1. The objective is to determine optimal allocation plans from the DC to each DP over a finite planning horizon of $T$ periods. The core challenge lies in determining allocation plans that effectively balance the competing needs across interdependent DPs to maximize overall system performance in disaster relief.

Figure 1.

The multi-period emergency resource allocation problem.

3.1.1. Model assumptions

The proposed model is based on the following assumptions:

(1) The total demand for emergency resources at each demand point over the entire planning horizon is known and deterministic. In reality, post-disaster demand is subject to considerable uncertainty. However, the deterministic assumption is widely adopted in the emergency logistics literature^{18 23} for two reasons. First, it provides a tractable baseline model that focuses on the core trade-offs among cascading losses, fairness, and capacity constraints without the additional complexity of stochasticity. Second, in practice, relief agencies typically produce point estimates of total demand based on rapid needs assessments (e.g., affected population multiplied by per-capita consumption rates). The proposed model can be applied directly using such estimates. The deterministic assumption therefore does not undermine the model’s practical relevance.

(2) The total quantity of resources dispatched from the distribution center in any single period cannot exceed its available capacity.

(3) Resources allocated at the beginning of period $t$ are delivered and become available for use at the demand point at the beginning of period $t + 1$ .

(4) Demand points are interdependent. The shortage at one point may propagate and generate cascading effects on other points.

3.2. Loss model considering regional interdependency

Most existing models for emergency resource allocation quantify loss based solely on local resource shortages, largely neglecting the interdependencies among regions formed through economic, social, and logistical ties. Consequently, these models fail to capture the cascading effects of disaster losses across interconnected systems. To address this, this paper introduces the regional interdependency into the loss metric model to characterize the cascade effect of shortage.

3.2.1. Interdependency network

Let $G = (V, E)$ denote an undirected weighted graph, where $V = {1, 2, \dots, n}$ is the set of demand points. Each edge $(i, j) \in E$ has a weight $w_{i j} \geq 0$ , representing the strength of economic, logistical, or social linkage between regions $i$ and $j$ . The interdependency factor $θ_{i}$ for region $i$ is defined as its weighted degree centrality $θ_{i} = \sum_{j \neq i} w_{i j}$ . A larger $θ_{i}$ indicates that a shortage in region $i$ would propagate more severe cascading losses to the whole system. In practice, $w_{i j}$ can be derived from inter-regional freight/passenger flow or supply chain dependency indices. The required inter-regional flow data (e.g., freight and passenger turnover) may not be immediately available for all disaster contexts. However, relief agencies can use proxy indicators such as road network centrality, population migration patterns, or economic input-output tables to construct $θ_{i}$ within hours. For rapid onset disasters, a simplified version using only population density and major transportation hubs can serve as a reasonable approximation.

3.2.2. Sigmoid-based loss function

To capture the nonlinear escalation of loss once a shortage exceeds a critical threshold, we adopt an S-shaped (sigmoid) function, following established disaster impact studies.^36–38 The interdependency factor $θ_{i}$ is embedded into the sigmoid to reflect that shortages in highly interdependent regions trigger steeper losses. The resulting loss for region $i$ in period $t$ is:

L_{i t} (S_{i t}) = \frac{\frac{1}{1 + e^{‐ θ_{i} (\frac{S_{i t}}{D_{i}} ‐ a)}} ‐ \frac{1}{1 + e^{θ_{i} a}}}{\frac{1}{1 + e^{‐ θ_{i} (1 ‐ a)}} ‐ \frac{1}{1 + e^{θ_{i} a}}}

(1)

where

D_{i}

is the total demand of region

i

over the planning horizon

T

, and

S_{i t}

is the shortage quantity at the beginning of period

t

. The parameter

a

determines the shortage ratio at which the loss begins to rise sharply. The denominator normalises the loss to the range

[0, 1]

, so that

L_{i t} = 0

when

S_{i t} = 0

and

L_{i t} ⟶ 1

S_{i t} \to D_{i}

. The graph of the loss function is showed in Figure 2.

Figure 2.

Relationship of the shortage and the loss.

For a region with high $θ_{i}$ , even a small shortage causes a steep loss escalation, thereby encoding the idea that failures in central nodes rapidly cascade to neighbours. The reduced-form loss function thus preserves the priority ranking induced by network centrality without requiring explicit dynamic propagation models.

3.3. Model formulation

3.3.1. Parameter definitions

$D_{i}$ : Total emergency resource demand at demand point $i$ over the planning horizon $T$ .

$C$ : Capacity of the distribution center.

$λ$ : Weighting factor for fairness.

3.3.2. Decision variables

$x_{i t}$ : Quantity of emergency resources allocated to demand point $i$ at the beginning of period $t$ .

$S_{i t}$ : Quantity of shortage at demand point $i$ at the beginning of period $t$ .

The objective of this study is to minimize the total loss of the demand system and maximize the fairness among regions. The loss is calculated by the proposed loss model. Fairness is defined as range-based disparity (max-min total loss), which is a special case of min-max fairness. This is a special case of min-max fairness (Rawlsian criterion,³⁹ which prioritizes the welfare of the worst-off group in a distribution. This choice directly penalizes the difference between the region with the highest cumulative loss and the one with the lowest cumulative loss, thereby avoiding extreme deprivation. The proposed model is formulated as follows.

\min F = \sum_{i = 1}^{n} \sum_{t = 1}^{T} L_{i t} (S_{i t}) + λ (\max_{i = 1, 2 \dots n} \sum_{t = 1}^{T} L_{i t} (S_{i t}) - \min_{i = 1, 2 \dots n} \sum_{t = 1}^{T} L_{i t} (S_{i t}))

(2)

When $λ = 0$ , the model minimizes total loss only (efficiency-driven). Increasing $λ$ places more weight on reducing inter-regional disparity.

s . t .

\sum_{i = 1}^{n} x_{i t} \leq C, t = 1, 2, \dots, T

(3)

S_{i 1} = D_{i}, i = 1, 2, \dots, n

(4)

S_{i, t + 1} = S_{i t} - x_{i t}, i = 1, 2, \dots, n, t = 1, 2, \dots, T - 1

(5)

S_{i t} \geq 0, i = 1, 2, \dots, n, t = 1, 2, \dots, T

(6)

x_{i t} \geq 0, i = 1, 2, \dots, n, t = 1, 2, \dots, T

(7)

Equation (2) represents the objective function, indicating that the model aims to minimize the loss while maximizing fairness. The first term represents the total accumulated loss across all demand points over the entire planning horizon. A smaller total loss indicates that resources are more effectively distributed to mitigate the adverse consequences of shortages. The second term quantifies the range of cumulative losses among the demand points. In particular, it computes the difference between the highest total loss experienced by any region and the lowest total loss among all regions. A smaller value implies that the allocation is fairer across regions. Equation (3) ensures that the total quantity of resources allocated from the distribution center to all demand points in each period does not exceed its capacity. Equations (4)–(6) describe the state transition constraints for the emergency resource shortage. Equation (7) imposes the non-negative constraint on the allocated quantity of emergency resources.

4. Algorithm design

To solve the multi-period interdependent resource allocation problem formulated in Section 3, a hybrid ACO-DQN algorithm is proposed. This section is structured as follows. In Section 4.1, the problem is formally defined as a sequential decision-making process within a Markov decision process (MDP) framework, specifying the state, action, and reward functions. Section 4.2 discusses the limitations of applying a standard DQN directly to this problem and outlines two corresponding improvement strategies, which motivate by the ACO. The complete algorithmic steps and training procedure are provided in the Appendix.

4.1. Formulation as a markov decision process

This section formulates the multi-period resource allocation model as a Markov Decision Process within a DQN framework. In this framework, the distribution center acts as an agent that interacts with an environment comprising the demand points. At each decision period $t$ , the agent observes the current state $S_{t}$ , takes an action $a_{t}$ (i.e., a resource allocation plan), receives an immediate reward $r_{t}$ , and the environment transitions to a new state $S_{t + 1}$ . The objective of the agent is to learn an optimal allocation policy that maximizes the expected cumulative reward over the planning horizon. The definitions of these core elements are provided below.

Agent: The distribution center serves as the agent, whose goal is to learn an optimal policy for allocating emergency resources across demand points.

Environment: The environment comprises the set of interdependent demand points, which respond to the agent’s allocation actions.

State ( $S_{t}$ ): At the beginning of period $t$ , the agent observes a state vector $S_{t} = {S_{1 t}, S_{2 t}, \dots, S_{n t}}$ , where $S_{i t}$ represents the remaining unmet demand at demand point $i$ . The initial state $S_{1}$ is given by the total demand of each point at the start of the planning horizon.

State Transition ( $S_{t} \to S_{t + 1}$ ): Given the current state $S_{t}$ and action $a_{t}$ , the next state $S_{t + 1}$ is determined deterministically as $S_{i, t + 1} = S_{i t} - x_{i t}$ , where $x_{i t}$ denotes the quantity of resources allocated to demand point $i$ in period $t$ . This transition reflects that allocated resources reduce the corresponding shortage, and any unmet demand is carried forward to the subsequent period.

Reward( $r_{t}$ ): After executing action $a_{t}$ in state $S_{t}$ , the agent receives an immediate reward defined as the negative of the original objective function (Equation (2)): $r_{t} = - F$ .

Action( $a_{t}$ ): The action at period $t$ represents the resource dispatch plan from the DC to all demand points. The action $a_{t} = {x_{1 t}, x_{2 t}, \dots, x_{n t}}$ is an $n$ -dimensional vector, where $x_{i t}$ is the quantity of resources allocated to demand point $i$ in period $t$ . The action is subject to the supply capacity constraint $\sum_{i = 1}^{n} x_{i t} \leq C$ (Equation (3)). The action space is structured around four predefined allocation strategies, which guide the generation of feasible dispatch vectors.

To guide the agent in generating effective actions, we define four distinct resource allocation strategies, each corresponding to a specific weight vector $w_{t} = (w_{1 t}, w_{2 t}, \dots, w_{n t})$ used to proportionally allocate the available supply $C$ . The allocation to region $i$ is $x_{i t} = \min (S_{i t}, w_{i t} \times C / \sum_{j = 1}^{n} w_{j t})$ . The resource allocation strategies are as follows.

Action1 (Urgency Priority): $w_{i t} = θ_{i} \times {(\frac{S_{i}}{D_{i}})}^{2}$ . Action2 (Logarithmic-Balancing): $w_{i t} = θ_{i} \times [1 + \ln (1 + 10 \frac{S_{i}}{D_{i}})]$ . Action3 (Mixed- Weight): $w_{i t} = ρ_{1} θ_{i} + ρ_{2} \frac{S_{i}}{D_{i}} + ρ_{3} \frac{D_{i}}{\sum_{i = 1}^{n} D_{i}}$ , where $ρ_{1}$ , $ρ_{2}$ , $ρ_{3}$ are adjustable coefficients. Action4 (Fairness-Oriented): three-stage process that are minimum requirement, disparity reduction, proportional allocation. The detailed descriptions of the four action strategies are provided in Appendix 1.

This framework transforms the complex multi-period resource allocation problem into a decision-making task that can be trained through deep reinforcement learning. Although the MDP formulation could theoretically be solved by dynamic programming or mixed-integer programming for small $n, T$ , the state space grows exponentially with the number of demand points and demand levels ( $O (\prod_{i} D_{i})$ ). For realistic scales (e.g., $n \geq 10, T \geq 100$ ), exact methods become computationally prohibitive. Moreover, in emergency response, decision-makers require near-optimal allocation policies within seconds. DRL offers a scalable alternative: it learns a parameterized policy offline and then executes it online without re-optimization, making it suitable for time-critical scenarios. Additionally, the proposed hybrid ACO-DQN framework can naturally accommodate future extensions to stochastic demands, where traditional optimization would require re-solving from scratch. Thus, RL is not merely a black-box optimizer but an essential tool for handling the problem’s sequential nature and large state space.

4.2. Design of the ACO-DQN algorithm

When a standard DQN is applied to the multi-period allocation problem, two main challenges are encountered which are the inefficient use of experience and the undirected exploration. To address these issues, the pheromone mechanism of Ant Colony Optimization (ACO) is integrated into DQN.

4.2.1. Pheromone-guided experience utilization

In standard DQN, transitions are sampled uniformly from the replay buffer, and the varying learning value of different experiences is ignored. To bias learning towards historically successful actions, the Q-value update is augmented with a pheromone term.

Q^{'} (s, α) = Q (s, α) + ξ \cdot Φ (s, α) \cdot C_{Φ}

(8)

where

Φ (s, α)

denotes the pheromone concentration associated with state-action pair

(s, α)

C_{Φ}

is the pheromone confidence coefficient that adjusts its influence, and

ξ

is a dynamic weighting factor that balances the contributions of TD-error and pheromone. The pheromone matrix is maintained separately from the Q-network. After each episode, pheromone concentrations are updated based on the cumulative reward:

Φ (s, α) \leftarrow (1 - ρ) Φ (s, α) + Δ Φ

. A closed-loop interaction is thereby created: value estimates are provided by the Q-network, the Q-value is biased by the pheromone, and the resulting policy performance feeds back into pheromone reinforcement. Unlike prioritized experience replay, which focuses on sampling efficiency, the proposed mechanism directly modifies the Q-value to favour actions that have led to high episodic returns, thereby accelerating convergence without violating Bellman consistency.

4.2.1. Adaptive exploration

The standard ε-greedy exploration strategy linearly decays randomness in a predetermined manner, which lacks adaptability and may lead to inefficient exploration behavior in complex decision spaces. A three-phase adaptive exploration mechanism is introduced, in which the balance between pheromone guidance and Q-value guidance is dynamically adjusted.

1) Early stage: Exploration is strongly guided by accumulated pheromone trails, enabling rapid bootstrap from high-quality historical experience and reducing dependence on purely random initial exploration. 2) Middle stage: A balanced mix of pheromone and Q-value guidance maintains a robust trade-off between exploration and exploitation. 3) Late stage: Exploration becomes strongly Q-value-driven, allowing the policy to finely converge towards the optimum predicted by the mature value network.

The flowchart of the main innovations of the algorithm is shown below (Figure 3).

Figure 3.

Innovations of the algorithm.

The complete training procedure of the ACO-DQN algorithm is outlined in Appendix2. The complexity analysis of the proposed algorithm are shown in Appendix 3.

5. Numerical experiments

This section conducts comprehensive numerical experiments to validate the effectiveness of the proposed ACO-DQN algorithm and the proposed emergency allocation model. The experiments consist of two parts: (1) performance assessment on randomly generated instances of varying scales, and (2) a case study based on actual emergency resource allocation data.

5.1. Algorithm performance evaluation

To evaluate the performance of the proposed ACO-DQN algorithm, we design four groups of randomly generated test instances with increasing complexity. The detailed parameter settings are as follows: Group 1: $n = 12$ , $T = 15$ , $C = 700$ ; Group 2: $n = 15$ , $T = 20$ , $C = 900$ ; Group 3: $n = 18$ , $T = 25$ , $C = 1200$ ; Group 4: $n = 20$ , $T = 30$ , $C = 1500$ . For each group, we independently generate 50 random instances. In each instance, the total demand uniformly drawn from the interval $D_{i} \in [100, 1500]$ , $θ_{i} \in [0.1, 2.5]$ . The remaining parameters are set as $λ = 0.4$ , $a = 0.3$ . Each instance is solved 50 times by each algorithm to ensure statistical robustness.

This section evaluates the performance of the proposed ACO-DQN algorithm against DQN and DRL-GA on the four groups described above. The following subsections analyze solution quality, stability, convergence speed, and computational efficiency.

5.1.1. Solution quality and stability analysis

Table 1 reports the average objective value obtained by each algorithm on the four groups. The objective value integrates total cascading loss and fairness, where a lower value indicates better overall performance. The standard deviation reflects the stability of the algorithm across repeated runs.

Table 1.

Average objective value of three algorithms (mean ± standard deviation).

Group	ACO-DQN	DQN	DRL-GA
1	37.8318±0.4795	37.8440±0.4867	40.9056±0.9041
2	57.3948±0.8683	57.4299±0.8694	63.5976±1.3912
3	77.8959±1.0085	77.8518±1.0421	86.3369±1.6219
4	98.6038±1.4720	98.6264±1.5635	109.9494±1.8017

From Table 1, ACO-DQN achieves slightly lower objective values than DQN in three groups, with improvements ranging from 0.01 to 0.04. Although the differences are marginal, ACO-DQN consistently matches or marginally outperforms DQN. Furthermore, the standard deviations of ACO-DQN are generally smaller than those of DQN, indicating slightly better stability. In contrast, DRL-GA yields significantly higher objective values and larger standard deviations, demonstrating its inferior performance and robustness.

To gain deeper insight into the distribution of the results, Figure 1 presents boxplots of the objective values for the three algorithms on each group, based on the 50 instances. The boxplots clearly show the median (central mark), interquartile range (box edges), and outliers (points beyond the whiskers).

From the boxplots in Figure 4, ACO-DQN consistently exhibits the lowest median and interquartile range across all four groups, indicating both superior solution quality and higher stability compared to DQN and DRL-GA. DQN yields slightly higher medians and somewhat wider distributions, while DRL-GA shows markedly elevated medians, larger spreads, and several outliers especially in Groups 3 and 4, confirming its inferiority and instability. Overall, the boxplot analysis reinforces the conclusion that ACO-DQN outperforms the other two algorithms in terms of both central tendency and robustness.

Figure 4.

Distribution of objective values for three algorithms.

5.1.2. Convergence speed and computational efficiency analysis

Since DRL-GA performs poorly in solution quality, we focus the convergence analysis on ACO-DQN and DQN. Table 2 reports the average number of episodes required to reach 90% of the maximum reward (convergence) and the average CPU time per instance. The improvement percentage is calculated as

(D Q N - A C O - D Q N) / D Q N \times 100 %

; hence, a positive value indicates that ACO-DQN outperforms DQN (i.e., faster convergence or shorter runtime).

Table 2.

Convergence speed and computational efficiency comparison.

Group	Algorithm	Convergence speed		Computational efficiency
Group	Algorithm	Convergence (episodes)	Improvement	CPU time (s)	Improvement
1	ACO-DQN	258.4264	2.0040%	2.9010	1.1550%
1	DQN	263.7104	-	2.9349	-
2	ACO-DQN	249.5139	4.1500%	3.4657	0.9890%
2	DQN	260.3167	-	3.5003	-
3	ACO-DQN	227.2192	3.1740%	3.7941	0.7640%
3	DQN	234.6669	-	3.8233	-
4	ACO-DQN	222.3576	3.9180%	4.2262	0.7030%
4	DQN	231.4257	-	4.2561	-

As shown in Table 2, ACO-DQN consistently converges faster than DQN across all groups, with improvements ranging from 2.0% to 4.2%. The average reduction in convergence episodes is 3.3%. Moreover, ACO-DQN requires slightly less CPU time per instance. These results demonstrate that the pheromone-guided mechanism effectively accelerates the learning process. Therefore, ACO-DQN is more suitable for time-critical emergency response scenarios where rapid decision-making is essential.

Based on the two parts of analysis, ACO-DQN demonstrates clear superiority. It not only matches DQN in solution quality and stability while outperforming DRL-GA by a large margin, but also converges faster with slightly lower CPU time. Therefore, ACO-DQN is a more efficient, stable, and reliable algorithm for multi-period emergency resource allocation, particularly in time-critical disaster response.

5.2. Sensitivity analysis of key parameters

To examine the impact of key parameters on model performance, we conduct sensitivity analysis using Group 3 instances ( $n = 15, T = 20$ ). The parameters tested include the sigmoid turning point $a$ , the fairness weight $λ$ , and the distribution center capacity $C$ . Each parameter is varied while holding the others at their default values ( $a = 0.3, λ = 0.4, C = 900$ ). The performance metrics are the total loss, the fairness, and the combined objective $F$ .

5.2.1. Effect of the sigmoid turning point $a$

The parameter $a$ determines the shortage ratio at which loss begins to escalate rapidly. A smaller $a$ makes the loss function more sensitive to shortages, leading to aggressive allocation, while a larger $a$ allows more tolerance. The parameter $a$ is varied from 0.1 to 0.5 in increments of 0.1.

In Table 3, as

a

increases from 0.1 to 0.5, the total loss first rises slightly then declines, reaching its minimum at

a = 0.5

with a reduction of 0.71% compared to

a = 0.1

. The fairness is lowest at

a = 0.4

and increases moderately at

a = 0.5

. The combined objective

F

is minimized at

a = 0.5

. The default value

a = 0.3

yields a balanced trade-off where total loss is within 1.6% of the optimum and fairness remains moderate. Therefore,

a = 0.3

is suitable for general scenarios, while a smaller

a

may be preferred for highly urgent disasters where early shortages must be strictly avoided.

Table 3.

Sensitivity of performance metrics to $a$ .

$a$	Total loss	Fairness	Objective
0.1	77.4707	5.7034	79.7521
0.2	78.1143	5.9310	80.4867
0.3	78.1837	4.6081	80.0270
0.4	77.5643	3.1793	78.8360
0.5	76.9223	4.2120	78.6071

5.2.2. Effect of the fairness weight $λ$

The parameter $λ$ is used to balance the trade-off between minimizing total loss and reducing fairness, and it is varied from 0 to 1.0 in steps of 0.2.

In Table 4, as

λ

is increased from 0 to 1.0, a clear downward trend is observed in the fairness, which reaches its minimum at

λ = 1.0

, corresponding to a reduction of 37.9% compared to

λ = 0

. Meanwhile, the total loss remains almost unchanged. The combined objective

F

is minimized at

λ = 0

, but at

λ = 0.6

it is still as low as 79.64, while the fairness is already reduced by 21.4%. A balanced trade-off is offered by the default value

λ = 0.4

where the fairness is reduced by 7.1% relative to

λ = 0

, with only a 0.4% increase in total loss. Therefore,

λ = 0.4

is considered suitable for general scenarios, whereas a larger

λ

may be preferred when fairness is of primary concern.

Table 4.

Sensitivity of performance metrics to $λ$ .

$λ$	Total loss	Fairness	Objective
0.0	77.9279	4.2494	77.9279
0.2	77.8767	5.9350	79.0637
0.4	78.2647	3.9520	79.8455
0.6	77.6355	3.3424	79.6409
0.8	77.3633	3.9645	80.5349
1.0	77.9473	2.6441	80.5914

5.2.3. Effect of distribution center capacity $C$

The per-period capacity is scaled by factors of 0.6, 0.8, 1.0, 1.2, and 1.4 relative to the default value $C = 900$ .

As the capacity factor is increased, both total loss and fairness decrease significantly. When the capacity is increased by 40% (factor 1.4), total loss is reduced by 26.1% and fairness is reduced by 51.0% compared to the baseline factor 1.0. Conversely, a 40% reduction in capacity (factor 0.6) leads to a 60.9% increase in total loss and a 46.4% increase in fairness relative to the baseline. The default capacity factor 1.0 represents a moderately constrained scenario, as further capacity increases yield diminishing returns. Hence, the default setting is considered appropriate for general situations, while capacity adjustments may be considered when resource availability is extremely tight or abundant (Table 5).

Table 5.

Sensitivity of performance metrics to capacity factor.

$C$	Total loss	Fairness	Objective
0.6	126.1861	7.8311	129.3185
0.8	96.4987	4.4982	98.2980
1.0 (baseline)	78.4274	5.3534	80.5687
1.2	66.2052	2.7287	67.2966
1.4	57.9583	2.6218	59.0071

5.2.4. Robustness to $θ_{i}$ perturbations

Unlike the global parameters $a$ , $λ$ , and $C$ which are fixed across instances, the interdependency factors $θ_{i}$ vary across regions and are derived from real-world data (e.g., freight and passenger turnover). In practice, these indicators may contain measurement errors. To assess the model’s sensitivity to such errors, the $θ_{i}$ values are perturbed rather than varied uniformly. Specifically, for each Group 3 instance, the region with the largest $θ_{i}$ (the most influential node) is identified, and its $θ_{i}$ is multiplied by factors of 0.6, 0.8, 1.0, 1.2, and 1.4, while all other $θ_{i}$ are kept unchanged. This test examines whether moderate mis-estimation of a critical region’s interdependency would dramatically change the allocation outcome.

In Table 6, as the perturbation factor is varied from 0.6 to 1.4, the objective value changes by less than 1.1% relative to the baseline. For a 40% underestimation with factor 0.6, the objective increases by only 0.33%. For a 40% overestimation with factor 1.4, the objective increases by 0.50%. The fairness is more sensitive to underestimation but remains within an acceptable range. These results indicate that the model is robust to moderate estimation errors in the interdependency factors. Therefore, the default

θ_{i}

values derived from real-world data are considered reliable for practical deployment.

Table 6.

Sensitivity to $θ_{i}$ perturbations.

Perturbation factor	Total loss	Fairness	Objective
0.6	76.8830	6.4764	79.4735
0.8	77.1070	5.2031	79.1883
1.0 (baseline)	77.2262	4.9627	79.2113
1.2	78.4336	3.9490	80.0133
1.4	77.3358	5.6758	79.6061

The above analysis confirms that the model behaves as expected and that the default parameter values $a = 0.3, λ = 0.4, C = 900$ are reasonable choices that balance the total loss and the fairness. The model is also robust to perturbations in $θ_{i}$ .

5.2. Case study

The proposed model is applied to a real-world scenario, the 2008 Wenchuan earthquake. Four severely affected regions are selected for analysis which are Dujiangyan, Wenchuan, Beichuan, and Qingchuan. First, the allocation of prefabricated housing, which is a critical emergency resource for post-disaster shelter, is examined. Then, the generalizability of the model is tested by applying it to a different resource type, i.e., disinfectants for epidemic prevention. Finally, the adaptability of the proposed model to various emergency resources and disaster contexts is discussed.

5.2.1. Prefabricated housing allocation

The demand for prefabricated housing in each region is estimated using the formula: $Demand = Planned area per household \times Total affected population / Average household size coefficient$ . The affected populations in the four regions are 7457, 58454, 18298, and 20272, respectively. Assuming an average household size coefficient of 3.5 and a planned area of 20 m² per household, the emergency resource demands for the four regions are calculated as 42611 m², 334023 m², 104560 m², and 115840 m², respectively. The planning horizon is set as $T = 6$ periods, and the distribution center capacity is set as $C = 90000 m^{2}$ .

To measure regional interdependency, this study adopts road freight turnover and road passenger turnover as indicators. Freight turnover reflects the strength of supply chain networks for raw materials and goods, while passenger turnover captures socioeconomic linkages such as labor and business flows. Together, they form the basis for regional interdependency. According to the Sichuan Statistical Yearbook 2008, the road passenger turnovers for the four regions are: 122256, 60353, 6096, and 10979 (10,000 person·km). The corresponding road freight turnovers are: 5908, 120431, 3859, and 2792 (10,000 ton·km).

To eliminate scale differences and integrate both dimensions, the raw data are normalized, and a weighted sum approach is used to construct a composite interdependency index: $θ_{i} = w^{f} \times {F r e i g h t}_{i} + w^{p} \times {P a s s e n g e r}_{i}$ . Where ${F r e i g h t}_{i}$ and ${P a s s e n g e r}_{i}$ are the normalized freight and passenger turnovers of region $i$ , respectively. In this study, the weights are set as $w^{f} = 0.5$ , $w^{p} = 0.5$ . The calculated interdependency values for the four regions are 0.5245, 0.7469, 0.0410, and 0.0565.

The resulting total objective value is 16.7058, comprising a total loss of 15.2007 and a fairness term of 1.5051. The losses attributed to each region are: 1.7553, 5.5179, 3.9173, and 4.0102, respectively. The emergency resource allocation plan across the six periods is presented below.

Table 7 presents the multi-period emergency resource allocation plan generated by the proposed model for the case study. The results illustrate the model’s ability to balance operational efficiency with distributional fairness across interconnected regions throughout the planning horizon.

Table 7.

Emergency resource allocation plan for prefabricated housing.

Period	Dujiangyan	Wenchuan	Beichuan	Qingchuan
1	24698	32399	16236	16665
Satisfaction Rate	57.96%	9.70%	15.53%	14.39%
2	17913	37474	16759	17423
Satisfaction Rate	100.00%	20.92%	31.56%	29.43%
3	0	49353	19709	20937
Satisfaction Rate	100.00%	35.69%	50.41%	47.50%
4	0	53102	17630	19267
Satisfaction Rate	100.00%	51.59%	67.27%	64.13%
5	0	57414	15291	17294
Satisfaction Rate	100.00%	68.78%	81.89%	79.06%
6	0	62360	12674	14965
Satisfaction Rate	100.00%	87.45%	94.01%	91.98%
Allocated Quantity	42611	292102	98299	106551

The model first prioritizes Dujiangyan, whose demand is fully satisfied within the first two periods. This allocation priority stems from the high interdependency coupled with moderate absolute demand of Dujiangyan. Addressing its shortage early effectively mitigates potential cascading losses, thereby accelerating the overall recovery of affected regions.

Subsequently, the model adopts a phased approach to allocate resources to the remaining regions. Although Wenchuan has the highest interdependency, its resource demand is magnitude larger. To prevent overallocation to a single high-priority node at the expense of systemic fairness, the model gradually increases Wenchuan’s allocation share over successive periods. ,This results in a final satisfaction rate of 87.45% by Period 6. This rate is slightly lower than those of Beichuan 94.01% and Qingchuan 91.98%.

This outcome illustrates the balance between efficiency and fairness of the proposed model. Beichuan and Qingchuan, despite their low interdependency, achieve high satisfaction rates because their smaller absolute demand can be met without severely compromising allocations to the critical, high-demand region Wenchuan. Thus, the model avoids the extremes of purely efficiency-driven or purely fairness-driven allocation.

5.2.2. Disinfectant allocation

To demonstrate the model’s applicability to a different category of emergency resources, we apply the same framework to disinfectants (e.g., 84 disinfectant solution, bleaching powder), which are essential for post-disaster epidemic prevention. The demand for disinfectants is estimated based on the standard practice of large-scale environmental disinfection after earthquakes. The required amount of disinfectant is approximately proportional to the affected population. A commonly used ratio in disaster logistics is 10 liters of disinfectant concentrate per affected person for the initial six-week response period. Thus, the demand for region $i$ is calculated as: ${Disinfectant demand}_{i} = {Affected population}_{i} \times 10 (liters)$ . The resulting demands are 74,570 L, 584,540 L, 182,980 L, and 202,720 L for Dujiangyan, Wenchuan, Beichuan, and Qingchuan, respectively. The distribution center capacity is set to $C = 160, 000 L$ per period (scaled to the demand magnitude), and the planning horizon remains $T = 6$ periods.

The sigmoid turning point $a$ is recalibrated to 0.1 for disinfectants. This lower value reflects the higher urgency of epidemic prevention: even a small shortage of disinfectants can lead to rapid disease spread and secondary disasters, so losses escalate more quickly than for prefabricated housing. All other parameters remain unchanged.

The resulting total objective value is 16.0756, comprising a total loss of 14.8235 and a fairness term of 1.2521. The losses attributed to each region are: 1.5463, 4.3005, 4.6765, and 4.3002, respectively. The emergency resource allocation plan across the six periods is presented in Table 8.

Table 8.

Emergency resource allocation plan for disinfectant.

Period	Dujiangyan	Wenchuan	Beichuan	Qingchuan
1	61304	87301	4792	6603
Satisfaction Rate	82.2%	14.9%	2.6%	3.3%
2	13266	132965	5796	7973
Satisfaction Rate	100%	37.7%	5.8%	7.2%
3	0	139592	8602	11806
Satisfaction Rate	100%	61.6%	10.5%	13.0%
4	0	137193	9634	13173
Satisfaction Rate	100%	85.0%	15.8%	19.5%
5	0	87489	12198	60313
Satisfaction Rate	100%	100%	22.4%	49.3%
6	0	0	72106	87894
Satisfaction Rate	100%	100%	61.8%	92.6%
Allocated Quantity	74570	584540	113128	187762

Table 8 presents the allocation plan. The total objective value is 16.0756, comprising a total loss of 14.8235 and a fairness term of 1.2521, both slightly lower than those for prefabricated housing (15.2007 and 1.5051, respectively). This reduction is primarily driven by the smaller $a$ , which makes the loss function more sensitive and forces more aggressive allocation to high-interdependency regions.

The allocation pattern differs notably from the prefabricated housing case. Dujiangyan (interdependency 0.5245, demand 74,570 L) is fully satisfied by the end of Period. Wenchuan (interdependency 0.7469, demand 584,540 L) receives gradually increasing allocations and reaches full satisfaction by Period 5. Beichuan and Qingchuan, which have very low interdependency (0.0410 and 0.0565), are largely deferred: at Period 6 their satisfaction rates reach only 61.8% and 92.6%, compared to 94.0% and 92.0% for prefabricated housing. This indicates that when urgency is high, the model prioritizes high-interdependency regions even more aggressively, at the expense of low-interdependency areas with large demands. Overall, the plan satisfies about 91.8% of total disinfectant demand, confirming that the model adapts effectively to resource criticality by recalibrating $a$ .

5.2.3. Generalizability of the proposed model

The case studies above demonstrate that the proposed model is not limited to a specific resource type. Whether allocating shelter materials (prefabricated housing) or epidemic prevention supplies (disinfectants) are allocated, the same framework, which includes the interdependency network $θ_{i}$ , the sigmoid loss function with urgency parameter $a$ , and the fairness-efficiency trade-off controlled by $λ$ , produces plausible and interpretable allocation plans. This generalizability is attributed to three features. First, the interdependency factor $θ_{i}$ captures resource-independent systemic importance based on economic and logistical networks. Second, the turning point $a$ can be calibrated to reflect the urgency of any given resource type. Third, the fairness term is universally applicable to any resource where equitable distribution is a concern.

Therefore, the proposed model can be readily extended to other emergency resources, such as food, drinking water, medical supplies, and fuel, by recalculating demands based on population or other relevant indicators and by adjusting $a$ according to the criticality of the resource.

5.3. Managerial implications

The proposed model and algorithm offer several actionable insights for emergency managers and policymakers involved in multi-period resource allocation under regional interdependencies.

(1) Priority setting based on interdependency. The interdependency factor $θ_{i}$ quantifies the systemic importance of each region using readily available data (e.g., freight and passenger flows). Decision makers can use $θ_{i}$ to identify critical nodes in the affected area and allocate resources preferentially to regions with higher $θ_{i}$ . This helps mitigate cascading losses and accelerates overall recovery.

(2) Balancing efficiency and fairness. The fairness weight $λ$ provides a transparent mechanism to trade off total system loss against inter-regional equity. Our experiments show that $λ = 0.4$ reduces fairness disparity by 7.1% with only a 0.4% increase in total loss, offering a pragmatic default. In practice, when distributional justice is a primary concern (e.g., in political sensitive contexts), a larger $λ$ can be adopted; when minimizing overall damage is paramount, a smaller $λ$ may be preferred.

(3) Adjusting urgency via the sigmoid turning point. The parameter $a$ controls how quickly losses escalate with shortage. For life-critical resources such as disinfectants, a small $a$ forces aggressive allocation to high-interdependency regions, as shown in the case study. For less urgent resources like temporary housing, a larger $a$ allows a more balanced distribution. This flexibility enables managers to calibrate the model according to the resource’s criticality and the disaster’s time pressure.

(4) Capacity planning. The sensitivity analysis reveals that increasing capacity beyond a certain point yields diminishing returns. Managers should prioritize ensuring a minimum adequate capacity before investing in further expansion. When capacity is severely constrained, the model automatically favors high-interdependency regions, which may be an acceptable trade-off in extreme scarcity.

(5) Algorithm selection for real-time deployment. The ACO-DQN algorithm converges 2.0%∼4.2% faster than standard DQN while maintaining the same solution quality, and its inference time is below 0.5 seconds per decision period. This makes it suitable for time-critical emergency response where rapid re-planning is required as new demand information arrives.

6. Conclusion

This paper investigates a multi-period emergency resource allocation problem in post-disaster with interdependent regions. First, a loss metric model is developed to capture the cascading effects of resource shortages across interdependent regions. By modeling demand points as a system, the proposed framework provides a more realistic quantification of losses caused by shortage. Unlike deprivation cost models that assume independent regions, our loss function explicitly incorporates interdependency factors to reflect cascading dynamics. Second, based on the loss model, a multi-period optimization model is formulated to minimize loss and maximize fairness. Third, to solve the proposed model efficiently, a novel hybrid algorithm (ACO-DQN) is designed, which integrates the pheromone guided mechanism of ACO into DQN. This integration enables dynamic guidance of the exploration process and enhances the Q-value update strategy. In contrast to existing heuristic-DRL hybrids that use heuristics only for action filtering or population initialization, our ACO-DQN directly embeds the pheromone signal into the Q-value update, creating a closed-loop interaction between heuristic memory and value learning. Numerical experiments confirm that ACO-DQN outperforms the standard DQN in terms of both computational efficiency and solution stability. Furthermore, a case study based on the 2008 Wenchuan earthquake illustrates that the proposed model enables decision-makers to balance efficiency and fairness under resource constraints.

While the proposed algorithm shows significant advantages in small-scale problems, its performance in large-scale and complex scenarios remains less pronounced. Therefore, future research will focus on exploring more efficient optimization algorithms suitable for large-scale emergency resource allocation problem.

Supplemental material

Supplemental material - Multi-period emergency resource allocation problem with a hybrid ant colony optimization and deep Q-network algorithm

Supplemental material for Multi-period emergency resource allocation problem with a hybrid ant colony optimization and deep Q-network algorithm by Jingke Zhou, and Yingzhen Chen in Science Progress.

Footnotes

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 72374067.

ORCID iD

Yingzhen Chen

Author contributions

Zhou Jingke(First Author): Conceptualization, Methodology, Software, Validation, Investigation, Data curation, Writing-original draft, Writing-review & editing, Visualization. Chen Yingzhen(Corresponding Author): Conceptualization, Methodology, Supervision, Project administration, Funding acquisition, Writing-original draft, Writing -review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by National Natural Science Foundation of China (72374067).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data used in this study are derived from publicly available sources (e.g., Sichuan Statistical Yearbook 2008) and simulated instances generated by the authors. The simulation data supporting the findings are available from the corresponding author upon reasonable request.*

Supplemental material

Supplemental material for this article is available online.

References

Kundu

Sheu

Kuo

. Emergency logistics management-review and propositions for future research. Transp Res E Logist Transp Rev 2022; 164: 102789. https://doi.org/10.1016/j.tre.2022.102789

Meng

Wang

, et al. A two-stage chance constrained stochastic programming model for emergency supply distribution considering dynamic uncertainty. Transp Res E Logist Transp Rev 2023; 179: 103296. https://doi.org/10.1016/j.tre.2023.103296

Zhang

Dai

, et al. Robust optimization of emergency resource location and coupling allocation considering multiple uncertainties. Socioecon Plann Sci 2026; 103: 102391. https://doi.org/10.1016/j.seps.2025.102391

Nahavandi

Homayounfar

Daneshvar

, et al. Hierarchical structure modelling in uncertain emergency location-routing problem using combined genetic algorithm and simulated annealing. Int J Comput Appl Technol 2022; 68(2): 150–163. https://doi.org/10.1504/ijcat.2022.123466

Fan

Chang

Mišić

, et al. DHL: deep reinforcement learning-based approach for emergency supply distribution in humanitarian logistics. Peer Peer Netw Appl 2022; 15(5): 2376–2389. https://doi.org/10.1007/s12083-022-01353-0

Shi

Zhang

, et al. Dynamic truck-drone cooperative delivery of emergency supplies considering secondary disasters. Transp Res E Logist Transp Rev 2026; 206: 104543. https://doi.org/10.1016/j.tre.2025.104543

Lei

Guo

Shi

, et al. A novel embedded dual-layer multi-objective evolutionary algorithm for multimodal emergency logistics delivery under demand uncertainty and supply shortage. Swarm Evol Comput 2026; 100: 102279. https://doi.org/10.1016/j.swevo.2025.102279

Balcik

Beamon

. Facility location in humanitarian relief. Int J Logist Res Appl 2008; 11(2): 101–121. https://doi.org/10.1080/13675560701561789

Rawls

Turnquist

. Pre-positioning of emergency supplies for disaster response. Transp Res B Methodol 2010; 44(4): 521–534. https://doi.org/10.1016/j.trb.2009.08.003

10.

Holguín-Veras

Pérez

Jaller

, et al. On the appropriate objective function for post-disaster humanitarian logistics models. J Oper Manag 2013; 31(5): 262–280. https://doi.org/10.1016/j.jom.2013.06.002

11.

Wang

de Vries

. Quantifying human suffering for humanitarian logistics: deprivation cost versus deprivation level. Transp Res E Logist Transp Rev 2026; 207: 104612. https://doi.org/10.1016/j.tre.2025.104612

12.

Chen

, et al. Regional economic impact of flood disasters in Yangtze River Economic Zone: a TERM model with a decomposition analysis approach. Int J Disaster Risk Reduct 2025; 120: 105346. https://doi.org/10.1016/j.ijdrr.2025.105346

13.

Ding

. Interregional economic impacts of an extreme storm flood scenario considering transportation interruption: a case study of Shanghai, China. Sustain Cities Soc 2023; 88: 104296. https://doi.org/10.1016/j.scs.2022.104296

14.

Wang

Song

, et al. A reinforcement learning-assisted search and rescue resource allocation decision-making approach for maritime emergencies. Comput Ind Eng 2025; 201: 110933. https://doi.org/10.1016/j.cie.2025.110933

15.

Beamon

. Humanitarian relief chains: issues and challenges. In: Proceedings of the 34th International Conference on Computers and Industrial Engineering. University of Washington, 2004, pp. 77–82.

16.

Balcik

Beamon

Smilowitz

. Last mile distribution in humanitarian relief. J Intell Transp Syst 2008; 12(2): 51–63. https://doi.org/10.1080/15472450802023329

17.

Ahmadi

Seifi

Tootooni

. A humanitarian logistics model for disaster relief operation considering network failure and standard relief time: a case study on San Francisco district. Transp Res E Logist Transp Rev 2015; 75: 145–163. https://doi.org/10.1016/j.tre.2015.01.008

18.

Holguín-Veras

Jaller

Van Wassenhove

, et al. On the unique features of post-disaster humanitarian logistics. J Oper Manag 2014; 32(7-8): 386–395.

19.

Holguín-Veras

Pérez

Jaller

, et al. Material convergence: important and understudied disaster phenomenon. Nat Hazards Rev 2016; 17(1): 04015015.

20.

Shi

Wang

, et al. A dynamics model of the emergency medical supply chain in epidemic considering deprivation cost. Socioecon Plann Sci 2024; 94: 101924. https://doi.org/10.1016/j.seps.2024.101924

21.

. A novel stochastic-robust optimization model for emergency supplies prepositioning under uncertain scenario probabilities. Expert Syst Appl 2026; 297: 129380. https://doi.org/10.1016/j.eswa.2025.129380

22.

Tzeng

Cheng

Huang

. Multi-objective optimal planning for designing relief delivery systems. Transp Res E Logist Transp Rev 2007; 43(6): 673–686. https://doi.org/10.1016/j.tre.2006.10.012

23.

Huang

Jiang

Yuan

, et al. Modeling multiple humanitarian objectives in emergency response to large-scale disasters. Transp Res E Logist Transp Rev 2015; 75: 1–17. https://doi.org/10.1016/j.tre.2014.11.007

24.

Huang

Zhu

Wang

, et al. Balancing the trade-off between efficiency and equity in a stochastic emergency supplies allocation problem. Appl Math Model 2025; 148: 116242. https://doi.org/10.1016/j.apm.2025.116242

25.

Helbing

. Globally networked risks and how to respond. Nature 2013; 497(7447): 51–59. https://doi.org/10.1038/nature12047

26.

Jiang

Liang

, et al. Risk propagation and intervention in a complex supply chain network under public emergency. Comput Ind Eng 2026; 213: 111797. https://doi.org/10.1016/j.cie.2025.111797

27.

Zeng

Wei

, et al. Deep reinforcement learning based medical supplies dispatching model for major infectious diseases: case study of COVID-19. Oper Res Perspect 2023; 11: 100293. https://doi.org/10.1016/j.orp.2023.100293

28.

Lei

Liu

. Multi-disaster emergency response decision support based on reinforcement learning algorithm. Procedia Comput Sci 2025; 261: 887–895. https://doi.org/10.1016/j.procs.2025.04.418

29.

Peng

Wang

Yin

, et al. Multi-agent deep reinforcement learning-based truck-drone collaborative routing with dynamic emergency response. Transp Res E Logist Transp Rev 2025; 195: 103974. https://doi.org/10.1016/j.tre.2025.103974

30.

Gao

Wang

. Post-disaster emergency supplies distribution optimization: a deep reinforcement learning approach. Proceedings of the 24th Wuhan International Conference on E-Business (WHICEB 2025), Cham, 2025. Springer. https://doi.org/10.1007/978-3-031-94184-9_11

31.

Wang

Fan

Zhu

, et al. When demand uncertainty occurs in emergency supplies allocation: a robust DRL approach. Appl Sci 2026; 16(2): 581. https://doi.org/10.3390/app16020581

32.

Wang

Hao

, et al. Multiobjective vehicle routing optimization with time windows: a hybrid approach using deep reinforcement learning and NSGA-II. IEEE Trans Intell Transp Syst 2025; 26(3): 4032–4047. https://doi.org/10.1109/tits.2024.3515997

33.

Pradhan

Bisoy

Kautish

, et al. Intelligent decision-making of load balancing using deep reinforcement learning and parallel PSO in cloud environment. IEEE Access 2022; 10: 76939–76952. https://doi.org/10.1109/access.2022.3192628

34.

Kosanoglu

Atmis

Turan

. A deep reinforcement learning assisted simulated annealing algorithm for a maintenance planning problem. Ann Oper Res 2024; 339(1): 79–110. https://doi.org/10.1007/s10479-022-04612-8

35.

Meng

Tang

. Scheduling of continuous annealing with a multi-objective differential evolution algorithm based on deep reinforcement learning. IEEE Trans Autom Sci Eng 2023; 21(2): 1767–1780. https://doi.org/10.1109/tase.2023.3244331

36.

Buzna

Peters

Helbing

. Modelling the dynamics of disaster spreading in networks. Physica A 2006; 363(1): 132–140. https://doi.org/10.1016/j.physa.2006.01.059

37.

Chou

Yang

Cheng

, et al. Identification and assessment of heavy rainfall-induced disaster potentials in Taipei City. Nat Hazards 2013; 66(2): 167–190. https://doi.org/10.1007/s11069-012-0511-z

38.

Chen

Zhao

Huang

, et al. A bi-objective optimization model for contract design of humanitarian relief goods procurement considering extreme disasters. Socioecon Plann Sci 2022; 81: 101214. https://doi.org/10.1016/j.seps.2021.101214

39.

Rawls

. In: A theory of justice. 1st ed. Belknap Press of Harvard University Press, 1971.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.66 MB

0.00 MB

Multi-period emergency resource allocation problem with a hybrid ant colony optimization and deep Q-network algorithm

Abstract

Keywords

1. Introduction

2. Literature review

2.1. Modeling the loss of emergency resource shortages

2.2. Deep reinforcement learning for resources allocation problems

2.2.1. Applications in emergency management

2.2.1. Hybrid DRL-heuristic algorithms

2.3. Gap analysis

3. Model development

3.1. Problem description

3.1.1. Model assumptions

3.2. Loss model considering regional interdependency

3.2.1. Interdependency network

3.2.2. Sigmoid-based loss function

3.3. Model formulation

3.3.1. Parameter definitions

3.3.2. Decision variables

4. Algorithm design

4.1. Formulation as a markov decision process

4.2. Design of the ACO-DQN algorithm

4.2.1. Pheromone-guided experience utilization

4.2.1. Adaptive exploration

5. Numerical experiments

5.1. Algorithm performance evaluation

5.1.1. Solution quality and stability analysis

5.1.2. Convergence speed and computational efficiency analysis

5.2. Sensitivity analysis of key parameters

5.2.1. Effect of the sigmoid turning point a

5.2.2. Effect of the fairness weight λ

5.2.3. Effect of distribution center capacity C

5.2.4. Robustness to θ i perturbations

5.2. Case study

5.2.1. Prefabricated housing allocation

5.2.2. Disinfectant allocation

5.2.3. Generalizability of the proposed model

5.3. Managerial implications

6. Conclusion

Supplemental material

Supplemental material - Multi-period emergency resource allocation problem with a hybrid ant colony optimization and deep Q-network algorithm

Footnotes

Acknowledgements

ORCID iD

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

Supplemental material

References

Supplementary Material

5.2.1. Effect of the sigmoid turning point $a$

5.2.2. Effect of the fairness weight $λ$

5.2.3. Effect of distribution center capacity $C$

5.2.4. Robustness to $θ_{i}$ perturbations