Sage Journals: Discover world-class research

Abstract

A feasibility-guided deep Q-network (FG-DQN) framework for the collaborative optimization of flexible resources is proposed in this paper to enhance power system resilience under extreme events. Targeting the coordinated siting, sizing, and switching of mobile generators, energy storage systems, reactive compensation devices, and reserve interconnections, the resilience enhancement task is formulated as a constrained sequential decision problem under multiple concurrent contingencies. AC power-flow equations, operational-safety limits, and budget constraints are embedded into the environment, and a normalized incremental-potential reward aligned with a composite resilience score is introduced. A width-limited search mechanism for reserve-interconnection selection is further developed to improve computational tractability and execution quality. Simulation results on the IEEE-30 and IEEE-57 systems confirm that more balanced resilience improvements and better overall solution quality can be achieved under unified feasibility checks.

Keywords

deep Q-networks deep reinforcement learning extreme events power system resilience

Introduction

In recent years, extreme weather events and cyberattacks have become more frequent, exposing power grids to higher uncertainty and risk and creating the need for coordinated planning, operation, and emergency response.¹ Power system resilience is regarded as a key capability that spans the withstand, adapt, and recover stages and supports critical-load protection and rapid restoration through unified topological and physical representations.^2,3

However, under multiple concurrent contingencies, resilience evaluation and operational decision-making are often treated inconsistently in existing studies.^4,5 At the same time, the high dimensionality of deployment actions is not handled effectively, and scalable decision structures remain limited.⁶ As a result, the treatment of the feasible operating region, the design of reward signals, and the mechanisms for action deployment are still inadequate and require further investigation.

One central challenge is that many control actions explored during training violate AC power-flow and operational-safety limits, and these infeasible actions accumulate in the replay buffer and slow down policy improvement. In Zhao and Wang,⁷ distribution-system restoration was framed as a graph-reinforcement-learning sequential decision problem and recovery efficiency was improved, but AC feasibility was not enforced during training and limit-violating samples persisted. In Bedoya et al.,⁸ deep reinforcement learning under asynchronous information was applied to enhance distribution-system resilience and policy effectiveness was demonstrated, but physical feasibility relied on ex-post checks and ineffective exploration during training could not be prevented. In Zhang et al.,⁹ a curriculum-based reinforcement learning method was proposed to improve convergence and stability for critical-load restoration, but over-limit and infeasible actions were handled heuristically rather than precluded by design. In Cao et al.,¹⁰ a two-timescale, physical-model-free voltage-control scheme was shown to be effective, but training stability still depended on external feasibility verification. In Xie et al.,¹¹ robustness analyses showed that infeasible samples introduced value-function bias and amplified variance accumulation. In Butt et al.,¹² deep reinforcement learning for resilient power and energy systems was surveyed and it was argued that embedding physical consistency in the environment remains essential for deployment. In Zimmerman et al.,¹³ environment-side pre-screening and sample cleaning that remove infeasible actions before replay were discussed, but systematic integration with deep reinforcement learning for resilience remained limited. In parallel, nonconvex optimal PMU placement formulations and related solution strategies have been studied in the context of wide-area monitoring and state estimation.^14,15 Mathematical models and algorithms for channel-constrained PMU allocation and arrangement under practical operation settings were also reported.^16,17 Cyber-physical security aspects, including complementarity-based reformulations of false-data-injection attacks on PMU-only state estimation, have been investigated as well.¹⁸ Taken together, these studies show that an effective handling mechanism for infeasible actions is urgently needed so that AC power-flow and security limits are enforced within the learning process and physically invalid decisions are filtered before they contaminate the replay buffer.

Furthermore, another critical issue is the misalignment between the reward signal and composite resilience objectives, which leads to policy drift and unstable convergence. In Shi et al.,¹⁹ a comparative analysis of resilience metrics in power and control systems was conducted and it was found that inconsistent normalization and weighting across systems undermined the reliability of cross-case conclusions. In Vijay and De,²⁰ resilience definitions, assessment methods, and enhancement strategies were synthesized and unified metric protocols were called for so that optimization targets and control signals would not become misaligned. In Zhang et al.,²¹ multi-objective learning in deep-reinforcement-learning-driven resilient dispatch was illustrated, but an explicit mapping consistent with a composite resilience score was not provided. The survey in Yang et al.²² on reinforcement learning for energy and electric systems underscored that reward design aligned with task objectives is essential for sample efficiency and bias reduction. Taken together, these studies show that, although various resilience metrics and reinforcement-learning based control frameworks have been proposed, many approaches either optimize each indicator separately or aggregate multiple indicators through simple weighted sums, leading to reward signals that remain misaligned with the intended composite resilience objectives. Hence, a normalized incremental, potential-based reward aligned with the composite resilience score is needed to guide coordinated improvement in connectivity, critical-line survival, load-loss ratio, and overload ratio.

Finally, large-scale selection of reserve interconnections and topology configurations introduces significant combinatorial complexity. In Hedman et al.,²³ optimal transmission switching was coupled with contingency analysis and discrete topology variables together with tightly coupled security constraints were shown to create heavy computational burdens. In Aziz et al.²⁴ and Zhou et al.²⁵ recent studies pointed out that as constraints and candidate sets grow, the combinatorial search space expands rapidly, making a single feed-forward decision unlikely to yield near-optimal solutions within acceptable computational effort when the system scale becomes large. In Chen et al.,²⁶ wildfire-driven cascades were analyzed and the importance of maintaining operational feasibility and consistent constraint handling under extreme scenarios was emphasized. In Gholizadeh and Musilek,²⁷ multi-timescale multi-agent graph-reinforcement-learning coordination improved economic efficiency and stability in high-renewable distribution networks, but an execution-stage near-optimal subset search for structural actions was not designed. In Tian et al.,²⁸ explainable reinforcement learning for distribution reconfiguration was investigated and it was argued that controllable combinatorial complexity together with physical consistency was essential for deployable robustness. It should be pointed out that, these studies show that the complex topology of large-scale power systems leads to highly complex combinations of reserve interconnections, which in practical systems may also include cross-area emergency support through HVDC links, while execution-level schemes that jointly balance computational cost and solution quality under AC-feasibility constraints remain scarce, so controlling combinatorial complexity in the execution stage remains a key requirement for enhancing power system resilience.

In summary, existing studies have advanced resilience assessment, reinforcement learning based restoration, and topology optimization from different perspectives. However, important gaps still remain in the treatment of the operational safety feasible region, in reward design under multiple concurrent contingencies, and in execution stage procedures that should balance computational efficiency and solution quality.

To address these limitations, a unified framework is needed for the collaborative optimization of flexible resources under multiple concurrent contingencies. Within such a framework, feasibility screening under AC power flow, reward design for resilience improvement, and control of combinatorial complexity during implementation should be handled in an integrated manner. On this basis, a feasibility guided deep Q-network framework is proposed for the collaborative optimization of flexible resources in power systems. The main contributions are summarized as follows:

(1) A decision mechanism with explicit feasibility checking is established by embedding AC power flow equations, operational safety limits, and budget constraints into the environment. As a result, the feasibility of each action can be evaluated before execution and experience replay, thereby mitigating the adverse effects of invalid actions on policy learning.

(2) A normalized reward design based on incremental potential is developed in accordance with the composite resilience score. Accordingly, the evolution of connectivity, critical line survival, load loss ratio, and overload ratio can be jointly guided by a unified learning objective.

(3) A search strategy is developed for reserve interconnection selection during implementation. By combining beam search with AC power flow verification, the computational burden and the variation of training outcomes can be reduced without enlarging the action space.

The remainder of this paper is organized as follows. Section II introduces the physical modeling and resilience-evaluation framework. Section III presents the FG-DQN methodology, including training-execution integration and width-limited search. Section IV reports numerical studies on the IEEE-30 and IEEE-57 systems, with comparative and ablation results. Section V concludes the work and outlines limitations, practical implications, and directions for future research.

Physical modeling and evaluation metrics

Modeling and evaluation framework

A unified end-to-end framework for FG-DQN is shown in Figure 1.

Figure 1.

Flow of the overall methodology.

The workflow in Figure 1 is organized into four lanes that implement the FG-DQN framework. The preprocessing lane uses structural information and N-1 security checking to identify vulnerable components and to generate weighted parallel N-3 contingencies from data and physical priors. The modeling lane defines the feasible domain through AC power-flow equations and operational safety limits and casts resilience-oriented resource deployment as a Markov decision process. The solution lane carries out policy learning with FG-DQN and applies a width-limited, quality-preserving beam search to candidate backup-line combinations, while AC power-flow feasibility checks are embedded to remove infeasible structural actions. The verification and evaluation lane computes resilience metrics such as connectivity, critical-line survival, load-loss ratio, and overload ratio together with a composite resilience score under unified criteria, forming a comparable and reproducible evaluation loop.²⁹ Information flows from the preprocessing lane to the modeling lane to construct state and action spaces and feasible-domain constraints, and trajectories generated in the solution lane are stored in the replay buffer and passed to the verification and evaluation lane for performance assessment. Evaluation results can then be used to tune reward weights and deployment rules and to adjust scenario generation, which jointly improves numerical stability and engineering feasibility.^30,31 To improve clarity and reproducibility, Algorithm 1 summarizes the unified end-to-end workflow illustrated in Figure 1.

Algorithm 1. Overall pseudocode of the FG-DQN end-to-end workflow
Require: Grid model $G$ , operational limits $Ω$ , AC power-flow solver $ACPowerFlow (\cdot)$ , candidate sets of flexible resources $A = {MG, ES, QC, BK}$ , contingency order $N - k$ with $k = 3$ , contingency sampling weights, training episodes $E$ , replay buffer $D$ with prioritized replay enabled, DQN network $Q_{θ}$ and target network $Q_{\bar{θ}}$ exploration schedule $ϵ$ , beam parameters $W$ and $S_{max}$ , exhaustive threshold $T_{exh}$ . Ensure: Trained parameters $θ$ , feasible deployment decisions, evaluation statistics and comprehensive resilience score. 1. Identify vulnerable components using structural features and $N - 1$ security checking. 2. Generate a parallel $N - 3$ contingency set $C$ using weighted multi-factor sampling. 3. Define the feasible domain $F$ by AC power-flow feasibility and operational safety constraints in $Ω$ , including voltage limits and thermal ratings. 4. Formulate the problem as an MDP with state $s$ , action $a \in A$ , and transitions screened by $F$ . 5. Initialize $Q_{θ}$ , set $Q_{\bar{θ}} \leftarrow Q_{θ}$ , and initialize replay buffer $D$ . 6. For episode $e = 1$ to $E$ do 7. Sample a contingency $c \in C$ and reset the environment under $c$ . 8. For each decision step do: 9. Select an action $a$ and using $ϵ - greedy$ on $Q_{θ}$ and apply feasibility screening by $F$ , 10. If the action is feasible then execute it, compute reward and metrics, and store the transition in $D$ . 11. If the action is infeasible then reject it and continue to the next step without storing it in $D$ . 12. Update $θ$ using prioritized replay minibatches from $D$ and periodically synchronize $\bar{θ}$ with $θ$ . 13. End for. 14. End for. 15. For BK subset execution, if the number of BK candidates is no greater than $T_{exh}$ then perform exhaustive search, otherwise perform width-limited beam search with width $W$ and deployment scale bounded by $S_{max}$ . 16. For each evaluated deployment, run $ACPowerFlow (\cdot)$ for feasibility verification and compute CLSR, Conn, LLR, Over, and the comprehensive resilience score for statistics. Return $θ$ and evaluation statistics.

Algorithm 1. Overall pseudocode of the FG-DQN end-to-end workflow

Require: Grid model

G

, operational limits

Ω

, AC power-flow solver

ACPowerFlow (\cdot)

, candidate sets of flexible resources

A = {MG, ES, QC, BK}

, contingency order

N - k

with

k = 3

, contingency sampling weights, training episodes

E

, replay buffer

D

with prioritized replay enabled, DQN network

Q_{θ}

and target network

Q_{\bar{θ}}

exploration schedule

ϵ

, beam parameters

W

and

S_{max}

, exhaustive threshold

T_{exh}

.
Ensure: Trained parameters

θ

, feasible deployment decisions, evaluation statistics and comprehensive resilience score.
1. Identify vulnerable components using structural features and

N - 1

security checking.
2. Generate a parallel

N - 3

contingency set

C

using weighted multi-factor sampling.
3. Define the feasible domain

F

by AC power-flow feasibility and operational safety constraints in

Ω

, including voltage limits and thermal ratings.
4. Formulate the problem as an MDP with state

s

, action

a \in A

, and transitions screened by

F

.
5. Initialize

Q_{θ}

, set

Q_{\bar{θ}} \leftarrow Q_{θ}

, and initialize replay buffer

D

.
6. For episode

e = 1

E

do
7. Sample a contingency

c \in C

and reset the environment under

c

.
8. For each decision step do:
9. Select an action

a

and using

ϵ - greedy

Q_{θ}

and apply feasibility screening by

F

,
10. If the action is feasible then execute it, compute reward and metrics, and store the transition in

D

.
11. If the action is infeasible then reject it and continue to the next step without storing it in

D

.
12. Update

θ

using prioritized replay minibatches from

D

and periodically synchronize

\bar{θ}

with

θ

.
13. End for.
14. End for.
15. For BK subset execution, if the number of BK candidates is no greater than

T_{exh}

then perform exhaustive search, otherwise perform width-limited beam search with width

W

and deployment scale bounded by

S_{max}

.
16. For each evaluated deployment, run

ACPowerFlow (\cdot)

for feasibility verification and compute CLSR, Conn, LLR, Over, and the comprehensive resilience score for statistics.
Return

θ

and evaluation statistics.

AC power-flow and operational feasibility region

The following full AC power-flow formulation is employed, which is a set of nonlinear algebraic equations and thus defines a nonconvex feasibility region. In this paper, it is used as an environment-side feasibility check, where an action is rejected if the AC power-flow does not converge or any operational limits are violated.

θ_{ij} \overset{Δ}{=} θ_{i} - θ_{j}

(1)

P_{i} = V_{i} \sum_{j \in N} V_{j} (G_{ij} \cos θ_{ij} + B_{ij} \sin θ_{ij})

(2)

Q_{i} = V_{i} \sum_{j \in N} V_{j} (G_{ij} \sin θ_{ij} - B_{ij} \cos θ_{ij})

(3)

Y_{ij} = G_{ij} + j B_{ij}

(4)

S_{ij} = P_{ij} + j Q_{ij}

(5)

{\begin{matrix} {\underline{V}}_{i} \leq V_{i} \leq {\bar{V}}_{i} \\ | S_{ij} | \leq {\bar{S}}_{ij} \\ {\underline{P}}_{g}^{G} \leq P_{g}^{G} \leq {\bar{P}}_{g}^{G} \\ {\underline{Q}}_{g}^{G} \leq Q_{g}^{G} \leq {\bar{Q}}_{g}^{G} \end{matrix}

(6)

where $N$ is the set of buses, $i, j \in N$ are bus indices. The symbol $\overset{Δ}{=}$ denotes ‘is defined as’. $V_{i}$ is the voltage magnitude at bus $i$ (p.u.), $θ_{i}$ is the phase angle, $Y_{ij}$ is the $(i, j)$ admittance matrix with real part $G_{ij}$ and imaginary part $B_{ij}$ , $P_{i}$ , and $Q_{i}$ denote the net active and reactive power injections at bus i, $S_{ij}$ is the complex power on branch $i \to j$ , and $| S_{ij} |$ is its apparent power, ${\underline{V}}_{i}$ and ${\bar{V}}_{i}$ are the lower and upper voltage limits, ${\bar{S}}_{ij}$ is the MVA limit of branch $i \to j$ , $G$ is the set of generators, with $P_{g}^{G}$ and $Q_{g}^{G}$ the active and reactive outputs of unit $g$ , and ${\underline{P}}_{g}$ , ${\bar{P}}_{g}$ , ${\underline{Q}}_{g}$ , ${\bar{Q}}_{g}$ their capability limits. The feasible region $F$ is the set of states that satisfy AC-power-flow convergence and all operational limits. If an action yields a state $(V, θ, P, Q) \notin F$ , the action is rejected so that training samples remain aligned with the evaluation criteria. Equation (1) defines the voltage-angle difference between buses $i$ and $j$ . Equations (2) and (3) are the active and reactive injections at bus $i$ . Equation (4) defines the series admittance of branch $(i, j)$ , and equation (5) expresses the complex branch power flow in terms of its active and reactive components. Equation (6) is the operational safety constraints. In equation (6), $| S_{ij} |$ denotes the magnitude of the complex branch power.

Flexible resources and cost-budget model

Flexible resources comprise four categories: mobile generators, energy storage systems, static reactive power compensation devices, and reserve interconnections. The cumulative investment satisfies:

C_{MG} + C_{ES} + C_{QC} + C_{BK} \leq B

(7)

C_{MG} = \sum_{u \in U_{MG}} (c_{MG}^{unit} \cdot p_{u} + c_{MG}^{inst})

(8)

C_{ES} = \sum_{e \in U_{ES}} (c_{batt} \cdot E_{e} + c_{inv} \cdot P_{e} + c_{ES}^{inst}), E_{e} = P_{e} τ_{e}

(9)

C_{QC} = \sum_{q \in U_{QC}} c_{QC}^{unit} \cdot Q_{q}

(10)

C_{BK} = c_{BK} \cdot | U_{BK} |, | U_{BK} | \leq K_{max}

(11)

where $B$ denotes the total budget. $C_{MG}$ , $C_{ES}$ , $C_{QC}$ , $C_{BK}$ are the investment costs for mobile generators (MG), energy storage (ES), reactive compensation (QC), and reserve interconnections (BK), respectively. $U_{MG}$ , $U_{ES}$ , $U_{QC}$ , and $U_{BK}$ denote the instance sets of the four resource categories, respectively. For each $u \in U_{MG}$ , $p_{u}$ is the unit quota of a mobile generator deployed, $c_{MG}^{unit}$ is the per-unit purchase/capacity cost, and $c_{MG}^{inst}$ is the per-site installation/commissioning cost. For each $e \in U_{ES}$ , $P_{e}$ is the rated charge/discharge power (MW), $τ_{e}$ is the equivalent duration (h), $E_{e}$ is the equivalent energy (MWh), $c_{batt}$ is the specific energy cost (per MWh), $c_{inv}$ is the inverter specific power cost (per MW), and $c_{ES}^{inst}$ is the per-site installation cost. For each $q \in U_{QC}$ , $Q_{q}$ is the installed reactive capacity (MVAr), and $c_{QC}^{unit}$ is the unit cost per MVAr. For reserve interconnections, $c_{BK}$ is the per-link installation cost, $| U_{BK} |$ is the number of selected links, and the cardinality is bounded by $K_{max}$ . |·| denotes set cardinality.

Equations (7)–(11) formulate the investment costs and budget constraints for the four resource types. Equation (7) enforces that the total cost of mobile generators, energy storage units, reactive compensators, and reserve-interconnection lines does not exceed the budget $B$ . Equations (8)–(10) compute the deployment costs of mobile generators, storage units, and reactive compensators, respectively, while equation (11) gives the cost and cardinality limit for the set of reserve-interconnection lines. All cost terms are expressed in the same monetary units and remain consistent across training and evaluation, the same conversion factors are used when mapping costs to reward penalties.

Resilience metrics and composite scoring

Under a unified evaluation protocol, four resilience metrics are defined. Resilience metrics are defined as follows:

CLSR = \frac{| {ℓ \in L_{k} : ℓ_{cwl}} |}{| L_{k} |}

(12)

Conn = Γ

(13)

LLR = \frac{P_{loss}}{P_{tot}}

(14)

Over = \frac{| L_{over} |}{| L |}

(15)

\begin{matrix} R = ω_{1} \cdot CLSR + ω_{2} \cdot Conn + ω_{3} \cdot (1 - LLR) \\ + ω_{4} \cdot (1 - Over) \end{matrix}

(16)

where $CLSR$ , $Conn$ , $LLR$ , $Over$ are critical-line survival rate, system connectivity, load loss rate, and overload ratio, respectively. $ℓ$ represents a certain path, $ℓ_{cwl}$ represents a connected path that does not exceed the limit. $L_{k}$ denotes the critical-path set, $L$ denotes the full path set. The notation |·| denotes the cardinality of a set. $P_{tot}$ is the total system demand, $P_{loss}$ is the unserved load after power-flow redispatch. $L_{over}$ is the set of overloaded paths, $Γ \in [0, 1]$ is the connectivity coverage. $R$ denotes the composite resilience score. $ω_{1}$ , $ω_{2}$ , $ω_{3}$ , $ω_{4}$ are the weight parameters of the four resilience metrics, respectively. Equations (12)–(15) define four normalized resilience metrics: the critical-line survival rate, system connectivity, load-loss ratio, and overload ratio. Equation (16) aggregates these metrics into the composite resilience score $R$ through a weighted sum, so that higher values indicate better overall resilience performance.

MDP formulation

Siting, sizing, and switching of the four resilient resources are modeled as a sequential decision process, serving as the core of the decision layer and defining an MDP $(S, A, P, r)$ . AC power-flow feasibility and operational-safety limits, budget and per-bus deployment caps, and the cross-island rule for reserve interconnections specify the feasible domain, while the reward is aligned with the composite resilience score, the optimization objective is formulated as follows:

max_{π} J (π) = E [\sum_{t = 0}^{T} γ^{t} r (s_{t}, a_{t})], a_{t} ~ π (\cdot | s_{t}), γ \in (0, 1)

(17)

s . t . (2) - (11)

(18)

T \leq T_{max}, B_{t} \geq 0, \forall t = 0, \dots, T

(19)

n_{MG} (b) \leq 2, n_{ES} (b) \leq 1, n_{QC} (b) \leq 1, \forall b \in N

(20)

BK (i, j) = {\begin{matrix} 1 bus (i) \neq bus (j) \\ 0 bus (i) = bus (j) \end{matrix}

(21)

where $J (π)$ denotes the expected discounted return of policy $π$ . $s_{t} \in S$ and $a_{t} \in A$ are the state and action at time step $t$ . $π (a_{t} | s_{t})$ denotes the action-selection distribution induced by the policy. $r (s_{t}, a_{t})$ is the step-wise reward. $γ$ is the discount factor. $E [\cdot]$ is the expectation over trajectories generated by $π$ and over the sampled contingency scenarios. $T$ is the episode horizon. $T_{max}$ is the maximum number of decision steps per episode. $B_{t}$ is the remaining budget at time $t$ . $n_{MG} (b)$ is the number of mobile generator units deployed at bus $b$ with an upper bound of 2. $n_{ES} (b)$ is the number of energy-storage systems deployed at bus $b$ with an upper bound of 1. $n_{QC} (b)$ is the number of reactive-power compensation units deployed at bus $b$ with an upper bound of 1. $BK (i, j) \in {0, 1}$ is the admissibility indicator for a candidate reserve-interconnection between endpoints $i$ and $j$ . It is topology-dependent and is deterministically evaluated by the cross-island rule via $bus (\cdot)$ , rather than being an optimization decision variable in equations (17) and (18). $bus (\cdot)$ returns the island identifier of a bus under the current topology. Equations (17)–(21) cast the siting, sizing, and switching of flexible resources as a constrained Markov decision process. Resource deployment is represented as sequential decisions over a finite discrete action set $A$ , and the counters $n_{MG} (b)$ , $n_{ES} (b)$ , and $n_{QC} (b)$ record how many units have been deployed at each bus up to time $t$ under their upper bounds. Equation (17) defines the discounted return $J (π)$ . Equation (18) requires that all decisions satisfy the nonlinear AC power-flow and operational-safety constraints in equations (2)–(6), as well as the engineering cost and budget constraints in equations (7)–(11). In particular, equation (7) enforces a total-budget inequality, while equation (11) further imposes a cardinality limit on the number of reserve interconnections. Equation (19) bounds the decision horizon and ensures that the remaining budget $B_{t}$ is nonnegative at each step. Equation (20) limits the number of mobile generators, storage units, and reactive compensators that can be placed at each bus. Equation (21) encodes the cross-island rule for reserve-interconnection lines by allowing only lines that connect different islands.

Potential-based composite reward

A normalized, potential-based composite reward is introduced to align policy learning with the composite resilience score.

{\hat{LLR}}_{t} = 1 - LL R_{t}

(22)

{\hat{Over}}_{t} = 1 - Ove r_{t}

(23)

\begin{matrix} Δ Φ_{t} = ω_{1} (CLS R_{t + 1} - CLS R_{t}) + ω_{2} (Con n_{t + 1} - Con n_{t}) \\ + ω_{3} ({\hat{LLR}}_{t + 1} - {\hat{LLR}}_{t}) + ω_{4} ({\hat{Over}}_{t + 1} - {\hat{Over}}_{t}) \end{matrix}

(24)

\begin{matrix} Ψ_{t} = 5 \cdot 1 {KBR} + (10 \cdot min (1, \frac{Δ Conn}{0.2})) \cdot 1 {Δ Conn > 0.05} \\ + 3 \cdot 1 {AAP} - 2 \times 10^{- 6} \cdot Cos t_{t} - 50 \cdot max (0, n_{ES} - 4) \\ - 30 \cdot max (0, n_{QC} - 6) \end{matrix}

(25)

Δ Conn = Con n_{t + 1} - Con n_{t} - C G_{t}

(26)

r_{t} = Δ Φ_{t} + Ψ_{t}

(27)

where $CLS R_{t}$ denote critical-line survival rate at time $t$ , $Con n_{t}$ denote system connectivity metric at time $t$ , $LL R_{t}$ denote load-loss ratio at time $t$ , $Ove r_{t}$ denote overload ratio at time $t$ . ${\hat{LLR}}_{t}$ and ${\hat{Over}}_{t}$ are positive indicators for $LL R_{t}$ and $Ove r_{t}$ . $Δ Φ_{t}$ is the incremental potential composed of weighted metric increments. $ω_{1}$ , $ω_{2}$ , $ω_{3}$ , $ω_{4}$ are the weight parameters of the four resilience metrics, respectively. $Ψ_{t}$ denote structural-guidance term that rewards critical-bus restoration and connectivity gains, and penalizes action cost and excessive deployments. Indicator function $1 {\cdot}$ equals 1 when the condition inside the braces is satisfied and 0 otherwise. $KBR$ denote number of restored critical buses. $C G_{t}$ represents the connectivity gain between $t$ and $t + 1$ . $AAP$ denote number of power sources attached to the attacked island at time $t$ . $Cos t_{t}$ denote immediate engineering cost of the executed action $a_{t}$ . $r_{t}$ denote instantaneous reward used by FG-DQN. Equations (22) and (23) convert the load-loss ratio and overload ratio into positive-direction forms so that larger values correspond to better performance. Equation (24) defines the incremental potential $Δ Φ_{t}$ . Equation (25) introduces the structural guidance term $Ψ_{t}$ . Equation (26) defines the connectivity increment $Δ Conn$ used inside the guidance term. Equation (27) combines the incremental potential and structural guidance into the final instantaneous reward $r_{t}$ .

Solution method

High-dimensional decision making under concurrent contingencies and physical constraints is addressed by a solution path comprising feasible-domain preprocessing, reward shaping, policy learning, and execution verification.

Training-execution integrated framework and stabilization mechanisms

An integrated training-execution framework for FG-DQN is established, in which the action-value function $Q_{θ} (s, a)$ is approximated by a DQN. Stability is enforced via feasible-domain prechecks and AC power-flow verification before samples are admitted to the replay buffer. Prioritized experience replay, soft target-network updates, gradient clipping, and learning-rate scheduling are employed.^32–34 The TD target and Huber loss are given below:

y_{t} = r_{t} + γ max_{a'} Q_{\bar{θ}} (s_{t + 1}, a')

(28)

L (θ) = E_{(s, a, r, s') ~ D} [Huber (y - Q_{θ} (s, a))]

(29)

Huber (z) = \frac{1}{2} z^{2} 1_{{| z | \leq K}} + K (| z | - \frac{1}{2} K) 1_{{| z | > K}} K > 0

(30)

p_{i} = | δ_{i} | + ε_{pr}

(31)

δ_{i} = y_{i} - Q_{θ} (s_{i}, a_{i})

(32)

w_{i} = (N \cdot P (i))^{- β} P (i) \propto p_{i}^{α}

(33)

where $s_{t}$ denotes the current state at time $t$ . $a_{t}$ denotes the executed action at time $t$ . $s_{t + 1}$ denotes the next state after the environment transition. $a'$ denotes a generic action used inside the maximization. $r_{t}$ denotes the immediate reward. $γ$ denotes the discount factor. $Q_{θ} (s, a)$ denotes the online action-value network with parameters $θ$ . $Q_{\bar{θ}} (s, a)$ denotes the target network with parameters $\bar{θ}$ . $y_{t}$ denotes the temporal-difference target. $L (θ)$ denotes the training objective over mini-batches drawn from the prioritized replay buffer $D$ . For prioritized replay, samples are drawn with probability $P (i)$ , importance weights are $w_{i}$ , and priorities are updated by equation (31) with TD error equation (32). $N$ denotes the buffer size, and $α$ , $β$ , $ε_{pr}$ are hyperparameters. Equation (28) gives the one-step temporal-difference target $y_{t}$ . Equation (29) defines the expected Huber loss. Equation (30) specifies the Huber loss function with threshold $K$ . Equation (31) maps the magnitude of $δ_{i}$ to a sampling priority $p_{i}$ with a small positive offset $ε_{pr}$ . Equation (32) defines the temporal-difference error $δ_{i}$ for each sampled transition. Equation (33) computes the importance-sampling weight $w_{i}$ .

The MDP-extended three-lane architecture is illustrated, clarifying the data flow and temporal relations among state generation, action selection, environment feedback, sample ingestion, priority-based sampling, and parameter backpropagation, as shown in Figure 2.

Figure 2.

Three-lane framework of the proposed FG-DQN.

At each deployment step the environment applies the currently selected microgrids, storage units, reactive compensators, and backup lines to the grid model and performs an AC power-flow feasibility verification based on the formulation in Section 2.2. The updated topology may be split into several islands. For each island that still contains at least one generation or slack bus an AC power flow is solved and nodal voltages and branch flows are obtained. Islands without any source are treated as fully lost and their demand is counted as part of the load loss. If any island fails to converge in the AC power flow or if any voltage or thermal limit from equations (2)–(6) is violated the global post action state is marked as infeasible and the candidate action is rejected in both training and evaluation. Otherwise, the converged solution is used to compute the resilience indicators, including the critical load survival rate CLSR, the connectivity index Conn, the load loss ratio LLR, and the overload ratio Over, and these indicators enter the incremental reward design and the comprehensive resilience score.

To instantiate the three-lane framework into executable steps, the training pseudocode of FG-DQN is formulated as Algorithm 2.

Algorithm 2. FG-DQN Training
Require: $Q$ and $\bar{Q}$ , replay buffer $D$ with PER parameters $α$ , $β \in [0, 1]$ , small constant $ε_{per}$ , minimum sampling probability $p_{min}$ , EPISODES $T$ , MAX_DEPLOYS_PER_EPISODE $T_{max}$ , BATCH_SIZE $N_{batch}$ , discount $γ$ and initial learning rate $η$ with $OneCycleLR$ scheduler, $ϵ - greed$ schedule, TARGET_UPDATE period $T_{t \arg et}$ , history window length $K$ , TOTAL_BUDGET, reward weights $ω_{1}$ , $ω_{2}$ , $ω_{3}$ , $ω_{4}$ , and cost penalty $λ$ , $Ω_{train}$ , $ACPowerFlow (\cdot)$ feasibility operator with AC power-flow and security constraints. Ensure: Trained parameters $θ$ . 1. Initialize $θ$ randomly and set $\bar{θ} \leftarrow θ$ . 2. Initialize the prioritized replay buffer $D (B, α, ε_{per}, p_{min})$ . 3. For episode $e = 1$ to $T$ : 4. Sample a scenario from $Ω_{train}$ and reset the environment. 5. Initialize state $s_{0}$ by concatenating the last $K$ steps, set $s \leftarrow s_{0}$ . 6. For step $t = 1 \begin{matrix} to \end{matrix} T$ : 7. With probability $ϵ$ choose a random action $a \in A$ , otherwise select $a = \arg max Q (s_{i}, a) (feasible, s' \in Ω_{powerFlow} (s, TOTA L_{BUDGET}))$ . 8. If feasible, compute the reward $r$ : $r \leftarrow ω_{1} Δ CLSR + ω_{2} Δ Conn - ω_{3} Δ LLR - ω_{4} Δ Over - λ Cos t (a)$ , store the transition in $D$ . 9. Else, assign a small reject penalty $r_{ir} < 0$ and continue to next step. 10. Initialize priority $p_{i} \leftarrow max {p} (or \| r \| + ε_{per})$ and push $(s, a, r, s')$ into $D$ . 11. Sample $N_{batch}$ transitions from $D$ with $P (i) \propto p_{i}$ and compute importance Weights $α_{i} = (N \cdot P (i))^{- b}$ . 12. Take a gradient step on $L$ , apply gradient clipping with global-norm ≤ 5.0, update priorities $p_{i} \leftarrow \| δ_{i} \| + ε_{per}$ . 13. If mod $T_{t \arg et} = 0$ , then set $θ^{-} \leftarrow θ$ . 14. Step the LR scheduler, update $ϵ$ and $β$ according to their schedules. 15. Set $s \leftarrow s'$ . 16. End for. 17. Record episode statistics (return, loss, feasibility ratio). 18. End for. Return $θ$ .

Algorithm 2. FG-DQN Training

Require:

Q

and

\bar{Q}

, replay buffer

D

with PER parameters

α

β \in [0, 1]

, small constant

ε_{per}

, minimum sampling probability

p_{min}

, EPISODES

T

, MAX_DEPLOYS_PER_EPISODE

T_{max}

, BATCH_SIZE

N_{batch}

, discount

γ

and initial learning rate

η

with

OneCycleLR

scheduler,

ϵ - greed

schedule, TARGET_UPDATE period

T_{t \arg et}

, history window length

K

, TOTAL_BUDGET, reward weights

ω_{1}

ω_{2}

ω_{3}

ω_{4}

, and cost penalty

λ

Ω_{train}

ACPowerFlow (\cdot)

feasibility operator with AC power-flow and security constraints.
Ensure: Trained parameters

θ

.
1. Initialize

θ

randomly and set

\bar{θ} \leftarrow θ

.
2. Initialize the prioritized replay buffer

D (B, α, ε_{per}, p_{min})

.
3. For episode

e = 1

T

:
4. Sample a scenario from

Ω_{train}

and reset the environment.
5. Initialize state

s_{0}

by concatenating the last

K

steps, set

s \leftarrow s_{0}

.
6. For step

t = 1 \begin{matrix} to \end{matrix} T

:
7. With probability

ϵ

choose a random action

a \in A

,
otherwise select

a = \arg max Q (s_{i}, a) (feasible, s' \in Ω_{powerFlow} (s, TOTA L_{BUDGET}))

.
8. If feasible, compute the reward

r

r \leftarrow ω_{1} Δ CLSR + ω_{2} Δ Conn - ω_{3} Δ LLR - ω_{4} Δ Over - λ Cos t (a)

, store the transition in

D

.
9. Else, assign a small reject penalty

r_{ir} < 0

and continue to next step.
10. Initialize priority

p_{i} \leftarrow max {p} (or | r | + ε_{per})

and push

(s, a, r, s')

into

D

.
11. Sample

N_{batch}

transitions from

D

with

P (i) \propto p_{i}

and compute importance
Weights

α_{i} = (N \cdot P (i))^{- b}

.
12. Take a gradient step on

L

, apply gradient clipping with global-norm ≤ 5.0, update
priorities

p_{i} \leftarrow | δ_{i} | + ε_{per}

.
13. If mod

T_{t \arg et} = 0

, then set

θ^{-} \leftarrow θ

.
14. Step the LR scheduler, update

ϵ

and

β

according to their schedules.
15. Set

s \leftarrow s'

.
16. End for.
17. Record episode statistics (return, loss, feasibility ratio).
18. End for.
Return

θ

Width-limited optimality search for reserve-interconnection line combinations

A width-limited hybrid exhaustive/beam search is employed at the execution layer to select near-optimal subsets of reserve-interconnection lines while enforcing AC power-flow feasibility and controlling combinatorial growth.³⁵ A scoring function for the subsets is defined as:

\begin{matrix} J (S) = α_{1} \cdot △ Conn + α_{2} \cdot △ CLSR - α_{3} \cdot △ LLR \\ - α_{4} \cdot Over - α_{5} \cdot Cos t + α_{6} \cdot AIS \end{matrix}

(34)

where $S$ denotes a candidate subset of reserve-interconnection lines selected from the current candidate set. $AIS$ represents the number of islands under attack that have been reconnected to the power grid. $α_{i}$ are weighting coefficients in the scoring function $J (S)$ that align the search with the composite resilience objective. $△ Conn$ is the improvement in system connectivity after applying $S$ . $△ CLSR$ is the improvement in the critical-line survival rate after applying $S$ . $△ LLR$ is the decrease in the load-loss ratio. $Over$ is the residual overload ratio after applying $S$ . $Cos t$ is the engineering cost associated with $S$ . Equation (34) defines the scoring function $J (S)$ . The layer-wise contraction-expansion search with AC-feasibility pruning and width-limited retention is illustrated, as shown in Figure 3.

Starting from the current candidate set, a contraction step removes dominated or infeasible options. If the remaining size does not exceed $K_{max}$ , all feasible subsets are exhaustively evaluated and the best subset is selected. Otherwise, the search proceeds layer by layer, an initial beam is created, each beam element is expanded by adding one candidate line, AC power-flow feasibility is checked for every new subset, infeasible subsets are discarded, and the top $W$ feasible subsets are retained according to the scoring function $J (S)$ . The process continues until no improvement is observed or the subset size reaches $K_{max}$ , the highest-scoring feasible subset is then returned to the policy layer for execution. This procedure confines complexity to the execution layer and suppresses training variance without enlarging the action space. $W$ is the beam width retained at each expansion layer during layer-by-layer search. $K_{max}$ is the maximum allowed subset size for exhaustive evaluation when the search space is small.³⁶ To translate the execution-side flow in Figure 3 into executable steps, the pseudocode for the reserve interconnections search process is formulated as Algorithm 3.

Figure 3.

Width-limited search for reserve-interconnection combinations.

Algorithm 3. Execution-Side Beam Search for Reserve-Interconnection Subset Selection
Require: Candidate set $C$ of reserve interconnections, maximum subset size $K_{max}$ , beam width $W$ , TOTAL_BUDGET, weights $ω_{1}$ , $ω_{2}$ , $ω_{3}$ , $ω_{4}$ , and cost penalty $λ$ , state $s$ , $ACPowerFlowSubset (s, S, TOTAL_BUDGET)$ returning $feasible$ , $Δ CLSR$ , $Δ Conn$ , $Δ LLR$ , $Δ Over$ and $Cos t (S)$ . Ensure: Feasible subset $S^{}$ , maximizing the composite resilience score. 1. Let $N_{c} \leftarrow \| C \|$ . Initialize $bes t_{set} \leftarrow Ø$ , $bes t_{score} \leftarrow - \infty$ . 2. Evaluate all $S \subseteq C$ with $1 \leq \| S \| \leq K_{max}$ . Keep the feasible $S$ with the largest $J (S)$ as $S^{}$ . 3. Else. 4. $Beam \leftarrow {Ø}$ . 5. For $level = 1$ to $K_{max}$ : 6. $Pool \leftarrow Ø$ . 7. For each $B \in Beam$ and $b \in C \ B$ do 8. $S \leftarrow B \cup {b}$ . 9. If $ACPowerFlowSubset$ returns feasible, compute $J (S)$ and append $(S, J (S))$ to $Pool$ . 10. End for. 11. $Beam \leftarrow top W subsets in Pool by J (S)$ . 12. If $Beam$ is empty, break. 13. If $Beam$ is not empty, set $S^{} \leftarrow \arg max_{S \in Beam} J (S)$ . 14. End for. Return $S^{}$ .

Algorithm 3. Execution-Side Beam Search for Reserve-Interconnection Subset Selection

Require: Candidate set

C

of reserve interconnections, maximum subset size

K_{max}

, beam width

W

, TOTAL_BUDGET, weights

ω_{1}

ω_{2}

ω_{3}

ω_{4}

, and cost penalty

λ

, state

s

ACPowerFlowSubset (s, S, TOTAL_BUDGET)

returning

feasible

Δ CLSR

Δ Conn

Δ LLR

Δ Over

and

Cos t (S)

.
Ensure: Feasible subset

S^{*}

, maximizing the composite resilience score.
1. Let

N_{c} \leftarrow | C |

. Initialize

bes t_{set} \leftarrow Ø

bes t_{score} \leftarrow - \infty

.
2. Evaluate all

S \subseteq C

with

1 \leq | S | \leq K_{max}

. Keep the feasible

S

with the largest

J (S)

S^{*}

.
3. Else.
4.

Beam \leftarrow {Ø}

.
5. For

level = 1

K_{max}

:
6.

Pool \leftarrow Ø

.
7. For each

B \in Beam

and

b \in C \ B

do
8.

S \leftarrow B \cup {b}

.
9. If

ACPowerFlowSubset

returns feasible, compute

J (S)

and append

(S, J (S))

Pool

.
10. End for.
11.

Beam \leftarrow top W subsets in Pool by J (S)

.
12. If

Beam

is empty, break.
13. If

Beam

is not empty, set

S^{*} \leftarrow \arg max_{S \in Beam} J (S)

.
14. End for.
Return

S^{*}

Computational complexity and engineering feasibility

By executing width-limited optimization at the execution layer, the exponential growth of structural combinations is reduced to a linear-time evaluation proportional to the product of the number of candidates and the cluster width. Formally, for a candidate set of size $N_{c}$ and beam width $W$ with maximum subset size $K_{max}$ , the beam search requires at most $K_{max} W N_{c}$ calls to the AC power flow based feasibility operator, which corresponds to a time complexity of $O (K_{max} W N_{c} C_{pf})$ per contingency, where $C_{pf}$ denotes the cost of one AC power flow solve. The computational burden, therefore, remains within acceptable bounds. Each candidate subset is subjected to AC power-flow verification to guarantee physical consistency and engineering interpretability. The stabilization mechanism, which combines prioritized replay and target networks, improves sample efficiency and convergence stability without expanding the action space.

Experiments and results

Experimental platform and unified settings

A unified experimental platform is established on the IEEE-30 and IEEE-57 systems, standardizing N-3 scenario generation, feasible-domain enforcement, FG-DQN-based policy learning, width-limited execution search, and evaluation under consistent resilience metrics. The workflow of parallel N-3 contingency sampling, based on weight fusion and weighted draws from critical buses and vulnerable lines, is illustrated in Figure 4. The corresponding numerical settings are summarized in Table 1.

Figure 4.

Vulnerable component identification and N-3 scenario generation process.

Table 1.

Scenario library, evaluation settings, and planning parameters.

Item	Setting and metric
Fault order	N-3 parallel failures
System	IEEE-30 and IEEE-57
Scenario generation	Training set employs vulnerability weight-driven weighted simple random sampling, test set consists of fixed scenario set S1–S4
Vulnerability weight sources	Joint evaluation of structural metrics and N-1 static safety verification
Power flow model	AC power-flow incorporating safety constraints such as capacity limits, node voltages, and connectivity, feasible region pre-defined on the environmental side
Training set $Ω_{train}$	Shared with online interaction and experience replay, used for policy learning and stability assessment
Test set $Ω_{test}$	Fixed N-3 combinations grouped as S1–S4, used for offline evaluations such as heatmaps and box plots
Method comparison	FG-DQN, GA, Vanilla_DQN, Greedy-1
Ablation configurations	Baseline, A1 NoWindow, A2 NoPER, NoRefConv, NoGRU, NoLSTM
Evaluation metrics	CLSR, Conn, LLR, Over, composite scores weighted by dimension, radar chart displays LLR and Over as $LLR (1)$ and $Over (1)$ to ensure four-axis alignment
$Δ$ Definition	Incremental improvement over the relatively unoptimized baseline for four metrics, higher values are better for CLSR and Conn’s $Δ$ , while lower values are better for LLR and Over’s $Δ$
Random seed	5 Seeds, training curves take the cross-seed mean, boxplots display distribution
$B$	20,000,000
$c_{MG}^{unit}$	500
$c_{MG}^{inst}$	50
$c_{batt}$	300
$c_{inv}$	80
$c_{ES}^{inst}$	50
$c_{QC}^{unit}$	70
$c_{BK}$	100,000
$K_{max}$	5
Feasibility check	After each deployment, AC power-flow verification is performed, the environment rejects invalid actions, and are not recorded as valid experience

Weights are first computed from betweenness centrality and N-1 security margins, and weighted sampling then constructs fixed training/test sets with reproducible seeds. The feasible domain of the environment is jointly defined by AC power-flow convergence and operational-safety limits. Each deployment action is verified through power-flow and out-of-limit checks before it is admitted to the replay buffer. Policy learning is performed using a deep Q-network with prioritized experience replay, soft target network updates, gradient clipping, and learning rate scheduling. At the execution layer, reserve-interconnection subsets are selected by a width-limited beam search with embedded AC power-flow checks. Evaluation is conducted under unified criteria that include connectivity, critical-line survival rate, load-loss ratio, and overload ratio, along with a weighted composite resilience score.

To ensure reproducibility and implementation consistency, unified settings are applied to training and evaluation, including scenario construction and statistical metrics, budget and portfolio configurations, and critical components such as experience replay, target networks, exploration policy, optimizer and learning-rate scheduling, gradient clipping, and history-window length. The specific configurations are listed in Table 2. These settings are kept identical for both test systems and for all comparison methods so that performance differences are attributable to methodological differences rather than unequal hyperparameter tuning effort.

Table 2.

Hyperparameters of training and stabilization.

Hyperparameter	Setting and metrics
Training rounds	3000
$T_{max}$	12
$ω_{1}$	8.0
$ω_{2}$	5.0
$ω_{3}$	6.0
$ω_{4}$	1.0
$K$	1.0
$α$	0.6
$ε_{pr}$	$10^{- 6}$
$W_{beam}$	3
$n_{MG}^{max}$	2
$n_{ES}^{max}$	1
$n_{QC}^{max}$	1
Target-network update period	Every 80 gradient steps
$T_{max}$	12
Number of simultaneous attacks	3
Experience replay capacity $N_{replay}$	60,000
Batch size $B$	128
$γ$	0.99
Learning rate $LR$ and scheduling	$1.2 \times 10^{- 3}$ , One Cycle LR
Optimizer	Adam
Loss function	Smooth L1 loss
Target network synchronization	Synchronize every 80 iterations
$ϵ - greed$ Exploration	1.0 → 0.2 → 0.05 (segmented linear decrease)
PER importance parameter	$0.6$
PER correction factor	$0.4 \to 1.0$ Training-dependent linear boost
PER stabilization term	$ϵ = 1 \times 1 0^{- 3}$ , minimum sampling probability $1 \times 10^{- 12}$
Gradient clipping norm	5.0
History window length K	Training default 6, scan range 1–6, optimal 4

To enhance transparency and reproducibility of the implementations used in the comparative study reported in Section 4.2, the core workflows of GA, Greedy-1, and Vanilla DQN are documented in the Appendix 1 as Algorithms A1 –A3. In addition, convergence evidence for GA, Greedy-1, and Vanilla DQN is provided in Appendix 2 as Figures B1 –B3. All three methods are evaluated under an identical environment model, contingency sets, budget rules, and metric computation pipeline, and they differ only in the decision rule used to select deployment actions and reserve-interconnection choices.

Operational metrics and method comparison

Post-operation stability

Learning and convergence behaviors are first examined for the proposed method alone, across training iterations, to assess stability and scalability under different network topologies.

Figure 5 shows that the episode reward increases steadily with training iterations and then enters a narrow fluctuation band in the later stage. The peak-to-trough range remains controlled and no long-term drift is observed. Preselection of the feasible domain and empirical filtering in prioritized experience replay reduce the proportion of invalid samples, leading to progressively stabilized value estimation. Exploration and exploitation remain balanced, and the curve trajectory is consistent with the operational-feasibility requirement. The convergence pattern is reproducible and numerically interpretable.

Figure 5.

Training reward curve on the IEEE-30 system.

Figure 6 shows sustained monotonic improvement on the larger network, with a more extended convergence phase but fluctuations that remain within a controllable range. Width-limited optimization at the execution layer suppresses noise from combinatorial explosion, and soft updates of the target network maintain gradient stability in the high-dimensional action space. Relative to Figure 5, scale expansion primarily results in a moderate increase in convergence time and jitter amplitude, while the overall stable convergence trend remains intact.

Figure 6.

Training reward curve on the IEEE-57 system.

Comparison with other methods

After the behavior of the individual method has been characterized, comparative analysis is carried out along two dimensions: (i) system-level overall performance and (ii) the incremental performance distribution across methods.

Observation of Figure 7(a) and (b) shows that the radar charts normalize the LLR and the Over using $LLR (1)$ and $Over (1)$ , respectively, and that the four axes share the same semantics where higher values indicate better resilience. In Figure 7(a), for the IEEE-30 system, the FG-DQN based method attains the largest overall radar area. It clearly improves connectivity and reduces load loss, while keeping the critical line survival rate and overload ratio competitive with the strongest comparison methods. Some comparison methods achieve slightly higher values on a single metric, which indicates an inherent trade-off among the four indicators under the common budget and feasibility constraints. FG-DQN is designed to optimize a composite resilience score with embedded AC feasibility checks rather than to maximize each indicator in isolation, so it provides a more balanced polygon instead of strict dominance on every axis. In Figure 7(b), for the IEEE-57 system, the improvements of FG-DQN become more uniform. The method dominates or closely matches the other algorithms on all four axes. The radar polygon of FG-DQN consistently covers a larger region than those of the other methods, indicating that the proposed framework maintains its multi-metric advantages as network scale and topological complexity increase.

Figure 7.

Multi-method synthesis radar charts for the four resilience metrics: (a) IEEE-30 system and (b) IEEE-57 system.

Observation of Figure 8(a) and (b) shows that both test systems exhibit a consistent improvement trend in the four resilience metrics when the methods are placed on the horizontal axis. The FG-DQN-based method produces larger positive gains in critical-path survival rate and in connectivity, and at the same time achieves stronger reductions in load-loss ratio and overload ratio. This concurrent favorable movement of all four indicators indicates that structural restoration and supply recovery can progress simultaneously. When the system scale is increased from IEEE-30 to IEEE-57, the advantage is preserved in the more complex topology. Under highly fragmented conditions, however, the decreases in load-loss ratio and overload ratio begin to converge, implying that the performance ceiling is mainly limited by connectivity boundaries, the coverage of candidate interconnections, and budget relaxation, rather than by learning instability. FG-DQN therefore delivers consistent directional improvements with superior magnitude across different system sizes and scenario sets.

Figure 8.

Heatmap of Δ for the four resilience metrics: (a) IEEE-30 system and (b) IEEE-57 system.

Ablation studies and hyperparameter validation

Ablation studies

To identify the causal contributions of the key mechanisms, the temporal sliding window and the prioritized experience replay module are removed separately. The resulting changes in scenario-wise increments, training trajectories, and distribution statistics are then examined in a coupled manner.

Figure 9 shows that the baseline configuration attains positive gains in most scenarios. Removing either the temporal sliding window or the prioritized experience replay (PER) module markedly weakens these gains, and several scenario-action combinations even turn negative. The regions with the most pronounced degradation coincide with the high-impact scenarios, which indicates that the temporal window captures outlier channels and short-term dynamics. At the same time, PER increases the sampling weight and estimation efficiency of critical experiences. Both components are therefore necessary to sustain stable improvements in high-impact settings.

Figure 9.

Ablation experiment: Δ-resilience heatmap.

Figures 10 and 11 show that, under the complete mechanism, the training loss decreases rapidly and then oscillates within a narrow and stable band, corresponding to a steady increase in episode return and curve convergence across different random seeds. Removing either mechanism enlarges the loss oscillations, lowers the mean return, and increases the variance. The learning trajectory becomes more sensitive to the synchronization between the experience distribution and the optimization objective, and the curve-level degradation is consistent with the degradation observed at the scenario level.

Figure 10.

Training loss curve (average of five seeds).

Figure 11.

Training return curve (average of five seeds).

Figure 12(a) and (b) show that the complete mechanism produces distributions with higher medians, tighter interquartile ranges, and shorter lower tails. After the mechanism is removed, the rightward shift becomes weaker and pronounced long tails appear. The agreement between the distributional statistics and the training curves indicates that the key mechanism accelerates convergence while improving robustness across heterogeneous scenarios, the observed gains are not attributable to a small number of favorable cases.

Figure 12.

Box plots of: (a) composite resilience on the test set and (b) returns (average over the last 100 episodes).

K-value scan

The length of the historical window affects both the coverage of temporal information and the variance of value estimation, so a balance must be maintained between perceptual depth and redundant noise.

Figure 13 shows that performance first increases and then decreases as K grows, and stabilizes near K = 4. A window that is too small cannot capture the short-term memory of threshold-crossing propagation, whereas an excessively long window introduces redundancy and raises the variance of value estimation. Both test systems exhibit the same trend, and the optimal window length is positively correlated with network scale, providing actionable guidance for multi-scale deployments.

Figure 13.

K-scan for the baseline, where K = 4 is optimal and most stable.

Conclusions

This study developed an FG-DQN-based framework for the collaborative allocation of mobile generators, energy storage, reactive compensation, and reserve interconnections to enhance power system resilience under extreme events while balancing physical feasibility and computational tractability. AC power-flow feasibility and operational-safety limits were embedded into the environment, and a normalized incremental-potential reward aligned with a composite resilience score was designed. In addition, a width-limited beam search with embedded AC power-flow checks was introduced for reserve-interconnection selection, which helped to preserve physical consistency and control the computational burden. Simulation results on the IEEE-30 and IEEE-57 test systems showed that FG-DQN increased network connectivity and the critical-line survival rate, reduced load-loss and overload ratios, and raised the composite resilience score while lowering variance.

One limitation of this study is that the simulations are based on IEEE standard test systems and do not incorporate real operating data. Future extensions may integrate time synchronized wide area measurements from existing PMU based monitoring infrastructures as a source of actual grid operating data. Such data, collected under realistic channel capacity and communication constraints, can drive a continuously updated scenario pipeline and enable the evaluation of cross condition generalization under real operating conditions. Another limitation is that feasible-region thresholds and reward weights are manually specified, which constrains transferability across systems. Future research can explore adaptive or data-driven schemes to tune these parameters under safety constraints. The tuning can be driven by state-estimation outputs from time-synchronized wide-area measurements, which is expected to improve generalization across different grids and operating conditions.

Footnotes

Appendix 1

Algorithm A1 presents the GA for portfolio deployment within a unified evaluation framework.

Algorithm A2 outlines the Greedy-1 with one-step look-ahead selection of deployment actions.

Algorithm A3 outlines the Vanilla DQN trained and evaluated under the same setting as FG-DQN but without feasibility-guided screening or beam search.

Appendix 2

Figure B1 presents the generation-wise objective evolution of the GA within the unified evaluation framework.

Figure B2 presents the step-wise indicator evolution of Greedy-1 within the unified evaluation framework.

Figure B3 presents the episode-wise reward trajectories of Vanilla DQN within the unified evaluation framework.

ORCID iD

Aoyu Lei

Ethical considerations

This article does not contain any studies with human or animal participants.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Science and Technology Project of China Southern Power Grid Co., Ltd. [grant number 000005KC24010023].

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Hines

Apt

Talukdar

. Large blackouts in North America: historical trends and policy implications. Energy Policy 2009; 37(12): 5249–5259. https://doi.org/10.1016/j.enpol.2009.07.049

Killenberger

Zielonka

Sasse

J-P

, et al. Weather resilience of the future Swiss electricity system with very high shares of variable renewable energy sources. Environ Res Energy 2025; 2(1): 015003. https://doi.org/10.1088/2753-3751/ada77c

Stanković

Tomsovic

De Caro

, et al. Methods for analysis and quantification of power system resilience. IEEE Trans Power Syst 2023; 38(5): 4774–4787. https://doi.org/10.1109/TPWRS.2022.3212688

Panteli

Trakas

Mancarella

, et al. Power systems resilience assessment: hardening and smart operational enhancement strategies. Proc IEEE 2017; 105(7): 1202–1213. https://doi.org/10.1109/JPROC.2017.2691357

Espinoza

Panteli

Mancarella

, et al. Multi-phase assessment and adaptation of power systems resilience to natural hazards. Elect Power Syst Res 2016; 136: 352–361. https://doi.org/10.1016/j.epsr.2016.03.019

Raoufi

Vahidinasab

Mehran

. Power systems resilience metrics: a comprehensive review of challenges and outlook. Sustainability 2020; 12(22): 9698. https://doi.org/10.3390/su12229698

Zhao

Wang

. Learning sequential distribution system restoration via graph-reinforcement learning. IEEE Trans Power Syst 2022; 37(2): 1601–1611. https://doi.org/10.1109/TPWRS.2021.3102870

Bedoya

Wang

Liu

C-C

. Distribution system resilience under asynchronous information using deep reinforcement learning. IEEE Trans Power Syst 2021; 36(5): 4235–4245. https://doi.org/10.1109/TPWRS.2021.3056543

Zhang

Eseye

Knueven

, et al. Curriculum-based reinforcement learning for distribution system critical load restoration. IEEE Trans Power Syst 2023; 38(5): 4418–4431. https://doi.org/10.1109/TPWRS.2022.3209919

10.

Cao

Zhao

, et al. Deep reinforcement learning enabled physical-model-free two-timescale voltage control method for active distribution systems. IEEE Trans Smart Grid 2022; 13(1): 149–165. https://doi.org/10.1109/TSG.2021.3113085

11.

Xie

Tang

Zhu

, et al. Robustness assessment and enhancement of deep reinforcement learning-enabled load restoration for distribution systems. Reliab Eng Syst Saf 2023; 237: 109340. https://doi.org/10.1016/j.ress.2023.109340

12.

Butt

Huda

Amin

. Design of fault-tolerant control system for distributed energy resources based power network using phasor measurement units. Meas Control 2023; 56(1–2): 269–286. https://doi.org/10.1177/00202940221122185

13.

Zimmerman

Murillo-Sanchez

Thomas

. MATPOWER: steady-state operations, planning, and analysis tools for power systems research and education. IEEE Trans Power Syst 2011; 26(1): 12–19. https://doi.org/10.1109/TPWRS.2010.2051168

14.

Theodorakatos

Babu

Lytras

. Avoiding the Maratos effect in non-convex optimization through piecewise convexity: a case study in optimal PMU placement problem. Algorithms 2025; 19(1): 11. https://doi.org/10.3390/a19010011

15.

Carvajal

Carrión

Jaramillo

. Planning scheme for optimal PMU location considering power system expansion. Energies 2025; 18(13): 3283. https://doi.org/10.3390/en18133283

16.

Theodorakatos

Babu

Theodoridis

, et al. Mathematical models for the single-channel and multi-channel PMU allocation problem and their solution algorithms. Algorithms 2024; 17(5): 191. https://doi.org/10.3390/a17050191

17.

Manousakis

Korres

. Optimal PMU arrangement considering limited channel capacity and transformer tap settings. IET Gener Transm Distrib 2020; 14(24): 5984–5991. https://doi.org/10.1049/iet-gtd.2019.1951

18.

Alexopoulos

Korres

Manousakis

. Complementarity reformulations for false data injection attacks on PMU-only state estimation. Elect Power Syst Res 2020; 189: 106796. https://doi.org/10.1016/j.epsr.2020.106796

19.

Shi

Jiang

, et al. Comprehensive power quality evaluation method of microgrid with dynamic weighting based on CRITIC. Meas Control 2021; 54(5–6): 1097–1104. https://doi.org/10.1177/00202940211016092

20.

Vijay

. Power distribution system resilience: a perspective of the power system operator. Sustain Energy Grid Netw 2025; 44: 101950. https://doi.org/10.1016/j.segan.2025.101950

21.

Zhang

, et al. Resilient dispatching optimization of power system driven by deep reinforcement learning model. Discov Artif Intell 2025; 5: 189. https://doi.org/10.1007/s44163-025-00451-1

22.

Yang

Zhao

, et al. Reinforcement learning in sustainable energy and electric systems: a survey. Annu Rev Control 2020; 49: 145–163. https://doi.org/10.1016/j.arcontrol.2020.03.001

23.

Hedman

O’Neill

Fisher

, et al. Optimal transmission switching with contingency analysis. IEEE Trans Power Syst 2009; 24(3): 1577–1586. https://doi.org/10.1109/TPWRS.2009.2020530

24.

Aziz

Lin

Waseem

, et al. Review on optimization methodologies in transmission network reconfiguration of power systems for grid resilience. Int Trans Electr Energy Syst 2021; 31: e12704. https://doi.org/10.1002/2050-7038.12704

25.

Zhou

Mao

Jiang

. Backstepping-based fault-tolerant control for strict-feedback nonlinear multi-agent systems: an encoding–decoding scheme. Automatica 2026; 185: 112800. https://doi.org/10.1016/j.automatica.2025.112800

26.

Chen

Ren

, et al. Field data–driven online prediction model for icing load on power transmission lines. Meas Control 2020; 53(1–2): 126–140. https://doi.org/10.1177/0020294019878872

27.

Gholizadeh

Musilek

. Explainable reinforcement learning for distribution network reconfiguration. Energy Rep 2024; 11: 5703–5715. https://doi.org/10.1016/j.egyr.2024.05.031

28.

Tian

Dong

Gong

, et al. Line hardening strategies for resilient power systems considering cyber-topology interdependence. Reliab Eng Syst Saf 2024; 241: 109644. https://doi.org/10.1016/j.ress.2023.109644

29.

Ghanbari

Jiang

. A comprehensive review on power system resilience: definition, assessment, and enhancement strategies. Int J Electr Power Energy Syst 2025; 172: 111149. https://doi.org/10.1016/j.ijepes.2025.111149

30.

Lin

Wang

Yue

. Equity-driven distribution power system planning for resilience enhancement. Elect Power Syst Res 2025; 241: 111197. https://doi.org/10.1016/j.epsr.2024.111197

31.

Pang

Liu

Zhang

, et al. Leveraging electric vehicles for enhancing power system resilience: a review of strategies and challenges. Curr Sustain Renewable Energy Rep 2025; 12: 11. https://doi.org/10.1007/s40518-025-00259-8

32.

Nematshahi

Shi

Wang

, et al. Deep reinforcement learning based voltage control revisited. IET Gener Transm Distrib 2023; 17: 4826–4835. https://doi.org/10.1049/gtd2.13001

33.

Jacob

Paul

Chowdhury

, et al. Real-time outage management in active distribution networks using reinforcement learning over graphs. Nat Commun 2024; 15: 4766. https://doi.org/10.1038/s41467-024-49207-y

34.

Gallego

Martín

Díaz

, et al. Maintaining flexibility in smart grid consumption through deep learning and deep reinforcement learning. Energy AI 2023; 13: 100241. https://doi.org/10.1016/j.egyai.2023.100241

35.

Raoufi

Vahidinasab

. Power system resilience assessment considering critical infrastructure resilience approaches and government policymaker criteria. IET Gener Transm Distrib 2021; 15(20): 2819–2834. https://doi.org/10.1049/gtd2.12218

36.

Bie

Lin

, et al. Battling the extreme: a study on the power system resilience. Proc IEEE 2017; 105(7): 1253–1266. https://doi.org/10.1109/JPROC.2017.2679040

Collaborative optimization of flexible resources for power system resilience based on feasibility-guided deep Q-networks

Abstract

Keywords

Introduction

Physical modeling and evaluation metrics

Modeling and evaluation framework

AC power-flow and operational feasibility region

Flexible resources and cost-budget model

Resilience metrics and composite scoring

MDP formulation

Potential-based composite reward

Solution method

Training-execution integrated framework and stabilization mechanisms

Width-limited optimality search for reserve-interconnection line combinations

Computational complexity and engineering feasibility

Experiments and results

Experimental platform and unified settings

Operational metrics and method comparison

Post-operation stability

Comparison with other methods

Ablation studies and hyperparameter validation

Ablation studies

K-value scan

Conclusions

Footnotes

Appendix 1

Appendix 2

ORCID iD

Ethical considerations

Consent to participate

Consent for publication

Funding

Declaration of conflicting interests

Data availability statement

References