A resilient network recovery framework against cascading failures with deep graph learning

Abstract

Because of the increasing importance and dependencies of infrastructure networks and the potential for massive cascading failures in real-world network systems, maintenance optimization to effectively reduce system performance loss caused by diverse disruptions is of significant interest among researchers and practitioners. In this work, a new recovery framework was developed to rapidly identify important system components for maintenance to improve network resilience against cascading failures. This work provides distinct advantages to determine an optimal maintenance priority by combining real-time network structure importance with other maintenance prioritization based on customer preference. This approach adopts structural graph embedding and deep reinforcement learning to extract real-time network topology information (such as minimum vertex cover) to update the maintenance priority during the recovery process. Based on the case studies on synthetic networks and a US airport network, the proposed recovery framework with real-time network topology awareness shows better performance than other maintenance prioritization strategies regarding resilience enhancement. This work improves the understanding of how the changing network structure influences maintenance effects. It also provides insights of the practical usefulness of advanced deep learning on helping optimal maintenance prioritization to effectively reduce the intensity and extent of cascading failures.

Keywords

Cascading failures deep graph learning network resilience maintenance optimization system simulation

Introduction

Real-world network systems with increasing scale and complexity, for example, infrastructure networks (such as power grids and transportation networks), are playing increasingly important roles in improving the efficiency of industry and the quality of modern life. Despite the promising benefits brought by diverse network systems, new challenges also emerge with their applications, such as reliability, safety, and maintenance.¹ For instance, a cyber-physical system,² which is a network of networks with new characteristics that require a fundamental understanding of their interactions and interconnections. These interdependencies strengthen the communication and control of system components, and provide intelligence, while they also weaken the system, accelerating failure propagation in the system.³ Thus, unexpected cascading failures have been seen in different network systems in which the initial breakdown of some components can iteratively trigger other functional components to become inaccessible or unusable, which may result in a wide failure spread across the systems.⁴ Cascading failures have catastrophic effects on personal lives and industries, and people have experienced these failures with devastating influences.⁵ Therefore, significant research has been conducted related to cascading failures in network systems, for example, modeling of cascading failures.⁶ Existing practices on cascading failure modeling can be categorized as Table 1.

Table 1.

Main categories of cascading failure models.

Model category	Main features	Relevant models
Local load redistribution-based models	Network node/edge failure only triggers the failures of other nodes/edges in the neighborhood via local load redistribution	Sandpile model,⁷ Fiber-bundle models,⁸ Threshold model,⁹ etc.
Global load redistribution-based models	Node/edge failure triggers the failures of nonadjacent nodes/edges, and even the nodes/edges far away via global load redistribution	OPA model,¹⁰ Motter-Lai model,¹¹ CASCADE model,¹² etc.
Dependency-based models	Node/edge failure triggers the failures of other nodes/edges through network dependency links or clusters, which are different from connectivity links	Single layer nework model,⁶ Interdependent network model,¹³ Spatially embedded model,¹⁴ etc.

System resilience is the ability of a system to withstand or quickly return to normal condition after the occurrence of events that disrupt the system state.¹⁵ Resilience reflects system-integrated capability against cascading failures, and some studies related to network resilience have also been conducted.^16,17 Because failures are inevitable and resources to restore the system when failures occur are limited and subject to budgets, optimal maintenance methods or strategies to improve system resilience and mitigate the influences of cascading failures are of great significance and interests in both academia and practice.^18–20 Di Muro et al. proposed a maintenance strategy for interdependent network systems by repairing the network nodes that belong to the largest connected component of each constituent network.²¹ Almoghathawi et al. developed an optimization method, which decides the maintenance priority of failed components in interdependent networks, to achieve a balance between resilience and investment cost.²² For demand-supply networks, Hosseinalipour et al.²³ developed a maintenance resource configuration method which can effectively reduce resource allocation cost while confining cascading failures.

Although efforts have been made to prevent cascading failures, high-impact cascading failures still occur in different network systems. One of the main reasons is that a large network system comprises many distinct components, which are subjected to diverse disruptions. Therefore, a key point of analyzing cascading failures in networks is to investigate the influence of component failures and some component importance measures have been proposed from different perspectives.^24,25 In practice, real-life network component importance impacting maintenance prioritization changes as network topology changes by the ongoing failures and restoration process. Therefore, identifying the components with high importance from the changing network structure to update maintenance priority readily is important. It helps to optimally allocate the limited maintenance resources to improve system recovery efficiency.²⁶

Dui et al.²⁷ combined residual resilience and some importance measures to optimize the recovery sequence of failed ports and routes in a maritime transportation system during which the traffic is dynamically redistributed. Almoghathawi et al.²⁸ developed a resilience-driven optimization model to address the interdependent network restoration problem, where the disrupted components might be partially restored for seeking the minimum cost associated with the recovery process. However, current research on the component importance measures of network systems in terms of the effects of component restoration on network resilience against cascading failures is very limited. Most centrality-based importance measures are closely relevant to network topology and interaction, whereas vulnerability-based importance measures mainly quantify the influence of a component failure on a certain performance of a network.²⁹ While none of them consider both real-time network connectivity of great importance and the impact of network maintenance on resilience during cascading failure-maintenance process.

Among network topology-based features, minimum vertex cover (MVC) is a key network component/node subset with a minimized number of network nodes so that all network edges are covered, that is, it presents a network with the fewest possible network nodes while covering all areas. Since MVC serves as an approximation of the backbone of the entire nework system, it can be used to identify critical network components from the perspective of global network connectivity importance. Figure 1 presents an example to illustrate the MVC node set in a network system, where the three middle computers circled in red form the MVC node set. MVC has been applied to different research fields, such as network security,³⁰ computational biology,³¹ and text summarization.³²

Figure 1.

An exampl of MVC node sets in a network system.

In this paper, the authors propose a new recovery framework for network resilience enhancement against cascading failures by considering real-time global network topology information. The core of the framework is quick MVC identification, that is, identifying important system components for maintenance, based on structural graph embedding technique and deep reinforcement learning (RL)-based graph learning. By incorporating MVC into other maintenance priority strategies, the framework can optimize recovery resource allocation by real-time updating maintenance priority targets. According to the two case studies, both system loss caused by cascading failures and recovery time can be remarkably reduced using the proposed recovery framework compared with existing maintenance strategies. It demonstrates the effectiveness of our proposed method on effectively confining cascading failure and enhancing network resilience.

Background

In this section, the authors present the modeling of mixed cascading failures and three existing recovery strategies with specific preferences, which form the basis for the following case studies in the manuscript.

Cascading failure model

In this paper, a single-structured network system is presented as an unweighted, undirected, self-loop free graph G with N nodes and an adjacency matrix ${e_{ij}}_{N \times N}$ . $e_{ij} = 1$ if an edge (topological connection) exists between node i and node j, and $e_{ij} = 0$ if no edge is connected between node i and node j. Because a network system generally carries a flow/load of a particular resource (e.g. electricity, data packets), “betweenness centrality” is used to describe network load.³³ The authors use an improved equation³⁴ to calculate the betweenness centrality for undirected graphs. Network load is assumed to only transmit along the shortest paths between each pair of network nodes. A path consists of the edges between the two targeted nodes. The shortest path is the one that consists of the smallest number of edges. Network efficiency E(G), defined in equation (1), is used to evaluate network performance in terms of transmission efficiency.³⁵

E (G) = \frac{1}{N (N - 1)} \sum_{i \neq j \in G} e f_{ij}

(1)

Higher E(G) indicates a more efficient network load transmission. The transmission efficiency of the shortest path³⁵ between node i and node j is depicted by $e f_{ij}$ . Each network node i is assigned a limited capacity C_i, which is the maximum load that a node can handle without congestion. A nonlinear capacity-load model³⁶ was adopted to determine the capacity of each network node. When the load on node i at time t, Li(t), exceeds node capacity C_i, node i suffers overloaded failure at time t. A mixed cascading failure model⁶ was applied to perform the propagation of system failures. This model considers the combined impacts of network load dynamics and network component dependency, which help to understand some essential features of cascading failures. In this model, dependence clusters of network nodes are proposed to describe the dependence interactions between network nodes apart from topological connections. An example of a single-structured network system with multiple dependence clusters is illustrated in Figure 2.⁶

Figure 2.

A single-structured network system with dependence node clusters.

Because the sizes of dependence clusters in a real-world network usually follow certain distributions,³⁷ the sizes of dependence clusters in a network system is assumed to follow a shifted/scale adjusted Poisson distribution.⁶ The probability that a node belongs to a dependence cluster of size D, P(D), is obtained based on equation (2),

P (D) = \frac{λ^{D - 1} e^{- λ}}{(D - 1)!}, D \geq 1

(2)

where $λ = (D - size) - 1$ . The average number of nodes in a dependence cluster in a network system is D-size. Therefore, D-size indicates the relative degree of node dependency regarding dependence clusters. A larger D-size means greater dependency. In this research, a dependence cluster immediately collapses (i.e. all nodes belonging to this cluster fail) if the percentage of failed nodes in this dependence cluster exceeds the cluster collapsing threshold (CCT).³⁸ Thus, a smaller CCT indicates stronger dependency strength inside the dependence clusters.

In summary, the mixed cascading failure model describes the iteration of two types of network failures, overloaded failures caused by load dynamics, and dependence cluster collapses caused by node dependency. The maintenance strategy was applied to restore network system when failure occurs so that a dynamical cascading failure-maintenance process could be observed.

Recovery strategy

Three different existing recovery strategies with specific preferences were employed to restore network system against cascading failures. (1) The high-degree first repair (HDFR) strategy is designed for the repair order assigned based on the degree (the number of topological connections) of failed nodes (i.e. this strategy has the preference that network nodes with higher node degrees are repaired with higher priority). (2) The shortest-time first repair (STFR) strategy assigns the repair order according to the required repair time of failed nodes. The maintenance prioritizes the failed nodes that require shorter repair time. (3) The high load first repair (HLFR) strategy prioritizes repair based on the amount of load transmitted through the failed nodes (i.e. this strategy has the preference that network nodes that carry higher loads are repaired with higher priority).

Ties, which occur when failed nodes under the same condition (i.e. degree, required repair time, or load), were broken according to the first fail/first repair policy. Once started, the repair activities of failed nodes do not stop until they are completed. For simplicity, it was assumed that the maintenance resources were available to complete repairs. The number of new repairing activities of failed nodes, which begin at each round of inspection for different maintenance strategies, is decided by the repair proportion p and total number of failed nodes in which repair activities have not yet started at the time of inspection.

As was mentioned previously, existing maintenance strategies with specific preferences rarely, if ever, consider component importance changing during the cascading failure-maintenance process. Therefore, identifying important components based on MVC from the real-time network structure is of practical significance for updating maintenance priority targets during the recovery process for effective maintenance. The authors propose a recovery framework that can combine real-time MVC importance with other preferences according to customers’ desirability (e.g., network transmission importance) to effectively restore network system from the cascading failures. Case studies were implemented on different network systems to demonstrate the effectiveness of the proposed framework. The details of the recovery framework are presented in the following sections.

Methodology

In this section, the authors propose a new recovery framework that combines MVC importance with existing maintenance strategies regarding other preferences to enhance system resilience against cascading failures. The framework is implemented by rapid MVC detection from the real-time network structure according to the graph embedding technique and the advanced RL-based graph learning.

Deep graph learning for MVC detection

MVC is an NP (non-deterministic polynomial-time)-hard computational problems, which means it requires exponential time algorithms to search for the optimal solution. Traditional methods to solve MVC can be divided into three categories: exact algorithms, approximation algorithms, and heuristic algorithms. Exact algorithms are based on enumeration or linear programming, which have exponential time complexity and are not practical for large-scale graphs. Approximation algorithms (if they are available) are faster with polynomial time complexity but do not guarantee solution quality and are not practically applicable. Heuristic algorithms are fast but do not guarantee solution quality, and they require expert knowledge and repeated designs for different problems.

RL has already shown its potential in optimization and management, and can be combined with graph embedding to derive a MVC solution efficiently. Recently, a few works have applied Reinforcement Learning (RL) to explore heuristics solutions for large-graph problems.^39–41 Some other efforts use Graph Neural Network (GNN) to derive solutions for large-scale graph problems.^42–44 The major difference between RL approaches and GNN approaches for graph problems is that RL approaches can automatically explore solution for the graph problems without predefined, labeled training datasets. It is important to note that for many NP-hard large-scale graph problems, it is impractical (or extremely computationally intensive) to generate appropriate training dataset. With an RL approach, it is convenient to construct reward function for the MVC problem: we simply add a constant negative value to the partial solution whenever a node is added to the solution, RL algorithms can automatically identify the best embedding and the best graph solution simultaneously during the training.^45,46

In this study, the authors adopted an state-of-the-art, high-performance deep graph analysis environment (OpenGraphGym)^46,47 to support MVC estimation in a very short time to provide real-time reference for maintenance priority updates. OpenGraphGym allows the authors to use several graph embeddings to represent graph attributes and features. We apply Structure2vec graph embedding⁴⁸ and deep Q-learning RL algorithm⁴⁹ to search for the optimal MVC solution in this work (See Figure 3). Q-learning algorithm is a process of trial and error interactions between the agent and the environment.⁵⁰ The goal of Q-learning algorithm is to find an optimal policy which maximizes the rewards at each episode. We use an approximation function as the policy because of the large number of possible states. At each episode, the agent explores a new problem instance by initializing a new environment. At every step of each episode, the agent decides an action randomly or according to the policy. The environment perceives the state and returns the reward signal to the agent. The transition of state, action and reward denoted as ${s_{t}, a_{t}, r_{t}, s_{t + 1}}$ are pushed to the replay memory buffer. $s_{t}$ and $s_{t + 1}$ are the states at the $t^{th}$ and $t + 1^{th}$ step, respectively. $a_{t}$ and $r_{t}$ are the action and reward at the $t^{th}$ step. Then, the agent applies the value function to compute the “label” of the policy output given above action $a_{t}$ . Bellman equation presented below is used to simplify the value function,

Q (s_{t}, a_{t}) = r_{t} + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1})

(3)

Figure 3.

The diagram of the advanced RL-based graph learning for MVC detection.

Finally, each episode ends with a special termination state. Then, the agent will start a new episode with a new problem instance.

As shown in Figure 3, two major parts of the framework are graph embedding and the RL Q function. Two steps are illustrated in the figure. At each step, graph embedding takes the graph as the input and produces the embeddings for all nodes in the graph. Then, the embeddings for all nodes are sent to the RL Q function. The RL Q function computes the scores for all nodes in the graph (shown as q in Figure 3). The nodes with the highest score (e.g., q₂) are marked and added to the partial solution. In Figure 3, two blue nodes are included in the partial solution.

By adding a node to the partial solution at each step, this approach provides a ranked MVC solution, which means that early selected nodes in an MVC solution generally contribute more to the final optimal MVC solution than late selected nodes. To further improve the computational efficiency to deliver a reasonable MVC solution, small graphs were used to train an MVC graph agent within the OpenGraphGym that can provide a reasonable MVC solution for relatively large graphs, which enables real-time MVC detection for real-world large scale networks. This acceleration is valid when the training graphs and testing graphs are of the same graph type and degree distribution.

MVC-based Recovery Framework

Because MVC identifies important network nodes from the perspective of global network connectivity, the proposed recovery framework combines it with the existing maintenance strategies having specific prioritization preferences for optimal maintenance resource allocation. HDFR strategy and HLFR strategy in which load is represented by betweenness centrality have already been proved to have the similar repairing effectiveness against the modeled cascading failures.³⁴ Therefore, real-time MVC was only incorporated with HLFR strategy and STFR strategy, respectively, to make comparisons with the existing maintenance strategies without incorporating MVC. The first/top u percentage of the nodes in MVC (the most vital nodes in the MVC node set) was assigned with a high maintenance priority. Then, the repairing of other failed nodes, which do not belong to these top u percentage of MVC nodes, are repaired following the order according to HLFR or STFR strategy. The order to start the repairing activities of the top u percentage MVC nodes, if their repairing had not yet started, was assigned based on their ranking inside of MVC. The maintenance prioritization of MVC in the recovery framework is indicated as u. Therefore, the weight of MVC for restoration is indicated by u. Bigger u denotes higher weight of MVC nodes regarding restoration priority.

The main steps of implementing the recovery framework with real-time MVC detection for system maintenance prioritization against cascading failures are briefly presented as follows:

Step 1. All network nodes are initially functional with limited capacity. Dependence clusters are created. Detect MVC of the network system.

Step 2. Randomly select nodes to break down because of initial disturbances.

Step 3. Dependence clusters collapse if CCT is exceeded. Update the network structure.

Step 4. Network loads are dynamically redistributed over the current network structure. Overloaded nodes break down.

Step 5. Failed nodes are arranged to start the repairing activities based on the maintenance strategy considering MVC importance.

Step 6. Quickly detect MVC from the current network structure.

Step 7. Return to Step 3 until system is recovered to the predetermined level.

The performance of the network system is recorded during the process of cascading failures with maintenance implementation.

Take an example of the case that MVC is combined with HLFR strategy and u = 10%. Every time at Step 5, the failed nodes belonging to the top 10% of MVC node set whose repairing activities have not been started, are repaired with top priorities (If the current MVC contains 100 nodes, only the top 10 important nodes have the top priorities to be selected to start the repairing). Then, other failed nodes, which do not belong to the top 10% of MVC set and the repairing activities have also not yet started, are repaired following the order according to the HLFR strategy if the maintenance resources are still available. Note again that only the failed nodes, whose repairing activities have not yet begun, are considered to decide the order of being repaired at Step 5. In reality, maintenance strategies are implemented with cost and different recovery prioritization usually comes with distinct investment with limitation. Therefore, there is an expressing need to improve the effectiveness and efficiency of recovery resource allocation. In this work, we focus on exploring the impact of maintenance prioritization and will consider the monetary cost associated with maintenance in future work.

Resilience metrics

In this work, system resilience loss was adopted to evaluate the maintenance effects of the recovery framework.³⁴ Figure 4 illustrates the changing trend of system performance after system interruptions and maintenance implementation. System interruption happens at time t_I, which leads to the degradation of system performance Q(t). Maintenance actions are then applied to restore the system until Q(t) is recovered to a predetermined level Q(t_e) at time t_e. Note that the predetermined level Q(t_e) can be set to be the same, close to, or better than the pre-disruption level Q(t_I).³⁴

Figure 4.

Changing trend of system performance Q(t) after system failures occur and after maintenance implementation.

System resilience loss up to time t is measured by a time-dependent metric, $ℜ (t)$ . It is defined as the proportion of lost performance regarding Q(t) due to cascading failures with respect to a comparative Q(t) if no failure occurred up to time t.³⁴ $ℜ (t)$ is calculated based on Eq. (4).

ℜ (t) = {\begin{matrix} 0, t \leq t_{I} \\ \frac{\int_{t_{I}}^{t} (Q (t_{I}) - Q (t)) dt}{Q (t_{I}) (t - t_{I})}, t > t_{I} \end{matrix}

(4)

where $0 \leq ℜ (t) \leq 1, t \in [0, t_{e}]$ . Smaller $ℜ (t)$ denotes less resilience loss. Network load was adopted as Q(t) to calculate resilience loss in terms of network load demand and supply capability.

The predetermined system recovery level is measured by network efficiency E(G), as was described previously. The time to restore the network system to the predetermined level during the cascading failure-maintenance process, T (total repair time), reflects the maintenance effects in terms of downgrading operation time. In the following case studies, we assume that maintenance actions are conducted since the disruption event occurs, so T is the total repair time from t_I to t_e. A shorter total repair time T is more desirable.

Case studies

Two case studies are conducted on a synthetic network structure and on a real-world network system structure. The results are presented in the following sections. To prove the effectiveness of the proposed recovery framework, three existing maintenance strategies introduced previously were implemented for comparison. The results under the existing maintenance strategies are used to compare with results from the MVC-based recovery framework.

Experiments with synthetic networks

In this case study, the mixed cascading failures are performed on a Barabási-Albert (BA) scale-free network with maintenance implementation. A BA network is a synthetic network model in which node degree distribution follows the power law⁵¹ (i.e. the probability that the network node has degree k, $P (k) ~ k^{- r}, r \approx 3$ ). This property has been observed in many real-life networks such as power grids, communication networks, and the internet.⁵²

The main assumptions made in the numerical simulation are as follows, The required repair time for failed nodes are independent random variables that are uniformly distributed in^1,4 in terms of simulation steps. Each simulation step represents a fixed duration of time. Once the repairing of a node is started, it does not stop until it is completed. CCT is 0.7 (i.e. dependence clusters instantly collapse once the failed nodes inside exceeds 70%). The maintenance process stops when network efficiency is recovered to 95% of its initial value. The proportion of total failed network nodes, where repair activities have not yet started, are selected to start repairing at each round of inspection, p = 0.6.

The adopted BA network examples have 250 nodes with an approximate edge probability of 0.1. To minimize random errors, the authors simulated the cascading failures triggered by 10 different sets of initial random failures of nodes, for each of which they randomly generated a different number of BA network realizations following the specific network scale. The 10 sets of different numbers of randomly initial failed nodes (v) are 9, 15, 21, 27, 33, 39, 45, 51, 57, and 63, accounting for about 4 to 25% of the total network nodes.

MVC graph agent development and implementation

To improve computational efficiency, the authors created 400 BA graphs of 20 nodes with an approximate edge probability of 0.1 to train the MVC graph agent. These training BA graphs are generated using the function “barabasi_albert_graph” in the graph manipulation library networkX.⁵³ This function can be used to create scale-free networks using BA network model.⁵¹

The trained MVC graph agent is tested on random graphs. For the targeted 250-node BA networks with edge probability around 0.1, the agent provided the MVC solution with the average size of 202, whereas the average size of MVC solutions given by a 2-opt MVC solver in networkX is around 225. The solution size obtained from the networkX solver is theoretically guaranteed to be less than two times the optimal solution size.⁵⁴

Based on the trained MVC graph agent, MVC is estimated from the real-time BA network structure changed by the ongoing cascading failures and maintenance implementation. Then, the maintenance priority is updated according to the current MVC and the other maintenance preference.

Results and discussion

MVC is combined with HLFR and STFR, respectively, to perform the proposed recovery framework to dynamically update the maintenance priority. The parameter u, which determines the top percentage of MVC nodes that have high maintenance priorities, are top 10%, top 30%, top 50% of the nodes in MVC, respectively. Figure 5 presents the average resilience loss and T vs. v in BA networks under the proposed recovery framework and three existing maintenance strategies. The presented results under the existing maintenance strategies for each v correspond to an average of over 90 network realizations. The results under MVC-based recovery strategies are averaged over 20 realizations. D-size exerts a notable influence of dependence clusters on network robustness against mixed cascading failures, for example, both recovery time and network collapse threshold are impacted by D-size.⁶ Scenarios with different D-size are also considered, that is, D-size = 4, 8, 12, 16, respectively. The curves shown in Figure 5 are averaged over the results of the scenarios with the four values of D-size.

Figure 5.

Average results of resilience loss and T as a function of the number of initial failed components v for the proposed recovery framework vs. existing maintenance strategies.

As shown in Figure 5, resilience loss increases as v increases because a larger number of initial randomly failed nodes triggers more dramatic cascading failures, which leads to more damage to the system performance. STFR shows better maintenance effects on reducing resilience loss than those of HDFR and HLFR. MVC + STFR method presents the best maintenance effects on mitigating resilience loss, and MVC + HLFR performs better than HLFR. T also increases as v increases, which is consistent with the changing trend of resilience loss, $ℜ (T)$ . Restoring network system takes more time from the cascading failures caused by a larger v to the predetermined level. The longest T is incurred using STFR, and the shortest T is achieved with HDFR or HLFR. The MVC-based recovery framework clearly shortens T when compared with the corresponding existing maintenance strategies. Additionally, the threshold can be observed from the changing trend of T as v increases. The growth rate of T decreases when v exceeds a certain threshold. Table 2 presents the average results of T and resilience loss for different scenarios under the existing maintenance strategies and the proposed recovery framework incorporating MVC.

Table 2.

Average results under different maintenance strategies.

Maintenance strategy	Average T over v (9–63)	Average resilience loss (%) over v (9–63)
HDFR	3.7	14.10
HLFR	3.7	14.03
MVC + HLFR	3.6	13.69
STFR	4.8	12.49
MVC + STFR	4.5	12.23

According to Table 2, MVC + STFR, which combining global network connectivity importance with repair time preference, performs better than STFR. MVC + HLFR method, which considers the importance of global network connectivity and transmission capability, has better maintenance effects than HLFR. These results indicate that the proposed recovery framework updating the maintenance priority by incorporating real-time MVC importance with other maintenance prioritizations has better maintenance effects on reducing resilience loss and shortening T than the existing strategies.

Experiments with the US Top 500 airport network

Disruptions of air transportation systems, caused by events, such as extreme weather and attacks, can lead to huge economic losses.⁵⁵ Studies have been conducted on the robustness of air transportation networks subject to interruptions.⁵⁶ A case study is conducted on the US top 500 airport network with the largest amount of traffic from publicly available data.⁵⁷ This real-world network system consists of 500 nodes and 2980 edges. Network nodes denote airports, and edges represent air routes between airports. Mixed cascading failures and maintenance are implemented into this system to investigate the maintenance effects of different strategies. Figure 6 depicts the US top 500 airport network using the visualization tool Gephi.⁴⁷ The darker colors represent the nodes with more significant degrees. Some hub nodes have larger degrees than other nodes.

Figure 6.

Visualization of the US top 500 airport infrastructure network.

MVC graph agent development and implementation

The authors generate 1000 training networks by randomly removing a part of the nodes from the original airport network. The number of nodes in the training networks ranged from 300 to 480 based on the observation of the number of functional nodes in the system during the recovery process. The number for different sizes of training networks is presented in Table 3.

Table 3.

The percentage of training graphs with different sizes.

Number of nodes	Percentage of the networks accounted for total training networks (%)
480	40
460	10
440	10
420	10
400	7.5
380	7.5
360	5
340	5
320	2.5
300	2.5

The average size of the MVC solution found by the trained agent (around 234) is better than that from the 2-opt solver in the networkX library (which is 294).

The maintenance actions are implemented once the cascading failures were triggered. MVC is calculated from the real-time BA network structure during the recovery process based on the trained MVC agent. Then, maintenance priority targets are updated according to the current MVC and the other maintenance preference that was considered.

Results and discussion

Similar to the previous case study, the values of parameter u are set to be top 10%, 30%, and 50%, respectively, for simulation in this case study. Figure 7 shows the average resilience loss $ℜ (T)$ and T versus v for the proposed recovery framework and three existing maintenance strategies. The results shown in Figure 7 under each v are averaged over the results of 90 network realizations for each existing maintenance strategy, whereas the results under the recovery framework considering MVC are averaged over 10 realizations because it is a fixed network topology. To reduce the bias result from network dependency, scenarios with different D-size (D-size = 4, 8, 12, 16, respectively) are considered, and the results are averaged for these different scenarios regarding D-size.

Figure 7.

Average results of resilience loss and T as a function of the number of initial failed components v for different maintenance strategies.

Similar to what was observed from the results on BA networks, resilience loss $ℜ (T)$ and T increase as v increases in the airport network. Figure 7 shows that the difference between the maintenance effects of different maintenance strategies regarding resilience loss are not as large as those in BA networks. However, the maintenance strategies that incorporate MVC importance (i.e. MVC + STFR and MVC + HLFR) still contribute to less resilience loss in most of the v. A shorter T is achieved by applying HDFR, HLFR, and MVC + HLFR. A longer T was needed for network efficiency to be recovered to the predetermined level when applying STFR, whereas T is shortened with MVC + STFR.

It also can be seen from Figure 7 that the growth rate of T is reduced when v exceeds a noticeable threshold. The threshold of v in the changing trend of T is around 20 (i.e. about 4% of the total network nodes). This finding is in accordance with the results in³⁴ under different repair strength and dependency strength. It indicates the robustness of the US top 500 airport network to cascading failures regarding recovery time. Table 4 presents the average results regarding D-size for different scenarios under the existing maintenance strategies and the proposed recovery framework incorporating MVC.

Table 4.

Average results under different maintenance strategies.

Maintenance strategy	Average T over v (9–63)	Average resilience loss (%) over v (9–63)
HDFR	4	13.11
HLFR	4	12.83
MVC + HLFR	4	12.71
STFR	6	12.36
MVC + STFR	5.5	12.19

Based on Table 4, both MVC + HLFR strategy and MVC + STFR strategy show better maintenance effects compared with that of single HLFR or STFR strategy on reducing resilience loss $ℜ (T)$ and total repair time T. This is consistent with the results presented in Figure 7 and the results obtained from the case study on BA networks.

Conclusions

In this paper, the authors propose a network recovery framework that updates maintenance priority targets based on the changing importance of system components during the process of cascading failures and maintenance. The recovery framework incorporates global network connectivity importance represented by MVC to existing maintenance prioritization strategies to optimize the maintenance priority at each round of inspection. An efficient MVC calculation from the updating network structure is the key step, and the authors provide a desirable solution through OpenGraphGym with graph embedding and deep reinforcement learning.

Case studies are conducted on BA networks and the US top 500 airport network by employing the proposed recovery framework against cascading failures. MVC estimation is incorporated into two existing maintenance strategies with different maintenance prioritization weights. Maintenance effects in terms of resilience loss regarding network load and recovery time are remarkably improved by considering real-time MVC in the maintenance prioritization strategy, which demonstrates the effectiveness of the proposed recovery framework.

This work supports the necessity and significance of updating maintenance priority based on real-time importance of system components during system recovery. The recovery framework will be applied to other real-life network systems in future work and monetary costs and benefits will be included to make a real-world optimization problem. To make the framework more practical, quick detection of MVC in a large network system is one direction of the authors’ future work. How to determine the importance of MVC in the joint maintenance prioritization remains an unexplored question, which is complicated by extending to a general network topology. This question will be another direction of future research.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (No. 72101116), the Natural Science Foundation of Jiangsu Province (No. BK20210317), and the Fundamental Research Funds for the Central Universities (No. 30921012204).

ORCID iDs

Jian Zhou

David W. Coit

References

Sun

Huang

, et al. A new fractal reliability model for networks with node fractal growth and no-loop. Physica A Stat Mech Appl 2019; 514: 699–707.

Fang

Wang

, et al. Performance and reliability improvement of cyber-physical systems subject to degraded communication networks through robust optimization. Comput Ind Eng 2017; 114: 166–174.

RJ.

Influence of clustering on cascading failures in interdependent systems. IEEE Trans Netw Sci Eng 2019; 6(3): 351–363.

Schäfer

Witthaut

Timme

, et al. Author correction: dynamically induced cascading failures in power grids. Nat Commun 2018; 9(1): 4032–4113.

Andersson

Donalek

Farmer

, et al. Causes of the 2003 major grid blackouts in North America and Europe, and recommended means to improve system dynamic performance. IEEE Trans Power Syst 2005; 20(4): 1922–1928.

Zhou

Huang

Coit

, et al. Combined effects of load dynamics and dependence clusters on cascading failures in network systems. Reliab Eng Syst Saf 2018; 170: 116–126.

Lee

D-S

Goh

K-I

Kahng

, et al. Sandpile avalanche dynamics on scale-free networks. Physica A Stat Mech Appl 2004; 338(1–2): 84–91.

Moreno

Gómez

Pacheco

AF.

Instability of scale-free networks under node-breaking avalanches. Europhys Lett 2002; 58(4): 630–636.

Moreno

Pastor-Satorras

Vázquez

, et al. Critical load and congestion instabilities in scale-free networks. Europhys Lett 2003; 62(2): 292–298.

10.

Dobson

Carreras

Lynch

, et al. An initial model for complex dynamics in electric power system blackouts. In Proceedings of the 34th annual Hawaii international conference on system sciences, 2001, pp. 710–718. New York: IEEE.

11.

Motter

Lai

Y-C.

Cascade-based attacks on complex networks. Phys Rev E 2002; 66(6): 065102.

12.

Dobson

Carreras

Newman

DE.

A probabilistic loading-dependent model of cascading failure and possible implications for blackouts. In Proceedings of the 36th Annual Hawaii international conference on system sciences, 2003, p. 10. New York: IEEE.

13.

Duan

, et al. Universal behavior of cascading failures in interdependent networks. Proc Natl Acad Sci USA 2019; 116(45): 22452–22457.

14.

Dong

Tian

Ding

A framework for modeling and structural vulnerability analysis of spatial cyber-physical power systems from an attack–defense perspective. IEEE Syst J 2021; 15: 1369–1380.

15.

Henry

Emmanuel Ramirez-Marquez

Generic metrics and quantitative approaches for system resilience as a function of time. Reliab Eng Syst Saf 2012; 99: 114–122.

16.

Wang

Pambudi

Wang

, et al. Resilience of IOT systems against edge-induced Cascade-of-Failures: a networking perspective. IEEE Internet Things J 2019; 6(4): 6952–6963.

17.

Pant

Barker

Ramirez-Marquez

, et al. Stochastic measures of resilience and their application to container terminals. Comput Ind Eng 2014; 70: 183–194.

18.

Zhou

Huang

Sun

, et al. Network resource reallocation strategy basedSB on an improved capacity-load model AICLM. Eksploat Niezawodn 2015; 17(4): 487–495.

19.

Xie

Lundteigen

Liu

Reliability and barrier assessment of series–parallel systems subject to cascading failures. Proc IMechE, Part O: J Risk and Reliability 2020; 234(3): 455–469.

20.

Lee

State-dependent age replacement policy for a system subject to cascading failures. Proc IMechE, Part O: J Risk and Reliability 2020; 234(2): 359–376.

21.

Di Muro

La Rocca

Stanley

, et al. Recovery of interdependent networks. Sci Rep 2016; 6: 22834.

22.

Almoghathawi

Barker

Albert

LA.

Resilience-driven restoration model for interdependent infrastructure networks. Reliab Eng Syst Saf 2019; 185: 12–23.

23.

Hosseinalipour

Mao

Eun

, et al. Prevention and mitigation of catastrophic failures in demand-supply interdependent networks. IEEE Trans Netw Sci Eng 2020; 7(3): 1710–1723.

24.

Liu

Ferrario

Zio

Identifying resilient-important elements in interdependent critical infrastructures by sensitivity analysis. Reliab Eng Syst Saf 2019; 189: 423–434.

25.

Gao

On the component resilience importance measures for infrastructure systems. Int J Crit Infrastruct Prot 2022; 36: 100481.

26.

Almoghathawi

Barker

Component importance measures for interdependent infrastructure network resilience. Comput Ind Eng 2019; 133: 153–164.

27.

Dui

Zheng

Resilience analysis of maritime transportation systems based on importance measures. Reliab Eng Syst Saf 2021; 209: 107461.

28.

Almoghathawi

González

Barker

Exploring recovery strategies for optimal interdependent infrastructure network resilience. Netw Spat Econ 2021; 21: 229–260.

29.

Ramirez-Marquez

Liu

, et al. A new resilience-based component importance measure for multi-state networks. Reliab Eng Syst Saf 2020; 193: 106591.

30.

Filiol

Franc

Gubbioli

, et al. Combinatorial optimisation of worm propagation on an unknown network. Int J Comput Sci 2007; 2(2): 124–130.

31.

Pirzada

. Applications of graph theory. In: PAMM: Proceedings in applied mathematics and mechanics, 2007, vol. 7, no. 1, pp. 2070013–2070013. Wiley Online Library.

32.

Gupta

Kaur

Mirkin

, et al. Text summarization through entailment-based minimum vertex cover. In: Proceedings of the third joint conference on lexical and computational semantics (* SEM 2014), Dublin, Ireland, August 23–24, 2014; 75–80.

33.

Goh

Kahng

Kim

Universal behavior of load distribution in scale-free networks. Phys Rev Lett 2001; 87(27Pt 1): 278701.

34.

Zhou

Coit

Felder

, et al. Resiliency-based restoration optimization for dependent network systems against cascading failures. Reliab Eng Syst Saf 2021; 207: 107383.

35.

Latora

Marchiori

Efficient behavior of small-world networks. Phys Rev Lett 2001; 87(19): 198701.

36.

Zhou

Huang

Wang

, et al. An improved model for cascading failures in complex networks. In Cloud computing and intelligent systems (CCIS), 2012 IEEE 2nd international conference on, 2012, vol. 2, pp. 721–725. New York: IEEE.

37.

Bashan

Parshani

Havlin

Percolation in networks composed of connectivity and dependency links. Phys Rev E 2011; 83(5): 8.

38.

Bai

Huang

Wang

, et al. Robustness and vulnerability of networks with dynamical dependency groups. Sci Rep 2016; 6: 37749.

39.

Barrett

Clements

Foerster

, et al. Exploratory combinatorial optimization with reinforcement learning. In: Proc AAAI Conf Artificial Intell 2020; 34(4): 3243–3250.

40.

Zheng

Wang

Song

. Opengraphgym: a parallel reinforcement learning framework for graph optimization problems. In: International conference on computational science, 2020, pp.439–452. Springer, Cham, Switzerland.

41.

Prouvost

Dumouchelle

Scavuzzo

, et al. Ecole: A Gym-like library for machine learning in combinatorial optimization solvers. arXiv preprint arXiv 2011: 06069.2020.

42.

Chen

Koltun

Combinatorial optimization with graph convolutional networks and guided tree search. arXiv preprint arXiv 2018: 1810.10659.

43.

Fey

Lenssen

JE.

Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv 2019: 1903.02428.

44.

Wilder

Ewing

Dilkina

, et al. End to end learning and optimization on graphs. Adv Neural Inf Proc Syst 2019; 32: 4672–4683.

45.

Dai

Khalil

Zhang

, et al. Learning combinatorial optimization algorithms over graphs. arXiv preprint arXiv 2017: 1704.01665.

46.

Zheng

Wang

Song

OpenGraphGym-MG: Using reinforcement learning to solve large graph optimization problems on MultiGPU Systems. arXiv preprint arXiv 2021: 2105.08764.

47.

Bastian

Heymann

Jacomy

. Gephi: an open source software for exploring and manipulating networks. In: Third international AAAI conference on weblogs and social media, May 17–20, 2009, San Jose, CA.

48.

Khalil

Dai

Zhang

, et al. Learning combinatorial optimization algorithms over graphs. Adv Neural Inf Proc Syst December 4–9, 2017 2017; 6348–6358, Long Beach, CA.

49.

Lillicrap

Sutskever

, et al. Continuous deep q-learning with model-based acceleration. In: International conference on machine learning, Proceedings of Machine Learning Research (PMLR), 2016, pp.2829–2838.

50.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518(7540): 529–533.

51.

Barabasi

A-L

Albert

Emergence of scaling in random networks. Science 1999; 286(5439): 509–512.

52.

Nguyen

Shen

Thai

MT.

Detecting critical nodes in interdependent power networks for vulnerability assessment. IEEE Trans Smart Grid 2013; 4(1): 151–159.

53.

Hagberg

Swart

Chult

DS.

Exploring network structure, dynamics, and function using NetworkX. Los Alamos, NM: Los Alamos National Lab (LANL), 2008.

54.

Bar-Yehuda

Even

A local-ratio theorm for approximating the weighted vertex cover problem. Computer Science Department, Technion, 1983.

55.

Sun

Wandelt

Complementary strengths of airlines under network disruptions. Saf Sci 2018; 103: 76–87.

56.

Skorupski

The simulation-fuzzy method of assessing the risk of air traffic accidents using the fuzzy risk matrix. Saf Sci 2016; 88: 76–87.

57.

Colizza

Pastor-Satorras

Vespignani

Reaction–diffusion processes and metapopulation models in heterogeneous networks. Nat Phys 2007; 3(4): 276–282.