Sage Journals: Discover world-class research

Abstract

Resilience provides a new approach that system administrators can use in the design and analysis of engineering systems to enhance the ability of such systems to withstand uncertain threats. In this article, an improved integrated metric is proposed for the quantitative assessment of resilience. The proposed metric is constructed in the form of a summation of two capacities: absorptive and restorative capacities. A weight coefficient is assigned to each capacity to enhance the flexibility according to various system requirements of stakeholders. In addition, based on the absolute time scale, a new time factor is proposed and incorporated into the resilience metric to quantify the effect of time on system performance. To test the performance of the proposed metric, three experimental studies are conducted wherein the proposed metric is compared with two other metrics reported in the literature. The results indicate that the proposed metric extends the flexibility of the previous metrics to systems where the time scale is addressed, and that the numerical values of resilience lie in a proper range and can be compared conveniently across different engineering systems. Furthermore, an example of an information exchange network is adopted to demonstrate the applicability of the proposed metric.

Keywords

System resilience integrated metric quantitative assessment engineering system absorptive capacity restorative capacity

Introduction

The term ‘resilience’ was first proposed by Holling¹ in the study of ecological systems. Since then, the concept of resilience and its evaluation methods have been developed and applied to many real-world complex systems such as national infrastructure facilities,^2–4 ecological systems^5,6 and economic systems.^7,8 In addition, resilience has expanded the library of system attributes such as reliability, robustness, safety and risk. Resilience is commonly studied to assess and improve the capability of a system to bounce back from disruptive events to its normal condition.^9,10 Currently, the development of resilience in complex engineering systems is still in its early stage. Several reported works have made some progress in defining and evaluating resilience for complex engineering systems.

The definitions of resilience reported so far can be classified into two types. One type approaches resilience from a specific disciplinary perspective.⁷ This type of definitions may address various aspects of system resilience due to changes in specific systems, which is appropriate for particular domains but may not be suitable for other applications. The other type approaches resilience from a more general perspective.^11–14 This type of definitions can be applied to various engineering systems. For instance, Vugrin et al.¹¹ defined resilience as a function of the absorptive, adaptive and restorative capacities. The absorptive capacity is the degree to which a system is able to absorb shocks caused by external disruption. The adaptive capacity is the degree to which a system is able to adapt itself temporarily to new disrupted conditions. The restorative capacity is the degree to which a system is able to restore itself if the adaptive capacity is not effective. This definition addresses the abilities of the system to resist disruptive events and ensure timely restoration.^12,14 Note that the adaptive and restorative capacities refer to recovery activities and that the effects of these capacities overlap in the recovery phase. Hence, in this study, the adaptive capacity is merged with the restorative capacity to describe the dynamic behaviour of system performance in the system recovery phase.

Compared to the definitions of resilience, the assessment methods for resilience seem to have attracted more attention from engineers and practitioners. In general, resilience assessment methods can be separated into two major categories: qualitative and quantitative methods.¹⁵ The qualitative category includes methods that tend to assess the system resilience without numerical descriptors. For more information on qualitative resilience assessment methods, readers can refer to previous literature.^16–20

The first general quantitative assessment method was proposed by Bruneau et al.,²¹ who aimed to measure the seismic resilience of a community to an earthquake by estimating the expected degradation in the quality of community infrastructure. They proposed a deterministic static metric for measuring the resilience loss of a community to an earthquake. In this pioneering research work, the concept of resilience loss, also known as a ‘resilience triangle’, was proposed, and has been widely utilized as a fundamental guide for quantitative resilience assessment methods.^22–25 Although this approach is utilized for the context of an earthquake, it has the advantage of general applicability and can be extended to many systems.

However, these resilience triangle paradigms are relatively simple and may not be able to represent the dynamic behaviours of various systems. For example, the resilience metric in Bruneau et al.’s study assumes that the normal quality of community infrastructure is at 100%. This assumption is unrealistic because the reference standard is difficult to quantify for practical systems.

Henry and Ramirez-Marquez²⁶ use three system states – stable original state, disrupted state and stable recovered state – to quantifying system resilience. The advantage of this metric is that both the disruption and recovery phases are considered. Similar time-dependent metrics have been adopted in other studies.^27–29 However, these time-dependent metrics perform poorly in representing the global level of system resilience intuitionally in comparison to metrics presented as a single numerical value.

Practically, a general resilience metric should satisfy the requirements of both comprehensiveness and universality. Comprehensiveness refers to the ability to capture the essence of system resilience and ensure that the quantification of resilience is consistent with its associated concepts. For instance, a general metric should be able to identify the variations in system performance precisely via its calculated results. Universality refers to the proper form of a resilience metric. For instance, a metric presented as a calculated value is better than a metric presented as a function of time. In addition, a metric value with a finite close domain such as [0, 1] is preferred to a metric with an infinite domain such as [0, $+ \infty$ ).

Recently, several studies have investigated the comprehensiveness and universality of resilience metrics. Nan and Sansavini¹³ proposed an integrated metric for assessing the resilience of interdependent infrastructures. Their proposed metric consists of several factors, which are measured from the aspects of the time scale and performance level. Each factor is consistent with the concept of the defined resilience capacities, and the resilience metric has a range of [0,). Tran et al.^30,31 proposed a novel resilience metric in which each factor is compatibly constructed. An improvement of Tran et al.’s metric over Nan and Sansavini’s is that it has a reference value of 2 when no disruptive event occurs.

However, the resilience metrics proposed by Nan and Sansavini¹³ and Tran et al.^30,31 have two shortcomings. First, their metrics use a relative time scale to model the time factor. This may cause problems because many realistic systems, such as infrastructure systems affected by an earthquake, concern only the absolute time scale. Second, the quantification of the absorptive capacity and that of the restorative capacity overlap each other, which leads to poor performance in scenarios that address the difference between these two capacities. For instance, the absorptive capacity is often emphasized more in an ecological system because the recovery of ecological losses involves huge economic costs and requires a long time, whereas the restorative capacity is more emphasized in a military C2 network because timeliness is essential in the military. Note that these problems also exist in previous resilience metrics.^32–35

Thus, in this study, we aim to develop an improved integrated metric that overcomes the disadvantages existing in previous metrics.^32–35 The major contributions of this article are summarized as follows:

Based on the absolute time scale, a new time factor is proposed and incorporated into the resilience metric to quantify the effect of time on system performance.

The absorptive and restorative capacities are used to quantify the proposed resilience metric in the form of a summation. Two weight coefficients are assigned to the two capacities to enhance the flexibility of the proposed metric according to various system requirements of stakeholders.

Each capacity is quantified from the perspective of state transition because of the dynamic behaviour during the disruption and recovery phases. Three dimensions – transition process, transition time and transition consequence – are used to describe the resilience capacities.

The proposed metric is compared and discussed using experimental scenarios.

A case study on information exchange in a networked engineering system is conducted to validate the proposed metric.

The rest of this article is organized as follows. Section ‘Development of improved integrated metric’ describes the development of the proposed integrated resilience metric. Section ‘Experimental comparison’ presents the comparison of the proposed metric with previously reported ones using three scenarios. Section ‘Case study on information exchange in networked system’ presents a case study on information exchange in a networked system to demonstrate the application of the proposed metric. Section ‘Conclusion’ concludes this article.

Development of improved integrated metric

In this section, we present the development of an improved integrated metric for resilience based on system performance. First, we describe the system performance measure. Second, we present the improved resilience metric.

System performance measure

In this study, an example involving notional system performance data is used to illustrate the construction of the resilience metric, as shown in Figure 1. Because the performance data are notional and can be replaced by real performance data from practical engineering systems, the resilience metric constructed on the basis of such performance data can also be adapted to various systems. The abscissa represents the time t, and the ordinate represents the performance data $y (t)$ . The key time points and performance values illustrated in the figure are defined as follows:

t_d = time when the disruption event starts;

t_r = time when the recovery action starts;

t_s = time when the recovery performance reaches a steady state;

y_o = normal operating performance level;

y_min = minimum performance level;

y_s = recovered performance level.

Figure 1.

Notional plot of system performance data over time during disruption and recovery phases.

The system performance is assumed to be at a normal operating level before a disruption event occurs. When the disruption event occurs at time $t_{d}$ , the system performance begins to drop until the minimum performance is reached. The time period from $t_{d}$ to $t_{r}$ is regarded as the disruption phase. After time $t_{r}$ , the system begins to recover until a stable level is reached. Similarly, the time period from $t_{r}$ to $t_{s}$ is regarded as the recovery phase. Therefore, the system’s dynamic performance of interest is divided into two phases (disruption and recovery phases) according to the occurrence point of the disruption event and recovery action. Noticeably, the disruption and recovery phases correspond to the absorptive and restorative capacities of system resilience; this correspondence contributes to the convenience of classifying and quantifying the two capacities.

Improved metric

When a disruption event or recovery action is applied to a resilient system, the system will transition from one stable state to another stable one. Normally, three important aspects are considered during the transition period. The first aspect is the manner in which the transition is carried out, which describes the transition process of the system performance. The second aspect is the level to which the system state reaches compared to the initial level. The last aspect is the time taken to finish the transition process. Once these three aspects of the transition period are determined, the absorptive and restorative capacities of the system can be quantified.

By considering these three aspects, both the absorptive and restorative capacities of the system resilience are measured by the process, time and consequence factors. The disruption process factor $δ_{d}$ is used to capture the dynamic behaviour of the system performance during the disruption phase. The disruption consequence factor $σ_{d}$ is used to capture the absorption ability of the system. The disruption time factor $ρ_{d}$ is used to capture the absorption rapidity during the disruption phase. Similarly, the recovery process factor $δ_{r}$ , recovery consequence factor $σ_{r}$ and recovery time factor $ρ_{r}$ are used to capture the dynamic behaviour, recovery ability and recovery rapidity of the system, respectively, during the recovery phase.

After all the factors are determined, an integrated resilience metric is developed to quantify the system resilience as follows

R_{I} = α δ_{d} σ_{d} ρ_{d} + β δ_{r} σ_{r} ρ_{r}

(1)

where ${α + β = 1, 0 \leq α \leq 1, 0 \leq β \leq 1}$ . $δ_{d} σ_{d} ρ_{d}$ and $δ_{r} σ_{r} ρ_{r}$ are used to quantify the absorptive and restorative capacities, respectively. Two weight coefficients $α$ and $β$ are assigned to the absorptive and restorative capacities, respectively, according to different requirements of the system administrator. For instance, if a system is designed to have more restorative capacity than absorptive capacity, the system administrator may assign a higher value of $α$ . Thus, $0 \leq R_{I} \leq \infty$ , and $R_{I}$ has a reference value of 1 for a normal operating scenario, which means that the desired performance suffers no loss over time. Typically, $R_{I}$ lies in the range of [0, 1], and a scenario where $R_{I} > 1$ may exist only when the system performance surpasses the initial level during the transition period. $R_{I}$ equals 0 means that the $y_{\min}$ and $y_{s}$ equal 0 simultaneously, which indicates that the system breaks down without any recovery action. $R_{I}$ equals 1 means that the system performance is not disrupted during the whole process. The disruption process factor $δ_{d}$ , recovery process factor $δ_{r}$ , disruption consequence factor $σ_{d}$ , recovery consequence factor $σ_{r}$ , disruption time factor $ρ_{d}$ and recovery time factor $ρ_{r}$ are discussed in detail in the following subsections.

Disruption process factor

The disruption process factor $δ_{d}$ accounts for the total performance maintained by a system during the disruption phase. This factor is calculated as follows

δ_{d} = \frac{\int_{t_{d}}^{t_{r}} y (t) d t}{(t_{r} - t_{d}) y_{o}}

(2)

As can be seen, it is measured precisely using the performance data during the disruption phase. Hence, the dynamic behaviour of the system performance during the disruption process can be captured. For instance, consider the three cases shown in Figure 2; when these three cases differ only in the dynamic performance behaviour during the disruption phase, the resilience of the three cases follows the order case 1 > case 2 > case 3, which can be reflected using equation (2).

Figure 2.

Illustration of systems having different performance behaviours during disruption phase.

Disruption consequence factor

The disruption consequence factor $σ_{d}$ accounts for the ability of the system to resist the loss of system performance after a disruption event. This factor is calculated as follows

σ_{d} = \frac{y_{\min}}{y_{o}}

(3)

The minimum performance $y_{\min}$ is commonly used to evaluate the absorption ability.^13,30 Hence, the disruption consequence factor $σ_{d}$ is obtained by normalizing $y_{\min}$ with the desired performance level $y_{o}$ , which is used as a benchmark. Noticeably, $y_{\min}$ is incorporated in both the disruption process factor and disruption consequence factor, but the multiple of two factors will not increase the effect of $y_{\min}$ to the system. Because the ‘disruption process factor’ is determined by the summation of performance data during the whole disruption process. Though $y_{\min}$ is incorporated in the summation of performance data, it is just a point data and has minor effect on the value of ‘disruption process factor’. Therefore, the effect of $y_{\min}$ is mainly presented on the ‘disruption consequence factor’.

Disruption time factor

The disruption process factor and disruption consequence factor can capture the absorptive capacity of system resilience effectively from the perspective of system performance. However, a measure from the perspective of time has not been considered, although it is another necessarily significant aspect for assessing system resilience. Various measures have been introduced to quantify the time effect on system resilience. For instance, a time factor has been represented by a function of time^30,34 or incorporated into a function of both time and performance.¹³ Previous studies used the relative time scale to quantify the time factor. The advantage is that the time factor becomes dimensionless. Thus, the factors derived from both the time and performance dimensions can be integrated into an integrated metric, in which the performance factor is also dimensionless.

However, the quantification of the time factor in a relative scale may incur problems because many realistic systems are only concerned with the absolute time scale. Figure 3 illustrates the system performance in two cases with different time periods. In case 1, the dynamic performance is carried out in the same manner as that in case 2; however, the time taken in case 1 is only half of that taken in case 2. By fixing the other conditions for both cases, it is natural to conclude that the system performs better in case 1 in terms of system resilience because the resilience is rewarded when the period is shorter in the transition process. However, for metrics using the relative time scale, a conclusion that the two cases have the same resilience would be obtained because the time difference between the cases cannot be reflected. Consequently, metrics that adopt relative time scale factors are not appropriate in such a scenario.

Figure 3.

Illustration of system performance in two cases with different periods.

In response to the above discussion, a novel time factor that is calculated using the absolute time scale is proposed as follows

ρ_{d} = Δ^{B / (t_{r} - t_{d})}

(4)

where $Δ$ is the degradation factor and B is the reference baseline of time. $Δ$ is used to measure the relative importance of the time dimension and lies in a range of $0 < Δ \leq 1$ . The importance of the time dimension increases with $Δ$ . When $Δ = 1$ , no degradation occurs in the time scale, indicating that the length of time has no effect on system resilience, and $ρ_{d}$ has a constant value of 1. The reference baseline B is used to provide a reference time scale according to the requirements of different systems. For instance, B is often counted in hours for infrastructure systems and in months or years for ecosystems. As can be seen, the disruption time factor $ρ_{d}$ is dimensionless and uses the absolute time scale. In addition, $ρ_{d}$ lies in a constraint range between 0 and 1, which contributes to the convenient integration for resilience metric.

Recovery process, consequence and time factors

The three aspects of the recovery capacity, that is, the recovery process factor, recovery consequence factor and recovery time factor, are defined in a manner similar to those of the absorptive capacity. The recovery process factor accounts for the total performance maintained by a system through the recovery phase and is given by equation (5). As shown in equation (6), the recovery consequence factor is defined as the ratio of the recovered performance level to the desired performance level. The recovery time factor is defined in a similar manner to the disruption time factor, except that the recovered time $(t_{s} - t_{r})$ is placed in the numerator, as shown in equation (7). This adjustment for recovery time factor is adopted to reward quick recovery action

δ_{r} = \frac{\int_{t_{r}}^{t_{s}} y (t) dt}{(t_{s} - t_{r}) y_{o}}

(5)

σ_{r} = \frac{y_{s}}{y_{o}}

(6)

ρ_{r} = Δ^{(t_{s} - t_{r}) / B}

(7)

Noticeably, the measure of the restorative and absorptive capacities are almost symmetric. Each capacity is measured in both the time and performance dimensions using three quantified factors. The three quantified factors are all normalized and dimensionless and have the same reference value of one during normal operation. Hence, it is easy to integrate the three factors to quantify the restorative and absorptive capacities. Furthermore, no overlap occurs among the three factors.

Experimental comparison

Three experimental studies are conducted by comparing the proposed metric with two reported resilience metrics by Tran et al.³⁰ and Nan and Sansavini.¹³ They share the following common characteristics:

All the three metrics are derived from the same notional performance data shown in Figure 1.

The absorptive and restorative capacities are considered as measures of system resilience and are quantified using performance data.

All the factors are integrated into a single metric.

Preliminary details of two reported metrics

The two reported metrics are introduced briefly here. Readers can refer to previous literature^13,30 for more details. The resilience metric proposed by Tran et al.³⁰ is given as follows

R_{Tran} = {\begin{matrix} σ ρ [δ + ζ + 1 - τ^{(ρ - δ)}] & if ρ - δ \geq 0 \\ σ ρ (δ + ζ) & otherwise \end{matrix}

(8)

where $σ$ , $ρ$ , $δ$ , $ζ$ and $τ$ are the total performance factor, recovery factor, absorption factor, volatility factor and recovery time factor, respectively. The detailed formula for each factor is given as follows³⁰

\begin{matrix} σ = \frac{\sum_{t_{0}}^{t_{final}} y (t)}{y_{o} (t_{final} - t_{0})}, ρ = \frac{y_{s}}{y_{o}}, δ = \frac{y_{\min}}{y_{o}}, τ = \frac{t_{s} - t_{0}}{t_{final} - t_{0}}, \\ ζ = \frac{1}{1 + \exp [- \frac{1}{4} (SN R_{dB} - 15)]} \end{matrix}

The total performance factor $σ$ is calculated using the entire performance data for both the disruption and recovery phases. The recovery factor $ρ$ and absorption factor $δ$ correspond to our recovery consequence factor and absorption factor, respectively. As noted above, the recovery time factor $τ$ is calculated in a relative scale and is different from the absolute time factor proposed in this study. According to the author’s description, the volatility factor $ζ$ is adopted to measure the ability of a system to provide a smooth transition from a degraded state to a recovered state. For convenience, the volatility factor is not considered in this study.

The resilience metric proposed by Nan and Sansavini¹³ is given as follows

R_{Nan} = δ_{Nan} \times (\frac{RAP I_{RP}}{RAP I_{DP}}) \times {(TAPL)}^{- 1} \times ρ_{Nan}

(9)

where $δ_{Nan}$ , $RAP I_{RP}$ , $RAP I_{DP}$ , TAPL and $ρ_{Nan}$ are the robustness factor, recovery rapidity factor, disruption rapidity factor, time-averaged performance loss factor and recovery ability factor, respectively. The detailed formula for each factor is given as follows¹³

\begin{matrix} δ_{Nan} = y_{\min}, RAP I_{DP} = \frac{y_{o} - y_{\min}}{t_{r} - t_{d}}, RAP I_{RP} = \frac{y_{s} - y_{\min}}{t_{s} - t_{r}}, \\ TAPL = \frac{\int_{t_{d}}^{t_{s}} (y_{o} - y (t)) dt}{(t_{s} - t_{d})}, ρ_{Nan} = \frac{y_{s} - y_{\min}}{y_{o} - y_{\min}} \end{matrix}

As can be seen, the robustness factor $δ_{Nan}$ and recovery ability factor $ρ_{Nan}$ correspond to our recovery consequence factor and absorption factor, respectively. The time-averaged performance loss factor TAPL, which is calculated using the performance data for both the disruption and recovery phases, is similar to the total performance factor proposed by Tran et al. The rapidity factors $RAP I_{DP}$ and $RAP I_{RP}$ both include time and performance dimensions and partially overlap with the robustness factor $δ_{Nan}$ and recovery ability factor $ρ_{Nan}$ .

In general, the comparisons of the three resilience metrics in terms of the time, process and consequence factors are summarized in Table 1. Our proposed metric mainly differs in the time and process factors. Instead of the relative time scale, the absolute time scale is adopted for the time factor in this study. The process factor of Tran et al.³⁰ and Nan and Sansavini¹³ is measured using the entire performance data for both the phases. In contrast, two process factors are proposed in this study to measure the dynamic behaviour of system performance separately for the disruption and recovery phases.

Table 1.

Comparison of the three resilience metrics.

	Time factor	Process factor	Consequence factor
Metric by Nan et al.	Relative scale	Integration	Division
Metric by Tran et al.	Relative scale	Integration	Division
Proposed metric	Absolute scale	Division	Division

Comparison of metrics in three experimental scenarios

Scenario 1

Scenario 1 considers the two cases described in section ‘Disruption time factor’ (Figure 3). In case 1, the dynamic performance is carried out in the same manner as that in case 2; however, the time taken in case 1 is only half of that taken in case 2. The performance data for cases 1 and 2 are created using the same logistic function used by Tran et al.³⁰ A disruption with no recovery action for the system is represented using notional performance data generated by the following equation

y = A_{1} + \frac{K_{1} - A_{1}}{1 + \exp [B_{1} (t - x_{1})]}

(10)

where $K_{1}$ , $A_{1}$ , $B_{1}$ and $x_{1}$ represent the desired performance level, minimum performance, inflection steepness and location of the inflection point on the $t - axis$ , respectively. A recovery action for the system is represented using notional performance data generated by the following equation

y = A_{2} + \frac{K_{2} - A_{2}}{1 + \exp [- B_{2} (t - x_{2})]}

(11)

where a negative sign is added in front of $B_{2}$ to create an increasing function and $K_{2}$ represents the recovered performance level. Based on equations (10) and (11), the notional performance data including those for both the disruption and recovery phases can be generated. The smooth transition between the degradation and recovery portions of the data is achieved by adjusting $A_{1}$ .

Logistic function parameters are used to change the shape of the performance data. Considering that the disruption and recovery actions are the same in this scenario, only the parameter of the recovered level $K_{2}$ is varied in case 1, with the remaining parameters being kept constant ( $x_{1} = 25$ , $x_{2} = 70$ , $B_{1} = 0.5$ , $A_{2} = 20$ ). Case 2 uses the same parameters to generate the performance data, except that the parameters pertaining to the time scale are doubled, that is, $x_{1} = 50$ and $x_{2} = 140$ . The generated performance data for cases 1 and 2 are illustrated in Figure 4.

Figure 4.

Notional performance data generated by equations (10) and (11) for different recovered levels $K_{2}$ .

The performance data of the six scenarios during the disruption phase (0–50 time steps) in case 1 are generated with the same equation (10). As a result, the performance lines with various markers are plotted coincidently, which resulted in such ‘hexagon star’ presentation. This phenomenon also occurs in case 2 during the disruption phase (0–70 time steps).

Figure 5 shows the calculated results of the three metrics for cases 1 and 2 with different recovered performance level $K_{2}$ . $R_{Tran}$ , $R_{Nan}$ and $R_{I}$ increase with $K_{2}$ in both the cases, indicating that the system resilience improves as the system performance is recovered to a higher level. For metrics $R_{Tran}$ and $R_{Nan}$ , the resilience results for cases 1 and 2 are almost the same, indicating that the time scale has no effect on the system resilience using the two reported metrics. However, for the proposed metric $R_{I}$ , the value for case 2 is smaller than that for case 1, indicating that the system resilience decreases as the time increases.

Figure 5.

Resilience results for cases 1 and 2 obtained using the three resilience metrics.

As previously mentioned, resilience is rewarded when the time period is shorter in the transition process. When the other conditions are kept constant for both the cases, the system performs better in case 1 in terms of the system resilience. Thus, the relative time scale factor in $R_{Tran}$ and $R_{Nan}$ is not able to capture this trend. In contrast, the absolute time scale factor incorporated in the proposed metric overcomes this drawback.

Scenario 2

Scenario 2 considers three cases for the performance data, as illustrated in Figure 6. The desired performance level $y_{o}$ , minimum performance level $y_{\min}$ , recovered performance level $y_{s}$ , initial disruption time $t_{d}$ , initial recovery time $t_{r}$ and time to reach the steady performance level $t_{s}$ are all the same for the three cases. The only difference lies in the dynamic process during the disruption and recovery phases.

Figure 6.

Performance data for three cases where parameters ${y_{o}, y_{\min}, y_{s}, t_{d}, t_{r}, t_{s}}$ are the same.

Table 2 lists the resilience results calculated by the three metrics. The results of $R_{Tran}$ and $R_{Nan}$ are the same for the three cases. The metric $R_{I}$ provides various resilience values for the three cases under varying weight coefficients.

Table 2.

Resilience calculation for scenario 2 using the three metrics.

	$R_{Tran}$	$R_{Nan}$	$R_{I}$
			$α = 0.8, β = 0.2$	$α = 0.5, β = 0.5$	$α = 0.2, β = 0.8$
Case 1	0.554	0.672	0.350	0.414	0.479
Case 2	0.554	0.672	0.334	0.453	0.572
Case 3	0.554	0.672	0.320	0.493	0.665

For the metric $R_{Tran}$ in the three cases, the absorption factor $ρ$ , recovery factor $δ$ and recovery time factor $τ$ are equal because parameters ${y_{o}, y_{\min}, y_{s}, t_{d}, t_{r}, t_{s}}$ are the same. As the sum of the performance data for the three cases is the same in this scenario, the resilience values calculated by the metric $R_{Tran}$ are equal too. Similarly, the above analysis also applies to the results obtained by $R_{Nan}$ . Hence, it can be concluded that the dynamic process presented in Figure 6 cannot be properly captured by the metrics $R_{Tran}$ and $R_{Nan}$ .

For the metric $R_{I}$ , the absorptive and restorative capacities are clearly distinguished and represented by $δ_{d} σ_{d} ρ_{d}$ and $δ_{r} σ_{r} ρ_{r}$ , respectively. The performance data during the disruption and recovery phases are quantified by $σ_{d}$ and $σ_{r}$ , respectively, which differ from the total performance factor $σ$ used in metric $R_{Tran}$ and the time-averaged performance loss factor TAPL used in metric $R_{Nan}$ . Hence, the proposed metric can provide different resilience values to correctly reflect the difference of the absorptive and restorative capacities in the three cases. In addition, such a classification of these two capacities enables the addition of weights $(α, β)$ to address the distinguished importance of the two capacities. For instance, when values $(α = 0.8, β = 0.2)$ are set (indicating that the absorptive capacity is more important than the restorative capacity), the order case 1 > case 2 > case 3 is obtained. This result agrees with the inference from the performance data that case 1 performs the best during the disruption phase. When the values $(α = 0.2, β = 0.8)$ are set (indicating that the absorptive capacity is less important than the restorative capacity), the order case 1 < case 2 < case 3 is obtained. This result agrees with the inference from the performance data that case 3 performs the best during the recovery phase.

Note that when $α and β$ equal 0.5, it does not indicate that the absorptive capacity and restorative capacity are treated equally. This phenomenon can be explained as follows. Though the areas under performance lines are almost the same, the performance line in disruption phase (recovery phase) is not linearly related to the absorptive capacity (restorative capacity). As a result, the resilience values, which are a summation of absorptive capacity and restorative capacity, are different for the three cases.

Scenario 3

Scenario 3 considers the performance data for four cases, as illustrated in Figure 7. For cases 1–3, all factors except the initial moment of recovery time are the same. Case 4 undergoes the same disruption phase as case 2 and the same recovery phase as case 1. The disruption time for cases 1–3 follows the order case 1 < case 2 < case 3, indicating that the absorptive capacity for case 3 is the highest. Similarly, case 3 has the highest restorative capacity among cases 1–3 because it takes the least time.

Figure 7.

Performance data for four cases where parameters ${y_{\min}, y_{s}}$ are the same.

Table 3 lists the resilience results calculated by the three metrics. The resilience values obtained by the metric $R_{Tran}$ remain the same for cases 1–3. This metric fails to represent the dynamic process as long as the summation of the performance data remains constant. However, this conclusion is improper for systems where the absorptive and restorative capacities are differently treated.

Table 3.

Resilience calculation for four cases using three metrics.

	$R_{Tran}$	$R_{Nan}$	$R_{I}$
			$α = 0.8, β = 0.2$	$α = 0.5, β = 0.5$	$α = 0.2, β = 0.8$
Case 1	0.554	0.209	0.254	0.385	0.517
Case 2	0.554	0.672	0.334	0.453	0.572
Case 3	0.554	2.167	0.373	0.496	0.619
Case 4	0.471	0.422	0.325	0.431	0.536

For the metric $R_{Nan}$ , the resilience follows the order case 1 < case 4 < case 2 < case 3. The metric value for case 3 is maximum because it takes the most time to reach the lowest level and takes the least time to recover to the desired performance level. Similarly, the metric value for case 1 is minimum because it takes the least time to reach the lowest performance level and takes the longer time to recover to the desired performance level. In general, the metric $R_{Nan}$ reflects the trends of the four cases in terms of resilience evaluation. However, the recovery rapidity factor $RAP I_{RP}$ and disruption rapidity factor $RAP I_{DP}$ are integrated into the metric $R_{Nan}$ in a form of ratio, which is prone to causing fluctuating ranges beyond [0, 1] and disadvantageous for resilience evaluation. For instance, case 3 obtained a resilience value of 2.167.

$R_{I}$ overcomes the drawbacks in $R_{Tran}$ and $R_{Nan}$ , by correctly reflecting the trend of the four cases similar to $R_{Nan}$ but lying in the range of [0, 1]. Moreover, $α = 0.8, β = 0.2$ and $α = 0.2, β = 0.8$ are adopted to shift the emphasized importance of the two capacities, which both agree with the trend shown in Figure 7.

Case study on information exchange in networked system

The application of the proposed resilience metric is demonstrated using a model of information exchange in a networked system. Many networked systems such as the Internet, World Wide Web, organizational networks and social networks rely on information exchange to facilitate the overall performance of the system. However, the connectivity can be vulnerable to disruptions, leading to node and/or link failures as well as degradation in system performance. In this study, a model proposed by Dodds et al.³⁶ to study organizational networks is used to simulate information exchange in a network. The information exchange has also been adopted by Tran et al.,³⁰ who define network disruptions as node removal events and recovery strategies as link rewiring. The proposed resilience metric, as applied to this problem of information exchange in networked systems, includes the system description, analysis of potential disruptions and recovery action and system performance measurement.

System description of information exchange in networks of interest

The system of interest can be represented as a network, which consists of nodes and links. Nodes are individual members within a system, and links are potential paths of information flow. A model based on the one adopted in previous studies^30,36 is used to simulate information exchange in a network. The goal of the networked system is to successfully enable information sharing between nodes during the operation period.

Scale-free network is adopted for network topology initialization because of its existence in many real-world complex systems.^37,38 The scale-free topology is created using the Barabási–Albert (BA) preferential attachment model.³⁹ The BA model begins with a small number of initially connected nodes ( $m_{0}$ ). The network grows by adding one node to the network at each step, where each added node links with m existing nodes in the network. The probability $p (k_{i})$ of a node with degree $k_{i}$ being linked is proportional to the degree of that node according to the following equation

p (k_{i}) = \frac{k_{i}}{\sum_{j} k_{j}}

(12)

Information exchange is realized by passing messages from the source node to the sink node. Each node in the network creates a new message with probability $μ$ at every time step in the simulation. The node creating the message is designated as the source node. The sink node of the message is randomly selected from the sets of nodes in the network. Once messages are created, they are forwarded along the shortest path in the network to their sink nodes. The shortest path is implemented using the Dijkstra algorithm.⁴⁰ Messages are passed from one node to a neighbouring node in a single time step. Each node is assumed to have complete knowledge of the current network topology, allowing nodes to determine the shortest path from themselves to another node in the network. As stated by Tran et al.,³⁰ these messages are not strictly designated and can represent any type of information flow. This model is not validated by empirical data but is meant to investigate the fundamental behaviours of networked systems and provide possible methods to improve network resilience. The limitation of such an abstraction of networked systems is that additional considerations of higher fidelity and domain-specific models are required to provide actionable recommendations to decision makers.

Analysis of disruption and recovery action in networks of interest

Node removals are considered as disruption events. Nodes can be removed uniformly at random, that is, random node failures, or in a targeted manner, that is, intentional network attacks. Targeted node removal is based on the node degree; nodes with the highest node degrees are removed each time. Node degrees are recalculated following any changes to the network topology.

The recovery actions considered in this study focus on link rewiring, wherein nodes affected by a disruption rewire any links disconnected by the disruption. Two recovery strategies are considered: random rewiring and preferential attachment. In random rewiring, disconnected nodes randomly decide whom they rewire to. In preferential attachment, disconnected nodes decide whom they rewire based on the probability equation given by equation (12), which gives preference to highly connected nodes.

Node removals and recovery actions are implemented within quantitative time steps. For instance, a disruption event lasting from time $t_{d}$ to $t_{r}$ may include several instances of node removals. The occurrence frequency of node removals, time point of occurrence and number of nodes removed for each occurrence can be adjusted.

System performance measurement for information exchange networks

The system performance $y (t)$ is measured as the total number of messages received in a network at each time step in a simulation. The system performance data are then used to calculate the system resilience using the proposed resilience metric.

Six scenarios are considered in this case, as listed in Table 4. The initial network topologies are all scale-free, with networks being subjected to a disruption event during the period of 50–70 time steps. Node removal occurs every five time steps during the disruption phase, with r nodes being removed each time. In scenarios with a recovery action, the recovery phase lasts for 90–110 time steps. Link rewiring occurs every five time steps during the recovery phase, with disconnected links being rewired in an order of time. The rewiring order is such that the links disconnected earliest are linked first. Considering the stochasticity of the simulation, 100 repetitions are run for each scenario. Stochasticity in the simulation is a result of randomness in message generation, the BA algorithm, node removals and link rewiring. Hence, statistical analysis of the output simulation data is carried out to characterize the resulting uncertainty in system performance.

Table 4.

Resilience of information exchange networks under different disruption and recovery strategies.

	Disruption type	Recovery action	$R_{I}$
			Case 1	Case 2	Case 3	Case 4
Scenario 1	Degree-based	Preferential	0.463	0.2434	0.2558	0.258
Scenario 2	Degree-based	Random	0.4667	0.2485	0.2614	0.3057
Scenario 3	Degree-based	None	0.331	0.0441	0.0471	0.0475
Scenario 4	Random	Preferential	0.6959	0.4216	0.4526	0.4550
Scenario 5	Random	Random	0.7013	0.4421	0.4545	0.4686
Scenario 6	Random	None	0.6639	0.2693	0.2777	0.2967

Figure 8 shows the mean performance data for each of the considered scenarios. The initial network topologies are created using the BA model, with $N = 100$ nodes, $m_{0} = 3$ and $m = 2$ . In each removal, the number of nodes removed is set to be a constant at $r = 5$ during the removal period in this case. As can be seen, degree-based removals experience larger degradation in performance than random removals. Compared to scenarios without a recovery action, scenarios with random and preferential rewiring can restore the performance to its normal level in a short time.

Figure 8.

Case 1: Mean performance data for networks subjected to (a) degree-based (Degree) attacks and (b) random attacks. Two adaptation strategies are considered – random (Rand) and preferential attachment (Pref) – in addition to no recovery action (None).

Different values for parameter r are considered to study the effect of removed nodes number on the information exchange network performance. Three other cases of node removal are adopted: (1) r is kept constant at 10 during the removal process; (2) r increases linearly with the occurrence frequency of node removal, with an initial value of $r = 6$ ; and (3) r decreases linearly with the occurrence frequency of node removal, with an initial value of $r = 14$ . The total number of removed nodes for the three cases during the disruption period is equal to 50. The performance results for these three cases are shown in Figures 9 –11.

Figure 9.

Case 2: Linear increase in number of removed nodes r with occurrence frequency of node removal, with initial value of $r_{0} = 6$ . All other conditions remain the same as those in case 1.

Figure 10.

Case 3: Constant number of removed nodes ( $r = 10$ ) during removal period. All other conditions remain the same as those in case 1.

Figure 11.

Case 4: Linear decrease in number of removed nodes r with occurrence frequency of node removal, with initial value of $r_{0} = 14$ . All other conditions remain the same as those in case 1.

Figures 9 –11 show that though the performance data during the disruption period for the three cases are different, it is difficult to determine whose resilience is the highest from the performance plots. Hence, the proposed resilience metric is used for further quantitative analysis. Table 4 lists the resilience values calculated by metric $R_{I}$ . The resilience for $r = 5$ (i.e. case 1) is considerably higher than that for the other three cases. This is because during the entire disruption phase, only 25 nodes are removed in case 1, whereas 50 nodes are removed in cases 2–4. For cases 2–4, the resilience in each scenario follows the order case 2 < case 3 < case 4, indicating that the case in which r decreases linearly with the occurrence frequency of node removal has the highest resilience.

Conclusion

This article presents an improved metric for the quantitative assessment of system resilience based on the system absorptive capacity and system restorative capacity. Each capacity is evaluated from the perspective of state transition and quantified by the time, process and consequence factors. The proposed metric has three advantages. First, a new time factor is proposed and incorporated into the resilience metric to quantify the effect of time on system performance. Second, system resilience is classified into two capacities based on the classification of the disruption and restoration phases. Two weight coefficients are assigned to the two capacities, which enhance the flexibility of the proposed metric when the stakeholder has different requirements for the absorptive and restorative capacities. Third, the numerical values of resilience under the proposed metric lie in a proper range and can be compared conveniently across different engineering systems. Three scenarios are used to validate the performance of the proposed metric, wherein the metric is compared with two previously reported metrics. In addition, an example of information exchange networks is used to demonstrate the application of the proposed metric.

The proposed resilience metric does not consider the cost benefit analysis and threat probabilities. For example, given knowledge of specific threat likelihoods, a system designer cannot use the assessments produced by our metric to determine whether enhancing the absorptive capacity or the restorative capacity can improve the system resilience in an economic manner. In addition, the proposed metric may cause confusion due to the summing form of two capacities and the imbalance of the two weight coefficients. We plan to improve the proposed metric in our future study. These aspects will be dealt with in future studies by improving the proposed metric.

Footnotes

Handling Editor: James Baldwin

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant numbers: 71701207 and 51705526), the National Defence Science and Technology Project Fund of the Central Military Commission (grant number: 3101097) and Science & Technology on Reliability & Environmental Engineering Laboratory.

ORCID iD

Congcong Cheng

References

Holling

. Resilience and stability of ecological systems. Annu Rev Ecol Syst 1973; 4: 1–23.

Panteli

Trakas

Mancarella

, et al. Power systems resilience assessment: hardening and smart operational enhancement strategies. P IEEE 2017; 105: 1202–1213.

Kwasinski

. Quantitative model and metrics of electrical grids’ resilience evaluated at a power distribution level. Energies 2016; 9: 93.

Zhao

Liu

Zhuo

. Hybrid hidden Markov models for resilience metric in a dynamic infrastructure system. IFAC PapersOnline 2016; 49: 343–348.

Biggs

Schlüter

Biggs

, et al. Toward principles for enhancing the resilience of ecosystem services. Ann Rev Environ Res 2012; 37: 421–448.

Hipsey

Hamilton

Hanson

, et al. Predicting the resilience and recovery of aquatic systems: a framework for model evolution within environmental observatories: Water Res Res 2015; 51: 7023–7043.

Martin

. Regional economic resilience, hysteresis and recessionary shocks. J Econ Geogr 2012; 12: 1–32.

Wink

. Regional economic resilience: policy experiences and issues in Europe. Raumforschung Raumordnung 2014; 72: 83–84.

Gao

Barzel

Barabási

A-L

. Universal resilience patterns in complex networks. Nature 2016; 530: 307–312.

10.

Kim

Chen

Y-S

Linderman

. Supply network disruption and resilience: a network structural perspective. J Oper Manage 2015; 33–34: 43–59.

11.

Vugrin

Warren

Ehlen

. A resilience assessment framework for infrastructure and economic systems: quantitative and qualitative resilience analysis of petrochemical supply chains to a hurricane. Process Saf Prog 2011; 30: 280–290.

12.

Zhang

Mahadevan

Sankararaman

, et al. Resilience-based network design under uncertainty. Reliab Eng Syst Safe 2018; 169: 364–379.

13.

Nan

Sansavini

. A quantitative method for assessing resilience of interdependent infrastructures. Reliab Eng Syst Safe 2017; 157: 35–53.

14.

Vugrin

Warren

Ehlen

, et al. A framework for assessing the resilience of infrastructure and economic systems. In: Gopalakrishnan

Peeta

(eds) Sustainable and resilient critical infrastructure systems. Berlin; Heidelberg: Springer, 2010, pp.77–116.

15.

Hosseini

Barker

Ramirez-Marquez

. A review of definitions and measures of system resilience. Reliab Eng Syst Safe 2016; 145: 47–61.

16.

Speranza

Wiesmann

Rist

. An indicator framework for assessing livelihood resilience in the context of social–ecological dynamics. Global Environ Chang 2014; 28: 109–119.

17.

Kahan

Allen

George

. An operational framework for resilience. J Homel Secur Emerg 2009; 6: 83.

18.

Labaka

Hernantes

Sarriegi

. Resilience framework for critical infrastructures: an empirical study in a nuclear plant. Reliab Eng Syst Safe 2015; 141: 92–105.

19.

Vlacheas

Stavroulaki

Demestichas

, et al. Towards end-to-end network resilience. Int J Crit Infr Prot 2013; 6: 159–178.

20.

Shirali

GHA

Mohammadfam

Motamedzade

, et al. Assessing resilience engineering based on safety culture and managerial factors. Process Saf Prog 2012; 31: 17–18.

21.

Bruneau

Chang

Eguchi

, et al. A framework to quantitatively assess and enhance the seismic resilience of communities. Earthq Spectra 2003; 19: 733–752.

22.

Adams

Bekkem

Toledo-Durán

, et al. Freight resilience measures. J Transport Eng 2012; 138: 1403–1409.

23.

Sahebjamnia

Torabi

Mansouri

. Integrated business continuity and disaster recovery planning: towards organizational resilience. Eur J Oper Res 2015; 242: 261–273.

24.

Zobel

. Representing perceived tradeoffs in defining disaster resilience. Decis Support Syst 2011; 50: 394–403.

25.

Zobel

Khansa

. Characterizing multi-event disaster resilience. Comp Oper Res 2014; 42: 83–94.

26.

Henry

Ramirez-Marquez

. Generic metrics and quantitative approaches for system resilience as a function of time. Reliab Eng Syst Safe 2012; 99: 114–122.

27.

Wood

Shiltz

Nudell

, et al. A framework for evaluating the resilience of dynamic real-time market mechanisms. IEEE T Smart Grid 2016; 7: 2904–2912.

28.

Barker

Ramirez-Marquez

Rocco

. Resilience-based network component importance measures. Reliab Eng Syst Safe 2013; 117: 89–97.

29.

Pant

Barker

Ramirez-Marquez

, et al. Stochastic measures of resilience and their application to container terminals. Comp Indus Eng 2014; 70: 183–194.

30.

Tran

Balchanos

Domerçant

, et al. A framework for the quantitative assessment of performance-based system resilience. Reliab Eng Syst Safe 2017; 158: 73–84.

31.

Tran

Domerçant

Mavris

. A network-based cost comparison of resilient and robust system-of-systems. Proc Comp Sci 2016; 95: 126–133.

32.

Cox

Prager

Rose

. Transportation security and the role of resilience: a foundation for operational metrics. Transp Pol 2011; 18: 307–317.

33.

Cimellaro

Reinhorn

Bruneau

. Seismic resilience of a hospital system. Struct Infrastruct E 2010; 6: 127–144.

34.

Francis

Bekera

. A metric and frameworks for resilience analysis of engineered and infrastructure systems. Reliab Eng Syst Safe 2014; 121: 90–103.

35.

Cimellaro

Renschler

Reinhorn

, et al. PEOPLES: a framework for evaluating resilience. J Struct Eng 2016; 142: 04016063.

36.

Dodds

Watts

Sabel

. Information exchange and the robustness of organizational networks. P Natl Acad Sci USA 2003; 100: 12516–12521.

37.

Albert

Barabási

. Statistical mechanics of complex networks. Rev Mod Phys 2002; 74: 47–97.

38.

Newman

MEJ

. The structure and function of complex networks. SIAM Rev 2003; 45: 167–256.

39.

Barabasi

Albert

. Emergence of scaling in random networks. Science 1999; 286: 509–512.

40.

Dijkstra

. A discipline of programming. Zug: Pearson Schweiz AG, 1976.

Improved integrated metric for quantitative assessment of resilience

Abstract

Keywords

Introduction

Development of improved integrated metric

System performance measure

Improved metric

Disruption process factor

Disruption consequence factor

Disruption time factor

Recovery process, consequence and time factors

Experimental comparison

Preliminary details of two reported metrics

Comparison of metrics in three experimental scenarios

Scenario 1

Scenario 2

Scenario 3

Case study on information exchange in networked system

System description of information exchange in networks of interest

Analysis of disruption and recovery action in networks of interest

System performance measurement for information exchange networks

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References