Reliability modeling and simulation: advancements with data-driven techniques and expert knowledge integration

Abstract

Reliability of a system defines the likelihood that a system would operate as expected for a predefined length of time. There have been numerous notable efforts to model systems with the goal of analyzing their reliability and possibly quantifying it. Simulation as a technique has shown a great capacity to address the challenge of quantitatively evaluating system’s reliability. Traditionally, expert knowledge has been the gold standard for modeling and analyzing system reliability. However, the prevalence of data that stems from systems of interest, enabled through the easily available Internet of Things devices, coupled with simulation, presents a potential to change the way reliability of systems is being evaluated. Namely, data provides an opportunity for both automated modeling and continuous reliability assessment of systems through simulation. This article presents an overview of reliability modeling and simulation for cyber-physical systems, highlighting opportunities arising from data availability and its integration with expert knowledge. We use case studies from our research to illustrate these opportunities and advancements.

Keywords

Reliability modeling and simulation data-driven fusion of data and expert knowledge

1. Introduction

Reliability is a fundamental metric that quantifies the likelihood of a system performing its intended function over a specified period of time, i.e., the likelihood that the system fails after a specified point in time. Reliability holds significant importance, particularly in safety-critical systems, as it closely relates to the overall safety. Although an unreliable system may not automatically be deemed unsafe, an unsafe system is inherently unreliable. The reason for this distinction lies in the fact that not all system faults pose a risk to human safety. In a formal sense, reliability is typically quantified using the following definition or formula:

R (t) = \Pr {T > t} = \int_{t}^{\infty} f (x) dx,

where $T$ is the random variable corresponding to the time that a system functions as expected, $f$ is the failure probability density function, and $t$ is the point in time of interest (typically starting from time zero). Traditional reliability calculations do not incorporate the impact of repair-related downtimes. Alternative and more comprehensive metrics are, however, available that also consider periods of downtime, namely availability, dependability, and performability. Availability distinguishes itself from reliability by considering equipment downtime resulting from repairs or maintenance.¹ It quantifies the ratio of operational time to the total lifetime of the system. Dependability, however, represents a more comprehensive measure that encompasses all other metrics evaluating the operational quality of a system.² Performability is a combined measure that takes into account both performance and reliability aspects.³

Modeling and simulation (M&S) have long been used to study the behaviors of systems. In particular, M&S enable evaluation of different scenarios for their impact on the overall reliability of a system.^4,5 Using simulation with appropriate reliability-centered models, we can explore how various operational conditions, changes in system configurations, repair and maintenance strategies, or new components impact system performance and reliability. This analysis helps in identifying potential vulnerabilities, components of high importance, beneficial repair and maintenance strategies, thus, supporting decisions to improve system’s reliability.

This article provides an overview of reliability M&S for cyber-physical systems (CPS), highlighting the opportunities created by the availability of extensive data and the integration of expert knowledge with data-driven techniques. In this context, we also present case studies from our research to illustrate these developments and suggest potential future directions.

The rest of the case is structured as follows: In Section 2, we provide background on reliability M&S and an overview of the related state of the art. Section 3 describes emerging trends in Reliability M&S, followed by the presentation of the two corresponding case studies in Section 4. Finally, a summary and outlook of the paper are provided in Section 5.

2. Background and state of the art

Reliability M&S are fundamental techniques used for assessment and enhancement of the performance and dependability of complex systems. In today’s interconnected and technology-driven world with ever-increasing complexity of systems, ensuring reliability of systems and components is a paramount concern across diverse industries, including aerospace, automotive, telecommunications, and manufacturing. Reliability M&S provide valuable tools for comprehending, predicting, and optimizing the reliability characteristics of these systems.

Reliability modeling encompasses the development of models that capture behaviors, failure patterns, performance attributes, and their interdependencies of systems or components. Reliability models enable engineers and researchers to analyze systems’ reliability and availability, estimate failure rates, and identify critical areas for improvement. However, the quantitative assessment can be greatly supported by simulation, which enables execution of these models to study their behaviors under different assumptions. By simulating the operation and behavior of a system over time, reliability simulation allows for the evaluation of different scenarios, identification of vulnerabilities, and optimization of maintenance and repair strategies. Furthermore, simulation enables introduction of different types of uncertainties in an algorithmic manner as opposed to only mathematical, thereby facilitating more realistic models and assessments.

A key advantage of reliability M&S lies in its ability to address complex systems where analytical solutions may be challenging or impractical. With advancements in computational power and modeling techniques, it has become feasible to model and simulate intricate systems with a high degree of accuracy and fidelity. In addition, the integration of data-driven approaches and Digital Twins has further enhanced the capabilities of reliability M&S by incorporating real-time data and capturing the dynamic behavior of systems in operation.^6,7

In the following, we provide an overview of the most significant Reliability M&S approaches, followed by an overview of the three main aspects of reliability adhering to CPS.

2.1. Reliability M&S

M&S provide valuable tools for evaluating the performance and reliability of complex systems. Some significant examples of how M&S have been applied for reliability analysis are described as follows.

One of the most popular and most used simulation paradigms for reliability M&S is Discrete-Event Simulation (DES). In systems, the interactions between physical processes and computational elements introduce complexities that can affect system reliability. DES allows modeling of these interactions in a discrete manner, capturing events, such as component failures and repairs, that occur at specific points in time. These events are then captured in the models that represent reliability-relevant systems’ behaviors. Through the calculation of reliability metrics, such as mean time to failure (MTTF) and availability, DES facilitates the quantitative evaluation of system’s reliability.

In the work by Ram,⁸ an extensive overview of the recent research in M&S for reliability engineering is provided. The book covers topics, such as the performance evaluation of complex engineering systems, the modeling of non-exponential distributions in reliability analysis, optimal software rejuvenation policies, multivariate analysis for reliability modeling, and more. The book, furthermore, provides insights and case studies to bridge the gap in knowledge pertaining to engineering applications.

Another example of using DES for reliability evaluation is the RAMSAS (Reliability Analysis by Modeling and Simulation) method, which is a model-based approach for system reliability analysis.⁹ This method can be integrated into various phases of the system development process, complementing other well-known techniques, such as Failure Modes, Effects, and Criticality Analysis (FMECA), Fault Tree Analysis (FTA), and Reliability Block Diagram (RBD). RAMSAS has been applied in different application domains, including avionics, automotive, and satellite systems,^10,11 to improve modeling of both intended and dysfunctional system behaviors.

More specifically and more recently, Chiacchio et al.¹² introduced a simulation framework called Stochastic Hybrid Fault Tree Automaton (SHyFTA) for designing and analyzing dynamic reliability problems in real industrial applications. SHyFTA combines deterministic and stochastic models using shared variables, particularly using Dynamic Fault Trees (DFT) for modeling complex dependable systems. Furthermore, Pourhassan et al.¹³ propose a simulation-based approach for generalized reliability assessment. The proposed model accommodates stochastic degradation processes and randomly occurring shocks across two-state, multistate, and continuous degradation scenarios, with customizable time to failure distributions and shock densities. Two case studies in a sugar plant and an example from literature demonstrate the model’s validity and applicability.

2.2. Three aspects of reliability in CPS

CPS, being complex systems, consist of multiple interconnected aspects, influenced by external factors. Numerous potential issues can arise, each associated with the different aspects of CPS and having their own distinct nature. Three key aspects, namely hardware, software, and human interaction, significantly impact the reliability of a CPS. Despite their distinctiveness, it is crucial to consider all these aspects collectively since they exert mutual influence on one another. It is worth noting that these aspects are essential for understanding the system’s reliability, but they do not encompass all the factors relevant to reliability that is often intertwined with security and safety. In the following, we provide a brief overview of hardware, software, and human reliability in CPS, exemplifying on the successful use of M&S for each reliability aspect. For a more in-depth overview of the concrete models and methodologies being used for reliability assessment for each of these three aspects, we refer the reader to our extensive review article.¹⁴

Hardware reliability has been very well defined and has a very sound theoretical background. The reliability of hardware components refers to the likelihood of them functioning as intended within a given timeframe. The main reliability measures are defined through the concepts of faults and failures, such as “mean time to failure” or MTTF or “mean time to repair” or MTTR. The assumption for using these two measures is that both repair times and interfailure times are exponentially distributed, i.e., the rates are constant. These are also the most common reliability features for describing fault models of systems’ components. If failures and repairs are non-exponentially distributed, then we use the notions of failure distributions and repair distributions.

In hardware reliability, diverse modeling approaches are employed to evaluate and predict the performance and dependability of hardware components and systems. These modeling techniques aid in understanding failure mechanisms, analyzing failure probabilities, and designing robust hardware. To analyze the interconnectedness of faults in components and systems that may result in system failure, Fault Trees (FTs)^15–17 can be employed. Other common modeling approaches in hardware reliability encompass Failure Modes and Effects Analysis (FMEA), which systematically identifies and prioritizes potential failure modes;¹⁸ RBD, a graphical method representing complex system reliability;¹⁹ Markov Models, useful for analyzing system reliability and availability;²⁰ DES, which models uncertainty and variability;²¹ Physics of Failure (PoF) Models, employing physics-based principles to understand failure mechanisms;²² Dynamic Reliability Models, accounting for changes in system states and interactions over time;²³ and the emerging use of Artificial Intelligence and Machine Learning techniques for data-driven decision-making in hardware reliability.^24,25 By integrating these modeling approaches, a comprehensive understanding of hardware reliability is achieved, enabling robust design and maintenance decisions.

In the context of hardware reliability assessment, various simulation methodologies have been used extensively. For example, DES has been used to evaluate FTs¹⁶ and RBDs.¹⁹ Monte Carlo simulation is used in the work by Ruijters and Stoelinga¹⁷ to evaluate FTs and in the work by Distefano and Xing²⁶ to evaluate RBDs. Finally, Lazarova-Molnar et al.¹⁵ use proxel-based simulation²⁷ to determine the instantaneous availability of basic FT components and the entire system.

Software reliability is considered to be an attribute of software quality, alongside functionality, usability, performance, serviceability, capability, installability, maintainability, and documentation. Software reliability analysis is mainly related to design errors, and, therefore, there is a different approach to it than the one usually used for hardware. As specified in the work by Rosenberg et al.,²⁸ software reliability comprised the following three activities: (a) Error prevention, (b) Fault detection and removal, and (c) Measurements to maximize reliability, thus supporting the first two activities.

Hence, to enhance the reliability of software, it is crucial to prioritize it right from the initial stages of development, including the requirements specification phase. The sooner reliability is incorporated, the greater its potential. Common practices employed to enhance software reliability encompass processes, such as debugging, early error detection, swift recovery, dynamic and static analysis, and evolution.

Compared with hardware reliability, software reliability is also linked to many unquantifiable factors, such as programmers’ skills or software project management skills. In the case of hardware, the skills of workers that build the hardware also matter, so the design aspect is common for both hardware and software. However, in software, there is no wear out. Therefore, most of the reliability in software is tackled through certifications and quality assurance methods,²⁹ and there is very little in terms of widely accepted software reliability assessment methods. One such attempt is presented in the work by Shi et al.²⁹ where the authors propose systematic software reliability prediction approach based on software metrics.

Various approaches have considered and successfully used simulation for software reliability assessment. For instance, Tausworthe and Lyu³⁰ discuss several simulation approaches for evaluating software reliability. In a study by Gokhale and Lyu,³¹ simulation procedures are developed to assess the influence of individual components on application reliability in the context of fault detection and repair strategies. Finally, Gokhale et al.³² use DES to model the failure behavior of a terminating application, considering both instantaneous and explicit repair scenarios.

Human Reliability: The majority of CPS are designed for interaction with humans and, thus, humans have, to a large extent, influence on the operation of these systems. This influence encompasses reliability. There have been numerous research efforts directed at estimating the impact of unexpected human interaction with a wide range of CPS. One example here is smart buildings, where a lot of research has been done on the topic of occupants’ behavior.³³ Humans are, however, highly uncertain in their behavior and, thus, modeling of their behavior needs different approaches from the ones applicable to software and hardware aspects. Moreover, the types of interactions that can occur might differ from one system to another. Therefore, it may be needed to provide custom solutions with respect to the type of system and interactions. In an attempt to study reliability due to human behavior,³⁴ Hollnagel talks about the concept of a “human error” and relates it to, what is termed as, “human reliability.” He, furthermore, emphasizes on the importance to develop a model to anticipate failures in joint machine–human systems. He concludes that this, however, is a very complex problem, as these failures are of complex nature and they need a careful consideration.

As far as approaches to include human uncertainty in the overall reliability assessment of CPS, the efforts are not plentiful. In one of them, Bessani et al.³⁵ present a model to include operator’s responsiveness together with machines’ faults and failures to evaluate the reliability of a system. Furthermore, Fan et al.³⁶ present a platform and associated methodology to effectively generate accident scenarios by modeling human–machine interaction errors using model-level fault injection, followed by simulation to produce dynamic evolutions of accident scenarios. These are notable efforts toward providing accurate and holistic, and therefore useful, reliability measures of CPS. In that context, Bolton et al.³⁷ recently introduced a novel method for generating human errors based on the task-based taxonomy of erroneous human behavior. In this case, the authors show how the method can be used with formal system modeling and formal verification with model checking to prove whether or not potentially unanticipated erroneous behavior could contribute to system failures.

Further notable recent attempts of addressing human behavior in the context of reliability of CPS have been presented in literature.^38–40 These efforts address critical aspects of system reliability and safety in different domains. Che et al.³⁸ focus on load-sharing man–machine systems and analyze the impact of machine degradation, human errors, and random shocks on system reliability. Che et al.⁴⁰ introduce an FT-based approach for aviation risk analysis, with a specific emphasis on mental workload overload, proposing safety recommendations to enhance aviation system dependability. Although Zheng et al.³⁹ introduce Systems Analysis for Formal Pharmaceutical Human Reliability (SAFPH), a next-generation human reliability analysis tailored for the pharmaceutical industry, which aims to comprehensively understand human actions and decision-making in pharmaceutical processes to improve safety and reliability. These research efforts contribute valuable insights into their respective fields, and offer methods and strategies to address human and technical factors influencing system dependability.

Several researchers employ simulation for human reliability assessment. For instance, Angelopoulou et al.⁴¹ introduce a human reliability assessment simulation model that accounts for performance shaping factors influencing human work within complex Industry 4.0 systems. Another approach, presented by, Musharraf et al.⁴² uses a Bayesian method to improve human reliability assessment by leveraging data generated in a simulator. In addition, an extensive overview of simulation approaches to human reliability assessment is presented in the work by Boring.⁴³

3. Emerging trends in reliability M&S

Evaluation of systems has undergone a significant transformation with the increasing availability of data. This transformation has been particularly accelerated by the development of the Internet of Things (IoT), which not only provides new opportunities for analyzing system reliability but also allows for the validation of existing approaches. Traditionally, reliability analysis heavily relied on expert knowledge, and this remains essential, especially for safety-critical systems, where the occurrence of faults can have catastrophic consequences. One prominent example that requires extensive expert knowledge is the design of FTs for aviation system reliability, as highlighted in the work by Netjasov and Janic.⁴⁴

It is, however, important to note that a considerable portion of systems used in manufacturing and other domains is not safety-critical. In these cases, the impact of faults and failures is primarily measured in terms of financial costs rather than posing risks to individuals or the environment.

Overall, while expert knowledge continues to play a vital role in ensuring system reliability, advancements in the IoT have opened up new possibilities for analyzing and validating reliability approaches. This is particularly relevant in non-safety-critical manufacturing systems, where the focus shifts toward minimizing financial implications resulting from faults and failures. The availability of data has indeed become a game changer in system evaluation, shaping the way we approach and improve system reliability.

In the following, we elaborate two emerging advancements in the reliability M&S. Namely, we first introduce the concept of data-driven reliability modeling, followed by the approach to combine and enhance data-driven models with expert knowledge in a systematic way. For both advancements, in the subsequent section, we provide illustrative case studies.

3.1. Data-driven reliability modeling

The conventional approach to reliability modeling, which heavily depends on expert knowledge, exhibits various limitations. As systems grow in complexity, it becomes increasingly challenging to maintain them effectively and identify vulnerabilities that can impact their reliability. In addition, a rising number and variety of failures add further complexity to the systems. Expert knowledge, while invaluable, can become a bottleneck as systems become increasingly complex.⁶

Another shortcoming of conventional reliability modeling is the static nature of the models, which fail to account for changes in a system over time. As physical systems undergo modifications, manually developed models can quickly become outdated and require updating. This manual model updating process becomes labor-intensive and tedious, particularly in systems where frequent changes in system topology and configuration occur.

Finally, conventional reliability models are typically developed for a specific system and its particular configuration. As a result, these models do not generalize well to other systems, limiting their applicability beyond the specific context for which they were developed.

Considering the challenges and limitations regarding conventional reliability modeling, it is crucial to address these issues by adopting an approach that involves dynamically generating accurate reliability models for systems characterized by high complexity. This need is particularly significant when it comes to short-term decision-making processes, where up-to-date and accurate models are essential for informed and effective decision-making.⁴⁵

Data-driven reliability modeling addresses this need by leveraging real-world data collected from the operation of a system. Compared to conventional modeling approaches, data-driven reliability modeling offers several advantages. It facilitates a more accurate representation of real-world system behavior, as it takes into account the variability and complexity that may not be captured by manually developed models. Moreover, data-driven models can adapt and learn from new data, enabling continuous improvement and updating of reliability assessments as more data become available.

The availability and ease of collecting data is the key driver behind the development of data-driven reliability modeling techniques. Recent advancements in the context of Industry 4.0 have facilitated the effective gathering of data through the use of IoT devices. These data play a crucial role in supplementing expert knowledge and constructing more accurate reliability models.⁶ Various types of data can be collected and used in data-driven reliability modeling, including the following:

Event data: Data related to system failures, faults, maintenance activities, repairs, and other relevant events.

Sensor data: Measurements of temperature, pressure, vibration, voltage, or any other relevant variables that can indicate the health and condition of system components.

Operational data: Data on operational processes, workload, operating conditions, environmental factors, and usage patterns.

Contextual data: Additional contextual information, such as system configuration, design specifications, and component specifications.

To fully harness the potential of the collected data, the development of new and sophisticated approaches to extract insights from them that is relevant for reliability modeling is essential. For example, sensor data are typically collected in the form of time series without explicitly capturing fault occurrences. Consequently, there is a need for approaches that prioritize event detection, enabling the extraction of fault occurrences and other relevant events from the time series data. Moreover, the development of accurate root cause analysis methodologies becomes crucial to extract event dependencies and model them effectively.⁴⁶

Figure 1 illustrates the feedback loop that can be facilitated through data-driven reliability assessment, where model generation is the key component. Based on a system of interest, data such as event data, sensor data, operational data and contextual data are collected. It is important to note that data-driven reliability modeling is particularly suitable for systems that are not safety-critical, as the collection of data in such scenarios is safe and feasible. In safety-critical systems, collecting data for reliability analysis would be impractical due to the potential risks to human lives associated with failures.

Figure 1.

Data-driven reliability assessment process (adapted from the work by Lazarova-Molnar and Mohamed⁴⁶).

Once collected, the data undergo preprocessing to ensure its usability for data-driven reliability modeling. This preprocessing phase involves tasks, such as event detection in sensor data and transforming the data into a format suitable for model generation methods. Next, the preprocessed data are used to generate reliability models. Once a reliability model is derived, simulation and data analytics techniques are used to support decision-making processes. These decisions may regard, for example, system configuration, purchasing decisions, or maintenance scheduling.

Several contributions propose methods for data-driven reliability modeling. For example, Lu et al.⁴⁷ propose a Bayesian approach to model the reliability of manufacturing systems based on data. Alsina et al.²⁵ use machine learning techniques and manufacturing component data to predict reliabilities. Zou et al.⁴⁸ introduce a novel data-driven stochastic manufacturing reliability model that captures production dynamics and identifies the causes of persistent production failures in both deterministic and stochastic scenarios. Lugaresi and Matta⁴⁹ employ Process Mining for automated manufacturing system discovery and digital twin generation. In our work, Friederich and Lazarova-Molnar,⁵⁰ we propose an approach for data-driven reliability modeling of smart manufacturing systems using Process Mining and stochastic Petri nets (SPNs). In addition, in another contribution,¹⁵ we present an approach for data-driven FTA based on time series data of a system.

In summary, data-driven reliability modeling is a promising research direction for addressing the limitations of conventional approaches using real-world data, facilitating accurate modeling, and supporting informed decision-making in complex systems.

3.2. Fusing data and expert knowledge for reliability model extraction

Disregarding the cognitive abilities and expert knowledge of humans results in significant information loss, which can only be partially offset by the expensive endeavor of collecting extensive data. However, advancements in system designs, data collection technologies, blockchain-based data storage, and access frameworks have rendered the reliance solely on expert knowledge for model building obsolete.

Combining expert knowledge with data has been a common practice in fields such as machine learning, artificial intelligence, and decision-making. The goal is to leverage the expertise of domain specialists or subject matter experts along with the information contained in data to improve the quality and accuracy of predictions, models, or decision-making processes.

One common approach to combining expert knowledge with data is through the use of expert systems or knowledge-based systems.^51–53 These systems integrate expert knowledge, typically represented as a set of rules or a knowledge base, with data-driven techniques to solve complex problems. Expert systems have been used in diverse domains, such as medicine,⁵⁴ finance,⁵⁵ and engineering,⁵⁶ to assist with diagnosis, decision-making, and prediction tasks.

Another approach is Bayesian statistics, which allows for the incorporation of prior knowledge or beliefs into the analysis of data.^57,58 Bayesian methods provide a framework for updating prior beliefs using observed data to obtain posterior beliefs. By combining prior knowledge with data, Bayesian inference can yield more accurate estimates and predictions.

Recent advancements in machine learning, such as deep learning and neural networks, have also explored ways to incorporate expert knowledge into data-driven models. One common approach is transfer learning, where knowledge from one domain or task is transferred to another related domain or task.^59,60 This allows models to benefit from pre-existing expertise and generalization capabilities.

Furthermore, the field of explainable AI (XAI) has gained attention, aiming to combine expert knowledge with data to provide interpretable and understandable models. Explainable AI techniques allow experts to contribute their domain knowledge, verify model behavior, and ensure ethical considerations are taken into account.^61,62

Considering some of the popular types of reliability models, i.e., FTs and Petri nets, there are basically two aspects in which data and expert knowledge can be fused: building the structure of a model, also known as qualitative analysis, and identifying the associated parameters or the quantitative analysis. In classical Simulation, both the structure and the parameters of the system of interest are fully characterized by experts. In Table 1, the areas in which data-driven methodologies and human cognitive capabilities can be combined are illustrated. When only expert knowledge is used for both qualitative and quantitative aspects, it is referred to as “Classical Simulation” (Table 1). If the information source is limited to data without incorporating expert knowledge, the approach is termed “Data-driven simulation.” The gray cells in Table 1 highlight the areas where the fusion of data and expert knowledge can enhance accuracy and leverage all available information from both data and human cognitive abilities.

Table 1.

Areas in which data and expert knowledge can be fused.

There have been studies in risk and reliability analysis to make use of experts’ knowledge. Traditionally, only Bayesian statistics were applied to take experts’ opinions into account because probability is the only language in which uncertainty can be consistently expressed. This process requires the use of prior distributions for reporting expert opinions.^63,64 Data availability has had its influence on reliability analysis, and Bayesian approach, especially Bayesian Networks, is the perfect framework to use both data and expert’s knowledge.⁶⁵ However, most of the existing research on reliability assessment makes use of the experts’ opinion only in quantitative analysis, i.e., to estimate model’s parameters (red fields in Table 1).

In our previous study,⁶⁶ we point out the gap in availability of systematic methods for fusing data with expert knowledge for the purpose of qualitative and quantitative reliability analysis of CPS. In this study, we introduce a systematic way of converting human’s knowledge into formal statements that can be easily combined with data-extracted facts for data-driven FTA. Different types of statements from experts can be recorded in various ways. Some examples are as follows:

Whenever components A and B fail (regardless of the rest), the whole system fails;

Whenever components A and B fail (not sure about the other components), the whole system fails to function;

Component A fails approximately once every 2 years;

Component A is highly reliable and rarely fails.

The first two examples concern qualitative analysis, while the last two examples inform us about the quantitative part of the model. Some of the methods that can help with converting these statements into computer understandable language are black lists and white lists^67,68 for the qualitative analysis, and Fuzzy set theory⁶⁹ and Bayesian statistics⁶³ for the quantitative analysis.

In the light of fusing data and expert knowlegde, it is important to mention the concept of cognitive digital twins (CDTs). Namely, CDTs, that can be also used for reliability M&S, aim to integrate a human dimension in classic digital twins. With this, CDTs reveal a promising evolution of the current digital twins concept toward a more intelligent, comprehensive, and full lifecycle representation of complex systems.⁷⁰ CDTs are essentially dynamic data-driven simulation models that do not only integrate real-time data for updating the model but also fuse it with human knowledge, which is seen as an important factor in designing, updating, optimizing and validating the model.

4. Case studies

To build up on the previous section, here we present two case studies from our research to illustrate both presented emerging trends in the field of Reliability M&S. In both case studies, we automatically extract reliability models that are subsequently simulated, showcasing how the whole Reliability M&S process can be automated. The first case study illustrates a purely data-driven approach to reliability modeling using SPNs, whereas in the second case study, we illustrate the concept of seamlessly fusing data and expert knowledge for FT Modeling and Analysis.

4.1. Case study 1: Data-driven reliability assessment of production systems

In this section, we present a case study that demonstrates the application of data-driven reliability assessment to a production system. To do so, we first describe the case study system and the data we extracted from it, followed by the data-driven model generation and validation process. Finally, we use the generated model to aid decision-making in maintenance staffing.

The presented data-driven reliability assessment approach is novel, as it not just uses data for typical M&S tasks, such as model parameterization, calibration, validation, or sensitivity analysis. In such approaches, an initial, manually derived model is usually assumed. Instead, our approach uses data to generate an explicit simulation model using process mining and statistical techniques. By generating a model from data, the model becomes very flexible and can quickly be calibrated/regenerated under changing conditions.

4.1.1. Case study system and extracted data

Figure 2 provides an overview of the case study system, which is a flow production line commonly found in manufacturing systems. The production line is fully automated and consists of five resource components: a manufacturing execution system (MES), two automated guided vehicles (AGVs), and two assembly cells. Both assembly cells work concurrently, performing the same assembly operation. The MES controls the production process by initiating new production orders, directing them to either Assembly Cell 1 or 2, and marking orders as completed. When a new production order is initiated and assigned to one of the assembly cells, the AGVs transport the raw material to the designated cell.

Figure 2.

Overview of the case study system.

The AGVs and the assembly cells are susceptible to failures, while the MES is always fully operational. In the event of a production resource failure, the resource stops operating and a repair crew is dispatched to repair the malfunctioning resource. The maintenance policy in place is purely reactive. The AGV has an unlimited buffer and a capacity of one, while both Assembly Cells 1 and 2 have a finite buffer and a capacity of one.

To simulate streaming data and conduct experiments, we developed a simulation model based on the described case study system. We used this simulation model to generate synthetic data in form of event logs, which capture the production process, and state logs, which record the operational state changes of production resources. The data captured by these logs will later be used to generate a simulation model.

Event logs can be extracted from information systems, such as MES, enterprise resource planning (ERP), or supply chain management (SCM). Each entry in the event log captures the execution of an event, including the timestamp, order identifier, the production resource involved in the corresponding activity, and the event type (i.e., $NA$ for events without activities, $start$ for the beginning of an activity, and $end$ for the completion of an activity).

State logs, however, can be extracted from control systems, such as programmable logic controllers (PLC) or supervisory control and data acquisition (SCADA). Each entry in the state log captures the operational state changes of a resource at a specific time, including the timestamp, the production resource changing its state, and the new state the resource transitioned to (i.e., $idle$ , $busy$ , $failure$ , $repaired$ ).

The event and state log that we used to generate the reliability model capture the production of 4431 orders more than 1-month time span. Excerpts of the event and state logs are displayed in Tables 2 and 3, respectively.

Table 2.

Excerpt of the generated event log.

Timestamp	Order ID	Resource	Event	Event type
01-06-2023 00:00:05	442	mes	new_order	NA
01-06-2023 00:00:05	442	mes	direct_to_line1	NA
01-06-2023 00:00:05	442	agv1	agv1_transport_to_cell1_buffer	start
01-06-2023 00:04:08	442	agv1	agv1_transport_to_cell1_buffer	end
01-06-2023 00:04:08	442	mes	enter_cell1	NA
01-06-2023 00:04:08	442	cell1	cell1_operation	start
01-06-2023 00:07:39	443	mes	new_order	NA
01-06-2023 00:07:39	443	mes	direct_to_line1	NA
01-06-2023 00:07:39	443	agv1	agv1_transport_to_cell1_buffer	start
01-06-2023 00:10:29	443	agv1	agv1_transport_to_cell1_buffer	end
01-06-2023 00:11:53	442	cell1	cell1_operation	end
01-06-2023 00:11:53	442	mes	order_completed	NA
01-06-2023 00:11:53	443	mes	enter_cell1	NA
…	…	…	…	…

Table 3.

Excerpt of the generated state log.

Timestamp	Resource	State
01-06-2023 00:00:05	agv1	busy
01-06-2023 00:04:08	agv1	idle
01-06-2023 00:04:08	cell1	busy
01-06-2023 00:07:39	agv1	busy
01-06-2023 00:10:29	agv1	idle
01-06-2023 00:11:53	cell1	idle
01-06-2023 00:11:53	cell1	busy
01-06-2023 00:18:05	agv1	busy
01-06-2023 00:20:15	cell1	idle
01-06-2023 00:21:09	agv1	idle
01-06-2023 00:21:09	cell1	busy
01-06-2023 00:21:31	agv1	busy
01-06-2023 00:25:07	agv2	busy
…	…	…

4.1.2. Reliability model generation

This section describes the generation of a reliability model based on the event and state logs, using our approach proposed in the work by Friederich and Lazarova-Molnar.⁵⁰ Our approach combines process mining and statistical analysis to generate a reliability model using SPNs as the modeling formalism. The class of SPNs considered can be defined as follows:

S P N = (P, T, A, m_{0})

where:

$P = {P_{1}, P_{2}, . ., P_{p}}$ is the set of places, drawn as circles,

$T = {T_{1}, T_{2}, . ., T_{q}}$ is the set of transitions along with their distribution functions or weights, drawn as bars,

$A = A^{I} \cup A^{O} \cup A^{H}$ is the set of arcs, where $A^{O}$ is the set of output arcs, $A^{I}$ is the set of input arcs, and $A^{H}$ is the set of inhibitor arcs, and each of the arcs has a multiplicity assigned to it,

$m_{0}$ is the initial marking of the Petri net.

Each transition $T_{i}$ can be either timed or immediate. A timed transition is drawn as a hollow bar and is assigned a probability distribution function that describes the firing time of the corresponding activity. An immediate transition is drawn as a filled-out bar and is assigned a weight that describes the firing probability of the associated event.

To generate the reliability model, we first extract a manufacturing process model and then integrate it with fault models for production resources.

The extraction of the manufacturing process model uses the information captured by the event log and involves the following four steps:

Identification of a Petri net that represents the material flow within a production line using process discovery algorithms. The material flow is the path that production orders follow through the system.⁴⁹

Determination of transition types (i.e., timed or immediate). Timed transitions correspond to the arrival of new production orders or to resources activities.

Estimation of probability distributions for timed transitions and extraction of weights for immediate transitions.

Extraction of resource capacities and buffers.

The extraction of the fault models for production resources uses the information captured by the state log and involves the following two steps:

Creation of necessary places and transitions from a fault model template.

Estimation of resource failure and repair distributions.

Finally, the fault models are integrated into the manufacturing process model using inhibitor arcs preventing the corresponding resource activity timed transition from firing.

Figure 3 depicts the generated reliability model obtained through the described model generation process. We successfully extracted and parameterized the manufacturing process model and the resource fault models. For each timed transition, the corresponding distribution function including parameters is shown. For each immediate transition, the corresponding weight is displayed. Furthermore, the capacity of one for both assembly cells, and their finite buffer sizes, have also been extracted. This data-driven reliability model can now be used to simulate and analyze the case study system.

Figure 3.

Generated reliability model of the case study system.

4.1.3. Simulation and validation of the generated model

We simulate the generated reliability models using DES. DES is a popular method for simulation of SPNs due to its versatility and efficiency in simulating complex systems.⁷¹ In DES, the simulation time is divided into discrete-time intervals, and events are triggered based on the state of the system and the occurrence of random variables. This approach allows for the modeling of stochastic systems with a high degree of realism, as it can capture the effects of randomness and variability in the system. DES is particularly useful in simulating SPNs, as it can effectively handle the probabilistic transitions and random variables that are inherent in these network models.⁷²

Validation is an essential step to ensure that the generated reliability model accurately reflects the behavior of the real-world system being modeled. For this case study, we use quantitative validation, which is typically conducted using either input–output transformations (IOT) or historical input data (HID). In the case of IOT, only the output from the actual system is compared to the output generated by the reliability model, without taking into account real data for the input random variables. However, when using HID, both the output and input data from the actual system are incorporated into the evaluation.⁷¹

We use the following two key performance indicators (KPIs) to compare the real system with the generated reliability model:

Production volume: Total amount of production orders completed over a given period of time.

Resource downtime: Amount of time that resources in the manufacturing system are not operational due to breakdowns, maintenance, or other reasons.

To assess the similarity between the outputs of the real system and the reliability model, we calculate confidence intervals using the t-distribution. By comparing the KPI of the real system and the confidence interval of the reliability model, we can determine if their outputs are statistically different or not, and thus whether the reliability model accurately captures the behavior of the real system. To do so, let $X_{obs}$ and $X_{sim}$ be the observed and simulated output, and let $μ_{obs}$ be the mean of the observed data set and $μ_{sim}$ be the mean of the simulated data set. Furthermore, let $S D_{obs}$ and $S D_{sim}$ be the standard deviation of the mean and $n$ be the number of data points in the observed respectively simulated data. The confidence intervals for $μ_{obs}$ and $μ_{sim}$ can then be calculated as follows:

C I_{obs} = [μ_{obs} - t_{\frac{α}{2}, n - 1} * \frac{S D_{obs}}{\sqrt{n}}, μ_{obs} + t_{\frac{α}{2}, n - 1} * \frac{S D_{obs}}{\sqrt{n}}]

C I_{sim} = [μ_{sim} - t_{\frac{α}{2}, n - 1} * \frac{S D_{sim}}{\sqrt{n}}, μ_{sim} + t_{\frac{α}{2}, n - 1} * \frac{S D_{sim}}{\sqrt{n}}]

where $t_{\frac{α}{2}, n - 1}$ is the t-value corresponding to the desired level of confidence and degrees of freedom.

Figure 4 illustrates the validation results for our extracted reliability model using IoT and 100 simulation replications at a confidence level of 95%. We validate the model using the same data that were generated from, and an unseen dataset capturing the production system at another month. As depicted, the confidence intervals of the model and the real system overlap for both KPIs, representing the production volume over one day and the total downtime of all production resources over one day. Consequently, we can assume the generated model to be valid for the system under study.

Figure 4.

Validation results.

Model Application and Decision Support: The generated and validated data-driven reliability model can now be applied to support various decisions regarding, for example, resource maintenance or system configuration. To apply the model, adjustments may be needed, such as modifying distribution functions for timed transitions, modifying firing probabilities for immediate transitions, or adding new model components. After the modifications, the model is simulated to assess the impact on the system. Similar to the validation process, this involves selecting an appropriate KPI that can be used to compare the model before and after modifications have been applied.

To demonstrate the application of the previously generated and validated reliability model, we consider the following scenario. The production manager wants to analyze how reducing the repair time for production resources will affect both the total resource downtime and the production volume. Based on this analysis, the production manager can make informed decisions, such as investing in better training for existing repair crews or determining the need for additional repair crews.

We test this scenario by adjusting the distribution functions of the repair transitions of the production resources in the SPN. Specifically, we decrease the duration of the repair activities by a factor ranging from 1 to 3 with a step of 0.1. A reduction factor of 3 implies that the repair activities are performed three times faster compared to the original configuration. Mathematically, the reduction factor $R$ can be described as $R = \frac{T_{o}}{T_{r}}$ where $T_{o}$ is the original repair duration and $T_{r}$ is the reduced repair duration. Figure 5 demonstrates the significant impact of reducing the repair duration on both resource downtime and production volume. The figure presents the mean values of both KPIs after conducting 100 simulation replications for each reduction factor.

Figure 5.

Effect of reducing the repair time duration of production resources.

4.2. Case study 2: Hybrid learning of FTs

Classical way of analyzing FTs, relies extensively on expert knowledge, and while this is essential for some systems, FTA can benefit substantially from data-driven M&S techniques. In this section, we present a methodology to learn FTs from both data and expert knowledge, which we call hybrid learning of FTs. The highlights of this case study can be summarized as follows:

Introducing a systematic way of converting human’s knowledge into formal statements that can be combined with facts extracted from data;

Fusion of data and expert knowledge is considered in different levels, for M&S of the system;

Applying hybrid learning of FTs for a multistate repairable FT.

To showcase and assess how availability of expert knowledge and data can affect the performance and accuracy of a system’s reliability measures, we start by an introduction to FTA through a multistate FT case study taken from the research literature on FTs.⁷³ Multistate FTs have the same structure of regular FTs, except that the components or the system may have more than two functioning levels. In other words, the state space of the system and its components may be represented by ${0, 1, \dots, M}$ , where 0 indicates a completely failed state, M indicates a perfectly working state, and the others are degraded states (Table 4).

Table 4.

Reliability and maintainability distribution functions of the basic events of the FT in Figure 6.

	Components	Reliability probability distribution(rate in f/h)	Maintainability probability distribution
1	Disk (Dij)	Exp(8.0e-05)	Weibull(5, 0.75)
2	Processor (Pi)	Exp(5.0e-07)	Exp(0.25)
3	Memory (Mj)	Exp(3.0e-08)	Weibull(5,20)
4	Bus (N)	Exp(2.0e-09)	Exp(0.006)
5	Power supply (PS)	Refer to Figure 7	Refer to Figure 7

A fault-tolerant multiprocessor (FTMP) consists of two processors $Pi$ $(i = 1, 2)$ with private memories $Mi$ $(i = 1, 2)$ and $M 3$ as a shared one (Figure 6). A processor and a memory form a processing unit. Each processing unit is connected to a mirrored disk system $Dij$ ( $i = 1, 2,$ and $j = 1, 2$ ), forming a processing subsystem. Both the processing subsystems and $M 3$ are connected via an interconnection Bus $N$ . Bobbio et al.⁷³ refined the description of the multiprocessor system by adding the component power supply $(PS)$ such that, when failing, it causes a system failure. The $PS$ is a multistate basic component that is modeled with three possible modes: working, defective, and failed, where the first corresponds to a normal behavior, the second to a defective working mode with abnormal voltage provided, while the last mode (failed) corresponds to a situation, where the $PS$ cannot work at all. As anticipated, the failed mode causes the whole system to be down. According to the literature, the failure distribution of all components (except for the $PS$ ) is assumed to be exponential with failure rates given in Table 5, expressed in failures per hour (f/h). State changes diagram for the multistate component $PS$ , is illustrated in Figure 7, where it has exponential probability distribution with the rate of 3.0e-05 ( $Exp$ (3.0e-05)) as the transition probability from working to defective state. $PS$ fails with a rate following $Normal (0.25, 0.1)$ probability distribution function, and it is repaired again with a duration that follows the $Uniform (0.1, 0.2)$ probability distribution function.

Figure 6.

A fault tree of an FTMP system with a multistate component $PS$ .

Table 5.

2 × 2 confusion matrix that depicts all four possible outcomes in classification.

MCS of the learned FT	MCS of the original FT
	Identified	Not identified
Identified	True positive (TP)	False positive (FP)
Not identified	False negative (FN)	True negative (TN)

Table 6.

Results of learning fault trees of the FTMP from data and expert knowledge under different scenarios.

		Scenario 1	Scenario 2	Scenario 3	Scenario 4
Structure learning measures	Sen	$0.200 \pm 0.016$	0.473 ± 0.018	0.900 ± 0.018	0.982 ± 0.008
	Spe	$0.989 \pm 3$ e-04	0.990 ± 6e-04	0.999 ± 3e-04	0.998 ± 4e-04
	ACC	$0.987 \pm 3$ e-04	0.989 ± 6e-04	0.998 ± 3e-04	0.998 ± 4e-04
	F-mea	$0.075 \pm 0.006$	0.195 ± 0.014	0.778 ± 0.038	0.792 ± 0.038
Parameter learning measures	${\hat{U}}_{n}$	2.54e-06 ±	1.91e-06 ±	5.73e-06 ±	6.36e-06 ±
		6.43e-07	6.02e-07	3.94e-07	0.00
	RMSE	3.48e-06 ±	4.05e-06 ±	5.78e-07 ±	0.00 ±
		5.85e-07	5.48e-07	3.58e-07	0.00

Figure 7.

State change diagram for the multistate component PS.

There are two aspects of FTA:

Qualitative Analysis:

Cut sets indicate which combinations of component failures lead to system failures. A minimal cut set (MCS) is a cut set of which no subset is a cut set. For example, in the sub-tree of FTMP displayed in Figure 8, ${D 11, D 12}$ is an MCS, and ${D 11, D 12, P 1}$ is a cut set. The structure function of the sub-tree in Figure 8 equals:

S 1 = (D 11 . D 12) + P 1 + (M 1 . M 3),

where ${D 11, D 12}$ , ${P 1}$ , and ${M 1, M 3}$ are the minimal cut sets. The structure of a failure of system’s FT can be fully knowledge-driven, fully data-driven, or a combination of both. Thus, the experts’ task only becomes more tedious and prone to errors as the system grows in complexity or changes behavior through time.

Quantitative Analysis:

In quantitative analysis, one objective is to measure how reliable the system of interest is. Assuming that the failure rates and the time it takes for them to be repaired are independent and are governed by probability distribution functions, then quantitative FTA starts with estimating the failure and repair probability distribution functions. Often times, in FTA, exponential or Erlang distributions are considered for modeling time-to-failure or time-to-repair associated with basic events, and domain experts specify the parameters of these distributions, e.g., an exponential distribution with a rate of 0.5.

Once the FT has been fully extracted and characterized, both in structure and parameters, reliability or unavailability of the system can be calculated.

As shown in Table 1, expert knowledge and data can be fused for both structure learning and parameter learning.

In this case study, we implement the methodology presented in the work by Niloofar and Lazarova-Molnar⁶⁶ by extracting MCSs of FTMP, from time series data of faults received from this system, and then combine this information with the statements from experts to estimate the system reliability.

The set of minimal cut sets for the FT from Figure 6 has 11 elements: $MCS = {PS,$ $N$ , { $D 11$ , $D 12$ , $M 3$ , $M 2$ }, { $D 11$ , $D 12$ , $P 2$ }, { $D 11$ , $D 12$ , $D 21$ , $D 22$ }, { $P 1$ , $M 3$ , $M 2$ }, { $P 1$ , $P 2$ }, { $P 1$ , $D 21$ , $D 22$ }, { $M 1$ , $M 3$ , $M 2$ }, { $M 1$ , $M 3$ , $P 2$ }, { $D 21$ , $D 22$ , $M 1$ , $M 3$ }}. Applying the proxel-based simulation method,^74,75 we calculated instantaneous unavailabilities of the basic events and the top event, which are shown in Figure 9.

Figure 8.

A sub-tree of the FTMP fault tree from Figure 6.

The hybrid FTA methodology consists of the following steps:

Generating time series data by simulating the original (ground truth) FT of Figure 6 with parameters in Table 5 and Figure 7;

Learning/Reconstructing the FT from the time series data simulated from the original FT and prior knowledge from the expert, under different scenarios;

Calculating the performance measures obtained from the hybrid learning of FT.

For this case study, we consider the following four scenarios:

No knowledge from experts and 20% data availability;

20% knowledge from experts and 20% data availability;

No knowledge from experts and 90% data availability;

90% knowledge from experts and 90% data availability (expert’s knowledge and data might overlap).

By 20% or 90% availability of knowledge from experts, we mean 20% or 90% of MCS are derived from expert knowledge. For instance, in the FT of Figure 9, if 1 out of 3 MCSs is derived from expert knowledge, it means ca. 33% expert knowledge availability.

Figure 9.

Instantaneous unavailabilities for the fault-tolerant multiprocessor from Figure 6.

Systematic Fusion of Data and Expert Knowledge: To benefit the most from the human prior knowledge of the system, we need to find a way to transfer this information into statements that can be seamlessly fused with data.

For the qualitative analysis of the FTMP system of Figure 6, let us assume that we have a piece of expert knowledge in the following form: “whenever events $M 1$ , $M 2$ , and $M 3$ happen, the whole system fails.” This information yields the $MCS : {M 1, M 3, M 2}$ . However, we also have the information that “whenever events $M 1$ and $M 2$ happen, there is a probability that system fails,” i.e., the expert is not sure about the outcome of the simultaneous failures of the components corresponding to $M 1$ and $M 2$ . These statements from experts can be validated later by data as they are being collected or being implemented in the model as weighted cut sets: $CS : {\dots, {M 1, M 3, M {2}}_{p_{1}}, {M 1, M {2}}_{p_{2}} \dots}$ . Weighted cut sets give each cut set a possibility of happening, where these probabilities can be later updated by collecting more data.

For the quantitative analysis, information received from experts can be implemented as prior probability distribution functions ( $π (θ)$ ) or point estimates ( $\hat{θ}$ ). In case of prior probability distributions, the collected data $(X = {x_{1}, x_{2}, \dots, x_{n}})$ can be used to update the prior distribution function to a posterior distribution function $(π (θ | X))$ using Bayesian theory.

For instance, assume an expert believes that an item fails twice a year. This belief from the expert can be converted to an exponential distribution with a failure rate of $λ = \frac{2}{365} = 0.00055$ . Hence, $π (θ) ~ Exp (0.00055)$ . However, collected data on failure times in a year for this item shows: $X = {120, 240, 320}$ , meaning that the item has failed after 120, 240, and 320 days of working. The posterior distribution function for an exponential prior distribution function, $Exp (λ)$ , follows Gamma distribution with parameters $α = n + 1$ and $β = n \bar{x} + λ$ , where $\bar{x}$ is the mean for the observation vector $X$ , and $n$ is the number of observations in $X$ .⁷⁶ Based on the Gamma distribution function, the updated estimate for the failure rate is the mean of Gamma distribution function: $\frac{α}{β}$ . Hence,

\begin{matrix} mean (π (θ | X)) = \frac{α}{β} = \frac{n + 1}{n \bar{x} + 0.00055} \\ = \frac{3 + 1}{680 + 0.00055} = 0.00588, \end{matrix}

where $\bar{x} = 226.67$ is the mean for $X$ , and $n = 3$ is the number of observations in $X$ . We can see that the initial expert’s belief is updated from 0.00055 to a larger number that is 0.00588, meaning that the item is not as reliable as it was initially assumed by the expert.

Performance Evaluation: To compare the reconstructed FT structure with that of the original FT, we use the 2 × 2 confusion matrix of Table 5 that depicts all four possible outcomes.

In this confusion matrix, TP represents the number of sets that are both in the MCS of the reconstructed FT and the original FT (correctly identified sets). FP is the number of sets in the MCS of the extracted FT which are not in the MCS of the true FT (incorrectly identified sets). FN is the number of undetected sets, and finally, TN is the number of sets which are correctly undetected. Using the confusion matrix, we calculate the sensitivity (Sen), specificity (Spe), and accuracy (ACC) and F-measure (F-mea):

\begin{matrix} Sen = \frac{TP}{(TP + FN)}, Spe = \frac{TN}{(TN + FP)}, \\ ACC = \frac{(TP + TN)}{(TP + TN + FP + FN)}, \\ F - mea = \frac{2 TP}{(2 TP + FP + FN)} \end{matrix}

Larger values of above-mentioned measures indicate higher performance in structure learning.

When the structure of the FT is extracted from the data/expert knowledge, the unavailability of the system can be calculated using proxel-based simulation. Since unavailabilities are calculated as transient solutions for each time step, we have a vector of instantaneous unavailabilities calculated for the extracted FT ${\hat{U}}_{i}, i = 1, \dots, n$ , where $n$ is the total number of time steps. For the original FT, there is also an associated vector of instantaneous unavailabilities: $U_{i}, i = 1, \dots, n$ . Root Mean Square Error (RMSE) is used to compare these vectors of unavailabilities:

RMSE = \sqrt{\frac{\sum_{i = 1}^{n} {(U_{i} - {\hat{U}}_{i})}^{2}}{n}}

Better estimation of unavailability leads to a smaller distance between ${{\hat{U}}_{i}}$ and ${U_{i}}$ ; hence, smaller values of RMSE. We also report ${\hat{U}}_{n}$ and $U_{n}$ as the final stable unavailability values.

Summary of Experimental Results: In this case study, we assume that expert’s availability of knowledge for the FT parameters, which are the probability distribution functions, is complete and partial knowledge only concerns the structure of the FT. Hence, the structure is learnt through fusion of data and expert knowledge.

We observe that as we enhance the algorithm with a prior knowledge of the system, the measures reported in Table 6 also improve. The true unavailability of the system is $U_{n}$ = 6.36109e-06, and ${\hat{U}}_{n}$ is always underestimated since the whole set of minimal cut sets is not observed from data or reported by experts. Best results are achieved in Scenario 4 where we have high availability of knowledge from experts and data.

5. Conclusion

Reliability M&S is gaining in importance with the ever-increasing complexity of systems that surround us. The conventional approach to reliability modeling heavily relies on expert knowledge, which poses limitations as systems grow in complexity. In this article, we provide an overview of the use of M&S for reliability analysis of CPS, with the aim of pointing how new technologies that enable easy collection of large quantities of data from systems can impact and transform the traditional ways of doing it. With this, we outline and describe the emerging trends in reliability modeling, mostly through the data-driven and hybrid modeling that fuses data and expert knowledge. Learning reliability models from data leverages real-world data collected from systems to improve accuracy and adaptability of extracted models. It is, however, important that in the whole process, we do not neglect the knowledge that we already have about these systems.

The systematic fusion of data and expert knowledge has the potential to play a significant role in enhancing reliability modeling. While data-driven approaches provide valuable insights, incorporating human expertise and cognitive abilities is essential. Combining expert knowledge with data-driven techniques allows for more efficient and more accurate model extraction, and improved predictions and decision-making. This integration can be achieved through approaches, such as expert systems, Bayesian statistics, transfer learning, and explainable AI.

To illustrate the new trends in reliability M&S, and based on our research developments, we presented two case studies. The first case study illustrates the data-driven reliability assessment of productions systems, where the extracted models are described using SPNs. The second case study focuses on the fusion of data and expert knowledge to extract reliability models in form of FTs.

These advancements offer significant potential for addressing the limitations of conventional reliability modeling. By leveraging data and expert knowledge, researchers and practitioners can enhance the accuracy, adaptability, and decision-making capabilities of reliability models. Furthermore, these advancements are very beneficial in terms of the utilization of Digital Twins technologies, which rely on having models updated with changes in the real systems. By developing approaches that automate simulation model extraction, be it using solely data or incorporating human expertise, we enable Digital Twins, in this case reliability-centered Digital Twins.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

ORCID iDs

Sanja Lazarova-Molnar

Jonas Friederich

Parisa Niloofar

Author biographies

Sanja Lazarova-Molnar holds two full professorships, at the Karlsruhe Institute of Technology and the University of Southern Denmark. Her research focuses on data-driven simulation, digital twins, and cyber-physical systems modeling for reliability and energy efficiency enhancement. She leads activities focused on digital twins and data-driven simulation in several European and national projects. Furthermore, she assumes leadership roles in IEEE and represents The Society for Modeling & Simulation International (SCS) to the Winter Simulation Conference (WSC) Board of Directors, contributing significantly to these professional organizations. She was also one of the Proceedings Editors for the WSC in 2019 and 2020, and an associate editor of SIMULATION: Transactions of The Society for Modeling and Simulation International.

Jonas Friederich holds a PhD in Computer Science from the University of Southern Denmark. His research interests cover Modeling and Simulation, Process Mining, Machine Learning, and Computer Vision.

Parisa Niloofar is an associate professor in the Software Engineering section of the Faculty of Engineering at the University of Southern Denmark (SDU). Her current research interests include Data-driven Simulation and Modeling, Cognitive Digital Twins, Hybrid Learning, Graphical Modeling, Missing Values, and Reliability Analysis. She obtained her PhD in Statistics, specializing in the area of Graphical Modeling and missing value imputation. Her current research on (Hybrid)data-driven reliability modeling of cyber-physical systems has added value to the literature.

References

Laprie

. Dependability: basic concepts and terminology. In: Laprie

(ed.) Dependability: basic concepts and terminology: In English, French, German, Italian and Japanese. New York: Springer, 1992, pp. 3–245.

Avizienis

Laprie

Randell

, et al. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Depend Secure Comput 2004; 1: 11–33.

Meyer

JF.

Performability: a retrospective and some pointers to the future. Perform Evaluat 1992; 14: 139–156.

Kim

Singh

Reliability modeling and simulation in power systems with aging characteristics. IEEE Trans Power Syst 2009; 25: 21–28.

Wang

Reddy

Krishnan

, et al. Compact modeling and simulation of circuit reliability for 65-nm CMOS technology. IEEE Trans Device Mater Reliab 2007; 7: 509–517.

Friederich

Lazarova-Molnar

Towards data- driven reliability modeling for cyber-physical production systems. Procedia Comput Sci 2021; 184: 589–596.

Friederich

Data-driven assessment of reliability for cyber-physical production systems, 2023, https://publikationen.bibliothek.kit.edu/1000175487

Ram

Modeling and simulation based analysis in reliability engineering. Boca Raton, FL: CRC Press, 2018.

Garro

Tundis

. Modeling and simulation for system reliability analysis: the RAMSAS method. In: 2012 7th international conference on system of systems engineering (SoSE), Genova, 16–19 July 2012, pp. 155–160. New York: IEEE.

10.

Garro

Groß

Tundis

, et al. Reliability analysis of an attitude determination and control system (ADCS) through the RAMSAS method. J Computat Sci 2014; 5: 439–449.

11.

Tundis

Ferretto

Garro

, et al. Dependability assessment of a deicing system through the RAMSAS method. In: 2017 IEEE international systems engineering symposium (ISSE), Vienna, 11–13 October 2017, pp. 1–8. New York: IEEE.

12.

Chiacchio

Iacono

Compagno

, et al. A general framework for dependability modelling coupling discreteevent and time-driven simulation. Reliab Eng Syst Safe 2020; 199: 106904.

13.

Pourhassan

Raissi

Hafezalkotob

A simulation approach on reliability assessment of complex system subject to stochastic degradation and random shock. Eksploatacja I Niezawodność 2020; 22: 20.

14.

Friederich

Lazarova-Molnar

Reliability assessment of manufacturing systems: a comprehensive overview, challenges and opportunities. J Manuf Syst 2024; 72: 38–58.

15.

Lazarova-Molnar

Niloofar

Barta

. Data-driven fault tree modeling for reliability assessment of cyber-physical systems. In: 2020 Winter simulation conference (WSC), Orlando, FL, 14–18 December 2020. New York: IEEE.

16.

Limnios

Fault trees. Hoboken, NJ: John Wiley & Sons, 2013.

17.

Ruijters

Stoelinga

Fault tree analysis: a survey of the state-of-the-art in modeling, analysis and tools. Comput Sci Rev 2015; 15: 29–62.

18.

Ashley

Armitage

Failure mode and effects analysis. J Patient Safe 2010; 6: 210–215.

19.

Signoret

Leroy

Signoret

, et al. Reliability block diagrams (RBDS). Reliab Asses Safe Prod Syst 2021: 195–208.

20.

Zhao

Yan

Wang

, et al. Research on reliability evaluation method of dc circuit breaker based on Markov model. Electr Power Syst Res 2019; 173: 1–5.

21.

Alexander

. Application of Monte Carlo simulations to system reliability analysis. In: Proceedings of the 20th international pump users symposium, 2003. Texas A&M University, Turbomachinery Laboratories, https://www.911metallurgist.com/wp-content/uploads/2016/01/Application-of-Monte-Carlo-Simulations-to-System-Reliability-Analysis.pdf

22.

Hendricks

George

Osterman

, et al. 3-Physics-of-failure (PoF) methodology for electronic reliability. Reliab Charac Electr Electron Syst 2015; 2015: 27–42.

23.

Zhang

Gao

Huang

, et al. Dynamic reliability modeling for system analysis under complex load. Reliab Eng Syst Safe 2018; 180: 345–351.

24.

Sun

Wang

, et al. A combined physics of failure and Bayesian network reliability analysis method for complex electronic systems. Proces Safe Environ Protect 2021; 148: 698–710.

25.

Alsina

Chica

Trawinski

, et al. On the use of machine learning methods to predict component reliability from data-driven industrial case studies. Int J Adv Manuf Tech 2018; 94: 2419–2433.

26.

Distefano

Xing

A new approach to modeling the system reliability: dynamic reliability block diagrams. In: RAMS ’06. Annual reliability and maintainability symposium, Newport Beach, CA, 23–26 January 2006, pp. 189–195. New York: IEEE.

27.

Lazarova-Molnar

The proxel-based method: formalisation, analysis and applications. PhD Thesis, Otto-von-Guericke- Universitat Magdeburg, Universitätsbibliothek, 2005.

28.

Rosenberg

Hammer

Shaw

. Software metrics and reliability. In: 9th international symposium on software reliability engineering, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=31e3f5732a7af3aecd364b6cc2a85d9495b5c159#:∼:text=Reliability%20is%20a%20by%2Dproduct,SATC%20to%20evaluate%20the%20reliability%20.

29.

Shi

Arndt

, et al. Metric-based software reliability prediction approach and its application. Empir Softw 2017; 22: 1579–1633.

30.

Tausworthe

Lyu

. Software reliability simulation. In: Lyu

(ed.) Handbook of Software Reliability Engineering. New York: McGraw-Hill, 1996, pp. 661–698.

31.

Gokhale

Lyu

MRT

. A simulation approach to structure-based software reliability analysis. IEEE Trans Softw Eng 2005; 31: 643–656.

32.

Gokhale

Lyu

Trivedi

. Reliability simulation of component-based software systems. In: Proceedings Ninth international symposium on software reliability engineering (Cat. No.98tb100257), Paderborn, 4–7 November 1998, pp. 192–201. New York: IEEE.

33.

Lazarova-Molnar

Mohamed

On the complexity of smart buildings occupant behavior: risks and opportunities. In: Proceedings of the 8th Balkan conference in informatics, Skopje, 20–23 September 2017, pp. 1–4. New York: ACM.

34.

ERIK

Human reliability assessment in context. Nuclear Eng Tech 2005; 37: 159–166.

35.

Bessani

Fanucchi

Delbem

ACC

, et al. Impact of operators’ performance in the reliability of cyber-physical power distribution systems. IET Gener Trans Distrib 2016; 10: 2640–2646.

36.

Fan

Chan

, et al. A simulation platform for human-machine interaction safety analysis of cyber-physical systems. Int J Ind Ergon 2018; 68: 89–100.

37.

Bolton

Molinaro

Houser

AM.

A formal method for assessing the impact of task-based erroneous human behavior on system safety. Reliab Eng Syst Safe 2019; 188: 168–180.

38.

Che

Zeng

, et al. Reliability analysis of load-sharing man-machine systems subject to machine degradation, human errors, and random shocks. Reliab Eng Syst Safe 2022; 226: 108679.

39.

Zheng

Bolton

Daly

, et al. The development of a next-generation human reliability analysis: systems analysis for formal pharmaceutical human reliability (SAFPH). Reliab Eng Syst Safe 2020; 202: 106927.

40.

Che

Zeng

You

, et al. A fault tree-based approach for aviation risk analysis considering mental workload overload. Eksploatacja I Niezawodność 2021; 23: 646–658.

41.

Angelopoulou

Mykoniatis

Boyapati

NR.

Industry 4.0: the use of simulation for human reliability assessment. Proced Manuf 2020; 42: 296–301.

42.

Musharraf

Moyle

Khan

, et al. Using simulator data to facilitate human reliability analysis. J Offshore Mech Arctic Eng 2019; 141: 021607.

43.

Boring

RL.

Dynamic human reliability analysis: benefits and challenges of simulating human performance, 2007, https://www.osti.gov/biblio/912477

44.

Netjasov

Janic

A review of research on risk and safety modeling in civil aviation. J Air Transp Manag 2008; 14: 213–220.

45.

Lugaresi

Matta

Real-time simulation in manufacturing systems: challenges and research directions. In: 2018 winter simulation conference (WSC), Gothenburg, 9–12 December 2018, pp. 3319–3330. New York: IEEE.

46.

Lazarova-Molnar

Mohamed

Reliability assessment in the context of industry 4.0: data as a game changer. Proced Comput Sci 2019; 151: 691–698.

47.

Chen

Zhao

Data-driven dynamic predictive maintenance for a manufacturing system with quality deterioration and online sensors. Reliab Eng Syst Safe 2021; 212: 107628.

48.

Zou

Chang

Arinez

, et al. Dynamic production system diagnosis and prognosis using model-based data-driven method. Expert Syst Appl 2017; 80: 200–209.

49.

Lugaresi

Matta

Automated manufacturing system discovery and digital twin generation. J Manuf Syst 2021; 59: 51–66.

50.

Friederich

Lazarova-Molnar

. Data-driven reliability modeling of smart manufacturing systems using process mining. In: 2022 winter simulation conference (WSC), Singapore, 11–14 December 2022, pp. 25–34. New York: IEEE.

51.

Alonso

Caraça-Valente

González

, et al. Combining expert knowledge and data mining in a medical diagnosis domain. Expert Syst Appl 2002; 23: 367–375.

52.

Gervás

An expert system for the composition of formal Spanish poetry. Knowl Based Syst 2001; 14: 181–188.

53.

Girard

Hubert

Modelling expert knowledge with knowledge-based systems to design decision aids: the example of a knowledge-based model on grazing management. Agricult Syst 1999; 59: 123–144.

54.

Taranath

Prabhu

BPA

Dani

, et al. Non-knowledge based decision support system. In Shakya

Balas

Haoxiang

(eds) Proceedings of third international conference on sustainable expert systems. Singapore: Springer, pp. 399–409.

55.

Nguyen

, et al. Knowledge representation of expert system in real-estate investment combining collected data. In: New Trends in Intelligent Software Methodologies, Tools and Techniques. Amsterdam: IOS Press, pp. 571–583, https://ebooks.iospress.nl/DOI/10.3233/FAIA220286

56.

Zhang

Chang

Liu

, et al. Deep learning and expert knowledge based underwater acoustic OFDM receiver. Phys Commun 2023; 58: 102041.

57.

Kneier

Woltersdorf

Peiris

, et al. Participatory Bayesian network modeling of climate change risks and adaptation regarding water supply: integration of multi-model ensemble hazard estimates and local expert knowledge. Environ Model Softw 2023: 105764.

58.

Sun

Wang

, et al. A new hybrid copula-based nonparametric Bayesian model for risk assessments of water inrush. Qual Reliab Eng Int 2022; 38: 1957–1976.

59.

Hua

Zeng

, et al. Learning for a robot: deep reinforcement learning, imitation learning, transfer learning. Sensors 2021; 21: 1278.

60.

Ahsan

Gomes

Denton

. Application of a convolutional neural network using transfer learning for tuberculosis detection. In: 2019 IEEE international conference on electro information technology (EIT), Brookings, SD, 20–22 May 2019, pp. 427–433. New York: IEEE.

61.

Sachan

Yang

, et al. An explainable AI decision-support-system to automate loan underwriting. Expert Syst Appl 2020; 144: 113100.

62.

Serradilla

Zugasti

Cernuda

, et al. Interpreting remaining useful life estimations combining explainable artificial intelligence and domain knowledge in industrial machinery. In: 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE), Glasgow, 19–24 July 2020, pp. 1–8. New York: IEEE.

63.

Kaplan

“Expert information” versus “expert opinions.” Another approach to the problem of eliciting/combining/using expert knowledge in PRA. Reliab Eng Syst Safe 1992; 35: 61–72.

64.

Coolen

Newby

Bayesian reliability analysis with imprecise prior probabilities. Reliab Eng Syst Safe 1994; 43: 75–85.

65.

Zhang

Marsh

DWR

. Generic Bayesian network models for making maintenance decisions from available data and expert knowledge. Proc IMechE Part O: Journal of Risk and Reliability 2018; 232: 505–523.

66.

Niloofar

Lazarova-Molnar

. Fusion of data and expert knowledge for fault tree reliability analysis of cyber-physical systems. In: 2021 5th international conference on system reliability and safety (ICSRS), Palermo, 24–26 2021, pp. 92–97. New York: IEEE.

67.

Jensen

Nielsen

TD.

Bayesian networks and decision graphs. Vol. 2. New York: Springer, 2007.

68.

Scutari

Ness

Bnlearn: Bayesian network structure learning, parameter learning and inference. R Package Version 2012; 3: 805.

69.

Huang

, et al. Reliability analysis of multistate systems with common cause failures based on Bayesian network and fuzzy probability. Ann Oper Res 2022: 1–15.

70.

Zheng

Kiritsis

The emergence of cognitive digital twin: vision, challenges and opportunities. Int J Prod Res 2022; 60: 7610–7632.

71.

Banks

Carson

Nelson

, et al. Discrete-event system simulation. Hoboken, NJ: Prentice Hall, 2010.

72.

Haas

. Stochastic Petri nets for modeling and simulation. In: Proceedings of the 2004 Winter simulation conference, Washington, DC, 5–8 December 2004. New York: IEEE.

73.

Bobbio

Portinale

Minichino

, et al. Improving the analysis of dependable systems by mapping fault trees into Bayesian networks. Reliab Eng Syst Safe 2001; 71: 249–260.

74.

Niloofar

Lazarova-Molnar

Data-driven extraction and analysis of repairable fault trees from time series data. Expert Syst Appl 2023; 215: 119345.

75.

Niloofar

Haghbin

Lazarova-Molnar

. Proxel-based simulation of fault trees in R. In: 2022 Winter Simulation Conference (WSC), Singapore, 11–14 December 2022, pp. 2523–2533. New York: IEEE.

76.

Bernardo

Smith

AF.

Bayesian theory, vol. 405. Hoboken, NJ: John Wiley & Sons, 2009.