Sage Journals: Discover world-class research

Abstract

Systems fail. Period. No matter how much planning and fault analysis is performed, it is impossible to create a perfectly reliable machine. The existing approach to improving reliability invariably involves advances in fault prediction and detection to include specific mechanisms to overcome a particular failure or mitigate its effect. While this has gone a long way in increasing the operational life of a machine, the overall complexity of systems has improved sharply, and it is becoming more and more difficult to predict and account for all possible failure modes. What is discussed here is a possible shift in approach from specific repair strategies to autonomous self-repair. Rather than focusing on mitigating or reducing the probability of failure, the focus is instead on what can be done to correct a failure that will invariably occur at some point during operation. By taking this approach, it is not just expected failure that can be designed for, unexpected failure modes are also inherently compensated for, extending the potential life of a system and reducing the need for through-life servicing.

I. Introduction

It is impossible to discuss the concepts of self-healing and self-repair without having some notion about their meanings. There are currently no universally accepted definitions of these terms, but instead, intuitive notions about the concepts involved. It is not the purpose of this article to suggest a new taxonomy, but instead to look at what the overall aims are of this emerging field and perhaps reflect on what is achievable now. To make these issues more awkward, there are currently many terms for the similar ideas, and conversely, many distinctly different ideas that are referred to by the same name. Furthermore, different fields of research such as electronics or mechanical design can have vastly different interpretations and objectives. A good example of this is modular or physical^1–3 redundancy in electronics—these concepts could perhaps be thought of as inefficient if the same principles are applied to a purely mechanical system that contains more material or elements than are strictly necessary for an optimized design.

In layman’s terms, perhaps what we are looking in self-repair are systems that are able to maintain some degree of functionality after a failure has occurred. This might be a controversial interpretation, however, as it can be argued that certain self-preservation or preemptive actions, such as prognostics or mitigation through fault tolerance, are an intrinsic element of self-healing, and hence, we should not focus solely on what happens after the event of a failure.

The above definition is similar to the general or biological definition of “resilience,” which is commonly interpreted as the ability to recover from adversity.⁴ Hence, fault-tolerant approaches might better fall under this general umbrella of “resilience” rather than self-healing.

Fundamentally, one crucial distinction is the difference between a reactive or proactive system. In fault tolerance, where the system is able to absorb a finite number of fault events without explicit repair or reconfiguration, it is assumed that failure can to a certain extent be prevented. For the purpose of this discussion, however, we will assume that failure can and does occur.

II. Achieving Self-Repair

To achieve a self-repairing system, it is clear that the system must have an element of self-awareness. Amor-Segan et al.⁵ state that the ultimate aim is to develop a system with “the ability to autonomously predict or detect and diagnose failure conditions, confirm any given diagnosis, and perform appropriate corrective intervention(s).” Following this logic, Figure 1 offers a proposed approach that can theoretically be applied to any system. By breaking the process down into a number of finite steps, we can better assess the current progress toward achieving an idealized self-repairing system.

Figure 1.

Proposed approach to self-correcting systems

Perhaps the first point that can be drawn from this proposed process is that the underlying cause of fault is not considered crucial. There is a whole research sector dedicated to function-based failure analysis,^6–8 and while there will invariably be some degree of crossover between the disciplines, here, it is better to focus instead on what happens after failure has occurred.

III. Detection and Diagnosis

Any critical fault will almost invariably lead to a fundamental change in the behavior of the system. This could perhaps be most easily interpreted as a deviation from the prescribed behavior, utilizing either internal or external telemetric data. One of the difficulties with this complex system is in defining “expected behavior”; however, this problem is not insurmountable, and a great deal of progress has been made in this research area.⁹

Conversely, the diagnosis of a fault is perhaps a more difficult proposition. This is partly due the difficulty in validating large, complex system models because of the vast number of possible system states.⁵ Furthermore, there is the issue of confidence in diagnosis, that is, how much certainty must be present to initiate repair? Because of this, an additional step is proposed in which the diagnosis must be confirmed, to avoid undesirable events such as “good” components being unnecessarily removed or routed around. Several methods are currently available for this:

Model-based abductive reasoning. Compare observation with predicted observation—if “X” is expected but “Y” is obtained, then “Y” must be corrected to get it to match;

Bayesian belief networks. Probabilistic graphical model is a type of statistical model that represents a set of random variables and their conditional dependencies—if “X” and “Y” happen, it is likely a failure with “Z”;

Case-based reasoning methods. Anecdotal evidence—if “X” happens, do “Y”—it accounts for expected failure only.

Currently, there has been some progress in these areas in electronics with built-in self-testing (BIST). Silicon electronic devices are susceptible to a variety of upset events, including transitory events (e.g. random single upset events caused by radiation) and permanent fault conditions that can be triggered by a vast variety of events. Rather than eliminating the underlying cause, BIST has been developed for computer dynamic random access memory (DRAM), where special structures are included in the memory chips that are activated when attached to production test machine. This enables rapid and reliable allocation of redundant memory cells to replace faulty cells that are commonly found in high-density memory. Perhaps what we now seek is a shift from this external detection to in-system detection and correction, such as self-contained BIST logic that can operate independently of the (expensive) production test machine during the operational life time of the memory chip. Data error detection and repair are particularly pervasive in electronic systems; it protects critical memory areas such as on-chip cache, which cannot tolerate transient upset errors.

IV. Corrective Actions

Perhaps of most interest in self-repairing systems is the corrective action itself. If it were possible to fully automate this process, then there are huge potential savings in maintenance, repair, and operations (MRO) costs. The precise methodology employed will almost invariably have to be application specific; however, a number of possible approaches are available:

Physical redundancy. An alternative load path or system is available should the primary system fail:

Currently, this is the easiest approach to include and is already implemented on mission-critical systems;

At a very basic level, this can simply be a complete facsimile of the primary system (modular redundancy) that can take over if failure occurs;

Its relative efficiency can perhaps be measured by how much of the primary system has to be physically replicated to provide the backup.

Self-repair. The system, as a whole, has the ability to partially or fully fix a given fault to continue operation:

This is the approach that is perhaps most achievable in the immediate future;

One approach is to extend the concept of redundancy to the use of degenerate modules that have the ability to perform the same function or yield the same output even if they are structurally different;¹⁰

Using this approach rather than having individual backups for each module, a single spare module can be reconfigured to provide cover for any defective module;

Alternatively, this concept of self-repair through self-reconfiguration does not necessarily require additional materials, instead performance can be sacrificed to ensure continued functionality utilizing only the currently available resources.

Self-healing. The system is able to physically bring itself back to its initial state of operation after a fault has occurred:

True self-healing systems are currently prohibitively expensive and infeasible for all but the most basic of systems or limited to exotic materials;

An idealized example couple would be the ability to automatically re-straighten a mechanical element (through a chemical process) after it has been bent, or physically fix thermal damage in an electronic component;

An alternative approach would be to have entirely adaptable systems, such as “smart dust,”¹¹ where there is a finer level of granularity and near infinite possibilities for reconfiguration.

To better emphasize the distinction between the corrective actions, Tables 1 shows a simple example of the repair of a car-tire puncture and how this compares to biological approaches.

One salient point that becomes apparent when looking at biological parallels is that each of the corrective actions does not necessarily have to occur in isolation. Indeed, in the broken skin example, it is common for self-repair and self-healing to occur sequentially to produce a coupled-whole system. Indeed, this is normally preceded by assisted repair in which the wound is externally bandaged.

Table 1.

Biological inspiration for self-healing

Corrective process	Car tire mechanism example	General approach	Biological parallel in broken skin
Redundancy	Run-flat tire—stiffened tire wall that is able to temporarily carry load in the event pneumatic pressure is lost	If primary load/electronic/signal path fails, an alternative is used instead	Areas of skin continuously worn develop calluses to provide additional protection against skin breakage
Self-repair	Tire-weld or a similar substance is used within tire to automatically seal puncture	System is repaired using some peripheral materials automatically	A scab is formed over the cut to prevent further damage and enable continued operation
Self-healing	Low transition temperature rubber tire that is able to automatically melt to seal a puncture	System is healed/repaired at a molecular level with little or no evidence that repair has taken place	Epithelialization collagen synthesis, contraction, and remodeling occur to produce a near-perfect restoration of the skin

V. Current Progress

The electronics domain is perhaps leading the way with regard to self-repairing systems. An evolution from external testing to self-contained testing is already underway with the next proposed approach built-in self-repair (BISR). Electronic Data Capture (EDC) methods offer BISR functionality via special hardware structures. A limitation here is that permanent faults cannot generally be handled by EDC, and system failure will result. Permanent faults can be protected by introducing system redundancy such as Triple Modular Redundancy (TMR), which was first proposed more than 50 years ago;¹² however, one must assume that the voting logic itself is trustworthy or else can also be replicated.

A less popular approach is that of fine-grained fault tolerance employing interconnect interleaving and quadded logic,¹³ which requires additional logic and signal routing hardware but which is able to “absorb” certain permanent fault events without loss of functionality. The basic principles at work here are the fine granularity of the underlying transistor and interconnecting structure that offers many possibilities for reconfiguration and fault tolerance. Beyond this, there is significant interest in new bio-inspired approaches that use cellular based architectures. Inspired by the early observations of Von Neumann on the intrinsic fault-tolerant properties of biological systems,⁴ this offers the possibility of electronics systems whose operation is governed by localized interactions between electronic “cells,” that is, circuits not requiring global coordination, and hence, BIST and BISR can be executed at the cellular level.^14,15

VI. Conclusion

Looking at the overall concept of product reliability, if viewed from the perspective of the user, a system with an integral resilience-mechanism would appear to be more “reliable”—it is able to maintain operation for a longer period of time than would otherwise have been possible. However, from a design approach, systems with additional procedures built-in are invariably more complex, and hence, the primary system becomes intrinsically less “reliable,” even though it is able to bring itself back to a normal operating condition. Getting the balance right between this intrinsic reliability and apparent reliability is important to ensure that self-healing technologies are accepted by the end user.

Despite vast improvements in system modeling and prediction, most machines still fail in the face of unexpected damage,¹⁶ and one of the long-standing challenges of creating a reliable system is achieving robust performance under uncertainty.¹⁷ Self-repairing techniques inherently must be designed to compensate for a wide variety of failure modes, thus overcoming some of the problems associated with uncertainty. Although specific solutions have not been suggested, proposed methodologies for developing self-repairing strategies should not focus on a finite number of underlying causes. Instead, the focus should be on how these causes manifest, how they can be detected, and ultimately how they can be corrected autonomously.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References

Habinc

Functional triple modular redundancy (FTMR) VHDL design methodology for redundancy in combinatorial and sequential logic design and assessment report. Available online at http://www.gaisler.com/doc/fpgan003n01-0-2.pdf (2002).

Davies

Steffen

Dixon

Goodall

Zolotas

. Modelling of high redundancy actuation utilising multiple moving coil actuators. In International Federation of Automatic Control (IFAC) world congress, Seoul, South Korea, 2008, pp.3228–33.

Davies

Tsunashima

Goodall

Dixon

Steffen

Fault detection in high redundancy actuation using an interacting multiple-model approach. In IFAC symposium on fault detection, supervision and safety of technical processes (SafeProcess), Barcelona, Spain, 2009, pp.1228–33.

Von Neumann

. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies 1956; 34: 43–98.

Amor-Segan

McMurran

Dhadyalla

Jones

RP.

Towards the self-healing vehicle. In Automotive electronics, 2007: 3rd Institution of Engineering and Technology conference (IET), June 2007, pp.1–7.

Tumer

Stone

. Mapping function to failure during high-risk component development. Research in Engineering Design 2003; 14(1): 25–33.

Arunajadai

Stone

Tumer

. A frcre mode identification. In Proceedings of the 2002 ASME design engineering technical conference: Design theory and methodology conference, Montreal, QC, Canada, 2002.

Tumer

Stone

Roberts

Brown

. A function-based exploration of JPL’s problem/failure reporting database. In Proceedings of the 2003 ASME international mechanical engineering congress and expo (IMECE2003-42769), Washington, DC, 2003.

Visinsky

Cavallaro

Walker

. Robotic fault detection and fault tolerance: A survey. Reliability Engineering & System Safety 1994; 46(2): 139–58.

10.

Edelman

Gally

. Degeneracy and complexity in biological systems. Proceedings of the National Academy of Sciences 2001; 98(24): 13763–8.

11.

Kahn

Katz

Pister

KS.

Next century challenges: Mobile networking for “Smart Dust.” In Proceedings of the 5th annual ACM/IEEE international conference on mobile computing and networking, August 1999, pp.271–8.

12.

Lyons

Vanderkulk

. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 1952; 6(2): 200–9.

13.

Jensen

. Quadded NOR logic. IEEE Transactions on Reliability 1963; R-12(3): 22–31.

14.

David

McWilliam

Purvis

. Designing convergent cellular automata. BioSystems 2008; 96(1): 80–5.

15.

Tyrrell

Greensted

. Evolving dependability. Journal on Emerging Technologies in Computing Systems 2007; 3(2).

16.

Bongard

Zykov

Lipson

. Resilient machines through continuous self-modeling (sic.). Science 2006; 314(5802): 1118–21.

17.

Thrun

Burgard

Fox

. Probabilistic Robotics. Cambridge, MA: MIT Press, 2005.

Concepts of Self-Repairing Systems

Abstract

I. Introduction

II. Achieving Self-Repair

III. Detection and Diagnosis

IV. Corrective Actions

V. Current Progress

VI. Conclusion

Footnotes

Funding

References