Abstract
Fault-tolerant systems are expected to operate in a variety of devices ranging from standard PCs to embedded devices. In addition, the emergence of new software technologies has required these applications to meet the needs of heterogeneous software platforms. However, the existing approaches to build fault-tolerant systems are often targeted at a particular platform and software technology. The objective of this paper is to discuss the use of
1. Introduction
At present, a number of software development technologies (e.g., component-based approach, aspect-oriented programming, and web services) can be employed for building systems that can be run on a variety of hardware platforms ranging from standard PCs to networked embedded devices. This scenario is also valid for reliable systems which are often required to run on a variety of hardware platforms including embedded devices. In this paper we are concerned with examining two types of heterogeneity:
Device Heterogeneity. Fault-tolerant systems are often deployed with heterogeneous devices which can range from PCs to embedded devices. However, this heterogeneity is expected to be adversely affected by the emergence of new hardware platforms. Software Language/Middleware Heterogeneity. There are currently a large number of fault-tolerant policies, each of which requires a particular procedure and strategy. They are normally based on heterogeneous programming languages and technologies (e.g., publish-subscribe systems, web service applications, tuple spaces, and message-oriented toolkits).
The aim of this paper is to investigate approaches that can lead to the development of middleware solutions that require different programming models in different environments. For this purpose, we introduce
The policy is deployed in the form of component plugins, which are destroyed when no longer required. Flexibility. Fault-tolerant systems can be developed and deployed independently of target platforms. The kernel can plug in the targeted platforms of a particular abstraction or behavior that is implemented. Reusability/Modularity. The developers can reuse the existing components and processes that are employed for particular platforms. Transference of Skills. The employment of different technologies to build applications for each targeted device and applicability do not allow for the transfer of skills across different tools. Skill sets and areas of expertise are rarely transferable when they rely on different technologies. A generic approach can bring about the transference of skills because the developers only utilize a single tool for developing applications based on a variety of technologies. Technology Independency. Interoperability.
The paper is structured in the following way: we set out by discussing the basic concepts and research challenges in Section 2. Section 3 reviews selected works related to the topics discussed in this paper. Following this, Section 4 outlines the generic framework for fault-tolerant system development; this includes an in-depth examination of the benefits obtained. Together with this section, Sections 5 and 6 examine some case studies which draw on the constructed prototype to show that the proposed approach attains an acceptable standard of performance and adequate resource consumption overhead. Finally, Section 7 concludes the paper and adds some comments about ongoing work.
2. Background and Research Challenges
Before examining the
Sophisticated applications must now take account of a wide range of software technologies and middleware platforms to meet a large number of requirements. As illustrated in Figure 1(a), a reliable system may have to use an implementation that has already been developed as Java or XPCOM (https://developer.mozilla.org/en/XPCOM) components, and a multithreaded component technology may also be necessary. In addition, multiple distribution abstractions may be required for example, a publish-subscribe binding when the application is operating over ad hoc wireless networks, along with Web Service middleware when the application needs to interact with a legacy service in the established infrastructure.

Furthermore, end users currently rely on a variety of devices ranging from PCs to smartphones. They are also often interested in accessing data from a number of sources including sensor motes (e.g., data coming from urban rivers, such as temperature and depth levels). Since these different sources have heterogeneous sensor protocols and interfaces, this scenario requires a good deal of effort on the part of application developers. Moreover, information systems that rely on sensors (e.g., information systems for disaster management) usually depend on integrating and composing several services, where each service handles different sensors that monitor and collect specific contextual information. Since typically the services are developed in an independent way, it is important to do adhere to a standardized interface to ensure interoperability.
Against this backdrop,
Reliable systems are also built on a diverse range of hardware platforms, as illustrated in Figure 1(b). Instead of depending on porting applications across these platforms with the aid of the software technologies that are available,
The
Design diversity [2] means that multiple functionally equivalent software components are independently generated from the same initial specification. Two or more versions of the software component are independently developed from this specification, each by a group that does not interact with any other, and, whenever possible, employs different algorithms.
However, the provision of software redundancy involves the following: (i) an increase in the cost of creating the software and (ii) a greater degree of complexity in the system, caused by the addition of redundant components. Ideally, the added software redundancy should be incorporated into the original system in a structured and nonintrusive manner to enable the application designers to construct dependable systems.
2.1. Recovery Block
Recovery block [3] is a technique devised by Randell [4] from what, to some extent, was observed to be the practice at that time. The description outlined here has been slightly changed from the original description so that it is in accordance with the approach for component-based systems development. In a system with recovery blocks [3], the design of the system is broken down into fault recoverable blocks/modules (i.e., reliable system components). Each critical system component requires the separate development of alternative variants (modules of differing design aimed at a common specification) and one adjudicator to check the results produced by the variants (by means of an acceptance test). On entry to a recovery block, the state of the reliable system component (or of the whole system) must be saved to permit backward error recovery, that is, to establish a checkpoint.
The primary alternate is executed, and then the acceptance test is evaluated to provide an assessment of its outcome. If the acceptance test is passed, the outcome is regarded as successful and the recovery block can be exited. The information on the state of the component system obtained on entry (i.e., at the checkpoint) can be discarded. However, if the test fails or if any errors are detected by other means during the execution of the alternate, an exception is raised and backward error recovery is invoked. This restores the state of the component system to what it had been on entry. After this recovery, the next alternate is executed and then the acceptance test is applied again. This sequence continues until either an acceptance test is passed or all of the alternates have failed it. If all the alternates either fail the test or result in an exception (due to an internal error being detected), a failure exception will be signaled to the environment of the recovery block.
2.2. N-Version Programming Technique
Among the design diversity techniques, it is worth highlighting the N-version programming technique [2]. In an N-version software system, each module is formed of up to N different implementations. Each variant carries out the same task, but it is hoped in a different way. Each version then submits its answer to a voter or decider which determines the correct answer (e.g., the majority of the votes) and returns this as the result of the N-version component system.
There are few differences between the recovery block and the N-version techniques, but they are important. In traditional Recovery Blocks, each alternative would be executed serially until an acceptable solution is found as determined by the adjudicator. The N-version technique has always been designed to be executed in parallel. In a serial N-version system, the cost in time of trying out multiple alternatives may be too expensive, especially for a real-time system. Another important difference between the two methods is the distinction between the roles of an adjudicator and decider. The recovery block technique requires each fault recoverable block (reliable system component) to build a specific adjudicator; in the N-version technique, a single default decider (e.g., the majority) may be used. On the basis of the assumption that the programmer can create a sufficiently simple adjudicator, the recovery block technique will create a system which is very unlikely to enter into an incorrect state. The engineering tradeoffs, especially monetary costs, involved with developing either type of system have both benefits and drawbacks, and it is important for the engineer to explore the space so as to be in a position to decide on what the best solution for his project should be.
2.3. State-Based Variant Execution
The ability of dynamic reconfiguration—for example, to replace faulty components and/or to change the computation performed in fault situations—is a crucial factor in the development of reliable systems. When account is taken of the diversity of designs (components and their different variants), ideally the selection of the variant that will be executed should depend on the system and/or state of the component.
Consider the following example of motivation: components
2.4. Sensor Web Enablement
As well as the issues referred to above, one of the main challenges for the application developer is how to integrate data that has been acquired from different types of sensors. Existing sensors use a large variety of sensor protocols (e.g., Sun SPOT ZigBee protocol, XBee/ZigBee, and GumStix Wi-Fi) and sensor interfaces (e.g., nesC), and most applications are still dealing with this by integrating sensor resources through their own mechanisms. However, this manual bridging of the gap between sensor resources and applications leads to an extensive adaptation effort and is considered to be a key cost factor in large-scale deployment scenarios [6].
This challenge to address the diversity of protocols, interfaces, and sensor devices was addressed by the Open Geospatial Consortium (OGC) which in 2003 began to lay down a set of standards (http://www.ogcnetwork.net/swe) with the aim of establishing the “Sensor Web” [1]. This can be defined as an infrastructure that allows for the interoperable usage of sensor resources by ensuring that their discovery, access, tasking, and eventing and alerting are carried out in a standardized way. Thus, the Sensor Web conceals the underlying layers, the network communication details, and heterogeneous sensor hardware from the applications built on top of it and thus allows users to share sensor resources more easily [6]. In the Sensor Web paradigm, all the sensors report their position and are available in the worldwide web; in addition, their metadata is registered so that they can all be uniformly accessed (and some of them even controlled) via the internet [1].
The realization of the vision of sensor webs and networks is being pursued by the Sensor Web Enablement (SWE) working group of OGC through the establishment of several (XML-based) encodings for describing sensor resources and sensor observations and through several standard interface definitions of web services. The first generation of SWE includes standards for [1] (a) description of sensor data; (b) description of sensor metadata including properties and the behavior of the sensors; (c) access to observations and sensor metadata based on standardized data formats and appropriate query and filter mechanisms; and (d) setting of tasks for sensors to obtain measurement data.
FlexFT adopts OGC SWE standards to provide standardized access to sensor observations. The most important standard in this context is the Sensor Observation Service [7], which consists of a pull-based service for querying as well as inserting measured sensor data and metadata.
3. Related Work
This section presents the related work on component-based building system technology. We first review each platform and highlight their main features and contributions. Then, we outline how our work contributes towards the state of the art.
SaveCCM [8, 9] is a component model designed to develop vehicular real-time systems. Within this domain, SaveCCM addresses the safety-critical subsystems responsible for controlling vehicular dynamics which includes power-train, steering, and braking. However, SaveCCM only supports RTXC OS [10] and Microsoft Windows OSs and thus is only deployable in environments where they are supported. Reconfiguration at runtime is not achieved in SaveCCM, and hence, all the configurations are carried out at compile time. This prevents the use of SaveCCM in systems that need a dynamic configuration such as a scenario in which new functionalities have to be deployed at runtime.
RUNES (Reconfigurable, Ubiquous, Networked Embedded Systems) [11, 12] is a software platform aimed at providing the software fabric for developing networked embedded systems. It is based on a component model which encapsulates the characteristics of the devices and also allows for the dynamic reconfiguration of the network of embedded systems. The component model is carried out by implementing a runtime API and the components themselves for particular devices. To support reconfiguration, the RUNES architecture employs metamodels which are updated by the API runtime whenever a component is created or destroyed. Although RUNES is able to handle changes occurring in the network of devices, fault tolerance techniques can only be supported at device level whereas the
The Loosely coupled Component Infrastructure (LooCI) [13] is designed to support embedded Java ME (microedition) platforms such as Sun SPOT or Java ME smartphones. LooCI comprises an easy-to-use component model and a simple yet extensible networking framework. Each LooCI node is connected via a common event-bus communication substrate. Like other embedded component platforms, such as RUNES [12] or OpenCOM [14], LooCI components support runtime reconfiguration, concrete interface definitions, and introspection and support for the rewiring of bindings. LooCI was recently ported to a number of sensor devices and Android platforms and is thus capable of creating component-based platforms in a heterogeneous environment. Unlike
The component-based operating system (OS) Lorien [15] allows users to experiment freely with software at any system level (e.g., MAC, drivers, routing, scheduling, etc.), and code can be (un)loaded dynamically during experiment runtime without resetting the nodes. The OS can also be used as a boot manager to run other Wireless Sensor Network (WSN) OSs of the user's choice. Lorien runs on T-Mote class devices and provides all the benefits of OpenCOM while running on resource-constrained devices. It is particularly targeted at providing runtime reconfiguration (flexibility) for OSs that runs on WSNs. While it provides flexibility on WSN OSs, Lorien does not provide a generic framework for constructing reconfigurable fault-tolerant systems. Furthermore, Lorien does not provide an implementation that can ensure there will be an interaction with the Sensor Web paradigm.
The middleware developed in the context of the MORE project (Network-centric Middleware for GrOup communication and Resource Sharing across Heterogeneous Embedded Systems) [16] targets heterogeneous embedded systems in the Service-Oriented Architecture (SOA) context. MORE middleware allows XML-based information (e.g., SOA data, XML-based policies) to be transferred to embedded services nodes in an efficient manner. The idea is to reduce consumption of resources (e.g., battery, processing time) in the devices. To achieve such a goal, the μSOA approach is proposed to reduce the message size and parsing overhead. According to the authors, a μSOA message requires 2.5% of a standard SOAP message. In contrast,
In summary, we argue that
4.
Framework
The Fault-Tolerant Component Frameworks. This layer is responsible for providing mechanisms for developing reliable component-based systems. These mechanisms are implemented in the form of component frameworks. Component Framework (CF) has been defined as “collections of rules and interfaces that govern the interaction of a set of components “plugged into” them” [17]. A CF embodies rules and interfaces that make sense in a specific application domain. Component Runtime Kernel Layer. This layer provides support for the development of component-based reliable systems. That is, this layer provides the inherent component model operations of

4.1.
: Component Model
The The The component The
4.2.
Framework Classes
Figure 3 shows the classes and interfaces that comprise the

The abstract class
The
5. Case Study One: Design Diversity Techniques
This section provides some simple examples that illustrate how the
5.1. N-Version Programming Technique
We examine the implementation of a simple reliable component based on the N-version programming technique [2] using the
Figure 4 shows the

N-version programming realization.
In addition, the
public Integer multiply(Integer a, Integer b) { Object return (Integer) this.execute(“multiply”, params); }
The
5.2. Recovery Block Technique
This section gives an example that illustrates how the
Figure 5 shows the

Recovery block realization.
The
In addition, the
public int return (int }
The The The The
5.3. State-Based Variant Execution
This section gives a simple example to illustrate how the

State-based variant execution realization.
The
public void send (String message) { Object this.execute (“send”, args); }
The The The
5.4. Experimental Results
The N-version programming technique [2] (example discussed earlier) was implemented (together with other design diversity techniques) and deployed in two different hardware platforms: Standard PC and Sun SPOT (Sun Small Programmable Object Technology) (http://www.sunspotworld.com/docs/Red/spot-developers-guide.pdf).
The experiment was run in a desktop with an Intel i5 CPU 2.67 GHz processor and 8 GBytes of RAM memory running Ubuntu 12 operational system. Sun SPOT is a Wireless Sensor Network (WSN) mote developed by Sun Microsystems. Unlike other available mote systems, Sun SPOT is built on the Squawk Java Virtual Machine [19]. For comparative purposes, the Squawk Java Virtual Machine was used in both platforms. The application code used to assess the cost of utilizing the
Table 1 shows the average performance (measured in ms) and the memory consumption (measured in bytes) of the main operations of the N-version programming technique: (a) to load and instantiate
*Sun SPOT (Sun Small Programmable Object Technology).
On the basis of these values, it can be argued that the proposed approach has an acceptable performance and resource consumption overhead across heterogeneous platforms. It should be stressed that these values are in compliance with those of the study conducted by Hehmann et al. [20] which states that the reconfiguration delays should not exceed 250 ms. Moreover, according to [21] for multimedia applications, delays less than 150 ms are not even noticeable, and the maximum tolerable delay is 400 ms.
6. Case Study Two: Sensor Web Enablement
This section gives an example that illustrates how the
6.1. Example
The example setting employed is illustrated in Figure 7 and described as follows.
A set (3 to 5) of Sun SPOT sensors which work together to monitor the temperature from the surrounding environment. Each sensor uses a different port number (in the range of 66 to 70) to broadcast, once every second, the collected temperature data. Owing to its simplicity, the adopted communication protocol is the radiogram protocol that provides datagram-based communication (with no guarantees of delivery or ordering) between two devices. By default, the The base station employs an N-version programming approach to collect the information that has been broadcast. In other words, the base station instantiates the In order to validate the N-version programming technique, some erroneous temperature values were injected (by placing a heat source close to the sensors). As expected, these results were disregarded because the correct temperature values were returned by other sensors/variants. The next step is to store the consensual temperature (and corresponding metadata useful for discovery and human assistance) in the As discussed earlier, the
Sun SPOT sensors and variants.

6.2. Experimental Results
To perform our experiments with
Unlike the previous implementation, the N-version programming technique [2] was only implemented and deployed in the Sun SPOT base station. Three scenarios were employed to evaluate this implementation. The one single difference between these three scenarios is the number (three to five) of sensors/variants employed. The first experiment employs three sensors, while the second and third experiments employ four and five sensors, respectively.
Table 3 shows the average performance of these three experiments, that is, the average of the execution for 1000 samples where each sample represents the performance (measured in ms) of the main operations of the N-version programming technique combined with the Sensor Web Enablement approach. The following operations were assessed:
To execute a redundant operation, that is, to execute the To execute the method for the respective result of the (majority) decision, and To store the consensual temperature (and corresponding metadata that is useful for discovery and human assistance) in the
Performance of the experiments.
On the basis of these values, the difference in performance between the three experiments suggests the approach performance increases linearly with the growth of sensor nodes. However, the analysis of the data shows that there is a great variation in the sample performance. That is, several outliers were observed while the results were being obtained. This might be due to the overhead inherent to the Java Virtual Machine (JVM) such as Garbage Collection.
7. Concluding Remarks
This paper discussed the use of a generic component-based framework for the construction of adaptive fault-tolerant systems that can integrate and reuse technologies and deploy them across heterogeneous devices. We have implemented a framework prototype and evaluated the potential benefits by means of two case studies and performance measurements. These show that the proposed framework can deal with a wide degree of heterogeneity with minimal resource overheads.
With regard to our generalized approach, it should be emphasized that
The generality of hardware and software are achieved by means of the so-called loader and binder extension plugins [14] that we borrowed from OpenCOM. In short, the loader plugin encapsulates the complexity of loading software in a particular deployment environment (e.g., loader for an assembly-based software component into the Sun SPOT sensor mote or a loader for deploying N-version system based on Java multithreads). The binder plugin provides a wide range of “binding mechanisms.” Using binders, developers are free to implement a wide range of binding mechanisms that might be required in the underlying deployment environment. For example, he/she may implement a binder that creates connections between Java components or a binder that connects components written in assembly language. That way, one can create fault-tolerant software for a variety of environments such as sensor nodes, mobile phones, and desktop PCs.
With regard to future studies, two different directions can be envisaged:
Fault Tolerance Techniques. Regarding the examples discussed in Section 5, we plan to incorporate other fault tolerance techniques into the
Multihop Communication. Regarding the example discussed in Section 6, we plan to utilize the
Footnotes
Acknowledgments
The authors would like to express their gratitude for the support granted by CNPq and FAPESP to the INCT-SEC (National Institute of Science and Technology—Critical Embedded Systems—Brazil) processes 573963/2008-9 and 08/57870-9. Dr. D. M. Beder and Dr. J. Ueyama are also grateful to CNPq for the support provided for the REACT project (process 483699/01881-5). Dr. J. Ueyama would also like to thank FAPESP (process 2008/05346-4), CNPq (process 474803/2009-0), and RNP (CIA2-RIO) for their financial support. Dr. J. P. de Albuquerque and Dr. J. Ueyama are also grateful for the support granted by FAPESP (process 2008/58161-1). Finally, Dr. J. P. de Albuquerque would also like to thank the Alexander von Humboldt Foundation for its sponsorship.
