An error-detecting approach for fault tolerance parallel recomputing with parallel digital terrain analysis

Abstract

In recent years, the research of parallel digital terrain analysis has become a hot spot. Using parallel computing technology to solve data-intensive computing problems has become a new trend in digital terrain analysis (DTA). On the other hand, with the development of hardware technology and new applications, how to ensure the reliability of the results of parallel computing is one of key problems. We can improve the system’s ability to provide the right service by properly adopting fault tolerance technology. A parallel error-detecting approach based on parallel recomputing technology is presented and implemented by the combination of redundant process and multi-threads technology. Adopting a parallel comparison between the results of the process and the ones of its copy process, on the same data block, it can improve the efficiency of fault tolerance parallel recomputing. Furthermore, considering the relationship between the error-detection of the results and the recomputing, a modified scheme is proposed to make them to be executed concurrently. According to the error-detecting analysis of slope algorithm from DTA, it proves the effectiveness of the error-detecting approach based on fault-tolerant recomputing and achieves minor overhead.

Keywords

Digital terrain analysis parallel recomputing fault tolerance recomputing error detection

Introduction

At present, the research of the parallel technology has become a new developing direction and a hot topic in the digital terrain analysis.^1,2 With the rapid development of software and hardware technology, the rising popularity of multi-core processors, the parallel computing has been used to solve more complex problems in many fields. In order to solve the problem of huge amounts of spatial geographic data analysis, serial to parallel computation becomes an inevitable trend for the computing architecture of the digital terrain analysis. At the same time, the utilization of multi-core or multi-node computing resources can improve the ability to handle the complex algorithms in the digital terrain analysis.

Recently, as the updating and developing of hardware technology, the speed of parallel computation is becoming fast. But the reliability of the parallel computation is becoming a new challenge. We must apply an appropriate fault tolerant technique to improve system’s ability, which can provide the right services. In the digital terrain analysis, the grid DEM (Digital Elevation Model) has a huge amount of data. We need the GB level of space to store the DEM data. Under the circumstances, the traditional software fault tolerance cannot meet the demand and we introduce a parallel recomputing technology to solve a lot of task-intensive and data-intensive applications.³ However, the study of error-detection approach based on the parallel recomputing is rare according to the analysis of the existing literatures. According to the characteristics of digital terrain analysis, the authors propose an approach of error detection based on parallel reomputing. By using the threads to compare the execution results and the processes to execute the algorithm of digital terrain analysis, it can provide a fast error-detecting method. Besides, there is no load balance problem using the multi-thread technology, so parallel recomputing can greatly enhance the performance of the system. By executing comparison between the results of the original process and the ones of the copy process on the thread level, a fast performance of error detection can be achieved.

On the basis of data-intensive characteristics of parallel digital terrain analysis, we put forward a kind of error-detecting method using the redundant processes and multi-thread technology to realize a fast parallel recomputing. Furthermore, a modified scheme is proposed to make the error-detection procedure of the results and the recomputing procedure to be concurrently executed on a fine data granularity which improves the effectiveness of the fault-tolerant recomputing.

Related work

One of the key steps of fault tolerance is error detection. In parallel system, there are two kinds of common faults, hardware and software failure. A hardware failure can be further divided into two categories: data errors and control flow errors.⁴ The data error is referred as a hardware failure which causes the changes of content in data storage unit. The data error usually does not immediately cause the system failure, but the error will spread other parts of program with the execution of the program and it eventually leads to system failure. The control flow error is referred as a hardware failure that makes the instruction error in the relevant control unit, which can make program flow to a wrong branch. In some cases, the control flow error can be detected by the system’s structure and reported to the system.

Neumann⁵ first put forward the ideas of the redundancy. The redundancy contains hardware redundancy, software redundancy, time redundancy and information redundancy. Nahmsuk introduced EDDI (error detection by duplicated instructions),⁶ which is a method of instruction-level redundancy. Through compile-time complicated instruction, it can obtain a redundant copy and insert the compare instruction before store or branch instruction. By comparing original program and calculation results of the redundant copy of the program, if the results are not the same, then there is a hardware failure. EDDI can achieve a high error-detecting rate. But the copy instruction will bring more overheads. Then, Nahmsuk et al. put forward a new hardware error detection technology based on redundant instruction, namely ED4I (Error detection by diverse data and duplicated instructions).⁷ This is also a method of instruction level redundancy. Using the difference transformation in the complex instruction, the number of instruction operand and a copy of the program of instruction operand will keep k-times relationship and k values have a decisive influence on error detection rate.

Fu et al. put forward a kind of error detection method by redundant processes for MPI programs, namely REDReP.⁸ The method dynamically creates a process copy for every MPI process and the intermediate results are compared at the same time. But for huge amounts of spatial data, the big data will be divided into many data blocks and we need more processes to handle these data blocks. For the inter-process communication and storage of the intermediate states, it can bring more communication overhead and memory overhead. For a new fault tolerant method, namely the parallel recomputing,³ it also copies a simple calculation to an idle process. By comparing the calculation results in part, it can detect whether or not there are errors. In view of the parallel error-detecting approach of the digital terrain analysis, Song et al.⁹ presented an approach based on the adjacent process, which realizes the error detection by comparing the block redundancy rows. The error-detecting overhead of the approach is much lower than redundant processes. It can be very good at detecting calculation error according to the checking result set.

Oriented parallel recomputing of the error-Detecting

Parallel recomputing

The recomputing method was first used for the online fault detection in an arithmetic logic unit. Patel and Fung¹⁰ proposed the shift operation of time redundant in 1982. The principle is that comparing the calculation results of before and after the shift is used to check the rightness. The fault detection ability depends on the shifts of digits. The basic idea of the parallel recomputing technology is that the program determines whether or not a task is recomputed by comparing the results of two redundant tasks. In case of an error, the program will be rolled back recently to save the state of the computing of the task. Then, the approach made use of fault-free processes to complete the calculation of fault recovery.¹¹ The recovery technology can greatly improve the fault-tolerant performance of parallel program based on the parallel recomputing process.

The main points of the parallel recomputing technology include the segmentation of program instructions and parallel scheduling problem. According to the instruction length of a parallel task, a range of the variables and the constraints are first determined in the parallel task and then the task is divided into several parts. The partition strategy is based on the user’s guide of initial correction of error in every part. Clearly, the partition method requires the experiences of the developers and could be complex for some specific applications. The scheduling bookkeeping records looping blocks that are executed by every thread. Each thread has been out of circulation blocks. When a thread fails, according to the scheduling bookkeeping of threads, we can make sure how many circulation blocks will be determined, and parallel recomputing scheduling scheme will be built.

Parallel error detection approach

The digital terrain analysis is facing the huge amounts of data in scientific computation. The data-parallel method is one of the effective methods in the parallel digital terrain analysis. Through the parallelization of data, it can improve the efficiency of system processing. The grid DEM data as the source for all kinds of terrain factors can calculate the parameters of the surface topography, terrain morphology analysis and statistical features of the terrain analysis. Due to resource limitation of competition, it will lead to failure for the processes and the computed results may not be right. How to detect the error is the focus of the paper.

The paper is based on multi-thread way of error detection. Reading, writing and comparing the data will be done in thread way. The process will handle the calculation of the data and send the results to the corresponding thread to compare the rightness of the computation. The work flow of parallel error detection is shown in Figure 1. The specific process is shown as follows:

Step 1: Data partition. The data is divided into many blocks. According to the size of DEM, the data will be divided by row or column. The number of blocks is decided by the idle threads.

Figure 1.

Parallel error-detecting process.

Step 2: Reading the data. Each data block will be read into memory. Afterwards, the new data will be sent to the process and the redundant copy of the process.

Step 3: Data calculation. The data block will be processed by the process and the copy of the process respectively.

Step 4: Error detection. The calculated results of the process and the redundant copy of the process will be sent to the master node. The paper takes the fuzzy electoral law and compares the calculated results. The error point will be controled within the limit of the fault-tolerant coefficient, namely: ɛ. In the difference of terrain analysis algorithms, we can set different values to ensure correct operation of the fault-tolerant mechanism.

Step 5: Judgement. If the accumulated value is beyond the reasonable scope, it will turn to the first step and carry on the parallel recomputing.

In the parallelization of error detection workflow, some work must be done. First of all, the original of data block must carry on the reasonable division, which can improve the memory usage and effectively control the load balance. Second, the computation results of each process will be detected so that the recomputing can effectively correct the failure.

Before the error detection is done, the priority is to access and store the data set. Second, the copy of the process and the process will deal with the data. Because of the large amount of raster data, it will need more resources overhead by reading the data from hard disk into the memory. We adopt a multi-thread model in which each time a thread reads a data block and uses the mutual mechanism to avoid the conflicts of the reading and writing data at the same time.

Select data blocks

The number of data blocks is determined by the number of threads; each thread alone reads and writes a block of data. The data size of DEM reaches megabytes (MB) or gigabytes (GB) in parallel digital terrain analysis. Assuming that the number of threads is M, the size of the DEM data is MSize, the divided blocks are equal in size except for the last data block, and the size of a data block is calculated as follows (except for the last one data block)

FTSize = \frac{MSize}{M}

(1)

The size of the last data block is given in formula (2)

LTSize = MSize - (M - 1) \times FTSize

(2)

Assuming we have data set including n data blocks B = {B₁, B₂, … , B_n}. Here, B_i is the data block which will be dealt with by the ith process P_i. The calculation results R_i = {r_kl}, where r_kl is the element of result set that the data block B_i is calculated by the process P_i. The whole results are fused to gain the required terrain factor.

Error-detection of the result set

In parallel digital terrain analysis, grid DEM is to describe the surface with regular grid unit. Generally, the raster data stores the data with two-dimensional array. The value of each unit in array represents the elevation value of center of each grid. To judge the error, we can figure out the comparing result whether or not the result is in the error range. When the data is divided into many small data blocks, for example, the data block B_i is calculated by the corresponding process P_i. At the same time, B′_i (a copy of the data block B_i) is calculated by P′_i (a copy of the process P_i). The calculation results will be returned to the master node and the master node uses multithread technology to compare the result set. The calculation results (R_i and R′_i) are compared line by line and count the error point. The detailed process is given as follows.

Two detection threads were set to be started at the same time and threads will receive the data from the master node. In the master node, the threads will compare the data by line and statistical error point. The difference of comparison between two results by line are shown in Figure 2.

C = C + 1, if r_{kl} \neq r_{kl}' k = 1, 2, \dots, K, l = 1, 2, \dots, L

(3)

where C stands for the number of different data in the result set.

Calculate error-point ratio, τ. It is gained by formula (4)

τ = \frac{C}{i \times j}

(4)

Check whether or not error-point number is within a reasonable range.

Figure 2.

Process of the comparison of difference by row and column.

If $τ < ɛ$ , it means the comparison result is in the reasonable range, otherwise the program will take a fault-tolerant process.

Error detection overhead

In this paper, the overhead of error detection approach contains two aspects: communication overhead and comparison overhead. The process between sending and receiving data will generate the communication overhead. In dual-redundancy error detection mode, the processes not only produce communication overhead, but also have comparison overhead. The master node uses the multithread mode in the paper. Reading and writing the data will be finished in threads. The computing data will be handled in the processes mode. Ignoring the overhead of threads creation, the comparison overhead is in connection with the size of the data block and the comparison approach. Assuming N denotes the number of processes except for the main process, the number of redundant processes is also N. The total number of processes is 2N + 1. The number of threads is M and D stands for the communication between two processes. The comparison of each intermediate results is d₁, d_2,…, d_M. To simplify the discussion, assuming that once of data comparison and data communication take a unit time, we can use the data to measure the amount of overhead.

Under normal circumstances, regardless of the communication, computing overlapping and the time of threads starting and stopping, the time of the parallel program includes the communication time and compared time, namely

T = T_{c} + T_{m} = 2 \cdot N \cdot D + \sum_{i = 1}^{M} d_{i}

(5)

Algorithm description

ErrorDetecting(B, N, M, V)

Input:

B: denotes the original DEM,

N: represents the number of processes,

M: represents the number of data blocks,

V: denotes the threshold.

Output:

ErrorCnt: represents the statistical error rate.

Begin

/*Initialize the system environment, including the number of processes and current process*/

Init();

/* Number 0 process read the original DEM from a disk into memory*/

if(RankID==0) then

for(int i = 1;i< = N;i++) do

/* According to division strategy, process read the partitioned block */

ReadPartitionedBlock(Block);

CopyBlock(B_i); //Generate a copy of the data B_i

//Data blocks are sent to the process and redundant process

SendBlock(B_i, B′_i, P_i, P′_i);

#pragma omp parallel //multi-threads

for(int i = 1;i< = M;i++) do

/*Receive the computation results from the process and redundant process*/

Receive R_i and R′_i

ErrorCnt = compare(R_i, R′_i); //error detection,

If(ErrorCnt>V) RecoveryBlock(B_i);

end for

else

/*Receive Data from number 0 process*/

Receive B_i and B′_i

/*B_i is computed by process P_i, and then generate R_i */

R_i = Process(P_i, B_i);

R′_i = Process(P′_i, B′_i);

/* R_i and R′_i are send to the number 0 process*/

SendBlock(R_i, R′_i, 0);

end

End

Modification of error-detecting

In conventional error detection methods, we can see that after the task with the data is finished computing, the computed results will be used to detect whether or not there is an error. Whenever the errors of computed results are detected, we usually adopt the suitable recovery technology to correct these errors. However, these methods have brought a lot of overhead and cost problems. Since the detecting process occurs after the computing task is finished, there is a larger delay before the recovery. So it is a coarse-granularity error detection method. If we start the error detection process during the task is executed, and once an error is detected the recomputing process is early started. This method can reduce the waiting time of the detection process. Since a multi-thread method and redundant processes strategy are adopted for detecting the result set, we propose a real-time detection method, namely the computation and detection performed simultaneously, so as to speed up the efficiency of error detection.

Notation

In order to analyze the problem conveniently, the following is some notations.

n: the number of logical data blocks computed in the system, n = 4 in the example of this problem.

P, P′, P₀: dual processes, P′ is a copy of P.P₀ denotes the main process.

B_i: the i^th logical data block while the block B is partitioned into n small blocks. It’s a logical partition. (n > i > 0)

B′_i: a copy of logical data block B_i

NP: a new process that the data block B_i will be recomputed. (n > i > 0)

R_i: the computed results of data block B_i. (n > i > 0).

R′_i: the computed results of data block B′_i. (n > i > 0).

C_i: the ith current computing state of the process P that is executed successfully. (n > i > 0)

D_i: the ith current detecting state of the process P₀. (n > i > 0)

A modified method

First, we assume that these logical blocks are independent with each other. Each process is not vulnerable to be both transient and permanent faults. And we should consider the best case that no error appears in the computation process of the program.

Second, the logical block B₁ is executed by process P. At the same time, the copy of block B₁, B₁′ is executed by the copy of process P, P′. When the data blocks B₁ and B₁′ are finished, their result sets R₁ and R₁′ are generated, and they are sent to a comparison process to detect whether or not an error occurs. The process P₀ will receive the two computation result sets from the process P and process P′. The comparison of result sets R₁, R₁′ and the computation of the logical data block B₂ will start execution at the same time. Once an error is detected by the comparison process P₀, a new process NP will be started to recompute the logical data block B₁.

Similarly, the logical data block B₂ and the copy of B₂′ are executed, respectively, by the process P and P′. The R₂ and R₂′ are compared by the comparison process P₀ and P₀ determines whether the recomputed process is started or not. Simultaneously, P is in C₃ state and P₀ is in D₂ state. If the block B₂ needs to be recomputed, then start a new process NP to compute the logical data block B₂, as shown in Figure 3. The data blocks B₃ and B₄ are also processed by this way. This method is a fine-granularity error detection process compared the conventional error detection methods such as recomputing method, but it reduces the waiting time of the comparison process and recomputing process.

Figure 3.

Process of the modified method.

From the above analysis, we can draw a conclusion that the computation time of the optimization method is much less than the one of the original method. We assume that CT denotes the time of original computation method and DT is the time of error detection. Because the error detection method is simply compared with no other operations, CT is greater than DT. The overhead of the original parallel recomputing method is CT + DT + CT/4, as shown in Figure 4.

Figure 4.

Overhead of the original method.

Whatever the computation or detection, each logical block has the same time, namely CT/4 or DT/4. When the logical blocks B₁, B₂, and B₃ are computed with the redundancy strategy, the result sets from two processes are compared. When the error with any block is detected, for instance B₂, this block is again recomputed by a new process. The computation tasks of other blocks behind this block, for instance B₃, will not be suspended but continue to execute, the new computation will not incur the delay, as shown in Figure 5(a). Because the parallel recomputing process of B₂ is executed with the original computing process at the same time, the whole time is also CT. But the logical block B₄ is a special case. When B₄ is computed and the result sets are detected, if an error occurs, the computing time is CT + DT/4 + CT/4 since the original computation has been over. As shown in Figure 5(b), there is additional detection time DT/4 and recomputing time CT/4 if the result of B₄ is not right.

Figure 5.

Overhead of the modified method.

Algorithm description

Input: ErrorDetecting(B,N,M,m).

B denotes the original DEM,

N represents the number of processes,

M represents the number of data blocks,

m denotes the number of logical block.

Output:null.

Begin

/*Initialize the system environment, including the number of processes and current process*/

Init();

/* Number 0 process read the original DEM from a disk into memory*/

if(RankID==0)

/* According to division strategy, process read the partitioned block */

for(int i = 1;i < = N;i++)

/*Data is send to the process and B_i will be partitioned N blocks*/

ReadPartitionedBlock(Block);

#pragma omp parallel //multi-threads

for(int j = 1;j< = M;j++)

/*Partitioned logical block And B_jk is computed by process P_k, and then generate R_k

for(int k = 1;k< = m;k++)

B′_jk = CopyBlock(B_jk);

SendBlock(B_jk,B′_jk,P_k,P′_k);

startTime = currentTime();

end for

ReceiveResult(R_k,R′_k,P_k,P′_k); // R_k and R′_k are compared by Process P₀

findError = ErrorDetection(P′_i,R_k,R′_k);

if(findError) Recomputed(B_ii);

endTime = currentTime();

Output(startTime-endTime);

end for

else

/*Receive Data from number 0 process*/

Receive B_jk and B′_jk

/*B_jk is computed by process P_k, and then generate R_k

R_k = Process(P_k, B_jk);

R′_k = Process(P′_k, B′_jk);

/*R_k and R′_k are send to the number 0 process*/

SendBlock(R_k, R′_k, 0);

end if

End

Experiments and analysis

The experiments are made for verifying the error detection performance on a small-scale cluster system. The configuration on the experiments is as follows: processor configuration, XeonE5645 2.8 GHz quad-core processors and 8G memory, using Gigabit Ethernet connectivity between nodes. In the cluster system, we use the master-slave parallel computing model. A primary node is responsible for the data distribution and error detection. The software environment of the system is as follows: GDAL 1.6.1, OpenMP 1.5.4, GCC 4.4.7, and MPICH2. The dimensions of the two DEM data sets are 31492 × 13717 and 17087 × 11412, respectively. The size of the former is 1.61 GB and another one is 376 MB with TIFF format. The data are floating-point type.

The experiment adopts the slope algorithm of DEM to validate the relationship between the data granularity and computing nodes. Gradient slope is the degree of surface unit steep slow, usually the ratio of the vertical height to the horizontal distance, of slope surface is called slope. The slope S of a point on surface unit is the function of elevation variation of terrain curved surface function, Z = f(x, y), in the direction from east to west or from south to north

S = arc \tan \sqrt{f_{x}^{2} + f_{y}^{2}}

(6)

Herein, $f_{x}^{2}$ is the rate of change in the direction from south to north and $f_{y}^{2}$ is the rate of change in the direction from east to west. In general, $f_{x}^{2}$ and $f_{y}^{2}$ are got by the numeric differential method in 3 × 3 moving window.

In software fault-tolerant approach, the effective approach of parallel program error-detecting is to use redundant processes.^12,13 If we use the process comparison, one process needs to send data to another process and another process will compare whether or not the results of computation with the data are equal to the ones of the original data. The process will need to pay more communication overheads and may lead to load imbalance. In the paper, we adopt a method to combine the redundant processes and threads. Through dividing the data block, it will speed up the data processing ability and parallel comparison by using the threads at the same time. It can also improve the efficiency of error detection. In order to verify the efficiency of error detection, the slope algorithm is used to verify the ability of error detection in the experiments.

The experimental results of the parallel slope algorithm are shown in Figure 6. The approach proposed by the paper differs from the method of redundant processes, as shown in Figure 6(a), and the data size is 374 MB. The results show that the more the division number, the smaller the amount of data blocks. The original method adopts redundant processes to compute and compare. The overhead of the new method and modified method are relatively lower than the overhead of the original method. At the same time, the communication overhead will increase while the overhead of thread is less. With the increasing number of the data blocks, the number of threads is increasing rapidly and the two kinds of approach tend to be equal. Figure 6(b) looks like Figure 6(a), the difference is that data size is 1.61 GB. It is shown that error detection approach based on parallel recomputing can achieve the better performance. We adjust the number of threads to observe the change of computing overheads. When the number of threads exceeds a certain value, the time of the computation will become more and more.

Figure 6.

Overheads of parallel slope algorithms with different data size.

Conclusions

The digital terrain analysis has the data-intensive characteristics, so adopting the data parallel is one of the effective strategies for parallel digital terrain analysis. We propose an error-detecting approach for parallel digital terrain analysis. First, the approach adopts multiple threads to read and write the data blocks in the master node. Second, the process and a copy of process will handle the partitioned data blocks respectively. Finally, the computing results will be returned. According to the returned results, the master node uses multi-thread technology based on the parallel comparison of the results. Using the multi-thread technology can effectively compare and recover the right result. In a mixed computing mode, in the case of failure the program can still run when the master node can start a new thread to replace the failure of the thread. However, the number of threads should not be too many. Too many threads starting and closing will spend more overheads.

To improve error detection efficiency, the paper also presented a modified approach to make the error detection and the recomputing to be done concurrently in a fine data granularity. However, we only consider the error detection by comparing the computing results from a process and its copy process, without comparing the rightness of the reading and writing data. The comparison of the result set is mainly done in the primary node. If the other nodes are also adopted the approach, it may improve the efficiency. At the same time, it will bring more overheads and may cause the load imbalance. In the future, we will further study these problems.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been substantially supported by the National Natural Science Foundation of China (grant no. 41171298).

References

Liu

. Application of cluster computing in the digital watershed model. Adv Water Sci 2006; 17: 841–846.

Liu

Jiao

. Research of parallel DEM algorithm. China J Image Graph 2002; 7: 506–512.

Panfeng

Yunfei

Hongyi

. Parallel recomputing: a new approach for fault-tolerant high performance computing. Comput Sci China 2009; 36: 21–25.

Fu HY. Research on fault tolerant parallel OpenMP program. PhD Thesis. National University of Defense Technology, 2010.

Neumann

. Probabilistic logic and the synthesis of reliable organisms from unreliable components, Princeton, NJ: Princeton University Press, 1956.

Nahmsuk

Philip

Edward

. Error detection by duplicated instructions in super-scalar processors. IEEE Transact Reliab 2002; 51: 63–75.

Nahmsuk

Subhasish

Edward

. Error detection by diverse data and duplicated instructions. IEEE Transact Comput 2002; 51: 180–199.

Song

Yang

. Implementation of MPI program error detection by redundant processes. Microelectron Comput 2009; 26: 53–56.

Song

Liu

Tang

. Research on the fault tolerant algorithm of parallel digital terrain analysis. Geogr Geographic Inform Sci China 2013; 29: 1–5.

10.

Dubrova

. Fault tolerant design: an introduction, London: Kluwer Academic Publishers, 2008.

11.

Shang

Jin

. Fault-tolerant mechanism of the distributed cluster computers. Tsinghua Sci Technol 2007; 12: 186–191.

12.

Ding

. Using a parallel recomputing OpenMP fault tolerance mechanism. J Software 2012; 23: 411–427.

13.

Reed

Mendes

. Reliability challenges in large systems. Future Generation Comput Syst 2006; 22: 293–302.