Abstract
In a distributed mobile e-health care system, e-health service providers exchange data on the fly in response to user queries without any centralized control. Local databases in e-health service providers might be intercepted during the exchange of data and read by intruders; and malicious transactions may damage data that is highly confidential. In this case any centralized control for securing data cannot be assumed to protect confidential data. Therefore, securing health information from malicious attacks has become a major concern. Although prevention techniques are available, the history of system break-ins guarantees that there is no foolproof technique that totally eliminates security loopholes in a computer system. Hence, efficient damage assessment and recovery techniques are needed. Traditional methods require scanning the entire log from the point of attack to the end which is a slow procedure. In this paper, we present an efficient damage assessment and recovery algorithm to recover the database from malicious transactions. The algorithm is based on data dependency and uses a single matrix. The results of this work prove that our algorithm performs better than the other algorithms in both the damage assessment and the recovery stages.
1. Introduction
Information warfare is one of the most effective weapons used in today's wars. It had many definitions throughout history. In 1980, information warfare was used as a military term [1]. Then, it grew up to affect our life more during the Gulf War in 1991 [1]. Libicki and Fellow [2] compared the definition of information warfare to the process of discovering the nature of an elephant by a blind man. If the blind man touched the elephant's leg, he will think that it is a tree, while if he touched the tail, he will think that it is a rope. The aim of this comparison is to say that information warfare has different meanings and dimensions. It can be defined according to the dimension that the person tackles.
In this research, information warfare is defined as the set of mechanisms used to gain access to private information using information warfare weapons such as computer viruses, Trojan horses, logic bombs, trap doors, and denial of service [3].
Securing data has become a crucial issue during the web era because business applications are based on intensive sharing of information through the internet. Defensive information warfare passes through three phases in order to secure data: prevention, detection, and recovery. The prevention phase tries to prevent an attack on the database. However, there is always a chance for a successful attack on the computer system. As a legitimate transaction reads damaged data to update valid data, damage will spread in the database. When prevention fails, the intrusion detection phase starts identifying the attack. Although there are numerous researches in the intrusion detection area [4], existing detection techniques cannot guarantee that the malicious transaction will be detected immediately. Hence, until the attack is detected, part of the database will be affected. Thus, recovery will be needed.
Recovery is one of the main phases in defensive information warfare because it ensures software tolerance. Upon transaction processing, it reads and manipulates data items which in turn might be read and updated by other transactions. This leads to interdependency between the transactions. Hence, when a malicious transaction is detected by the intrusion detection system, many transactions might be affected after reading malicious data. All the affected transactions must be rolled back to recover the database to its consistent state. There are two issues to be considered in order to have efficient recovery of the database. First, the recovery process must be carried out in the shortest possible time to minimize denial of service. Second, the accuracy of the algorithm is a crucial issue in order to roll back only affected transactions or affected data items. The benign part of the database should stay available to the running applications.
In this paper, we propose new efficient damage assessment and recovery algorithms that use a single matrix to store the dependency between data items in the log file. When an intrusion detection system detects a malicious transaction, the matrix will be used to identify the damaged data items for later recovery. The algorithms do not need to scan the entire log file in order to discover the affected part of the database. This will cut down on damage assessment time. Using only one data structure, the matrix, saves space and computational time because there is no need for graphs or logical operations as other approaches do. All of this increases the efficiency of our approach over previously proposed ones.
The remainder of the paper is organized as follows. Section 2 covers the literature review. Section 3 presents the proposed model. Section 4 discusses the experimental results, and Section 5 concludes the paper.
2. Literature Review
Damage assessment and recovery algorithms have been of great interest to researchers in the last two decades. There is always an intrusion detection system that detects the malicious transactions. Then the recovery process starts, where all the transactions from the malicious one to the end of the log file should be assessed and classified into benign or affected transaction. There are two approaches to do so: data dependency and transaction dependency. Using the data dependency approach, affected data items will be checked if they were updated by an operation of a malicious transaction, while using transaction dependency approach, a transaction will be considered as affected if it reads a value that was updated by a malicious transaction.
One of the previous approaches was the user isolation approach [5]. This approach assumes that all the users are malicious users until a certain period of time passes. Then, the actions of the user will be identified as either malicious and will be aborted or nonmalicious and will be committed.
Clustering the log file is also one of the suggested approaches. In [6], the authors proposed segmenting the log files into clusters. However, the size of the cluster was a weakness because two different clusters may contain two dependent transactions. This issue was solved in [7]. The authors proposed segmenting the log file into clusters and further subclusters. The clusters were subclustered according to two different approaches: number of data items or space occupied.
In [8], the authors suggested an agent-based approach, where each agent is responsible for a set of data items. The agents keep the version of the data item to help identify whether it is affected or not.
Other approaches suggested using matrices in recovery. In [9, 10], the authors used a matrix that saves transaction dependencies. After each committed transaction, a new row is added to the matrix. The cells of the matrix hold 0 value in case there is no dependency, 1 in case of dependency, a transaction ID in case there is a transaction that reads a data item that was last modified by another transaction, and finally a negative transaction ID if a data item was modified by a set of transactions. In the last case a complementary array was needed to save the list of transactions. The static allocation of the matrix will cause a problem if the transactions and data items grow in size. The difference between our work and the work presented in [9, 10] is the fact that we will use only one matrix without any complementary structures which will save memory and execution time. In addition, we will use data dependency instead of transaction dependency, which locates more precisely the affected data items.
The authors of [11] suggested using column dependency. The attributes of the database are referred to as columns. The transaction takes into consideration the columns that were read or modified by any operation of the transaction. This approach decreased recovery time and showed that as the number of malicious transactions increases, the number of inconsistencies after reexecution will increase too.
Damage assessment and recovery for distributed databases were suggested in [12]. The proposed approach keeps a log file at each site, where there are two processes that are responsible for damage assessment and repair: a local DAR manager and a local DAR executor. Scanning the log file is the responsibility of the local DAR executor that will identify affected subtransactions and clean them. The damage assessment and recovery algorithms work in parallel, which in turn reduce execution time.
In [13], the authors suggested a recovery approach for real time database systems based on transaction fusion. Upon recovery, the malicious and the valuable affected transactions are fused into new transactions based on their dependencies and time limit. Transaction fusion reduced the number of transactions which in turn reduced recovery time.
The Retro system was suggested in [14]. It is a self-healing system based on transactional dependency and repairs the database from intrusion at the operating system level. The model performs selective recovery by undoing only suspect transactions in order to reduce the effect of those transactions as much as possible.
In [15], the authors propose a damage assessment model that deals with new intertransaction dependencies: phantom dependency, pseudoidentity dependency, domain integrity dependency, and reference integrity dependency. They describe their robust damage assessment approach and show that in some cases the recovery is incomplete unless the new proposed dependencies are considered. In [16], the ITDM prototype model for damage assessment implements the damage assessment model described in [15] using SQL rewriting. The SQL statements that are sent to the DBMS were modified in order to obtain the read/write dependencies. They showed that exploiting the suggested intertransaction dependencies of [15] leads to more consistency in the database.
The Before Images approach was suggested in [17]. The Before Image table is a table that is similar to the stored table in the database but without constraints. These tables are inaccessible by the users. This approach keeps track of the deleted data items. When a data item is deleted or updated in the initial table, a copy of the old record will be added to the Before Image table that refers to the initial table. To control the size of the Before Image table, a time window was suggested for the deleted data items. When these items are out of date, they will be deleted from the Before Image table. The performance of the algorithm shows degradation due to the queries performed on the Before Image table.
The authors of [18] propose a new approach that is based on transaction reverse dependency log file. The transaction reverse dependency log file is used to generate the Undo Transaction Set (UTS). The authors present a recursive algorithm for generating the UTS and another algorithm that uses a stack. Then, they showed that the stack performs better than the recursive one because recursive functions cause a lot of overhead whenever the transaction number increases.
In [19], the authors propose that each application needs its own type of recovery based on its nature. The authors suggested a recovery algorithm for a banking system by exploiting the deposit/withdrawal transactions. They reduced recovery time by incorporating the semantics of the transactions. The limitation of this approach is the fact that it cannot be applied to any database system since it is not generic.
Fuzzy dependencies were suggested in [20]. The suggested approach consists of “Fuzzy Value Generator” that interacts with the database and the “fuzzy dependency storage”. The advantage of this approach is the fact that it does not require intensive search for transactional dependencies in the log file. This boosts the recovery process. However, the disadvantage of this approach is the lack of accuracy.
3. The Proposed Model
The proposed model is based on data dependency and uses a matrix to store the dependencies between data items. Data dependency is considered to be more accurate than transaction dependency. The data dependency approach identifies the affected data items by each operation of the transactions and only deals with these data items upon recovery. However, transaction dependency rolls back all the affected transactions regardless of the operations that compose the transaction. The proposed algorithm uses a single matrix for identifying the affected data items during the damage assessment process, which saves memory and reduces recovery time.
3.1. Assumptions
The proposed model is based on the following assumptions. First, the algorithm will receive the set of malicious transactions from an intrusion detection system. Second, the history generated by the database management systems is rigorously serializable as serializability theory provides correctness. Third, there is a sequential log file that saves all the read/write operations of the committed transactions. This log file is inaccessible by users and will be used upon recovery. Fourth, the transactions have sequential IDs that are incremented on the arrival of new transaction. This means that when transaction
3.2. Definitions
Definition 1.
A write operation
Definition 2.
A blind write is when a transaction
Definition 3.
A write operation
Definition 4.
A data value
Definition 5.
A transaction management mechanism guarantees rigorousness if it guarantees strictness, and no data item is written until the transaction that previously read it either commits or aborts [14].
Definition 6.
A transaction
Definition 7.
A write operation is called a valid write if the value is written by a benign transaction and is independent of any contaminated data [25].
3.3. The Damage Assessment Algorithm
The proposed damage assessment algorithm is based on data dependency. The idea of this approach is to connect the data items together to show the dependency between them. Then, the algorithm will detect the directly affected data item(s) by the malicious transaction. After that, the algorithm will be able to find any data items that are reachable by the malicious data items. Those data items are called affected data items. Data items that are not reachable by affected data will be considered clean and will be available to the user upon the recovery process.
A two-dimensional matrix will keep track of the dependencies between the data items. This matrix will directly point out the affected data items without any need to scan the whole matrix or the log file. Only one matrix will be used to represent the dependencies. No other data structures or logical operations are needed in the algorithm.
Initially, the matrix is empty. One of the transaction's operations is commot., a row will be added to the matrix to represent this transaction and the data that this transaction used to write the new values of the data items are added as columns. Although data dependencies are the key of the algorithm, saving the transactions will help accurately identify the affected data items. The columns in the matrix represent the data items used in the operation to write a new value. Each cell in the matrix holds the data item that was written by a given transaction.
Another feature that increases the efficiency of the proposed algorithm is adding the first column to the matrix, which is BW column or blind write column. It does not correspond to a specific data item; instead, it will be used if the transaction blindly wrote an item without reading any other data item. The importance of this column will be revealed if an affected data item was blindly written after being affected and then it will become clean. This helps in reducing the number of affected data items which in turn reduces recovery time. Table 1 shows the matrix that will be generated after the commitment of the transactions as follows.
The dependency matrix.
Sample Transactions Set. Sample transactions set is as follows:
In [26], the authors used multiple matrices for damage assessment in order to discover dependencies which require logical operations between the matrices. However, in our approach, the single matrix reduces the damage assessment time.
Assume that
Considering the data item B, we will move to the transactions
Receive a set of malicious transactions S While there is unprocessed transaction in S Select the minimum unprocessed transaction id Identify the data items set D written by Add D to the affected set Associate with each data item the affecting transaction index i For each unprocessed data item in the affected set Find the index k for the item For If Remove k from the affected Move to another data item in the affected set Else If Add
3.4. The Recovery Algorithm
During recovery, any executing or new operation of a transaction should be prevented from accessing malicious and affected data items. Only affected operations of the transactions will be reexecuted. The other part of the database will be available. The operations of malicious transactions will be undone. The recovery algorithm is summarized in Algorithm 2.
Receive the set of malicious and affected data items For each affected data item Retrieve the operation from the log file Update the value by the old value before the operation Update the database For each malicious data item Retrieve the operation from the log file Delete the operation
3.5. Check Points
The matrix may grow and the data inside it may become obsolete. Thus, to save memory, the above algorithm requires a checkpoint to get rid of the matrix at a specific time interval after which we suspect that the data is clean and the malicious transactions were detected by the intrusion detection system. This time interval should not be too short in order not to need to go back to previous check points and reread the log file. Additionally, it should not be too long so that the size of the matrix and the log file can be controlled. However, the intrusion detection system may detect a malicious transaction after the check point. In this case, the dependency matrix has to be reconstructed. It will be time consuming if we reread the log file and reconstruct the matrix. To solve this issue, we will keep a compressed structure of the dependency matrix that will help in reconstructing the matrix without going back to the log file.
Since the matrix is a sparse matrix, the condensed storage technique that we used is the Condensed Row Storage (CRS) [26]. The CRS format makes no assumption about the sparsity structure of the matrix and does not store any unnecessary element of the matrix. Assuming we have an One-dimensional vector One-dimensional vector One-dimensional vector
Thus, the CRS of the matrix in Table 1 will be as follows:
These vectors will only be used in case we needed to reconstruct the matrix after a check point. Since we assume that the log file is rigorous and serializable, we will only build the matrix starting from the malicious transaction. The transactions that precede the malicious one do not need to be recovered. The vectors will be refreshed at each new check point to hold the values of the new matrix.
The CRS will only be built for one check point backward. If in rare cases the intrusion detection system detected an attacker before the check point of the CRS (worst case scenario), then the log file will be used and the dependency matrix will be rebuilt.
3.6. Example
Consider a database for health care management system. We will only consider the process of keeping patients records in the health care system. It contains information about the following:
Doctor (DrID, DrName, DrSpecialization). Patient (PID, PName, PGender). Disease (DID, DName). PatientRecord (PID, DrID, DID). PatientBillItems (PBID, PID, Nitems, cost). PatientBill (BID, PID,
Consider the following insert transactions in the database:
The dependency matrix M will hold the dependencies of the above committed transactions, as shown in Table 2. The transactions
Sample matrix.
Consider the case where damage assessment algorithm receives from the intrusion detection system that
4. Experimental Results
In order to test the performance of the algorithm, we created a simulated system. This system creates random transactions. When a transaction is committed, the log file will be updated. The algorithm requires a dynamic dependency matrix that is also updated after each committed transaction. The matrix will be available during the damage assessment procedure. The log file is required to be available during the recovery process. The system generates random transactions and accordingly creates the log file. The system is connected to SQL server 2012. The Northwind database was used during the experiments [27]. The simulation system is implemented using Java programming language (JDK 1.6) on NetBeans IDE 6.8. The computer system used is as follows: the processor is Intel Core i5 CPU 2410 M at 2.3 GHz with 6 GB RAM.
4.1. The Damage Assessment Algorithm Performance Analysis
Figure 1 shows the time needed by the damage assessment algorithm to discover the affected data items. x-axis shows the transaction ID of the malicious transaction and y-axis shows the time in microseconds. As shown in the figure, as the transaction ID increases, the time required for damage assessment decreases. This is one of the advantages of the rigorous serializability of the log file. As the transaction ID increases, there is no need to go back to the log file and check the previous transactions. The log file contains 1080 records and each record refers to a transaction. When the malicious transaction ID is 500, the algorithm has to traverse only 580 transactions, whereas when the malicious transaction ID is 1000, the algorithm has to traverse 80 transactions only. This decreases damage assessment time and in turn reduces the system denial of service time.

Performance time of damage assessment algorithm.
We compared our algorithm to three other ones: traditional algorithm, traditional clustered algorithm, and hybrid cluster algorithm. All the algorithms were tested in the same simulation system that contains 200 transactions. Figure 2 shows this comparison. As shown in the figure, our algorithm performs faster than the hybrid cluster one by 2218 times. This comparison shows that our algorithm performs the best among the four comparisons. This is due to the fact that our algorithm is based on matrices which allow the algorithm m to go directly to a specific index. In addition, using only one data structure saved time because no logical operators are needed. In addition, our algorithm does not need to refer to the log file during the damage assessment stage.

Comparison of different damage assessment algorithms.
4.2. The Recovery Algorithm Performance Analysis
The recovery process starts when the damage assessment process stops. The output of the damage assessment stage is a set of affected data items to be recovered. Figure 3 shows the time needed to recover the affected data items. As revealed in the chart, as the number of data items increases, the recovery time increases. The database consisted of 500 transactions. Figure 4 shows the performance of different approaches in the recovery stage. The chart shows the efficiency of our algorithm. Our algorithm will refer to the log file records using the index of the transaction. Indexing helps a lot in saving the recovery time.

Performance of recovery algorithm.

Comparison of recovery performance in different approaches.
5. Conclusion
In this paper, we presented a new efficient approach of damage assessment and recovery in databases. Our approach is based on data dependency because it is more accurate than transaction dependency. The dependencies between data items are stored in a dynamic matrix that will grow as new transactions commit and new data items are updated. Using only one data structure in the damage assessment process saves memory and processing time. Being able to directly access the affected data items that are all present in one column reduces processing overhead. All the aforementioned properties of the approach make it efficient and reduce recovery time which reduces denial of service of the database. The algorithms were implemented and compared to other approaches.
Footnotes
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work was partially funded by the Lebanese American University, Beirut, Lebanon.
