Sage Journals: Discover world-class research

Abstract

Task synthesis, function fusion, and resource integration are used to get higher performance and more reliability in integrated modular avionics system. However, safety issue is caused in the system integration process. In order to discover the potential hazards for function layer safety analysis, in this paper, we proposed an efficient biclustering algorithm—DeCluster, which can effectively mine all biclusters with maximal variant usage rate or maximal low usage rate in real-valued function-resource dataset. First, it constructs a sample weighted graph, which contains all resource relations set between two samples that satisfy the definitions; second, all maximal variant usage rate or low usage rate biclusters are mined in the graph. In order to get higher the mining efficiency of the algorithm, DeCluster algorithm uses several pruning strategies. The experimental results show our algorithm is more efficient.

Keywords

Bicluster real valued function-resource matrix algorithm

Introduction

In recent years, integrated modular avionics (IMA) has largely enhanced the performance of aircrafts and become one of the major research interests in the air transport industry. However, IMA system integration process caused some safety issues: (1) resource integration generated failure spread, (2) function information fusion generated failure implication and chaos, and (3) task synthesis generated failure damages expansion. Therefore, it is necessary to analyze the safety¹ in IMA system.

In IMA system, function architecture is defined by well-formed organization, which is based on scheduling policy. According to the sensor data, the function operation and the system functions can be organized dynamically. The functional framework is defined by the corresponding task unit. Such organization is evaluated by its performance of completing the corresponding task. Different from the process introduced in ARP 4754, the functions are not addressed directly from the conceptual studies, but from requirements of tasks. The relation between functions and resources is denoted as a function-resource matrix, in which each row denotes a resource and each column denotes a function, the value is the use degree of one function to one resource. For example, for the resource whose computing capability is 100 MHz, function F₁ needs 60 MHz computing capability to compute some results. The dependence degree of this function on this computing resource is 0.6. Through mining the above function-resource matrix, the usage relation between a group of functions and a group of resources can be gained. Given two functions F₁F₂, the resource relations called by each function are as follows: F₁==>R₁R₂, F₂==>R₂R₃R₄. Assuming that F₁F₂ needs to cooperate to complete a task T. All above both functions may be called at the same time. For resource R₂, it supports F₁ and F₂ simultaneously. There may have two conditions: (1) R₂ has high effectiveness for F₁, but has low effectiveness for F₂; (2) R₂ has high effectiveness for both F₁ and F₂. The safety degree of the first condition is higher than that of the second one. The reason is that resource R₂ can serve F₁ and F₂ simultaneously in the first condition; while in the second condition, resource R₂ needs to serve for two functions. From the perspective of functional safety, if resource R₂ has defects, its influence on the first condition is lower than the second condition. So, through function-resource matrix mining, in order to achieve a group of functions, the resources which can satisfy all functional demands simultaneously and the resources which can satisfy all functional demands through multiple accesses can be mined, i.e. mining bicluster with variant usage rate or low usage rate in function-resource matrix. Through function-resource matrix mining, the resources satisfying functional demands simultaneously can be mined. Therefore, mining biclusters with variant usage rate or low usage rate in function-resource matrix can be used to analyze the safety between functions and resources. So the safety of function can be obtained by the safety of resources.

Biclustering² is a method for clustering condition set and gene set simultaneously. Therefore, biclustering algorithms are mainly used for biological analysis. Cheng and Church² used a low mean squared residue to generate biclusters. In Madeira and Oliveira,³ biclustering algorithms for biological analysis are presented. However, above algorithm’s mining efficiency is not well. MicroCluster⁴ uses weighted directed range multigraph to generate deterministic bicluster. SAMBA⁵ is designed to mine constant value biclusters. OPSM⁶ can find biclusters with coherent trends of up or down regulation. Biclusters with constant columns can be inferred by xMotifs.⁷ Range Support is used by RAP⁸ for mining constant row biclusters which uses range support measure to mine the meaningful patterns which are coherent for a substantial fraction of transactions or samples in the dataset. It adopts gene growth method to mine constant row biclusters. During mining process, RAP generates (n+1)-level biclusters from n-level biclusters by using range support criterion. However, it does not use efficient pruning techniques, so the mining efficiency is not well.

In order to use biclusters for disease detection, some biclustering methods are used to mine differential coexpression biclusters which are corrected coexpression in one microarray but not in the other. Okada and Inoue⁹ and Serin and Vingron¹⁰ used two-steps approach to infer differential coexpression biclusters, but the efficiency is low. Therefore, Wang et al.¹¹ and Southworth et al.¹² constructed a difference matrix to mine discriminative bicluster. DiBiCLUS¹³ algorithm mines differential coexpression biclusters in two discretized gene expression datasets. Subspace differential coexpression (SDC) algorithm¹⁴ is another method for mining SDC patterns. It infers the patterns which are coexpressed over a large percent of the conditions in one microarray dataset, but in a much smaller percent of conditions in the other. However, above biclustering algorithms cannot mine biclusters with variant usage rate or low usage rate simultaneously.

According to above analysis and in hopes of overcoming the limitations of biclustering methods to mine the bicluster with variant usage rate or low usage rate at the same time in the real-valued function-resource matrix. In this paper, we propose a new biclustering algorithm—DeCluster—to mine all biclusters with maximal variant usage rate or maximal low usage rate simultaneously in the real-valued function-resource matrix. First, it constructs a sample weighted graph, which contains all resource relations set between two samples that satisfy the definitions; second, all maximal variant usage rate or low usage rate biclusters are mined in the graph. In order to improve the mining efficiency of the algorithm, DeCluster algorithm uses several pruning strategies to mine maximal bicluster without candidate maintenance.

Problem definition

Biclustering algorithm was proposed for microarray dataset analysis first. The microarray dataset is defined as a gene expression matrix in real expression numbers, denoted as $D = C \times G$ , where the column set $C = {c_{1}, c_{2}, \dots, c_{i}}$ represents the different experimental conditions or samples, the row set $G = {g_{1}, g_{2}, \dots, g_{i}}$ represents genes, and the element value of $D_{ij}$ is a real number that represents the expression level of gene i under experimental condition j. Each row corresponds to the expression levels of a gene over all the experimental conditions, and each column corresponds to the expression levels of genes in one experimental condition.

A bicluster B is defined as a submatrix of D, where B = X×Y with $X \subseteq C$ , $Y \subseteq G$ and the genes in Y show similar behavior (coexpression) under the experimental conditions X. Given M be the set of all biclusters in D, which satisfy the given coexpression condition, then $N = P \times Q \subseteq M$ is a maxical bicluster if and only if there does not exist a bicluster $E = J \times K$ such that $P \subseteq J$ and $K \subseteq Q$ . A bicluster can be denoted as Samples(Genes), which represents the samples of the bicluster and its coexpressed genes.

In this paper, we propose a biclustering algorithm for function-resource matrix dataset mining. Function-resource matrix is denoted as a two-dimensional real-valued matrix, in which the set of resources is represented as row set R and the set of functions is represented as column set F. Each value of matrix D is a real-valued number which represents the ability validity or usage rate of resource i supporting function j. The domain of definition of D_ij value is [0,1], in which “0” denotes that the resource is not required during some functions executing; “1” denotes that the resource must be used during some functions executing, which is shown in Table 1.

Table 1.

A real-valued function-resource matrix.

	F₁	F₂	F₃	F₄	F₅
R₁	0.8	0.1	0.12	0.09	0.9
R₂	0.2	0.9	0.19	0.21	1
R₃	0.9	0.3	0.29	0.28	0.55
R₄	0.58	1	0.2	0.21	0.9

The purpose of our proposed bicluster is to mine a set of resources which are used under a set of functions. For example, given two functions F₁F₂ (F₁==>R₁R₂R₃, F₂==>R₂R₄), above two functions may be executed simultaneously. For resource R₂, there are three situations for satisfying F₁ and F₂: (1) For F₁, the usage rate of R₁ is lower, while it is higher for F₂, which is shown in Table 2; (2) for both F₁ and F₂, the usage rate of R₂ is higher, which is shown in Table 3; and (3) for both F₁ and F₂, the usage rate of R₂ is lower, which is shown in Table 4.

Table 2.

A real-valued variant usage rate function-resource matrix.

	F₁	F₂
R₁	0	0.8
R₂	0. 8	0.1
R₃	0.5	0
R₄	0	0.1

Table 3.

A real-valued nonvariant usage rate function-resource matrix.

	F₁	F₂
R₁	0.8	0
R₂	0. 8	0.7
R₃	0.5	0
R₄	0	0.1

Table 4.

A real-valued low usage rate function-resource matrix.

	F₁	F₂
R₁	0.8	0
R₂	0. 2	0.1
R₃	0.5	0
R₄	0	0.1

The safety analysis degree in the first and the third conditions is more important than the second condition. The reason is that, in the first condition and the third condition, the resource R₂ can satisfy F₁ and F₂ simultaneously. In the third condition, resource R₂ needs to satisfy the two functions, respectively. Our proposed DeCluster algorithm is to mine the first condition and the third condition.

Definition 1. D is a real-valued function-resource usage rate matrix; α is used to measure the association degree of functions under one resource; β is used to measure the restricting low usage functions under one resource; r is any resource in D; F₁ and F₂ are any two functions in D; r should meet the following conditions for relevance in F₁ and F_2:

[\forall r \in R | ({max}_{f \in {F_{1}, F_{2}}} D_{r, f} - {min}_{f \in {F_{1}, F_{2}}} D_{r, f}) \leq α ({min}_{f \in {F_{1}, F_{2}}} | D_{r, f} |) and {max}_{f \in {F_{1}, F_{2}}} D_{r, f} \leq β]

A bicluster is a low usage rate bicluster if all resources and functions meet the above formula.

It can be obtained from the description in Definition 1 that α and β are used to restrict resources with a low usage rate producing each function, e.g. bicluster F₂F₃F₄(R₁R₃) in Table 1.

Definition 2. D is a function-resource usage rate matrix; γ is a user-defined parameter used for measuring the variant usage rate of functions in resources; β is a parameter restricting low usage rate of resources; r is any resource in function-resource usage rate matrix D; F₁ and F₂ are two functions in D; r should satisfy the following conditions for variant usability in F₁ and F_2: ${max}_{f \in {F_{1}, F_{2}}} D_{r, f} \geq β and \frac{{max}_{f \in {F_{1}, F_{2}}} D_{r, f}}{{min}_{f \in {F_{1}, F_{2}}} D_{r, f}} \geq γ$ . If at least one resource in a bicluster satisfies the conditions above under two functions and meanwhile this resource satisfies the conditions in Definition 1 under other functions, this bicluster is one with variant usage rate of resources.

It can be obtained from the description in Definition 2 that at least one resource in bicluster of variant usage rate of resources satisfies the conditions in the formula in Definition 2 under two functions and meanwhile this resource satisfies the conditions in the formula in Definition 1 under other functions. For the convenience of description, such resources are defined as resources with variant usage rate as below:

Definition 3. D is a function-resource usage rate matrix; γ is a user-defined parameter used for measuring the variant usage rate of functions in resources; β is a parameter restricting low usage rate of resources; α is a user-defined parameter used for measuring the degree of association of functions in resources; r is any resource in function-resource usage rate matrix D; F is function set in D and r should meet the following conditions for resources with variant usage rate under F: ${max}_{f \in F} D_{r, f} \geq β and \frac{{max}_{f \in F} D_{r, f}}{{max 2}_{f \in F} D_{r, f}} \geq γ$ and

[\forall r \in R | ({max 2}_{f \in {F_{1}, F_{2}}} D_{r, f} - {min}_{f \in {F_{1}, F_{2}}} D_{r, f}) \leq α ({min}_{f \in {F_{1}, F_{2}}} | D_{r, f} |) and {max 2}_{f \in {F_{1}, F_{2}}} D_{r, f} \leq β]

under which max refers to maximal value, min refers to minimum value, and max2 refers to the second maximal value.

Therefore, resources in bicluster with variant usage rate of resources must be those meeting Definitions 1 and 3. We will define the relationship among resources. The relationship among resources in true data and that among resources in discrete data have the same form of expression and only minor differences in definition.

Definition 4. Assuming that real-valued usage rate of resource R₁ under functions F₁ and F₂ are V₁ and V₂, R₁ has the following four forms of expression under F₁ and F₂: (1) if V₁ and V₂ meet Definition 2 and V₁≥V₂, the contribution rate of R₁ to F₁ and F₂ satisfies the requirement of variance and it is expressed as “R₁”; (2) if V₁ and V₂ satisfy Definition 2 and V₂≥V₁, the contribution rate of R₁ to F₁ and F₂ satisfies the requirement of variance and it is expressed as “*R₁”; (3) if V₁ and V₂ satisfy Definition 1, the contribution rate of R₁ to F₁ and F₂ satisfies the requirement of low usage rate and it is expressed as “−R₁”; and (4) if V₁ and V₂ do not satisfy Definition 1 or 2, they are not recorded.

Therefore, each resource in bicluster mined with DeCluster algorithm meets the first or second condition in Definition 4 under all functions. To improve the mining efficiency of the algorithm, DeCluster algorithm mines biclusters with maximal variant usage rate or maximal low usage rate in real-valued function-resource matrix by using sample growth method without candidate maintenance. The mining process of this algorithm will be introduced in the next section.

The DeCluster algorithm

Constructing sample relational weighted graph

Using undirected sample relational weighted graph for mining biclusters was used in Wang et al.^11,15 The reason is that, if the number of samples is far less than the number of items, it can improve mining efficiency. Therefore, our DeCluster algorithm will adopt undirected sample relational weighted graph (hereinafter referred to as sample weighted graph) to mine biclusters with maximal variant usage rate or maximal low usage rate in the real-valued function-resource matrix.

Definition 5. Sample weighted graph can be denoted as the set $G = {E, V, W}$ . Each vertex in the vertex set V in the weighted graph represents a function. If an edge exists between a pair of vertices, this means the resource with variant usage rate or low usage rate exists below two functions represented by this pair of vertices. The set of the edges is expressed as E. The weights of each edge are the resource set satisfying the definition of variant usage rate or the definition of low usage rate under the two functions connected with this edge. The set of the weights is expressed as W.

According to the description in Definition 1, when the resources among functions satisfy the definition of variant usage rate, the weight between two functions does not meet commutativity. For instance, the weight under F₁F₂ is R₁*R₂R₃, while the weight under F₂F₁ is R₁*R₂R₃−R₅. Therefore, in Definition 5, the weight of each edge is the weight under F_iF_j, where i<j. Figure 1 shows weight relationship graph corresponding to Table 1.

Figure 1.

The sample weighted graph constructed from Table 1.

Mining maximal bicluster

According to descriptions in Definitions 1 and 2, when a new function is extended to the bicluster, it is necessary to calculate the intersection of all edges of the extending function and the resource collection of bicluster extended, thus ensuring that the resource collection under the extended function and that under existing functions satisfy constraint conditions in Definition 1 or 2. For the convenience of design of pruning strategies, it is required to not only calculate the intersection of resources, but also consider symbols before resources during sample growth and the calculation of intersection of weight, i.e. symbols before resources are also required for “intersection” calculation. Operational rules of these symbols can be obtained from the definition below.

Definition 6. According to descriptions in Definitions 3 and 4, for resource R₁, intersection operational rules of its form of expression are as follows: (1) intersection between “R₁” and “R₁” is “R₁”; (2) intersection between “−R₁” and “−R₁” is “−R₁”; (3) intersection between “*R₁” and “−R₁” is “*R₁”; (4) intersection between “R₁” and “−R₁” is “R₁”; and (5) intersection between “*R₁” and “R₁” is “*R₁.”

It can be seen from Definition 3 that the calculation of intersection between “R₁” and “*R₁” will not occur. Therefore, its rules are not provided in Definition 6. During function extension, with the increase of functions, the calculation of intersection of multiple forms of expression of the same resource will occur. The intersection can be calculated according to operational rules described in Definition 6 according to the sequence. We will introduce how DeCluster algorithm uses pruning strategies to mine all biclusters with maximal variant usage rate or maximal low usage rate in sample relationship weight graph without candidate maintenance in detail. This paper will judge maximal bicluster with the method of prior detection proposed in [11] without candidate maintenance. That is to say, if resources under the current candidate sample and some prior candidate sample (mined sample) have some inclusion relation, i.e. all biclusters produced by the current candidate sample can be produced by some prior candidate sample and the current candidate sample can be pruned. During the pruning design of backward checking, if F₁ is the prior candidate function of F₂, the weight on two function edges is the resource collection information of F₂F₁ rather than F₁F₂. As resources under F₁F₂ and F₂F₁ have different forms of expression, the sample weighted graph made by this algorithm is a directed graph rather than undirected graph. For F_n and F_m, it is necessary to build edges on F_nF_m and F_mF_n, respectively. However, for F_nF_m and F_mF_n, the difference of weights on the edge is the interchange of resource expression forms “R₁” and “*R₁.” Therefore, for saving the storage space, the storage of weight is only that of weight on F_iF_i+1 edge. The weight on F_i+1F_i edge can be calculated with F_iF_i+1.

Resource R₁ is, respectively, expressed as “R₁” and “*R₁” above when the form of expression of resources is illustrated, just for the convenience of design of pruning strategies. If a resource in the current candidate function to be extended satisfies the form of “R₁,” this resource can be pruned according to the lemma below.

Lemma 1. Assuming that P is the bicluster with variant usage rate to be extended currently; M is the candidate function set of P and N is the prior candidate function set of P. If the form of expression is “R_j” for any resource R_j in candidate function M_i(M_i∈M) and there is a prior candidate function N_j(N_j∈N) under which resource R_j also exists and resource R_j must exist in PN_jM_p and PN_jM_i for other candidate samples M_p in M, resource R_j in M_i can be obtained by extension of prior candidate function N_j.

Proof of Lemma 1. Proof by contradiction is adopted. Resource expression form of current candidate function M_i is “R_j”; a prior candidate function N_j (N_j∈N) exists; resource R_j also exists under N_j, and resource R_j must exist in PN_jM_p and PN_jM_i. Thus, M_i can be pruned. In line with description (1) in Definition 4, for resource R_j, “high usage rate” is under some function in P. In accordance with the definitions of variant usage rate or low usage rate, resource R_j must be “low usage rate” under candidate function M_i and prior candidate function N_j. So, the bicluster extended currently must be a bicluster with variant usage rate. As only one “high usage rate” can exist for each resource under all functions in the bicluster with variant usage rate, and if resource R_j must exist in PN_jM_p and PN_jM_i, the bicluster with variant usage rate gained through extension of PM_i can be obtained through extension of PN_jM_i. Thus, M_i can be pruned. This contradicts the assumption, so the original proof is established.□

If a resource in the current candidate function to be extended meets the form of “*R₁,” it should be judged whether this resource can be pruned according to the weight of prior candidate function. Therefore, the following lemma can be used for pruning.

Lemma 2. Assuming that P is the bicluster with variant usage rate to be extended currently; M is the candidate function set of P and N is the prior candidate function set of P. If the form of expression is “*R_j” for any resource R_j in candidate function M_i(M_i∈M) and there is a prior candidate function N_j(N_j∈N) under which resource R_j with the form of expression “−R_j” also exists and resource R_j must exist in PN_jM_p and PN_jM_i for other candidate samples M_p in M, resource R_j in M_i can be obtained by extension of prior candidate function N_j.

Proof of Lemma 2. Proof by contradiction is adopted. Resource expression form of current candidate function M_i is “*R_j”; a prior candidate function N_j (N_j∈N) exists and resource R_j expression form under N_j is “−R_j”; resource R_j also exists under N_j, and resource R_j must exist in PN_jM_p and PN_jM_i. Thus, M_i can be pruned. In line with description (1) in Definition 4, for resource R_j, “high usage rate” is under some function in M_i. In accordance with the definitions of variant usage rate and low usage rate, resource R_j must be “low usage rate” under P. Since resource R_j expression form under N_j is “−R_j,” the bicluster extended currently must be a bicluster with variant usage rate. As only one “high usage rate” can exist for each resource under all functions in the bicluster with variant usage rate, and if resource R_j must exist in PN_jM_p and PN_jM_i, the bicluster with variant usage rate gained through extension of PM_i can be obtained through extension of PN_jM_i. Thus, M_i can be pruned. This contradicts the assumption, so the original proof is established.□

Similarly, if a resource in the current candidate function to be extended meets the form of “−R₁,” it should be judged whether this resource can be pruned according to the weight of prior candidate function. Therefore, the following lemma can be used for pruning.

Lemma 3. Assuming that P is the bicluster with variant usage rate to be extended currently; M is the candidate function set of P and N is the prior candidate function set of P. If the form of expression is “−R_j” for any resource R_j in candidate function M_i(M_i∈M) and there is a prior candidate function N_j(N_j∈N) under which resource R_j with the form of expression “−R_j” also exists and resource R_j must exist in PN_jM_p and PN_jM_i for other candidate samples M_p in M, resource R_j in M_i can be obtained by extension of prior candidate function N_j.

Proof of Lemma 3. Proof by contradiction is adopted. Resource expression form of current candidate function M_i is “−R_j”; a prior candidate function N_j (N_j ∈ N) exists and resource R_j expression form under N_j is “−R_j”; resource R_j also exists under N_j, and resource R_j must exist in PN_jM_p and PN_jM_i. Thus, M_i can be pruned. In line with description (1) in Definition 4, for resource R_j, “low usage rate” is under some function in M_i. In accordance with the definitions of variant usage rate and low usage rate, resource R_j must be “low usage rate” under all the function of P or “high usage rate” under only one function of P. Since resource R_j expression form under N_j is “−R_j,” the bicluster extended currently may be a bicluster with variant usage rate. As only one “high usage rate” can exist for each resource under all functions in the bicluster with variant usage rate or the bicluster with low usage rate. Since the expression form of R_j under M_i is “low usage rate.” If resource R_j must exist in PN_jM_p and PN_jM_i, the bicluster with variant usage rate gained through extension of PM_i can be obtained through extension of PN_jM_i. Thus, M_i can be pruned. This contradicts the assumption, so the original proof is established.

Lemma 4. Assuming that P is the bicluster with variant usage rate to be extended currently; M is the candidate function set of P and N is the prior candidate function set of P. If the same prior candidate function N_j(N_j∈N) exists for any resource R_j in candidate function M_i(M_i∈M), making each resource R_j in candidate function M_i satisfy pruning conditions in Lemma 1 or 2 or 3, and PN_jM_i.Resources is the same as PM_i.Resources, candidate function M_i can be pruned.

Proof of Lemma 3.The process of proof can be gained through merging the processes of proof in Lemma 1, 2, and 3, so it is omitted here.

It can be seen from Lemma 4 that, the candidate function can only be pruned if all resources in the candidate function can be obtained by resource extension in the same prior candidate function; otherwise, this candidate function will be extended.

Output strategy. Assuming that P is the bicluster with variant usage rate to be extended currently; M is the candidate function set of P and N is the prior candidate function set of P. If all candidate functions M_i (M_i∈M) for P meet pruning conditions in Lemma 4 and no prior candidate function N_j (N_j∈N) which makes P.Resources the subset of PN_j.Resources exists, P can be outputted. If all candidate functions M_j which do not meet pruning conditions in Lemma 4 make PM_i.Resources the subset of P.Resources, and no prior candidate function N_j (N_j∈N) which makes P.Resources the subset of PN_j.Resources exists, P can be outputted.

The above output strategy is actually subject to the definition: if no successor or prior is its superset, it can be outputted. We will explain the algorithm mining process through an example. The data in the example are function-resource use relationship matrix shown in Table 1. It constructs the weighted graph among functions, as shown in Figure 1. The mining process for DeCluster mining Table 1 expressed matrix is shown in Figure 2. The specific description of DeCluster algorithm is as follows:

Figure 2.

Example mining process of DeCluster algorithm.

Algorithm 1: DeCluster algorithm

Input: coherent threshold: α, low usage rate threshold: β, variant usage rate threshold: r, function-resource matrix: D

Output: all biclusters with maximal variant usage rate or maximal low usage rate meeting the threshold

Initial value: sample weight graph: G = Null, current bicluster to be extended Q = Null, S_i = Null and S_j = Null.

Scanning dataset D and construct its weighted graph;

For each pair of samples in D, if all samples linked the current extended sample satisfies pruning conditions in Lemma 4, then stop extending the current sample and change extend the next samples;

If the current extending sample which linked the current extended samples do not satisfy pruning conditions in Lemma 4, then the current extending sample would be merged to the current extended samples;

Go to (2) until all pairs of samples in D were extended.

Experimental result and analysis

In this section, we will compare the mining efficiency between our proposed algorithm and existing algorithm. The hardware environment of the experiment is Intel(R) Core(TM)2 Duo 2.53 GHz CPU and 4 G memory; the software environment is Microsoft Windows 7 SP1 operating system; the algorithm programming and operating environment is Microsoft Visual C++ 6.0 SP6. Experimental data used in this paper are simulation data. To fully test the performance of the algorithm, we generate three datasets randomly, each of which contains 20 sampling sites and 1000 resources. Table 5 describes proportions of 0, 0.1, 0.2, and 0.8 in each row in each dataset.

Table 5.

The proportion of each value in three real-valued dataset.

	0	0.1	0.2	0.8
D₁	0.2	0.2	0.2	0.4
D₂	0.2	0.3	0.3	0.2
D₃	0.4	0.2	0.2	0.2

In this section, the comparison will be made on the mining efficiency of DeCluster algorithm and RAP algorithm. To fully compare the scalability of algorithms, we generate multiple groups of datasets with different numbers of functions and resource in allusion to three datasets in Table 5. The selection of functions and resources is based on the order of functions and resources in dataset. The parameter of variant usage rate is 4 and that of low usage rate is 0.5. Figure 3(a) and (b) provides the comparison of performance period when the number of functions of two algorithms above is 10 and 20, respectively, and the number of resources is 200, 400, 600, 800, and 1000, respectively, and the parameter of relevancy under dataset D₁ is 1. It can be seen from these figures that the mining time of both algorithms increases progressively with the increase of number of resources in dataset. Meanwhile, the mining efficiency of DeCluster algorithm is higher than that of RAP algorithm under each data size. Especially when the number of resources in dataset is high, the mining efficiency of DeCluster algorithm is nearly 20 times higher than that of RAP algorithm. The reason is that RAP algorithm mines bicluster with the method of resource extension. With the increase of number of resources in dataset, this algorithm needs more iterations to mine all biclusters meeting threshold conditions. However, DeCluster algorithm uses high-efficiency pruning strategies for mining and will produce more maximal biclusters especially when the number of resources in dataset is high and data are dense. Therefore, DeCluster algorithm has a higher pruning efficiency. Figure 4(a) and (b) provides the comparison of performance period under datasets with different resources of functions and resources when the parameter of relevancy of three algorithms above is 2 in dataset D₁. Similar to the description in Figure 3(a) and (b), the mining efficiency of DeCluster algorithm is higher than that of RAP algorithm under each data size. When the number of resources in dataset is low, the pruning efficiency of DeCluster algorithm is not significantly higher than that of RAP algorithm. However, with the increase of number of resources in dataset, the pruning efficiency of DeCluster algorithm becomes significantly higher.

Figure 3.

The running time comparison between two algorithms under different number of resources and functions in D1 when α = 1: (a) 10 functions and (b) 20 functions.

Figure 4.

The running time comparison between two algorithms under different number of resources and functions in D1 when α = 2: (a) 10 functions and (b) 20 functions.

Figures 5(a) and (b) and 6(a) and (b), respectively, provide the comparison of performance period of both algorithms above under datasets with different numbers of sampling sites and resources when their parameters of relevancy under dataset D₂ are, respectively, 1 and 2. It can be seen that as proportions of 0.1 and 0.2 in dataset D₂ increase compared to those in dataset D₁, according to descriptions of the definition of variant usage rate and low usage rate, mining dataset D₂ will produce more biclusters than mining dataset D₁ under the same parameters. Therefore, when the number of functions is 20, RAP algorithm cannot mine datasets with the number of resources higher than 400 in limited memory space, but DeCluster algorithm can complete all mining processes within 10 s. Figures 7(a) and (b) and 8(a) and (b), respectively, provide the comparison of performance period of both algorithms above under datasets with different numbers of sampling sites and resources when their parameters of relevancy under dataset D₃ are, respectively, 1 and 2.

Figure 5.

The running time comparison between two algorithms under different number of resources and functions in D2 when α = 1: (a) 10 functions and (b) 20 functions.

Figure 6.

The running time comparison between two algorithms under different number of resources and functions in D2 when α = 2: (a) 10 functions and (b) 20 functions.

Figure 7.

The running time comparison between two algorithms under different number of resources and functions in D3 when α = 1: (a) 10 functions and (b) 20 functions.

Figure 8.

The running time comparison between two algorithms under different number of resources and functions in D3 when α = 2: (a) 10 functions and (b) 20 functions.

Conclusions

In order to use resource ability to computer the health degree of a set of functions, in this paper, we propose an efficient bicluster mining algorithm—DeCluster, to effectively mine all biclusters with maximal variant usage rate and maximal low usage rate in the real-valued function-resource matrix. The mining process of DeCluster algorithm is divided into two steps: first, scanning original function-resource matrix, according to the definition of biclusters with maximal variant usage rate and maximal low usage rate, all sample weighted graphs satisfying the above definition are produced; then, it uses sample growth method to mine all biclusters with maximal variant usage rate bicluster and maximal low usage rate bicluster. In order to improve the mining efficiency, DeCluster algorithm uses multiple pruning strategies to ensure the mining of maximal bicluster without candidate maintenance. However, due to the lack of true test data, all experimental results of the algorithm in this paper are mined based on artificially generated data. Our next research direction is to mine biclusters with variant usage rate and low usage rate in function-resource matrix measured in true environment.

Footnotes

Authors’ contributions

LZ and ZX conceived and designed the experiments; LZ performed the experiments; MW and LZ analyzed the data; MW and TY wrote the paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper is supported by Aviation Foundation under Grant No. 20155553036 and No. 20155515002 and it is also supported by National Key Basic Research Program of China under Grant No. 2014CB744900.

References

Leveson N. Engineering a safer world: systems thinking applied to safety. Massachusetts, USA: MIT Press, 2011.

Cheng Y and Church GM. Biclustering of expression data. In: Proc. 8th Int’l conf. intelligent systems for molecular biology (ISMB00), California, USA, 19–23 August 2000, pp.93–103. USA: ACM Press, 2000.

Madeira

Oliveira

. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 2004; 1: 24–45.

Zhao

Zaki

. MicroCluster: efficient deterministic biclustering of Microarray data. IEEE Intell Syst 2005; 20: 40–49.

Subramanian

Tamayo

Mootha

et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 2005; 102: 15545–15550.

Ben-Dor

Chor

Karp

et al.

Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 2003; 10: 373–384.

Murali TM and Kasif S. Extracting conserved gene expression motifs from gene expression data. In: Proc. Pac symp biocomput, Lihue, Hawaii, 3–7 January 2003, pp.77–88. Singapore: World scientific press.

Pandey G, Atluri G, Steinbach M, et al. An association analysis approach to biclusting. In: Proc. ACM conf. on knowledge discovery and data mining, Paris, France, 28 June–1 July 2009, pp.677–686. USA: ACM Press.

Okada

Inoue

. Identification of differentially expressed gene modules between two-class DNA microarray data. Bioinformation 2009; 4: 134–137.

10.

Serin

Vingron

. Debi: discovering differentially expressed biclusters using a frequent itemset approach. Algorithms Mol Biol 2011; 6: 18.

11.

Wang M, Shang XQ, Zhang SH, et al. FDCluster mining frequent closed discriminative bicluster without candidate maintenance in multiple microarray datasets. In: Proceedings of ICDM workshops, Sydney, Australia, 14–17 December 2010, pp.779–786. USA: IEEE press.

12.

Southworth

Owen

Kim

. Aging mice show a decreasing correlation of gene expression within genetic modules. PLoS Genet 2009; 5: e1000776.

13.

Odibat O, Reddy CK and Giroux CN. Differential biclustering for gene expression analysis. In: Proceedings of the ACM conference on bioinformatics and computational biology (BCB), New York, USA, 2–4 August 2010, pp.275–284. USA: ACM press.

14.

Fang G, Kuang R, Pandey G, et al. Subspace differential coexpression analysis: problem definition and a general approach. In: Proceedings of the 15th Pacific symposium on biocomputing (PSB), Big Island of Hawaii, 4–8 January 2010, vol. 15, pp.145–156. Singapore: World scientific press.

15.

Wang M, Shang X, Miao M, et al. FTCluster: efficient mining fault-tolerant biclusters in microarray dataset. In: Proceedings of ICDM 2011 workshop on biological data mining and its applications in healthcare, Vancouver, Canada, 11–14 December 2011, pp.1075–1082. USA: IEEE press.