Abstract
Protein-protein interactions (PPIs) are essential to a number of biological processes. The PPIs generated by biological experiment are both time-consuming and expensive. Therefore, many computational methods have been proposed to identify PPIs. However, most of these methods are limited as they are difficult to compute and rely on a large number of homologous proteins. Accordingly, it is urgent to develop effective computational methods to detect PPIs using only protein sequence information. The kernel parameter of relevance vector machine (RVM) is set by experience, which may not obtain the optimal solution, affecting the prediction performance of RVM. In this work, we presented a novel computational approach called GWORVM-BIG, which used Bi-gram (BIG) to represent protein sequences on a position-specific scoring matrix (PSSM) and GWORVM classifier to perform classification for predicting PPIs. More specifically, the proposed GWORVM model can obtain the optimum solution of kernel parameters using gray wolf optimizer approach, which has the advantages of less control parameters, strong global optimization ability, and ease of implementation compared with other optimization algorithms. The experimental results on yeast and human data sets demonstrated the good accuracy and efficiency of the proposed GWORVM-BIG method. The results showed that the proposed GWORVM classifier can significantly improve the prediction performance compared with the RVM model using other optimizer algorithms including grid search (GS), genetic algorithm (GA), and particle swarm optimization (PSO). In addition, the proposed method is also compared with other existing algorithms, and the experimental results further indicated that the proposed GWORVM-BIG model yields excellent prediction performance. For facilitating extensive studies for future proteomics research, the GWORVMBIG server is freely available for academic use at http://219.219.62.123:8888/GWORVMBIG.
Keywords
Introduction
Protein-protein interactions (PPIs) are playing key roles in the fields of biological processes. It is much important to have knowledge of PPIs that can provide a certain help in understanding molecular functions of biological processes and propose a new method for practical medical applications, and bring about a deep understanding of disease mechanisms. Despite many high-throughput methods like yeast 2-hybrid system,1,2 protein chips 3 and immunoprecipitation 4 are commonly used to detect PPIs. However, these experimental approaches are both expensive and time-consuming. In addition, the approaches mentioned above result in high rates of false negatives and false positives.5,6 Accordingly, a lot of computational approaches based on different types of data, such as protein domain, genomic information, and protein structure have been presented to detect PPIs. However, most of the aforementioned approaches are limited as they are difficult to calculate and depend on large amounts of homologous proteins. As a result, developing effective computational approaches based on protein sequence information to detect PPIs is much essential.
Till now, a lot of computational methods based on sequence have been proposed to identify PPIs. Yu and Guo 7 reported a computational approach based on secondary structures to identify PPIs, which found that most of the interacting regions are taken up by helix and disordered structures. Pitre et al 8 proposed a novel sequence-based computational approach called protein-protein interaction prediction engine (PIPE), which can detect PPIs for any target pair of the yeast Saccharomyces cerevisiae proteins. Xia et al 9 presented a computational approach using protein sequence information, which combined rotation forest with autocorrelation descriptor to identify PPIs. Zhao et al 10 proposed a model based on position-specific scoring matrix (PSSM) and auto covariance for predicting bioluminescent proteins and yielded a high test accuracy of 90.71%. Shi et al 11 proposed an effective method based on protein sequence, which employed a support vector machine (SVM) as classifier and used correlation coefficient (CC) transformation as a feature extraction method. Zahiri et al 12 proposed a novel evolutionary-based feature extraction algorithm for PPI prediction called PPIevo, which extracts the evolutionary feature from the PSSM of the protein sequence. In spite of this, there is still space to improve the accuracy and efficiency of the existing methods.
In this work, we proposed a computational method called GWORVM-BIG, which used Bi-gram (BIG) to represent protein sequences on a PSSM and GWORVM classifier to perform classification for predicting PPIs. More specifically, the proposed GWORVM model can obtain the optimum solution of kernel parameters using gray wolf optimizer (GWO) approach, which has the advantages of less control parameters, strong global optimization ability and ease of implementation compared with other optimization algorithms. The experimental results on yeast and human data sets demonstrated the good accuracy and efficiency of the proposed GWORVM-BIG method. The results showed that the proposed GWORVM classifier can significantly improve the prediction performance compared with the relevance vector machine (RVM) model using other optimizer algorithms including grid search (GS), genetic algorithm (GA), and particle swarm optimization (PSO). In addition, the proposed method is also compared with other existing algorithms, and the experimental results further indicated that the proposed GWORVM-BIG model yields excellent prediction performance. For facilitating extensive studies for future proteomics research, the GWORVM-BIG server is freely available for academic use at http://219.219.62.123:8888/GWORVMBIG.
Materials and Method
Data set
To verify effectiveness of the proposed GWORVM-BIG prediction model, yeast and human data sets were employed in the experiment, which can be obtained from the publicly available Database of Interacting Proteins (DIP). 13 For better implementing the proposed approach, the protein pairs with less than 50 residues are removed, because they might just be fragments. The protein pairs with too much sequence identity are generally considered to be homologous; thus, the pairs which have ⩾40% sequence identity are also deleted for eliminating the bias to these homologous sequence pairs. As a result, 5594 positive protein pairs were selected to create the positive data set and 5594 negative protein pairs were selected to build the negative data set from the yeast data set. Similarly, we selected 3899 positive protein pairs to create the positive data set and 4262 negative protein pairs to build the negative data set from the human data set. Consequently, the yeast data set contains 11 188 protein pairs and the human data set contains 8161 protein pairs.
Position-specific scoring matrix
PSSM is a useful tool that was originally applied for detecting distantly related proteins. Each protein sequence can be transformed into a PSSM 14 using the position-specific iterated BLAST (PSI-BLAST) 15
where N represents the length of a protein sequence, 20 represents a total of 20 amino acids, and
In this article, PSI-BLAST is employed to construct each protein sequence PSSM. To obtain highly and widely homologous sequences, the e-value parameter of PSI_BLAST was set 0.001, and 3 iterations were selected. Consequently, the PSSM of each protein sequence can be expressed as a 20-dimensional matrix that consists of L × 20 elements, where L represents the number of residues of a protein. The columns of the matrix represent the 20 amino acids.
BIG feature extraction method
The BIG has been applied for protein fold recognition. 16 In the work, we employed BIG feature extraction method 17 to represent a given protein sequence based on its PSSM. In detail, the BIG feature vector was computed through counting the BIG frequencies of occurrences in PSSM. A PSSM of a protein sequence contains L rows and 20 columns, where L is the length of protein sequence and 20 represents 20 amino acids. The function of BIG feature extraction method can be given as equation (2)
Equation (2) gives 400 frequencies of occurrences
These BIG features can also be expressed as
where
Finally, each protein sequence was transformed into a 400-dimensional vector using BIG method.
Relevance vector machine
Relevance vector machine was always experimented on the binary classification.
18
Given a train data with input
where
where
The label
For the sake of making the value of most components of the weight vector
Because
The integral of the product of
Because
The iterative process of
Here,
Gray wolf optimizer
In recent years, optimization algorithms based on meta-heuristic have been extensively employed to solve many optimization problems in different fields. The meta-heuristics are inspired from nature and animal’s behaviors, typically related to physical phenomena or evolutionary concepts. As a new meta-heuristic algorithm, GWO was first developed by Mirjalili et al.
19
The GWO approach simulates the social leadership and hunting behavior of gray wolves in nature. Gray wolves live in a pack which composes of 5 to 12 wolves on average. The leader of the group is named alpha, whose responsibility is to make a decision about habitat, hunting, and so on. The second in the group is called beta, which can provide a certain help to alpha in decision-making. The lowest ranking gray wolf is called
where
where
where
In the GWO approach,
RVM based on GWO
Generally, the kernel parameters of RVM are selected by experience, which may not obtain the optimum solution, affecting the prediction accuracy of RVM. To obtain the optimum solution, many researchers expect to solve the problem using GS, GA, 20 and PSO. 21 In the article, GWO approach was used to obtain the optimum solution of kernel parameters of RVM for the first time. The classification flowchart of RVM based on GWO is given in Figure 1.

The classification flowchart of GWORVM classification algorithm.
Performance evaluation
In the study, to evaluate the feasibility and effectiveness of the proposed method, we calculated the value of 5 parameters: Accuracy (Ac), Sensitivity (Sn), Specificity (Sp), Precision (Pe), and Matthews correlation coefficient (Mcc). They are expressed as follows
where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. True positives represent the count of true interacting pairs correctly predicted. True negatives are the number of true noninteracting pairs predicted correctly. False positives defined as the count of true noninteracting pairs falsely predicted, and false negatives represent true interacting pairs falsely predicted to be noninteracting pairs. Moreover, a receiver operating characteristic (ROC) curve was created to evaluate the performance of our proposed method.
Results and Discussion
Performance of the proposed method
To evaluate the efficiency of the GWORVM-BIG, experiments were carried out using the same feature extraction method and a different classifier (GWORVM, GSRVM, GARVM, and PSORVM) on yeast and human data sets, respectively. For averting the overfitting, the data sets were divided into the training sets and independent test sets. More specifically, we randomly selected 1 out of 5 of the data sets as independent test sets and selected the remaining data sets as training sets. Furthermore, we also performed 5-fold cross-validation tests to benchmark the performance of the GWORVM-BIG. In the experiment, the optimum solution of the kernel parameters of RVM was obtained using GWO, GS, GA, and PSO approaches on yeast and human data sets, respectively. The optimum solutions are displayed in Tables 1 and 2. In addition, we selected the Gaussian function as the kernel and set up beta = 0, where beta represents classification or regression. Here, “beta = 0” represents classification. The experimental results are shown in Tables 3 to 10 on yeast and human data sets.
Kernel parameter optimal value of Gauss kernel RVM based on different optimization algorithms on yeast data set.
Abbreviation: RVM, relevance vector machine.
Kernel parameter optimal value of Gauss kernel RVM based on different optimization algorithm on human data set.
Abbreviation: RVM, relevance vector machine.
Fivefold cross-validation results shown using RVM based on GWO on yeast data set.
Abbreviations: RVM, relevance vector machine; GWO, gray wolf optimizer.
It can be seen from Table 3 that the GWORVM-BIG obtained 97.12%, 96.91%, 97.53%, and 93.81% average accuracy, sensitivity, precision, and Mcc on yeast data set. Table 7 shows that the GWORVM-BIG also achieved good results of average accuracy, sensitivity, precision, and Mcc of 94.567%, 95.55%, 93.08%, and 89.51% on human data set. It can be seen from Tables 4 and 8 that the average accuracy, sensitivity, precision, and Mcc of GSRVM-BIG are 94.79%, 90.58%, 97.11%, and 89.79%; and 92.15%, 91.78%, 91.08%, and 85.45% on yeast and human data sets, respectively. As shown in Tables 5 and 9, the GARVM-BIG obtained 94.79%, 90.58%, 97.11%, and 89.79%; and 92.15%, 91.78%, 91.08%, and 85.45% average accuracy, sensitivity, precision, and Mcc on yeast and human data sets, respectively. Tables 6 and 10 shown that the average accuracy, sensitivity, precision, and Mcc of PSORVM-BIG are 94.79%, 90.58%, 97.11%, and 89.79%; and 92.15%, 91.78%, 91.08%, and 85.45% on yeast and human data sets, respectively.
Fivefold cross-validation results shown using RVM based on GS on yeast data set.
Abbreviations: RVM, relevance vector machine; GS, grid search.
Fivefold cross-validation results shown using RVM based on GA on yeast data set.
Abbreviations: RVM, relevance vector machine; GA, genetic algorithm.
Fivefold cross-validation results shown using RVM based on PSO on yeast data set.
Abbreviations: RVM, relevance vector machine; PSO, particle swarm optimization.
Fivefold cross-validation results shown using RVM based on GWO on human data set.
Abbreviations: RVM, relevance vector machine; GWO, gray wolf optimizer.
Fivefold cross-validation results shown using RVM based on GS on human data set.
Abbreviations: RVM, relevance vector machine; GS, grid search.
Fivefold cross-validation results shown using RVM based on GA on human data set.
Abbreviations: RVM, relevance vector machine; GA, genetic algorithm.
Fivefold cross-validation results shown using RVM based on PSO on human data set.
Abbreviations: RVM, relevance vector machine; PSO, particle swarm optimization.
In the experiment, comparing the GWORVM-BIG with GSRVM-BIG, GARVM-BIG and PSORVM-BIG showed that the GWORVM-BIG method is accurate, robust, and effective for predicting PPIs. Similarly, as shown in Figures 2 and 3, the ROC curves of GWORVM-BIG are also significantly better than GSRVM-BIG, GARVM-BIG, and PSORVM-BIG. This clearly proved that the GWORVM is an accurate and robust classifier for predicting PPIs. The major reason is that GWO algorithm has the advantages of less control parameters, strong global optimization ability, and ease of implementation compared with other optimization algorithms. The GWO approach simulates social leadership and hunting behavior of gray wolves in nature, which can obtain the optimal solution position through continuous iteration optimization. It has stronger robustness to the change of its relevant parameters and can adaptively adjust the convergence factor during the process of iteration. The GWO realizes the balance of population between the global search ability and local development ability using feedback mechanism of searching individual information Therefore, the optimization efficiency of GWO algorithm is superior to other intelligent optimization algorithms.

Comparison of ROC curves of RVM based on different intelligent optimization algorithms on yeast data set. ROC indicates receiver operating characteristic; RVM, relevance vector machine.

Comparison of ROC curves of RVM based on different intelligent optimization algorithms on human data set. ROC indicates receiver operating characteristic; RVM, relevance vector machine.
Comparison with the SVM based on GWO method
Despite the proposed GWORVM-BIG achieving good predictive performance, for further evaluating the prediction performance of RVM, the comparison of prediction accuracy was implemented between GWORVM classifier and the state-of-the-art SVM based on GWO algorithm on yeast and human data sets using the same feature extraction method (BIG). In the experiment, we used the LIBSVM tool 22 to perform classification and GWO algorithm to optimize the RBF kernel parameters of SVM. Here, the optimum solution of kernel parameters was obtained, which were set up c = 0.6 and g = 0.3.
As shown in Table 11, the GWOSVM obtained 92.54% average accuracy, 95.49% average sensitivity, 92.97% average precision, and 89.56% average Mcc on yeast data set. It can be seen from these experimental results that the prediction performance of GWORVM-BIG is significantly better than the GWOSVM-BIG. At the same time, as displayed in Figure 4, the ROC curves of GWORVM classifier are also significantly better than GWOSVM classifier. This clearly proved that the GWORVM classifier is an accurate and robust classifier for predicting PPIs.
Fivefold cross-validation results shown using GWOSVM on yeast data set.

Comparison of ROC curves between GWORVM and GWOSVM on yeast data set. ROC indicates receiver operating characteristic.
In this study, comparing GWORVM with GWOSVM showed that the classification performance of RVM is significantly better than SVM using the same feature extraction method. The major reason is that RVM classifier reduces the amount of calculation of the kernel and overcomes the disadvantage that kernels of SVM required meeting the condition of Mercer. As a result, the experimental results of this study indicate that the GWORVM-BIG prediction model can obtain high accuracy for predicting PPIs.
Comparison with other methods
For demonstrating the effectiveness of the proposed GWORVM-BIG prediction model, we also compared the performance of GWORVM-BIG with existing PPI predictor on yeast and human data sets. It can be found from Tables 12 and 13 that the prediction performance of the GWORVM-BIG model is obviously higher than other prediction models on yeast and human data sets. It is proved from these comparison results that the proposed model called GWORVM-BIG can improve the prediction accuracy relative to current exiting approaches.
The prediction results of different prediction models on yeast data set.
The prediction results of different prediction models on human data set.
The proposed GWORVM-BIG prediction model obtains the good prediction results relative to current exiting approaches. The major reason is that the proposed method adopted effective feature extraction method and classifier. Specifically, there are 3 reasons: (1) the PSSM matrix is a much useful tool for representing protein sequence, which can not only describe the order information but also retain sufficient prior information for the protein sequence. (2) The BIG probabilities represented each protein sequence by its PSSM and calculated the BIG feature using the probability information contained in PSSM. The BIGs features from PSSMs can significantly reduce the sparsity level, which helps in improving the recognition performance. (3) GWO algorithm was employed to obtain the optimal solution of kernel parameters of RVM, which improved prediction performance of RVM. As a result, the experimental results demonstrated that the GWORVM-BIG prediction model is very suitable for predicting PPIs.
Conclusions
In this study, we proposed a computational method called GWORVM-BIG, which used BIG to represent protein sequences on a PSSM and GWORVM classifier to perform classification for predicting PPIs. The experimental results on yeast and human data sets demonstrated the good accuracy and efficiency of the proposed GWORVM-BIG method. The comparisons showed that the proposed GWORVM classifier can significantly improve the prediction performance. The major improvements of the proposed method may attribute to following reasons: (1) the PSSM matrix is a much useful tool for representing protein sequence, which can not only describe the order information but also retain sufficient prior information for the protein sequence. (2) The BIG feature extraction method represented each protein sequence by its PSSM and calculated the BIG feature using the probability information contained in PSSM. The BIGs features from PSSMs can significantly reduce the sparsity level, which helps in improving the recognition performance. (3) GWO algorithm has the advantages of less control parameters, strong global optimization ability, and ease of implementation compared with other optimization algorithms, which were employed to obtain the optimal solution of kernel parameters of RVM and improved prediction performance of RVM.
Footnotes
Acknowledgements
The authors thank all the editors and anonymous reviewers for their constructive advices.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the independent innovation fund of double top of School of Computer Science and Technology, under grant number 2018ZZCX14.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
J-YA, Z-HY and YZ conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript; D-FW designed, performed, and analyzed experiments and wrote the manuscript. All authors read and approved the final manuscript.
