Abstract
Abstract
When sequencing a new genome, its function and structure are important concerns, and inferring methods are based on protein sequence similarity methods. However, sequence groups differ in their parameters such as the number of group members and intra- and inter-class variability. A method that performs well on one group may not perform well on another group. Thus, learning similarity in a supervised manner could provide a general framework to set a similarity function to a specific sequence class. Here we describe a novel method that learns a similarity function between proteins by using a binary classifier and pairs of equivalent sequences (belonging to the same class) as positive samples, and non- equivalent sequences (belonging to different classes) as negative training samples. For sequence pair representation, we propose to use advanced techniques from fuzzy theory, including a sigmoid-type function for normalization and the class of Dombi operators that provide a more robust method. Using some additional constraints, the learned function turns out to be a valid kernel or metric function, and we present a new way of learning it, along with a new parameter-weighting technique. Using a dataset of archeal, bacterial, and eukaryotic 3-phosphoglycerate-kinase sequences (3PGK) and clusters from COG, we evaluate this equivalence learning method from a protein classification point of view. A receiver operator characteristic (ROC) analysis shows that we get a much more robust and accurate methodology for protein classification when these techniques are applied together. (See online Supplementary Material at www.liebertonline.com).
Get full access to this article
View all access options for this article.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
