Abstract
This work addresses the issues of data representation and incorporation of domain knowledge into the design of learning systems for reasoning about protein families. Given the limited expressive capacity of a particular method, a mixture of protein annotation and fold recognition experts, each implementing a different underlying representation, should provide a robust method for assigning sequences to families. These ideas are illustrated using two data-driven learning methods that make use of different prior information and employ independent, yet complementary, projections of a family: hidden Markov models (HMMs) based on a multiple sequence alignment and neural networks (NNs) based on global sequence descriptors of proteins. Examination of seven protein families indicates that combining a generative (HMM) and a discriminative (NN) method is better than either method on its own. Biologically, human 4-hydroxyphenylpyruvic acid dioxygenase, involved in tyrosinemia type 3, is predicted to be structurally and functionally related to the glyoxalase I family.
Get full access to this article
View all access options for this article.
