Abstract
A key component in genome sequence analysis is the identification of regions of the genome that contain regulatory information. In higher eukaryotes, this information is organized into modular units called cis-regulatory modules. Each module contains multiple binding sites for a specific combination of several transcription factors. In this article, we propose a hidden Markov model (HMM) to identify transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs). For a given genomic sequence, we first select potential TFBSs from a large database (e.g., TRANSFAC), then construct an HMM where the TFBSs are only counted when they occur within a specialized CRM state. The novel features of the proposed method include that it does not assume a small set of TFBSs for a given gene; on the other hand, the method utilizes information from a large collection of well-characterized TFBSs and therefore is computationally more efficient and robust than the de novo methods. Our approach is applied to three data sets with experimentally evaluated TFBSs. The method shows better specificity and sensitivity than other similar computational tools in identifying CRMs and TFBSs. The executable codes of our programs and module predictions across the fly Drosophila genome are available at www.stat.purdue.edu/∼jingwu/module/.
Get full access to this article
View all access options for this article.
