Sage Journals: Discover world-class research

Abstract

The identification of transcription factor binding sites (TFBSs) is a problem for which computational methods offer great hope. Thus far, the expectation maximization (EM) technique has been successfully utilized in finding TFBSs in DNA sequences, but inappropriate initialization of EM has yielded poor performance or running time scalability under a given data set. In this study, we used a sequential integration approach that defined the final solution as the set of solutions acquired from solving objectives in a cascade manner to integrate the fuzzy C-means and the EM approaches to DNA motif discovery. The new method is explained in detail and tested on the chromatin immunoprecipitation sequencing (ChIP-seq) data sets for different transcription factors (TFs) with various motif patterns. The proposed algorithm also suggests an efficient process for analyzing motif similarity to known motifs as well as finding a target motif. A comparison of results with those of the well-known motif-finding tool, MEME-ChIP, shows the advantages of our proposed framework over this existing tool. Experimental results show that we were able to find the true motifs for all TFs, and that the motifs found by our proposed algorithm were more similar to JASPAR-known motifs for the STAT1, GATA1, and JUN TFs than those found by MEME-ChIP.

Get full access to this article

View all access options for this article.

References

Alipanahi

, Delong

, Weirauch

, et al. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838.

Asyali

M.H.

, and Alci

2004. Reliability analysis of microarray data using fuzzy c-means and normal mixture modeling based classification methods. Bioinformatics, 21, 644–649.

Bailey

T.L.

, Boden

, Buske

F.A.

, et al. 2009. MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Res. 37, 202–208.

Bailey

T.L.

, and Elkan

1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Conf. Intell. Syst. Mol. Biol., 2, 28–36.

Bailey

T.L.

, and Machanick

2012. Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res. 40, e128.

Chen

, Nimwegen

E.V.

, Rajewsky

, et al. 2010. Correlating gene expression variation with cis-regulatory polymorphism in Saccharomyces cerevisiae. Genome Biol. Evol., 2, 697–707.

Chen

2006. The Application of the Expectation-Maximization Algorithm to the Identification of Biological Models. Virginia Polytechnic Institute and State University.

Consortium

E.P.

2012. An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74.

Das

M.K.

, and Dai

H.-K.

2007. A survey of DNA motif finding algorithms. BMC Bioinformatics, 8, 1–13.

10.

C.B.

, and Batzoglou

2008. What is the expectation maximization algorithm?. Nat. Biotechnol., 26, 897–899.

11.

Gennert

M.A.

, and Yuille

A.L.

1988. Determining the optimal weights in multiple objective function optimization. Proceedings Second International Conference on Computer Vision, Tampa, FL, 87–89.

12.

Gupta

, Stamatoyannopoulos

J.A.

, Bailey

T.L.

, et al. 2007. Quantifying similarity between motifs. Genome Biol. 8, R24.

13.

Ibrikci

, and Karabulut

2010. Employing fuzzy C-means for DNA transcription factor binding site identification. J. Circuit Syst. Comput., 19, 15–30.

14.

Jin

, and Wang

2009. Fuzzy Systems in Bioinformatics and Computational Biology. Springer: Berlin-Heidelberg.

15.

Johnson

D.S.

, Mortazavi

, Myers

R.M.

, et al. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science, 316, 1497–1502.

16.

Lehtonen

, Matikainen

, and Julkunen

1997. Interferons up-regulate STAT1, STAT2, and IRF family transcription factor gene expression in human peripheral blood mononuclear cells and macrophages. J. Immunol., 159, 794–803.

17.

Machanick

, and Bailey

T.L.

2011. MEME-ChIP: Motif analysis of large DNA datasets. Bioinformatics, 27, 1696–1697.

18.

McLeay

R.C.

, and Bailey

T.L.

2010. Motif enrichment analysis: A unified framework and an evaluation on ChIP data. BMC Bioinformatics, 11, 1–11.

19.

Quang

, and Xie

2014. EXTREME: An online EM algorithm for motif discovery. Bioinformatics, 30, 1667–1673.

20.

Teytelman

, Thurtle

D.M.

, Rine

, et al. 2013. Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc. Natl. Acad. Sci. U. S. A., 110, 18602–18607.

21.

Vacic

, Iakoucheva

L.M.

, and Radivojac

2006. Two sample logo: A graphical representation of the differences between two sets of sequence alignments. Bioinformatics, 22, 1536–1537.

22.

Weirauch

M.T.

, Cote

, Norel

, et al. 2013. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134.

23.

, and Fenton

R.G.

1993. A sequential integration method for inverse dynamic analysis of flexible link manipulators. Proceedings IEEE International Conference on Robotics and Automation, Atlanta, GA, 743–748.

24.

Zeng

, Edwards

M.D.

, Liu

, et al. 2016. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics, 32, i121–i127.

25.

Zhang

X.N.

, Liu

J.X.

, Hu

Y.W.

, et al. 2006. Hyper-activated IRF-1 and STAT1 contribute to enhanced interferon stimulated gene (ISG) expression by interferon alpha and gamma co-treatment in human hepatoma cells. Biochim. Biophys. Acta, 1759, 417–425.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.37 MB

0.00 MB

Sequential Integration of Fuzzy Clustering and Expectation Maximization for Transcription Factor Binding Site Identification

Abstract

Abstract

Get full access to this article

References

Supplementary Material