Sage Journals: Discover world-class research

Abstract

The identification of regions of DNA sequences that code for proteins is one of the most fundamental applications in bioinformatics. These protein-coding regions are in contrast to other DNA regions that encode functional RNA molecules, provide structural stability of chromosomes, serve as genetic raw materials, represent molecular fossils, or have no known purpose (sometimes called “junk DNA”). A number of approaches have been suggested for differentiating between the protein-coding and non-protein-coding regions of DNA. A selection of these approaches is based on digital signal processing (DSP) techniques. These DSP techniques rely on the phenomenon that protein-coding regions have a prominent power spectrum peak at frequency f = ⅓ arising from the length of codons (three nucleic acids). This article partitions the identification of protein-coding regions into four discrete steps. Based on this partitioning, DSP techniques can be easily described and compared based on their unique implementations of the processing steps. We compare the approaches, and discuss strengths and weaknesses of each in the context of different applications. Our work provides an accessible introduction and comparative review of DSP methods for the identification of protein-coding regions. Additionally, by breaking down the approaches into four steps, we suggest new combinations that may be worthy of future study.

Get full access to this article

View all access options for this article.

References

Agarwal

, Plotkin

E.I.

, Swamy

M.N.S.

2001. Statistical optimal null filter based on instantaneous matched processing. Circ. Syst. Signal Process, 20:37–61.

Akhtar

, Epps

, Ambikairajah

2008a. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J. Sel. Top. Signal, 2:310–321.

Akhtar

, Ambikairajah

, Epps

2008b. Digital signal processing techniques for gene finding in eukaryotes. Lect. Notes Comput. Sci, 5099:144–152.

Akhtar

, Ambikairajah

, Epps

2008c. Optimizing period-3 methods for eukaryotic gene prediction. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 621–624.

Akhtar

, Epps

, Ambikairajah

2007a. On DNA numerical representations for period-3 based exon prediction. Proc. 5th IEEE Int. Workshop Genomic Signal Process. Stat, 34–37.

Akhtar

, Epps

, Ambikairajah

2007b. Time and frequency domain methods for gene and exon prediction in eukaryotes. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 573–576.

Anastassiou

2000. Frequency-domain analysis of biomolecular sequences. Bioinformatics, 16:1073–1081.

Bielinska-Waz

, Clark

, Wa

et al. 2007. 2D-dynamic representation of DNA sequences. Chem. Phys. Lett, 442:140–144.

Burge

, Karlin

1997. Prediction of complete gene structure in human genomic DNA. J. Mol. Biol., 268:78–94.

10.

Chechetkin

V.R.

, Turygin

A.Y.

1995. Size-dependence of three-periodicity and long-range correlations in DNA sequences. Phys. Lett. A, 199:75–80.

11.

Crick

F.H.C.

1988. What Mad Pursuit: A Personal View of Science. Basic Books: New York.

12.

Cristea

P.D.

2002. Conversion of nucleotides sequences into genomic signals. J. Cell. Mol. Med., 6:279–303.

13.

Datta

, Asif

, Wang

2004. Prediction of protein coding regions in DNA sequences using Fourier spectral characteristics. Proc. IEEE 6th Int. Symp. Multimedia Software Eng., 160–163.

14.

Deutsch

, Long

1999. Intron-exon structures of eukaryotic model organisms. Nucleic Acids Res, 27:3219–3228.

15.

Eddy

S.R.

2001. Noncoding RNA genes and the modern RNA world. Nat. Rev. Genet., 2:919–929.

16.

Eftestol

, Ryen

, Aase

S.O.

et al. 2006. Eukaryotic gene prediction by spectral analysis and pattern recognition techniques. Proc. 7th Nordic Signal Process. Symp.

17.

Eleftheriou

, Falconer

D.D.

1986. Tracking properties and steady-state performance of RLS adaptive filter algorithms. IEEE Trans. Acoust. Speech, 34:1097–1110.

18.

Fickett

J.W.

1982. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res, 10:5303–5318.

19.

Fox

, Carreira

2004. A digital signal processing method for gene prediction with improved noise suppression. EURASIP J. Appl. Signal. Process., 1:108–114.

20.

Gibbs

W.W.

2003. The unseen genome: beyond DNA. Sci. Am., 108–113.

21.

Gunawan

, Ambikairajah

, Epps

2007. A signal boosting technique for gene prediction. Proc. IEEE 6th Int. Conf. Inform. Commun. Signal Process., 1486–1489.

22.

Gunawan

, Epps

, Ambikairajah

2008. Boosting approach to exon detection in DNA sequences. Electron Lett, 44:323–324.

23.

Hota

M.K.

, Srivastava

V.K.

2008. DSP technique for gene and exon prediction taking complex indicator sequence. Proc. IEEE TENCON 2008.

24.

Jiang

, Lavenier

, Yau

2008. Coding region prediction based on a universal DNA sequence representation method. J. Comput. Biol, 15:1237–1256.

25.

Kahumani

, Devabhaktuni

, Ahmad

2008. Prediction of protein-coding regions in DNA sequences using a model-based approach. Proc. IEEE Int. Symp. Circuits Syst., 1918–1921.

26.

Kotlar

, Lavner

2003. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res., 13:1930–1937.

27.

Krogh

1997. Two methods for improving performance of an HMM and their applications for gene-finding. Proc. 5th Int. Conf. Intell. Syst. Mol. Biol., 179–186.

28.

, Holste

2005. Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome. Phys. Rev. E, 71:041910.

29.

Liew

, Yan

, Yang

2005. Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recogn, 38:2055–2073.

30.

Logeswaran

, Ambikairajah

, Epps

2006. A method for detecting short initial exons. Proc. IEEE Workshop Genomic Signal Process. Stat., 61–62.

31.

Long

, Rosenberg

, Gilbert

1995. Intron phase correlations and the evolution of the intron/exon structure of genes. Proc. Natl. Acad. Sci. USA, 92:12495–12499.

32.

, Zhu

2007. An algorithm for gene prediction based on the Z curve. IEEE Int. Conf. Bioinform. Biomed. Eng., 1:188–191.

33.

Makarov

2002. Computer programs for eukaryotic gene prediction. Brief Bioinform., 3:195–199.

34.

Marhon

, Kremer

S.C.

2010. Theoretical justification of computing the 3-base periodicity using nucleotide distribution variance. BioSystems 10.1016/j.biosystems.2010.07.001.

35.

Mena-Chalco

, Carrer

, Zana

et al. 2008. Identification of protein coding regions using the modified Gabor-Wavelet transform. IEEE ACM Trans. Comput. Biol, 5:198–207.

36.

Nandy

1994. A new graphical representation and analysis of DNAc sequence structure: methodology and application to globin genes. Curr. Sci., 66:309–314.

37.

Nanjundiah

2004. George Gamow and the genetic code. Resonance, 9:44–49.

38.

Nemati

, Basiri

, Ghasem-Aghaee

et al. 2009. A novel ACO-GA hybrid algorithm for feature selection in protein function prediction. Expert Syst. Appl., 36:12086–12094.

39.

Rogic

, Mackworth

A.K.

, Ouellette

B.F.

2001. Evaluation of gene finding programs on mammalian sequences. Genome Res., 11:817–832.

40.

Saeys

, Rouze

, de Peer

Y.V.

2007. In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. Bioinformatics, 23:414–420.

41.

Solovyev

V.V.

, Salamov

A.A.

, Lawrence

C.B.

1995. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. 3rd Int. Conf. Intell. Syst. Mol. Biol., 367–375.

42.

Storz

2002. An expanding universe of noncoding RNAs. Science, 296:1260–1263.

43.

Tiwari

, Ramachandran

, Bhattacharya

et al. 1997. Prediction of probable genes by Fourier analysis of genomic sequences. Comput. Appl. Biosci., 13:263–270.

44.

Tomar

, Gandhi

, Vijaykumar

2008. Digital signal processing for gene prediction. Proc. IEEE Region 10 Annu. Int. Conf. Proc. TENCON 2008.

45.

Tsonis

, Elsner

, Tsonis

1991. Periodicity in DNA coding sequences: implications in gene evolution. J. Theor. Biol., 151:323–331.

46.

Vaidyanathan

P.P.

, Yoon

B.J.

2002. Digital filters for gene prediction applications. Proc. Asilomar Conf. Signals Syst. Comput., 306–310.

47.

Voss

R.F.

1992. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett., 68:3805–3808.

48.

Wanas

, Auda

, Kamel

M.S.

et al. 1998. On the optimal number of hidden nodes in a neural network. Proc. IEEE Can. Conf. Electrical Comput. Eng., 918–921.

49.

Yan

, Lin

Z.S.

, Zhang

C.T.

1998. A new Fourier transform approach for protein coding measure based on the format Z-curve. Bioinformatics, 14:685–690.

50.

Yau

S.S.T.

, Wang

, Niknejad

et al. 2003. DNA Sequence representation without degeneracy. Nucleic Acids Res., 31:3078–3080.

51.

Yin

, Yau

2005. A Fourier characteristics of coding sequences: origins and a non-Fourier approximation. J. Comput. Biol, 12:1153–1165.

52.

Yin

, Yau

2007. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol, 247:687–694.

53.

Zhang

2009. DV-Curve: a novel intuitive tool for visualizing and analyzing DNA sequences. Bioinformatics, 25:1112–1117.

54.

Zhang

, Zhang

C.T.

1994. Z curves, an intuitive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn., 11:767–782.

55.

Zhang

Z.G.

, Zhang

V.W.

, Chan

S.C

et al. 2008. Time-frequency analysis of click-evoked otoacoustic emissions by means of a minimum variance spectral estimation-based method. Hearing Res., 243:18–27.

Gene Prediction Based on DNA Spectral Analysis: A Literature Review

Abstract

Abstract

Get full access to this article

References