Sage Journals: Discover world-class research

Abstract

Identifying a word (pattern) in a long sequence of letters is not an easy task. To achieve this objective, several models have been proposed under the assumption that the sequence of letters is described by a Markov chain. The Markovian hypothesis imposes restrictions on the distribution of the sojourn time in a state, which has geometric distribution in a discrete process. This is the main drawback when applying Markov chains to real problems. By contrast, semi-Markov processes are generalized. In semi-Markov processes, the sojourn time in a state can be governed by any distribution function. The goal of this article is to compute the first hitting time (position) of a word (pattern) in a semi-Markov sequence. To achieve this objective, we use the auxiliary prefix and backward chain. To give an example of the applications of the proposed model, the model is tested in a bacteriophage DNA sequence that is lacking the enzyme SmaI. We compute the probability that a word occurs for the first time after n nucleotides in a DNA sequence. The corresponding probability distribution, the mean waiting position, the variance, and rate of the occurrence of the word are obtained.

Get full access to this article

View all access options for this article.

References

Abadi

, and Vergene

2008. Poisson approximations for search of rare words in DNA sequences. Am. J. Prob. Math. Stat. 4, 223–224.

Aboluion

N.A.

2011. The construction of DNA codes using a computer algebra system [Ph.D. thesis]. University of Glamorgan, UK.

Antzoulakos

D.L.

2001. Waiting times for patterns in a sequence of multistate trials. J. Appl. Probab. 38, 508–518.

Barbu

, Boussemart

, and Limnios

2004. Discrete time semi-Markov processes for reability and survival analysis. Commun. Stat. Theory Methods. 33, 2833–2868.

Barbu

, and Limnios

2008. Semi-Markov Chains and Hidden Semi-Markov Models Toward Applications: Their Use in Reliability and DNA Analysis, 1st ed. Springer, New York.

Chadjiconstantinidis

, Antzoulakos

D.L.

, and Koutras

M.V.

2000. Joint distribution of successes, failures and patterns in enumeration problems. Adv. Appl. Probab. 32, 866–884.

Chryssaphinou

, Karaliopoulou

, and Limnios

2008. On discrete time Semi-Markov chains and applications in words occurrences. Commun. Stat. Theory Methods. 37, 1306–1322.

Codish

, Frank

, and Lagoon

2017. The DNA word design problem: A new constraint model and new results. IJCAI-17. Conference in Australia, 2017.

Crochemore

, and Stefanov

V.T.

2003. Waiting time and complexity for matching patterns with automata. Inform. Process. Lett. 87, 119–125.

10.

Den Hollander

2013. Stochastic models for genetic evolution. Lecture notes. Mathematical Institue, Leiden University.

11.

, and Koutras

1994. Distribution theory of runs: A Markov chain approach. J. Am. Stat. Assoc. 89, 1050–1558.

12.

J.C.

, and Chang

Y.M.

2002. On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J. Appl. Probab. 39, 70–80.

13.

Glaz

, Kulldorff

, Pozdnyakov

, et al. 2006. Gambling teams and waiting times for patterns in two-state Markov chains. J. Appl. Probab. 43, 127–140.

14.

Hebert

P.D.

, Cywinska

, Ball

S.L.

, et al. 2003. Biological identifications through DNA barcodes. Proc. R. Soc. London Series B Biol. Sci. 270, 313–321.

15.

Karaliopoulou

2009. On the number of word occurrences in a semi-Markov sequence of letters. ESAIM Probab. Stat. 13, 328–342.

16.

, Zhang

, Tian

, et al. 2018. Recognizing irregular entities in biomedical text via deep neural networks. Pattern Recogn. Lett. 105, 105–113.

17.

, Cao

, Cui

, et al. 2016. Extracting DNA words based on the sequence features: Non-uniform distribution and integrity. Theor. Biol. Med. Model. 13, 2.

18.

Limnios

, and Oprişan

2001. Semi-Markov Processes and Reliability. Springer Science and Business Media, Boston, MA.

19.

Lothaire

1983. Combinatorics on Words. Addison-Wesley, Cambridge, MA.

20.

Mode

C.J.

, and Pickens

G.T.

1988. Computational methods for renewal theory and semi-Markov processes with illustrative examples. Am. Stat. 42, 143–152.

21.

Montemanni

2015. Combinatorial optimization algorithms for the design of codes: A survey. J. Appl. Oper. Res. 7, 36–41.

22.

Neuts

1981. Matrix-Geometric Solutions an Algorithmic Approach. The Johns Hopkins University Press, Baltimore and London.

23.

Nicodeme

, Salvy

, and Flajolet

2002. Motifs statistics. Theor. Comput. Sci. 287, 593–617.

24.

Nuel

2008. Pattern Markov chains embedding through deterministic finite automata. J. Appl. Probab. 45, 226–243.

25.

Picard

, Schbath

, Lebarbier

, et al. 2011. Statistiques et génome. la Gazette des mathématiciens. 130, 51–82.

26.

Robin

, and Daudin

1999. Exact distribution of word occurrences in a random sequence of letters. J. Appl. Probab. 36, 179–193.

27.

Robin

, Schbath

, and Vandewalle

2007. Statistical tests to compare motif count exceptionalities. BMC Bioinform. 8, 84.

28.

Roy

, and Gupta

R.P.

1992. Classifications of discrete lives. Microelectron Reliab. 32, 1459–1473.

29.

Sigwart

, and Garbett

2018. Biodiversity assessment, DNA barcoding, and the minority majority. Integr. Comp. Biol. 58, 1146–1156.

30.

Srivastava

, and Baptista

M.S.

2016. Markovian language model of the DNA and its information content. Royal Soc. Open Sci. 3, 150527.

31.

Stefanov

, and Pakes

1997. Explicit distributional results in pattern formation. Ann. Appl. Probab. 7, 666–678.

32.

Stefanov

, Robin

. and Schbath

2011. Occurrence of structured motifs in random sequences: Arbitrary number of boxes. Discrete Appl. Math. 159, 826–831.

33.

Touyar

, Schbath

, Cellier

, et al. 2008. Poisson approximation for the number of repeats in a Markov chain model. J. Appl. Probab. 45, 440–455.

Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis

Abstract

Abstract

Get full access to this article

References