Approximate pattern matching with gap constraints

Abstract

Pattern matching is a key issue in sequential pattern mining. Many researchers now focus on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named Single-rOot Nettree for approximate pattern matchinG with gap constraints (SONG) based on a new non-linear data structure Single-root Nettree to effectively solve the problem. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤d≤m) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and effectiveness of SONG.

Keywords

Approximate pattern matching Gap constraints Length constraint Hamming distance Nettree

Get full access to this article

View all access options for this article.

References

Won

Park

Yoon

Kim

. An efficient approach for sequence matching in large DNA databases. Journal of Information Science 2006; 32: 88–104.

Wang

Ren

Ding

. Mining sequential patterns with periodic wildcard gaps. Applied Intelligence 2014; 41: 99–116.

Hong

Kim

. Effective pattern-driven concurrency bug detection for operating systems. Journal of Systems and Software 2013; 86: 377–388.

Akbari

Fathian

. A novel algorithm for ontology matching. Journal of Information Science 2010; 36: 324–334.

Hlayel

Hnaif

. An algorithm to improve the performance of string matching. Journal of Information Science 2014; 40: 357–362.

Navarro

Raffinot

. Fast and simple character classes and bounded gaps pattern matching with applications to protein searching. Journal of Computational Biology 2003; 10: 903–923.

Cole

Gottlieb

Lewenstein

. Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th ACM Symposium on the Theory of Computing, 2004, pp. 91–100.

Califf

Mooney

. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research 2003; 4: 177–210.

Haapasalo

Silvasti

Sippu

Soisalon

. Online dictionary matching with variable-length gaps. Experimental Algorithms 2011; 6630: 76–87.

10.

Rahman

Iliopoulos

Lee

Mohamed

Smyth

. Finding patterns with variable length gaps or don’t cares. Computing and Combinatorics 2006; 4112: 146–155.

11.

Sippu

Soisalon-Soininen

. Online matching of multiple regular patterns with gaps and character classes. Language and Automata Theory and Applications 2013; 7810: 523–534.

12.

Liu

Guo

. Subnettrees for strict pattern matching with general gaps and length constraints. Journal of Software 2013; 24: 915–932.

13.

Min

. Pattern matching with independent wildcard gaps. In: Proceedings of the 8th International Conference on Pervasive Intelligence and Computing, 2009, pp. 194–199.

14.

Min

. A Nettree for pattern matching with flexible wildcard constraints. In: Proceedings of the 2010 IEEE International Conference on Information Reuse and Integration, 2010, pp. 109–114.

15.

Guo

Xie

. Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph. Applied Intelligence 2013; 39: 57–74.

16.

Zhang

Kao

Cheung

Yip

. Mining periodic patterns with gap requirement from sequences. ACM Transactions on Knowledge Discovery from Data 2007; 1: 7-es.

17.

Bailey

Dong

. Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge and Information Systems 2007; 11: 259–286.

18.

Yang

Wang

. Efficient mining of gap-constrained subsequences and its various applications. ACM Transactions on Knowledge Discovery from Data 2012; 6: 1–39.

19.

Zhu

. Mining complex patterns across sequences with gap requirements. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007, pp. 2934–2940.

20.

Zhu

. SAIL-APPROX: An efficient on-line algorithm for approximate pattern matching with wildcards and length constraints. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2007, pp. 151–158.

21.

Fischer

Paterson M

. String matching and other products. In: Proceedings of the 7th SIAM AMS Complexity of Computation, 1974, pp. 113–125.

22.

Manber

Baeza

. An algorithm for string matching with a sequence of don’t cares. Information Processing Letters 1991; 37: 133–136.

23.

Bille

Gørtz

Vildhøj

Wind

. String matching with variable length gaps. Theoretical Computer Science 2012; 443: 25–34.

24.

Ding

Han

Khoo

. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: IEEE 25th International Conference on Data Engineering (ICDE), Shanghai, China, 2009, pp. 1024–1035.

25.

Ferreira

Azevedo

. Protein sequence pattern mining with constraints. In: European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, 2005, pp. 96–107.

26.

Zhu

Arslan

. PMBC: Pattern mining from biological sequences with wildcard constraints. Computers in Biology and Medicine 2013; 43: 481–492.

27.

Lam

Mörchen

Fradkin

. Mining compressing sequential patterns. Statistical Analysis and Data Mining 2013; 7: 35–52.

28.

Jiang

Min

. A heuristic algorithm for MPMGOOC. Chinese Journal of Computers 2011, 34(8): 1452–1462.