Sage Journals: Discover world-class research

Abstract

The Monge-Elkan distance is a straightforward yet popular distance measure used to estimate the mutual similarity of two sets of objects. It was initially proposed in the field of databases, and it found broad usage in other fields. Nowadays, it is especially relevant to the analysis of new-generation sequencing data as it represents a measure of dissimilarity between genomes of two distinct organisms, particularly when applied to unassembled reads. This article provides an algorithm to calculate the p-value associated with the Monge-Elkan distance. Given the object-level null distribution, that is, the distribution of distances between independently and identically sampled objects such as reads, the method yields the null distribution of the Monge-Elkan distance, which in turn allows for calculating the p-value. We also demonstrate an application on sequencing data, where individual reads are compared by the Levenshtein distance.

Get full access to this article

View all access options for this article.

References

Abdelkader

. A method based on WordNet and Monge-Elkan distance for business process model matching. Int J Inf Syst Model Des, 2018; 9(4):37–48; doi: 10.4018/IJISMD.2018100103

Bernstein

. On a modification of Chebyshev’s inequality and of the error formula of laplace. Ann Sci Inst Sav Ukraine, Sect. Math, 1924.

Cantor

, Kaltofen

. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 1991; 28(7):693–701; doi: 10.1007/BF01178683

Cheatham

, Hitzler

. String similarity metrics for ontology alignment. In ( Alani

, et al. eds), The Semantic Web–ISWC 2013, Springer Berlin Heidelberg: Berlin, Heidelberg; 2013. pages 294–309.

Chvátal

, Sankoff

. Longest common subsequences of two random sequences. J Applied Probability, 1975; 12(2):306–315.

Cohen

, Ravikumar

, Fienberg

. A comparison of string distance metrics for name-matching tasks. In Proceedings of the 2003 International Conference on Information Integration on the Web, IIWEB’03, AAAI Press: USA; 2003. page 73–78.

Feller

. Introduction to probability theory and its applications. 1966.

Gali

, Mariescu-Istodor

, Fränti

. Similarity measures for title matching. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 1548–1553, 2016; doi: 10.1109/ICPR.2016.7899857

Hamming

. Error detecting and error correcting codes. The Bell System Technical Journal, 1950; 29(2):147–160; doi: 10.1002/j.1538-7305.1950.tb00463.x

10.

Hoeffding

. Probability inequalities for sums of bounded random variables. J American Statistical Assoc, 1963; 58(301):13–30; doi: 10.1080/01621459.1963.10500830

11.

Jimenez

, Becerra

, Gelbukh

, et al. Generalized Mongue-Elkan method for approximate text string comparison. In ( Gelbukh

, editor) Computational Linguistics and Intelligent Text Processing, Springer Berlin Heidelberg: Berlin, Heidelberg, 2009. pages 559–570.

12.

Kaplar

, Aleksić

, Stošović

, et al. Evaluating string distance metrics for approximate dictionary matching: A case study in serbian electronic health records, 2019.

13.

Knuth

. The Art of Computer Programming (Seminumerical Algorithms), volume 2. Addison-Wesley: USA; 1981.

14.

Kullback

, Leibler

. On information and sufficiency. Ann Math Statist, 1951; 22(1):79–86; doi: 10.1214/aoms/1177729694

15.

Levenshtein

. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1966; 10(8):707.

16.

Majumdar

, Nechaev

. Exact asymptotic results for the Bernoulli matching model of sequence alignment. Phys Rev E Stat Nonlin Soft Matter Phys, 2005; 72(2 Pt 1):20901.

17.

Marriott

FHC

. Barnard’s Monte Carlo tests: How many simulations? Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979; 28(1):75–77.

18.

Monge

, Elkan

. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, AAAI Press: Portland, Oregon; 1996. pages 267–270.

19.

Rudniy

, Song

, Geller

. Mapping biological entities using the longest approximately common prefix method. BMC Bioinformatics, 2014; 15(1):187; doi: 10.1186/1471-2105-15-187

20.

Ryšavý

, Železný

. Estimating sequence similarity from contig sets. In ( Adams

, et al. eds), Advances in Intelligent Data Analysis XVI, Springer International Publishing: Cham; 2017. pages 272–283.

21.

Ryšavý

, Železný

. Estimating sequence similarity from read sets for clustering sequencing data. In ( Boström

, et al. eds), Advances in Intelligent Data Analysis XV, Springer International Publishing: Cham; 2016. pages 204–214.

22.

Ryšavý

, Železný

. Estimating sequence similarity from read sets for clustering next-generation sequencing data. Data Min Knowl Disc, 2019; 33(1):1–23; doi: 10.1007/s10618-018-0584-8

23.

Ryšavý

, Železný

. Reference-free phylogeny from sequencing data. BioData Min, 2023; 16(1):13.

24.

Santos

, Murrieta-Flores

, Martins

. Learning to combine multiple string similarity metrics for effective toponym matching. Intr J Digital Earth, 2018; 11(9):913–938.

25.

Song

, Rudniy

. Detecting duplicate biological entities using markov random field-based edit distance. Knowl Inf Syst, 2010; 25(2):371–387; doi: 10.1007/s10115-009-0254-7

26.

Stoilos

, Stamou

, Kollias

, et al. A string metric for ontology alignment. In ( Gil

eds), The Semantic Web – ISWC 2005, Springer Berlin Heidelberg: Berlin, Heidelberg; 2005. pages 624–637.

27.

Ukkonen

. Algorithms for approximate string matching. Information and Control, 1985; 64(1–3):100–118; doi: 10.1016/S0019-9958(85)80046-2. URL http://www.sciencedirect.com/science/article/pii/S0019995885800462. International Conference on Foundations of Computation Theory.

28.

Wagner

, Fischer

. The string-to-string correction problem. J ACM, 1974; 21(1):168–173; doi: 10.1145/321796.321811

29.

Winkler

. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990.

30.

Yamaguchi

, Yamamoto

, Kim

J-D

, et al. Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering. BMC Genomics, 2012; 13(Suppl 3):S8; doi: 10.1186/1471-2164-13-S3-S8

31.

Zielezinski

, Vinga

, Almeida

, et al. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol, 2017; 18(1):186; doi: 10.1186/s13059-017-1319-7

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.84 MB

An Algorithm to Calculate the p -Value of the Monge-Elkan Distance

Abstract

Get full access to this article

References

Supplementary Material