Sage Journals: Discover world-class research

Abstract

Similarity between objects (documents, persons, answers to a questionnaire, etc.) is generally determined through relations between representations of these objects. In the case of binary representations the presence of a property (e.g. an index term) carries a weight of one, its absence a weight of zero. In many similarity studies common zeros are ignored. This situation is called the zero insensitive case. In this article, however, we study the zero sensitive case. Clearly, answers to binary questionnaires (yes-no, encoded as 1-0) are zero sensitive, as people who answer ‘no’ to the same questions are more similar than those who give different answers. We present a wish list for such a zero sensitive approach to similarity. Making a difference between common zeros and common ones leads to an ‘identity-similarity’ theory. Hence, we move beyond a pure similarity theory. Two approaches to the problem of similarity measurement of presence-absence data, where common zeros matter and have the same effect as common ones, are presented. For the case that there is a difference between common ones and common zeros a totally new approach is proposed. In each case a coding approach is used, leading to new representations, which then lead to a similarity ranking. Examples of functions respecting these rankings are given.

When discussing similarity in general terms authors should clearly state which requirements they imply for the notion of ‘similarity’. It is only then that the problem of the best measure for a given study can be brought up for discussion in a meaningful way.

Keywords

zero-sensitive similarity absence-presence data ranking of identical arrays radix 4 encoding

Get full access to this article

View all access options for this article.

References

L. Egghe and R. Rousseau , Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve, Information Processing and Management (2004) [forthcoming].

G. Salton and M.J. McGill , Introduction to Modern Information Retrieval ( McGraw-Hill, New York , 1983).

S. Dominich , Mathematical Foundations of Information Retrieval ( Kluwer, Dordrecht , 2001).

P. Bollmann-Sdorra and V.V. Raghavan , On the delusiveness of adopting a common space for modelling IR objects: are queries documents? Journal of the American Society for Information Science 44 (1993) 579–587 .

S.E. Robertson , Query-document symmetry and dual models , Journal of Documentation 50 (1994) 233–238 .

R.E. Tulloss , Assessment of similarity indices for undesirable properties and a new tripartite similarity index based on cost functions. In: M.E. Palm and I.H. Chapela (eds), Mycology in Sustainable Development: Expanding Concepts, Vanishing Borders ( Parkway Publishers, Boone, NC , 1997) 122–143.

D. Nijssen , R. Rousseau and P. Van Hecke , The Lorenz curve: a graphical representation of evenness , Coenoses 13(1) (1998) 33–38 .

R.W. Hamming , Error detecting and error correcting codes , Bell Systems Technical Journal 29 (1950) 147–160 .

P.H.A. Sneath and R.R. Sokal , Numerical Taxonomy ( Freeman, San Francisco , 1973).

10.

A.E. Magurran , Ecological Diversity and its Measurement ( Chapman and Hall, London , 1988).

11.

H.P. Possingham , Decision theory and biodiversity management: how to manage a metapopulation. In: R.B. Floyd , A.W. Sheppard and P.J. De Barro (eds), Frontiers of Population Ecology ( CSIRO Publishing, Melbourne , 1996) 391–398.

12.

M. McPherson , L. Smith-Lovin and J. M. Cook , Birds of a feather: homophily in social networks , Annual Review of Sociology 27 (2001) 415–444 .

13.

C.H. Bennett , M. Li and B. Ma , Chain letters and evolutionary histories , Scientific American 288(6) (2003) 64–69 .

An approach to similarity measurement of absence-presence data: the case that common zeros matter

Abstract

Keywords

Get full access to this article

References