An evaluation of some conflation algorithms for information retrieval

Abstract

The characteristics of conflation algorithms are discussed and examples given of some algorithms which have been used for information retrieval systems. Comparative experiments with a range of keyword dictionaries and with the Cranfield document test collection suggest that there is relatively little difference in the performance of the algorithms despite the widely disparate means by which they have been developed and by which they operate.

Keywords

Affixes conflation free text stemming algo rithm string similarity suffix stripping.

Get full access to this article

View all access options for this article.

References

J.H. Ashford and D.I. Matkin , Report of a study of the potential users and application areas for free text information storage and retrieval systems in Britain, 1979-1981, Program 14 (1980) 14-23.

L.J. Anthony (Ed.), Microprocessors and Intelligence (Aslib, London , 1979).

J.B. Whitehead , Developments in word processing systems and their application to information needs, Aslib Proceedings 32 (1980) 118-133.

J.B. Lovins , Development of a stemming algorithm, Mech. Trans. Comput. Linguis. 11 ( 1968) 22-31.

B.J. Field , Semi-automatic development of thesauri using free-language vocabulary analysis, British Library Res. Develop. Dept. Rep. 5260 (1975).

C.J. Overhage and J.F. Reintjes , Project Intrex : A general review, Inform. Storage and Retrieval 10 (1974) 157-188.

T.C. Lowe , D.C. Roberts , and P. Kurtz , Additional text processing for on-line retrieval (The RADCOL System), Tech. Rep. RADC-TR-73-337 (1973).

M.F. Porter , An algorithm for suffix stripping, Program 14 (1980) 130-137.

J.L. Dolby and H.L. Resnikoff , On the structure of written English , Language 40 ( 1964) 167-196.

10.

J.L. Dawson , Suffix removal and word conflation, Assoc. Lit. Ling. Comput. Bull. 2 ( 1974) 33-46.

11.

C.P. Bourne and D.F. Ford , A study of methods for systematically abbreviating English words and names, J. ACM 8 (1961) 538-552.

12.

P. Willett , Document retrieval experiments using indexing vocabularies of varying size, II. Hashing, truncation, digram and trigram encoding of index terms, J. Documentation 35 (1979) 296-305.

13.

B.D. Tarry , Automatic suffix generation and word segmentation for information retrieval , M.Sc. thesis, University of Sheffield ( 1978).

14.

D. Cooper and M.F. Lynch , Compression of Wiswesser line notations using variety generation, J. Chem. Inform. Comput. Sci. 19 (1979) 165-169.

15.

M.A. Hafer and S.F. Weiss , Word segmentation by letter successor varieties , Inform. Storage and Retrieval 10 (1974) 371-385.

16.

K. Sparck Jones , Automatic Keyword Classification and Information Retrieval (Butterworths, London , 1971).

17.

J. Minker , E. Peltola and G.A. Wilson , Document retrieval experiments using cluster analysis, J. Amer. Soc. Inform. Sci. 24 (1973) 247-257.

18.

G.W. Adamson and J. Boreham , The use of an associative measure based on character structure to identify semantically related pairs of words and document titles, Inform. Storage and Retrieval 10 (1974) 253-260.

19.

C.P. Bourne , Frequency and impact of spelling errors in bibliographic data bases, Inform. Process. Management 13 (1977) 1-12.

20.

J.B. Lovins , Error evaluation for stemming algorithms as clustering algorithms. J. Amer. Soc. Inform. Sci. 22 (1971) 28-40.

21.

C.J. Van Rijsbergen , Information Retrieval (Butterworths , London, 1979).

22.

S. Siegel , Nonparametric Statistics for the Behavioural Sciences ( McGraw Hill, Tokyo, 1956).

23.

C. Landauer and C. Mah , Message extraction through estimation of relevance, Paper presented at the ACM-BCS Symp. on Research and Development, in: Information Retrieval, Cambridge, 23-26 June 1980 , to appear.

24.

P. Willett , A fast procedure for the calculation of similarity coefficients in automatic classification, Inform. Process. Management 17 (1981) 53-60.