Sage Journals: Discover world-class research

Abstract

A key element in modern text retrieval systems is the weighting of individual words for importance. Early in the development of document retrieval methods it was recognized that performance could be improved if weights were based at least in part on the frequencies of individual terms in the database. This observation led investigators to propose inverse document frequency weighting, which has become the most commonly used approach. Inverse document frequency weighting can be given some justification based on probabilistic arguments. However, many different formulas have been tried and it is difficult to distinguish between these on a purely theoretical basis. Witten, Moffat and Bell, have proposed a monotonicity condition as fundamental: ‘a term that appears in many documents should not be regarded as more important than a term that appears in a few’. Based on this monotonicity assumption and probabilistic arguments we show here how the TREC data can be used to learn ideal global weights. Using cross-validation we show that these weights are a modest but statistically significant improvement over IDF weights. One conclusion is that IDF weights are close to optimal within the probabilistic assumptions that are commonly made.

Get full access to this article

View all access options for this article.

References

[1] J.H. Williams Jr and M.P. Perriens , Automatic full text indexing and searching system , presented at IBM Information Systems Symposium, Washington, DC (1968).

[2] K. Sparck Jones , A statistical interpretation of term specificity and its application in retrieval , The Journal of Documentation 28(1) (1972) 11-21 .

[3] K. Sparck Jones , S. Walker and S.E. Robertson , A Probabilistic Model of Information Retrieval: Development and Status, University of Cambridge Technical Report 446 (1998).

[4] K. Sparck Jones , Information retrieval and artificial intelligence , Artificial Intelligence 114 (1999) 257-281 .

[5] K. Sparck Jones , S. Walker and S.E. Robertson , A probabilistic model of information retrieval: development and comparative experiments (Part 2) , Information Processing and Management 36 (2000) 809-840 .

[6] W.B. Croft and D.J. Harper , Using probabiliistic models of document retrieval without relevance information , Journal of Documentation 35(4) (1979) 285-295 .

[7] J.S. Ro , An evaluation of the applicability of ranking algorithms to improve the effectiveness of full-text retrieval. II. On the effectiveness of ranking algorithms on full-text retrieval , Journal of the American Society for Information Science 39(3) (1988) 147-160 .

[8] J. Zobel and A. Moffat , Exploring the similarity space , ACM SIGIR Forum 32(1) (1998) 18-34 .

[9] W.R. Greiff , A theory of term weighting based on exploratory data analysis , presented at: W.B. Croft , A. Moffat and C.J. van Rijsbergen (eds), SIGIR’98, Melbourne, Australia (1998) 21 .

10.

[10] S.E. Robertson and K. Sparck Jones , Relevance weighting of search terms , Journal of the American Society for Information Science May-June (1976) 129-146 .

11.

[11] C.J. van Rijsbergen , Information Retrieval, 2nd edn ( Butterworths, London , 1979).

12.

[12] G. Salton , Automatic Text Processing ( Addison-Wesley, Reading, MA , 1989).

13.

[13] K. Sparck Jones , S. Walker and S.E. Robertson , A probabilistic model of information retrieval: development and comparative experiments (Part 1) , Information Processing and Management 36 (2000) 779-808 .

14.

[14] I.H. Witten , A. Moffat and T.C. Bell , Managing Gigabytes, 2nd edn ( Morgan-Kaufmann, San Francisco, CA , 1999).

15.

[15] R. Baeza-Yates and B. Ribeiro-Neto , Modern Information Retrieval ( Addison-Wesley-Longman, Harlow , 1999).

16.

[16] W. Cooper , Some inconsistencies and misidentified modelling assumptions in probabilistic information retrieval , ACM Transactions on Information Systems, 13 (1995) 100-111 .

17.

[17] D.K. Harman , Overview of the second text retrieval conference , presented at: D.K. Harman (ed.), The Second Text Retrieval Conference (TREC-2), Gaithersburg, MD, Special Publication 500-215 (1994).

18.

[18] D.K. Harman , Overview of the third text retrieval conference , presented at: D.K. Harman (ed.), The Third Text Retrieval Conference (TREC3), Gaithersburg, MD, Special Publication 500-225 (1995).

19.

[19] M. Ayer , H.D. Brunk , G.M. Ewing , W.T. Reid and E. Silverman , An empirical distribution function for sampling with incomplete information , Annals of Mathematical Statistics, 26 (1954) 641-647 .

20.

[20] W. Hardle , Smoothing Techniques: with Implementation in S ( Springer, New York , 1991).

21.

[21] S.E. Robertson and S. Walker , Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , presented at: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994).

22.

[22] J.M. Ponte and W.B. Croft , A language modeling approach to information retrieval , presented at: W.B. Croft , A. Moffat , C.J. v. Rijsbergen , R. Wilkinson and J. Zobel (eds), SIGIR’98, Melbourne, Australia (1998).

23.

[23] W.J. Wilbur , Nonparametric significance tests of retrieval performance comparisons , Journal of Information Science 20(4) (1994) 270-284 .

24.

[24] S.E. Robertson and S. Walker , On relevance weights with little information , presented at: N.J. Belkin , D. Narasimhalu and P. Willett (eds), SIGIR’97, Philadelphia, PA (1997) 20 .

Global term weights for document retrieval learned from TREC data

Abstract

Get full access to this article

References