Sage Journals: Discover world-class research

Abstract

This study investigated whether document retrieval can be improved if documents are divided into smaller sub-documents or passages and the retrieval score for these passages are incorporated in the final retrieval score for the whole document. The documents were segmented by sliding a window of a certain size across the document and extracting the words displayed each time the window stopped. A retrieval score was calculated for each of the passages extracted and the highest score obtained by a passage of that size was taken as the document’s passage-level score for that window size. A range of window sizes was tried.

The experimental results indicated that using a fixed window size of 50 words gave better results than other window sizes for the TREC-5 and TREC-6 test collections. This window size yielded a significant retrieval improvement of 24% compared to using the whole-document retrieval score (using the traditional tf^*idf weighting scheme with cosine normalisation). However, combining this window score and the whole-document retrieval score did not yield a retrieval improvement.

Using a variable window size (ranging from 50 to 400 words) yielded a retrieval improvement of about 5% over using a fixed window size of 50. Different window sizes were found to work best for different queries. If the best window size to use for each query could be predicted accurately, a maximum retrieval improvement of 42% could be obtained.

Subsequent work suggests that the usefulness of passage-level evidence in document retrieval depends on the weighting scheme and type of normalisation used in the retrieval method.

Get full access to this article

View all access options for this article.

References

[1] J.P. Callan , Passage-level evidence in document retrieval . In: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1994), pp. 302-309 .

[2] W. Frakes and R. Baeza-Yates (eds), Information Retrieval: Data Structures and Algorithms ( Prentice-Hall, Englewood Cliffs, NJ , 1992).

[3] M. Kaszkiel and J. Zobel , Passage retrieval revisited . In: Proceedings of the Twentieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1997), pp. 178-185 .

[4] D. Harman , Ranking algorithms. In: W. Frakes and R. Baeza-Yates (eds), Information Retrieval: Data Structures and Algorithm ( Prentice-Hall, Englewood Cliffs, NJ , 1992), pp. 363-392.

[5] M.A. Hearst and C. Plaunt , Subtopic structuring for full-length document access . In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1993), pp. 59-68 .

[6] E. Mittendorf and P. Schauble , Document and passage retrieval based on hidden Markov models . In: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1994), pp. 318-327 .

[7] A. Moffat , R. Sack-Davis , R. Wilkinson and J. Zobel , Retrieval of partial document . In: The Second Text REtrieval Conference (TREC-2) (NIST Special Publication 500-215) (NIST, Gaithersburg, MD, 1994), pp. 181-190 . Available at: http://trec.nist.gov/pubs/trec2/t2_proceedings.html

[8] G. Salton , J. Allan and C. Buckley , Approaches to passage retrieval in full text information systems . In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1993), pp. 49-58 .

[9] G. Salton and C. Buckley , Term-weighting approaches in automatic text retrieval , Information Processing and Management 24(5) (1988) 513-523 .

10.

[10] R. Wilkinson , Effective retrieval of structured documents . In: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1994), pp. 311-317 .

11.

[11] J. Zobel , A. Moffat , R. Wilkinson and R. Sack-Davis , Efficient retrieval of partial documents , Information Processing and Management 31(3) (1995) 361-377 .

12.

[12] M. Fuller , M. Kaszkiel , C.S. Ng , P. Vines , R. Wilkinson and J. Zobel , MDS TREC-6 report . In: The Sixth Text REtrieval Conference (TREC-6) (NIST Special Publication 500-240) (NIST, Gaithersburg, MD, 1998), pp. 241-258 . Available at: http://trec.nist.gov/pubs/trec6/t6_proceedings.html

13.

[13] M. Fuller , M. Kaszkiel , D. Kim , C. Ng , J. Robertson , R. Wilkinson , M. Wu and J. Zobel , TREC-7 ad hoc, speech, and interactive tracks at MDS/CSIRO . In: The Seventh Text REtrieval Conference (TREC-7) (NIST Special Publication 500-242) (NIST, Gaithersburg, MD, 1999), pp. 465-474 . Available at: http://trec.nist.gov/pubs/trec7/t7_proceedings.html

14.

[14] M. Fuller , M. Kaszkiel , S. Kimberley , C. Ng , R. Wilkinson , M. Wu and J. Zobel , The RMIT/CSIRO ad hoc, Q&A, Web, interactive, and speech experiments at TREC-8 . In: The Eighth Text REtrieval Conference (TREC-8) (NIST Special Publication 500-246) (NIST, Gaithersburg, MD, 2000) pp. 549-564 . Available at: http://trec.nist.gov/pubs/trec8/t8_proceedings.html

15.

[15] A. Singhal , C. Buckley and M. Mitra , Pivoted document length normalization . In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, 1996), pp. 21-29 .

16.

[16] S.E. Robertson and S. Walker , Okapi/Keenbow at TREC-8 . In: The Eighth Text REtrieval Conference (TREC-8) (NIST Special Publication 500-246) (NIST, Gaithersburg, MD, 2000), pp. 151-162 . Available at: http://trec.nist.gov/pubs/trec8/t8_proceedings.html

17.

[17] URL: http://trec.nist.gov/

Incorporating window-based passage-level evidence in document retrieval

Abstract

Get full access to this article

References