ACM Home Page
Please provide us with feedback. Feedback
Revisiting the relationship between document length and relevance
Full text PdfPdf (415 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: IR: theory table of contents
Pages 419-428  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
David E. Losada  Univ. Santiago de Compostela, Santiago de Compostela, Spain
Leif Azzopardi  Univ. Glasgow, Glasgow, Scotland Uk
Mark Baillie  Univ. Strathclyde, Glasgow, Scotland Uk
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 138,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458139
What is a DOI?

ABSTRACT

The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
L. Azzopardi and D. Losada. Fairly retrieving documents of all lengths: A study of document length normalization using the language modeling approach. In Proc. 1st International Conference on the Theory of Information Retrieval, ICTIR'07, pages 65--75, Budapest, October 2007.
 
2
R. Blanco and A. Barreiro. Probabilistic document length priors for language models. In Proc. ECIR-08, the 30th European Conference on Information Retrieval Research, pages 394--405, Glasgow, United Kingdom, March 2008.
 
3
4
5
 
6
D. Harman. TREC:Experiment and Evaluation in Information Retrieval, chapter The TREC AdHoc Experiments, pages 79--97. The MIT press, 2005.
 
7
S. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 26:197--206, 1975.
 
8
D. Hiemstra. A probabilistic justification for using tf x idf term weighting in information retrieval. Int. Journal of Digital Libraries, 3:131--139, 2000.
9
 
10
W. Kraaij and T. Westerveld. Tno/ut at trec-9: how different are web documents. In Proc. TREC-9, the 9th Text Retrieval Conference, Gaithersburg, United States, November 2000.
11
 
12
 
13
M. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
 
14
 
15
S. Robertson, S. Walker, S. Jones, M. Hancock Beaulieu, and M. Gatford. Okapi at TREC-3. In D.Harman, editor, Proc. of the TREC-3, the 3rd Text Retrieval Conference, pages 109--127. NIST, 1995.
16
 
17
K. Sparck Jones and C. J. van Rijsbergen. Report on the need for and provision of and ideal information retrieval test collection. Technical report, British Library Research and Development Report, 1975.
 
18
19
20

Collaborative Colleagues:
David E. Losada: colleagues
Leif Azzopardi: colleagues
Mark Baillie: colleagues