ACM Home Page
Please provide us with feedback. Feedback
Minimal test collections for retrieval evaluation
Full text PdfPdf (280 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Evaluation 1--user models and test collections table of contents
Pages: 268 - 275  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Ben Carterette  University of Massachusetts Amherst, Amherst, MA
James Allan  University of Massachusetts Amherst, Amherst, MA
Ramesh Sitaraman  University of Massachusetts Amherst, Amherst, MA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 226,   Citation Count: 22
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148219
What is a DOI?

ABSTRACT

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95% confidence. Information retrieval metrics such as average precision require large sets of relevance judgments to be accurately estimated. Building these sets is infeasible and often inefficient for many real-world retrieval implementations. We present a new way of looking at average precision that allows us to estimate the confidence in an evaluation based on the size of the test collection. We use this to build an algorithm for selecting the best documents to judge to have maximum confidence in an evaluation with a minimal number of relevance judgments. A study with annotators shows how the algorithm can be used by a small group of researchers to quickly rank a set of systems with 95% confidence.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
4
 
5
R. Gentleman and R. Ihaka. The R Language. In Proceedings of the 28th Sumposium on the Interface, 1997.
 
6
M. Kendall. Rank Correlation Methods. Griffin, London, UK, fourth edition, 1970.
7
8
9
10
 
11
K. Sparck Jones and C. J. van Rijsbergen. Information Retrieval Test Collections. Journal of Documentation, 32(1):59--75, 1976.
 
12
M. A. Stephens. EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association, 69:730--737, 1974.
13
 
14
E. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In TREC 2005 Notebook, 2005.
 
15
16

CITED BY  22

Collaborative Colleagues:
Ben Carterette: colleagues
James Allan: colleagues
Ramesh Sitaraman: colleagues