| Minimal test collections for retrieval evaluation |
| Full text |
Pdf
(280 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Seattle, Washington, USA
SESSION: Evaluation 1--user models and test collections
table of contents
Pages: 268 - 275
Year of Publication: 2006
ISBN:1-59593-369-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 22, Downloads (12 Months): 205, Citation Count: 22
|
|
|
ABSTRACT
Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95% confidence. Information retrieval metrics such as average precision require large sets of relevance judgments to be accurately estimated. Building these sets is infeasible and often inefficient for many real-world retrieval implementations. We present a new way of looking at average precision that allows us to estimate the confidence in an evaluation based on the size of the test collection. We use this to build an algorithm for selecting the best documents to judge to have maximum confidence in an evaluation with a minimal number of relevance judgments. A study with annotators shows how the algorithm can be used by a small group of researchers to quickly rank a set of systems with 95% confidence.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
R. Gentleman and R. Ihaka. The R Language. In Proceedings of the 28th Sumposium on the Interface, 1997.
|
| |
6
|
M. Kendall. Rank Correlation Methods. Griffin, London, UK, fourth edition, 1970.
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
|
| |
11
|
K. Sparck Jones and C. J. van Rijsbergen. Information Retrieval Test Collections. Journal of Documentation, 32(1):59--75, 1976.
|
| |
12
|
M. A. Stephens. EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association, 69:730--737, 1974.
|
 |
13
|
|
| |
14
|
E. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In TREC 2005 Notebook, 2005.
|
| |
15
|
|
 |
16
|
|
CITED BY 22
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ben Carterette , Virgil Pavlu , Evangelos Kanoulas , Javed A. Aslam , James Allan, Evaluation over thousands of queries, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thomas Mandl , Christa Womser-Hacker , Giorgio Di Nunzio , Nicola Ferro, How robust are multilingual information retrieval systems?, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Javed A. Aslam , Evangelos Kanoulas , Virgil Pavlu , Stefan Savev , Emine Yilmaz, Document selection methodologies for efficient and effective learning-to-rank, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|