| A statistical method for system evaluation using incomplete judgments |
| Full text |
Pdf
(585 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Seattle, Washington, USA
SESSION: Evaluation 2
table of contents
Pages: 541 - 548
Year of Publication: 2006
ISBN:1-59593-369-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 16, Downloads (12 Months): 183, Citation Count: 27
|
|
|
ABSTRACT
We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. C. Anderson. Monte carlo methods and importance sampling. Lecture Notes for Statistical Genetics, October 1999.
|
 |
2
|
Javed A. Aslam , Virgiliu Pavlu , Robert Savell, A unified model for metasearch, pooling, and system evaluation, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956953]
|
 |
3
|
|
| |
4
|
J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, August 2005. Copyright held by authors.
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1--19, Gaithersburg, MD, USA, Apr. 1995. U. S. Government Printing Office, Washington D. C.
|
| |
9
|
P. Kantor, M.-H. Kim, U. Ibraev, and K. Atasoy. Estimating the number of relevant documents in enormous collections. In D. Cfd, editor, Proceedings of tthe 62nd Annual Meeting of the American Sociaty for Information Science, volume 36, pages 507--514, 1999.
|
| |
10
|
J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth and Brooks/Cole, 1988.
|
| |
11
|
E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of he Seventh Text REtrieval Conference (TREC-7), pages 1--24, 1999.
|
 |
12
|
|
CITED BY 27
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ben Carterette , Virgil Pavlu , Evangelos Kanoulas , Javed A. Aslam , James Allan, Evaluation over thousands of queries, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tanuja Bompada , Chi-Chao Chang , John Chen , Ravi Kumar , Rajesh Shenoy, On the robustness of relevance measures with incomplete judgments, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
Peter Bailey , Nick Craswell , Ian Soboroff , Paul Thomas , Arjen P. de Vries , Emine Yilmaz, Relevance assessment: are judges exchangeable and does it matter, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Javed A. Aslam , Evangelos Kanoulas , Virgil Pavlu , Stefan Savev , Emine Yilmaz, Document selection methodologies for efficient and effective learning-to-rank, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|