ACM Home Page
Please provide us with feedback. Feedback
A statistical method for system evaluation using incomplete judgments
Full text PdfPdf (585 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Evaluation 2 table of contents
Pages: 541 - 548  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Javed A. Aslam  Northeastern University, Boston, MA
Virgil Pavlu  Northeastern University, Boston, MA
Emine Yilmaz  Northeastern University, Boston, MA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 183,   Citation Count: 27
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148263
What is a DOI?

ABSTRACT

We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. C. Anderson. Monte carlo methods and importance sampling. Lecture Notes for Statistical Genetics, October 1999.
2
3
 
4
J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, August 2005. Copyright held by authors.
5
6
7
 
8
D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1--19, Gaithersburg, MD, USA, Apr. 1995. U. S. Government Printing Office, Washington D. C.
 
9
P. Kantor, M.-H. Kim, U. Ibraev, and K. Atasoy. Estimating the number of relevant documents in enormous collections. In D. Cfd, editor, Proceedings of tthe 62nd Annual Meeting of the American Sociaty for Information Science, volume 36, pages 507--514, 1999.
 
10
J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth and Brooks/Cole, 1988.
 
11
E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of he Seventh Text REtrieval Conference (TREC-7), pages 1--24, 1999.
12

CITED BY  27

Collaborative Colleagues:
Javed A. Aslam: colleagues
Virgil Pavlu: colleagues
Emine Yilmaz: colleagues