ACM Home Page
Please provide us with feedback. Feedback
Hypothesis testing with incomplete relevance judgments
Full text PdfPdf (304 KB)
Source
Conference on Information and Knowledge Management archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management table of contents
Lisbon, Portugal
SESSION: IR evaluation (IR) table of contents
Pages 643-652  
Year of Publication: 2007
ISBN:978-1-59593-803-9
Authors
Ben Carterette  University of Massachusetts Amherst, Amherst, MA
Mark D. Smucker  University of Massachusetts Amherst, Amherst, MA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 57,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321440.1321530
What is a DOI?

ABSTRACT

Information retrieval experimentation generally proceeds in a cycle of development, evaluation, and hypothesis testing. Ideally, the evaluation and testing phases should be short and easy, so as to maximize the amount of time spent in development. There has been recent work on reducing the amount of assessor effort needed to evaluate retrieval systems, but it has not, for the most part, investigated the effects of these methods on tests of significance. In this work, we explore in detail the effects of reduced sets of judgments on the sign test. We demonstrate both analytically and empirically the relationship between the power of the test, the number of topics evaluated, and the number of judgments available. Using these relationships, we can determine the number of topics and judgments needed for the least-cost but highest-confidence significance evaluation. Specifically, testing pairwise significance over 192 topics with fewer than 5 judgments for each is as good as testing significance over 25 topics with an average of 166 judgments for each - 85% less effort producing no additional errors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Earlbaum Associates, 1988.
3
 
4
 
5
J. M. Hoenig and D. M. Heisey. The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1):11--6, 2001.
 
6
 
7
E. L. Lehmann. Testing Statistical Hypotheses. Springer, 1997.
8
9
 
10
 
11
W. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, 2003.
 
12
E. M. Voorhees. Overview of the 2004 trec robust track. In 13th TREC, 2004.
 
13
D. Wackerly, W. Mendenhall, and R. L. Sheaffer. Mathematical Statistics With Applications. P W S Publishers, 5th edition, 1995.
14


Collaborative Colleagues:
Ben Carterette: colleagues
Mark D. Smucker: colleagues