ACM Home Page
Please provide us with feedback. Feedback
A comparison of statistical significance tests for information retrieval evaluation
Full text PdfPdf (1.04 MB)
Source
Conference on Information and Knowledge Management archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management table of contents
Lisbon, Portugal
SESSION: IR evaluation (IR) table of contents
Pages 623-632  
Year of Publication: 2007
ISBN:978-1-59593-803-9
Authors
Mark D. Smucker  University of Massachusetts Amherst, Amherst, MA
James Allan  University of Massachusetts Amherst, Amherst, MA
Ben Carterette  University of Massachusetts Amherst, Amherst, MA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 39,   Downloads (12 Months): 231,   Citation Count: 10
Additional Information:

appendices and supplements   abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321440.1321528
What is a DOI?

APPENDICES and SUPPLEMENTS
This is the original PDF as published in the proceedings. An error was found in the Conclusion and corrected post-publication. The Corrected Version of Record is now posted in the ACM Digital Library. See Full Text above.


ABSTRACT

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics for Experimenters. John Wiley & Sons, 1978.
 
2
J. V. Bradley. Distribution-Free Statistical Tests. Prentice-Hall, 1968.
 
3
C. Buckley. trec_eval. http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz.
 
4
5
6
 
7
 
8
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1998.
 
9
R. A. Fisher. The Design of Experiments. Oliver and Boyd, first edition, 1935.
10
 
11
O. Kempthorne and T. E. Doerfler. The behavior of some significance tests under experimental randomization. Biometrika, 56(2):231--248, August 1969.
12
 
13
W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer. Mathematical Statistics with Applications. PWS-KENT Publishing Company, 1990.
 
14
E. W. Noreen. Computer Intensive Methods for Testing Hypotheses. John Wiley, 1989.
 
15
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. 3-900051-07-0.
16
17
 
18
 
19
 
20
E. M. Voorhees and D. K. Harman, editors. TREC. MIT Press, 2005.
 
21
 
22
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80--83, December 1945.

CITED BY  10

Collaborative Colleagues:
Mark D. Smucker: colleagues
James Allan: colleagues
Ben Carterette: colleagues