| A comparison of statistical significance tests for information retrieval evaluation |
| Full text |
Pdf
(1.04 MB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
table of contents
Lisbon, Portugal
SESSION: IR evaluation (IR)
table of contents
Pages 623-632
Year of Publication: 2007
ISBN:978-1-59593-803-9
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 39, Downloads (12 Months): 231, Citation Count: 10
|
|
APPENDICES and SUPPLEMENTS
|
|
This is the original PDF as published in the proceedings. An error was found in the Conclusion and corrected post-publication. The Corrected Version of Record is now posted in the ACM Digital Library. See Full Text above.
|
ABSTRACT
Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics for Experimenters. John Wiley & Sons, 1978.
|
| |
2
|
J. V. Bradley. Distribution-Free Statistical Tests. Prentice-Hall, 1968.
|
| |
3
|
C. Buckley. trec_eval. http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz.
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1998.
|
| |
9
|
R. A. Fisher. The Design of Experiments. Oliver and Boyd, first edition, 1935.
|
 |
10
|
|
| |
11
|
O. Kempthorne and T. E. Doerfler. The behavior of some significance tests under experimental randomization. Biometrika, 56(2):231--248, August 1969.
|
 |
12
|
|
| |
13
|
W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer. Mathematical Statistics with Applications. PWS-KENT Publishing Company, 1990.
|
| |
14
|
E. W. Noreen. Computer Intensive Methods for Testing Hypotheses. John Wiley, 1989.
|
| |
15
|
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. 3-900051-07-0.
|
 |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
E. M. Voorhees and D. K. Harman, editors. TREC. MIT Press, 2005.
|
| |
21
|
|
| |
22
|
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80--83, December 1945.
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
Lucas Antiqueira , Osvaldo N. Oliveira, Jr. , Luciano da Fontoura Costa , Maria das Graças Volpe Nunes, A complex network approach to text summarization, Information Sciences: an International Journal, v.179 n.5, p.584-599, February, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|