ACM Home Page
Please provide us with feedback. Feedback
Evaluating evaluation metrics based on the bootstrap
Full text PdfPdf (279 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Evaluation 2 table of contents
Pages: 525 - 532  
Year of Publication: 2006
ISBN:1-59593-369-7
Author
Tetsuya Sakai  Toshiba Corporate R&D Center
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 96,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148261
What is a DOI?

ABSTRACT

This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Asakawa, S. and Selberg, E.: The New MSN Search Engine Developed by Microsoft (in Japanese), Information Processing Society of Japan Magazine, Vol 46, No. 9, pp. 1008--1015, 2005.
2
 
3
Efron, B. and Tibshirani, R.: An Introduction to the Bootstrap, Chapman & Hall/CRC, 1993.
4
5
 
6
Johnson, D. H.: The Insignificance of Statistical Significance Testing, Journal of Wildlife Management Vol. 63, Issue. 3, pp. 763--772, 1999.
 
7
8
 
9
NTCIR: http://research.nii.ac.jp/ntcir/
 
10
Sakai, T.: The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics, NTCIR-5 Proceedings, pp. 505--512, 2005.
 
11
Sakai, T. et al.: Toshiba BRIDJE at NTCIR-5 CLIR: Evaluation using Geometric Means, NTCIR-5 Proceedings, pp. 56--63, 2005.
 
12
Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and Management, to appear, 2006.
13
 
14
15
 
16
Voorhees, E. M.: Overview of the TREC 2004 Robust Retrieval Track, TREC 2004 Proceedings, 2005.
 
17
Vu, H.-T. and Gallinari, P.: On Effectiveness Measures and Relevance Functions in Ranking INEX Systems, AIRS 2005 Proceedings, LNCS 3689, pp. 312--327, 2005.

CITED BY  16