|
ABSTRACT
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
James Allan, Jamie Callan, Fang-Fang Feng, and Daniella Malin. INQUERY and TREC-8. In Voorhees and Harman {26}.
|
| |
2
|
Chris Buckley and Janet Walz. SMART in TREC 8. In Voorhees and Harman {26}.
|
| |
3
|
C. W. Cleverdon, J. Mills, and E. M. Keen. Factors determining the performance of indexing systems. Two volumes, Cranfield, England, 1968.
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
D.K. Harman, editor. Proceedings of the Fourth Text RE- trieval Conference (TREC-4), October 1996. NIST Special Publication 500-236.
|
| |
8
|
Donna Harman. Overview of the fourth Text REtrieval Conference (TREC-4). In Harman {7}, pages 1-23. NIST Special Publication 500-236.
|
| |
9
|
David Hawking, Peter Bailey, and Nick Craswell. ACSys TREC-8 experiments. In Voorhees and Harman {26}.
|
 |
10
|
|
| |
11
|
|
| |
12
|
K.L. Kwok, L. Grunfeld, and M. Chart. TREC-8 ad-hoc, query and filtering track experiments using PIRCS. In Voorhees and Harman {26}.
|
 |
13
|
|
| |
14
|
David D. Lewis. The TREC-4 filtering track. In Harman {7}, pages 165-180. NIST Special Publication 500-236.
|
| |
15
|
J. Mayfiled, P. McNamee, and C. Piatko. The JHU/APL HAIRCUT system at TREC-8. In Voorhees and Harman {26}.
|
| |
16
|
|
| |
17
|
K. Sparck Jones and C.J. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32(1):59-75, 1976.
|
| |
18
|
Karen Sparck Jones. Automatic indexing. Journal of Documentation, 30:393-432, 1974.
|
| |
19
|
Jean M. Tague. The pragmatics of information retrieval experimentation. In Karen Sparck Jones, editor, Information Retrieval Experiment, pages 59-102. Butterworths, 1981.
|
| |
20
|
|
| |
21
|
Jean Tague-Sutcliffe and James Blustein. A statistical analysis of the TREC-3 data. In D. K. Harman, editor, Overview of the Third Text REtrieval Conference (TREC- 3) {Proceedings of TREC-3.}, pages 385-398, April 1995. NIST Special Publication 500-225.
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
Ellen M. Voorhees and Donna Harman. Overview of the seventh Text REtrieVal Conference (TREC-7). In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1-23, August 1999. NIST Special Publication 500-242. Electronic version available at http://trec.nist.gov/pubs.html.
|
| |
26
|
E.M. Voorhees and D.K. Harman, editors. Proceedings of the Eighth Text REtrieval Conference (TREC-8). Electronic version available at http://trec.nist.gov/pubs.html, 2000.
|
| |
27
|
D. Williamson, R. Williamson, and M. Lesk. The Cornell implementation of the Smart system. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 2, pages 43-44. Prentice- Hall, Inc. Englewood Cliffs, New Jersey, 1971.
|
 |
28
|
|
CITED BY 93
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman, Using titles and category names from editor-driven taxonomies for automatic evaluation, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Douglas W. Oard , Dagobert Soergel , David Doermann , Xiaoli Huang , G. Craig Murray , Jianqiang Wang , Bhuvana Ramabhadran , Martin Franz , Samuel Gustman , James Mayfield , Liliya Kharevych , Stephanie Strassel, Building an information retrieval test collection for spontaneous conversational speech, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
|
|
|
|
|
|
|
|
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder, Using manually-built web directories for automatic evaluation of known-item retrieval, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Steven M. Beitzel , Eric C. Jensen , Ophir Frieder , Abdur Chowdhury , Greg Pass, Surrogate scoring for improved metasearch precision, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jack G. Conrad , Xi S. Guo , Peter Jackson , Monem Meziou, Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment, Proceedings of the 28th international conference on Very Large Data Bases, p.71-82, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
Jun Yan , Ning Liu , Qiang Yang , Benyu Zhang , Qiansheng Cheng , Zheng Chen, Mining Adaptive Ratio Rules from Distributed Data Sources, Data Mining and Knowledge Discovery, v.12 n.2-3, p.249-273, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Carlos A. Heuser , Viviane Moreira Orengo , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
|
|
|
Jianhan Zhu , Jun Wang , Vishwa Vinay , Ingemar J. Cox, Topic (query) selection for IR evaluation, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tanuja Bompada , Chi-Chao Chang , John Chen , Ravi Kumar , Rajesh Shenoy, On the robustness of relevance measures with incomplete judgments, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Susan L. Price , Marianne Lykke Nielsen , Lois M. L. Delcambre , Peter Vedsted , Jeremy Steinhauer, Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective, Information Systems, v.34 n.8, p.778-806, December, 2009
|
|
|
Carina F. Dorneles , Marcos Freitas Nunes , Carlos A. Heuser , Viviane P. Moreira , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Information Systems, v.34 n.8, p.740-756, December, 2009
|
|
|
|
|
|
|
|