|
ABSTRACT
In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Javed A. Aslam , Virgiliu Pavlu , Robert Savell, A unified model for metasearch, pooling, and system evaluation, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956953]
|
| |
3
|
Bacchetti, P. 2002. Peer review of statistics in medical research: The other problem. Brit. Med. J. 324, 1271--1273.
|
 |
4
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman, Using titles and category names from editor-driven taxonomies for automatic evaluation, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956868]
|
 |
5
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder, Using manually-built web directories for automatic evaluation of known-item retrieval, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860507]
|
 |
6
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder, Evaluation of filtering current news search results, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009087]
|
 |
7
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder, Hourly analysis of a very large topically categorized web query log, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009048]
|
| |
8
|
|
| |
9
|
Blustein, J. and Tague-Sutcliffe, J. 1995. IR-stat-pak. In Presented at the ACM Conference on Research and Development in Information Retrieval.
|
| |
10
|
|
| |
11
|
Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing Web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems.
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
Chowdhury, A. 2005. Automatic evaluation of Web search services. In Zelkowitz, M., Ed. Advances in Computers, Elsevier Academic Press.
|
| |
17
|
Clarke, C., Scholer, F., and Soboroff, I. 2005. The TREC 2005 terabyte track. In Proceedings of the The Text Retrieval Conference, NIST.
|
| |
18
|
Collings, B. J. and Hamilton, M. A. 1988. Estimating the power of the two sample Wilcoxon test for location shift. Biometrics 44, 847--860.
|
 |
19
|
|
 |
20
|
|
| |
21
|
Davidson, R. and MacKinnon, J. G. 2000. Bootstrap tests: How many bootstraps? Econometric Rev. 19, 55--68.
|
| |
22
|
Davidson, R. and MacKinnon, J. G. 2006. The power of bootstrap and asymptotic tests. J. Econometrics 133, 421--441.
|
| |
23
|
De Martini, D. and Rapallo, F. 2003. Calculating the power of permutation tests: A comparison between nonparametric estimators. J. Appl. Stat. Sci. 11, 109--120.
|
| |
24
|
De Martini, D. 2006. On the stability of statistical tests. In Proceedings of the ASA Joint Statistical Meeting.
|
| |
25
|
Ding, W. and Marchionini, G. 1996. Comparative study of Web search service performance. In Proceedings of the ASIS 1996 Annual Conference.
|
| |
26
|
Efron, B. and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC, 379--381.
|
| |
27
|
Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. 2005. Workshop: Intrinsic and extrinsic evaluation measures for MT and/or summarization. In Proceedings of the Annual Meeting of the Association of Computational Linguistics.
|
| |
28
|
Goodman, S. N. 1992. A comment on replication, p-values and evidence. Stat. Med. 11, 875--879.
|
| |
29
|
Hall, P. and Martin, M. A. 1988. On bootstrap resampling and iteration. Biometrika 75(4), 661--671.
|
 |
30
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
| |
31
|
|
| |
32
|
Hoenig, J. M. and Heisey, D. M. 2001. The abuse of power: The pervasive fallacy of power calculations for data analysis. Amer. Statist. 55(1), 19--24.
|
| |
33
|
Hollander, M. and Wolfe, D. 1973. Nonparametric Statistical Methods. John Wiley and Sons.
|
| |
34
|
|
| |
35
|
|
 |
36
|
|
| |
37
|
|
 |
38
|
Thorsten Joachims , Laura Granka , Bing Pan , Helene Hembrooke , Geri Gay, Accurately interpreting clickthrough data as implicit feedback, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076063]
|
| |
39
|
Lehmann, E. 1986. Testing Statistical Hypotheses. Wiley, 150.
|
 |
40
|
|
 |
41
|
|
| |
42
|
Miller, R. G., Jr. 1981. Simultaneous Statistical Inference. Springer, New York.
|
| |
43
|
Munzel, U. 2001. A unified approach to simultaneous rank test procedures in the unbalanced one-way layout. Biomet. J. 43(5), 553--569.
|
 |
44
|
|
| |
45
|
|
 |
46
|
|
 |
47
|
|
 |
48
|
|
 |
49
|
|
| |
50
|
|
| |
51
|
|
| |
52
|
|
| |
53
|
Shao, J. and Chow, S.-C. 2002. Reproducibility probability in clinical trials. Statistics in Medicine 21(12), 1727--1742.
|
 |
54
|
|
 |
55
|
|
| |
56
|
Spiegelhalter, D. J. and Freedman, L. S. 1986. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Statistics in Medicine 5, 1--13.
|
| |
57
|
|
| |
58
|
|
| |
59
|
Troendle, J. F. 1999. Approximating the power of wilcoxon's rank-sum test against shift alternatives. Stat. Med. 18(20) (Oct.), 2763--2773.
|
| |
60
|
|
 |
61
|
|
 |
62
|
|
 |
63
|
|
 |
64
|
|
|