ACM Home Page
Please provide us with feedback. Feedback
Repeatable evaluation of search services in dynamic environments
Full text PdfPdf (674 KB)
Source
ACM Transactions on Information Systems (TOIS) archive
Volume 26 ,  Issue 1  (November 2007) table of contents
Article No. 1  
Year of Publication: 2007
ISSN:1046-8188
Authors
Eric C. Jensen  Summize, Inc.
Steven M. Beitzel  Illinois Institute of Technology
Abdur Chowdhury  Summize, Inc.
Ophir Frieder  Illinois Institute of Technology and Georgetown University
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 128,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1292591.1292592
What is a DOI?

ABSTRACT

In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
Bacchetti, P. 2002. Peer review of statistics in medical research: The other problem. Brit. Med. J. 324, 1271--1273.
4
5
6
7
 
8
 
9
Blustein, J. and Tague-Sutcliffe, J. 1995. IR-stat-pak. In Presented at the ACM Conference on Research and Development in Information Retrieval.
 
10
 
11
Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing Web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems.
12
13
 
14
15
 
16
Chowdhury, A. 2005. Automatic evaluation of Web search services. In Zelkowitz, M., Ed. Advances in Computers, Elsevier Academic Press.
 
17
Clarke, C., Scholer, F., and Soboroff, I. 2005. The TREC 2005 terabyte track. In Proceedings of the The Text Retrieval Conference, NIST.
 
18
Collings, B. J. and Hamilton, M. A. 1988. Estimating the power of the two sample Wilcoxon test for location shift. Biometrics 44, 847--860.
19
20
 
21
Davidson, R. and MacKinnon, J. G. 2000. Bootstrap tests: How many bootstraps? Econometric Rev. 19, 55--68.
 
22
Davidson, R. and MacKinnon, J. G. 2006. The power of bootstrap and asymptotic tests. J. Econometrics 133, 421--441.
 
23
De Martini, D. and Rapallo, F. 2003. Calculating the power of permutation tests: A comparison between nonparametric estimators. J. Appl. Stat. Sci. 11, 109--120.
 
24
De Martini, D. 2006. On the stability of statistical tests. In Proceedings of the ASA Joint Statistical Meeting.
 
25
Ding, W. and Marchionini, G. 1996. Comparative study of Web search service performance. In Proceedings of the ASIS 1996 Annual Conference.
 
26
Efron, B. and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC, 379--381.
 
27
Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. 2005. Workshop: Intrinsic and extrinsic evaluation measures for MT and/or summarization. In Proceedings of the Annual Meeting of the Association of Computational Linguistics.
 
28
Goodman, S. N. 1992. A comment on replication, p-values and evidence. Stat. Med. 11, 875--879.
 
29
Hall, P. and Martin, M. A. 1988. On bootstrap resampling and iteration. Biometrika 75(4), 661--671.
30
 
31
 
32
Hoenig, J. M. and Heisey, D. M. 2001. The abuse of power: The pervasive fallacy of power calculations for data analysis. Amer. Statist. 55(1), 19--24.
 
33
Hollander, M. and Wolfe, D. 1973. Nonparametric Statistical Methods. John Wiley and Sons.
 
34
 
35
36
 
37
38
 
39
Lehmann, E. 1986. Testing Statistical Hypotheses. Wiley, 150.
40
41
 
42
Miller, R. G., Jr. 1981. Simultaneous Statistical Inference. Springer, New York.
 
43
Munzel, U. 2001. A unified approach to simultaneous rank test procedures in the unbalanced one-way layout. Biomet. J. 43(5), 553--569.
44
 
45
46
47
48
49
 
50
 
51
 
52
 
53
Shao, J. and Chow, S.-C. 2002. Reproducibility probability in clinical trials. Statistics in Medicine 21(12), 1727--1742.
54
55
 
56
Spiegelhalter, D. J. and Freedman, L. S. 1986. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Statistics in Medicine 5, 1--13.
 
57
 
58
 
59
Troendle, J. F. 1999. Approximating the power of wilcoxon's rank-sum test against shift alternatives. Stat. Med. 18(20) (Oct.), 2763--2773.
 
60
61
62
63
64

Collaborative Colleagues:
Eric C. Jensen: colleagues
Steven M. Beitzel: colleagues
Abdur Chowdhury: colleagues
Ophir Frieder: colleagues