| Capturing collection size for distributed non-cooperative retrieval |
| Full text |
Pdf
(198 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Seattle, Washington, USA
SESSION: Distributed IR
table of contents
Pages: 316 - 323
Year of Publication: 2006
ISBN:1-59593-369-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 20, Downloads (12 Months): 98, Citation Count: 12
|
|
|
ABSTRACT
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Agichtein, E., Ipeirotis, P. G., and Gravano, L. (2003). Modeling query-based access to text databases. In International Workshop on Web and Databases, pages 87--92, San Diego, California.
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. In Proceedings of TREC-2002, Gaithersburg, Maryland.
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
Ipeirotis, P. G. and Gravano, L. (2002). Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of 28th International Conference on Very Large Data Bases, pages 394--405, Hong Kong, China.
|
 |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
Karnatapu, S., Ramachandran, K., Wu, Z., Shah, B., Raghavan, V., and Benton, R. (2004). Estimating size of search engines in an uncooperative environment. In Workshop on Web-based Support Systems, pages 81--87, Beijing, China.
|
 |
18
|
|
 |
19
|
|
| |
20
|
Schumacher, F. X. and Eschmeyer, R. W. (1943). The estimation of fish populations in lakes and ponds. Journal of the Tennesse Academy of Science, 18:228--249.
|
| |
21
|
Si, L. and Callan, J. (2003a). The effect of database size distribution on resource selection algorithms. In Proeedings of SIGIR 2003 Workshop on Distributed Information Retrieval, pages 31--42, Toronto, Canada.
|
 |
22
|
|
 |
23
|
|
 |
24
|
Luo Si , Rong Jin , Jamie Callan , Paul Ogilvie, A language modeling framework for resource selection and results merging, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584856]
|
| |
25
|
Sutherland, W. J. (1996). Ecological Census Techniques. Cambridge University Press.
|
| |
26
|
|
|