ACM Home Page
Please provide us with feedback. Feedback
Capturing collection size for distributed non-cooperative retrieval
Full text PdfPdf (198 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Distributed IR table of contents
Pages: 316 - 323  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Milad Shokouhi  RMIT University, Melbourne, Australia
Justin Zobel  RMIT University, Melbourne, Australia
Falk Scholer  RMIT University, Melbourne, Australia
S. M. M. Tahaghoghi  RMIT University, Melbourne, Australia
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 98,   Citation Count: 12
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148227
What is a DOI?

ABSTRACT

Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Agichtein, E., Ipeirotis, P. G., and Gravano, L. (2003). Modeling query-based access to text databases. In International Workshop on Web and Databases, pages 87--92, San Diego, California.
2
 
3
 
4
 
5
6
 
7
Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. In Proceedings of TREC-2002, Gaithersburg, Maryland.
8
 
9
 
10
11
12
 
13
Ipeirotis, P. G. and Gravano, L. (2002). Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of 28th International Conference on Very Large Data Bases, pages 394--405, Hong Kong, China.
14
15
 
16
 
17
Karnatapu, S., Ramachandran, K., Wu, Z., Shah, B., Raghavan, V., and Benton, R. (2004). Estimating size of search engines in an uncooperative environment. In Workshop on Web-based Support Systems, pages 81--87, Beijing, China.
18
19
 
20
Schumacher, F. X. and Eschmeyer, R. W. (1943). The estimation of fish populations in lakes and ponds. Journal of the Tennesse Academy of Science, 18:228--249.
 
21
Si, L. and Callan, J. (2003a). The effect of database size distribution on resource selection algorithms. In Proeedings of SIGIR 2003 Workshop on Distributed Information Retrieval, pages 31--42, Toronto, Canada.
22
23
24
 
25
Sutherland, W. J. (1996). Ecological Census Techniques. Cambridge University Press.
 
26

CITED BY  12

Collaborative Colleagues:
Milad Shokouhi: colleagues
Justin Zobel: colleagues
Falk Scholer: colleagues
S. M. M. Tahaghoghi: colleagues