ACM Home Page
Please provide us with feedback. Feedback
Estimating corpus size via queries
Full text PdfPdf (286 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 15th ACM international conference on Information and knowledge management table of contents
Arlington, Virginia, USA
SESSION: Ranking and estimation table of contents
Pages: 594 - 603  
Year of Publication: 2006
ISBN:1-59593-433-2
Authors
Andrei Broder  Yahoo! Research, Sunnyvale, CA
Marcus Fontura  Yahoo! Research, Sunnyvale, CA
Vanja Josifovski  Yahoo! Research, Sunnyvale, CA
Ravi Kumar  Yahoo! Research, Sunnyvale, CA
Rajeev Motwani  Stanford University, Stanford, CA
Shubha Nabar  Stanford University, Stanford, CA
Rina Panigrahy  Stanford University, Stanford, CA
Andrew Tomkins  Yahoo! Research, Sunnyvale, CA
Ying Xu  Stanford University, Stanford, CA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 58,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183614.1183699
What is a DOI?

ABSTRACT

We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
J. Bar-Ilan. Size of the web, search engine coverage and overlap - methodological issues. Unpublished, 2006.
 
3
4
 
5
 
6
 
7
 
8
A. Z. Broder. Web measurements via random queries. Presentation at the Workshop on Web Measurement, Metrics, and Mathematical Models (WWW10 Conference), 2000.
9
10
 
11
12
 
13
S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280(5360):98--100, 1998.
14
 
15
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.
 
16
 
17
S. Wu, F. Gibb, and F. Crestani. Experiments with document archive size detection. In Proc. 25th European Conference on IR Research, volume 2633 of Lecture Notes in Computer Science, pages 294--304. Springer, 2003.


Collaborative Colleagues:
Andrei Broder: colleagues
Marcus Fontura: colleagues
Vanja Josifovski: colleagues
Ravi Kumar: colleagues
Rajeev Motwani: colleagues
Shubha Nabar: colleagues
Rina Panigrahy: colleagues
Andrew Tomkins: colleagues
Ying Xu: colleagues