| Estimating corpus size via queries |
| Full text |
Pdf
(286 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 15th ACM international conference on Information and knowledge management
table of contents
Arlington, Virginia, USA
SESSION: Ranking and estimation
table of contents
Pages: 594 - 603
Year of Publication: 2006
ISBN:1-59593-433-2
|
|
Authors
|
|
Andrei Broder
|
Yahoo! Research, Sunnyvale, CA
|
|
Marcus Fontura
|
Yahoo! Research, Sunnyvale, CA
|
|
Vanja Josifovski
|
Yahoo! Research, Sunnyvale, CA
|
|
Ravi Kumar
|
Yahoo! Research, Sunnyvale, CA
|
|
Rajeev Motwani
|
Stanford University, Stanford, CA
|
|
Shubha Nabar
|
Stanford University, Stanford, CA
|
|
Rina Panigrahy
|
Stanford University, Stanford, CA
|
|
Andrew Tomkins
|
Yahoo! Research, Sunnyvale, CA
|
|
Ying Xu
|
Stanford University, Stanford, CA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 58, Citation Count: 7
|
|
|
ABSTRACT
We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
J. Bar-Ilan. Size of the web, search engine coverage and overlap - methodological issues. Unpublished, 2006.
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
A. Z. Broder. Web measurements via random queries. Presentation at the Workshop on Web Measurement, Metrics, and Mathematical Models (WWW10 Conference), 2000.
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280(5360):98--100, 1998.
|
 |
14
|
|
| |
15
|
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.
|
| |
16
|
|
| |
17
|
S. Wu, F. Gibb, and F. Crestani. Experiments with document archive size detection. In Proc. 25th European Conference on IR Research, volume 2633 of Lecture Notes in Computer Science, pages 294--304. Springer, 2003.
|
|