|
ABSTRACT
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
J. Battelle. John Battelle's searchblog. battellemedia.com/archives/001889.php, 2005.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
M. Cheney and M. Perry. A comparison of the size of the Yahoo! and Google indices. vburton.ncsa.uiuc.edu/indexsize.html, 2005.
|
| |
8
|
|
| |
9
|
dmoz. The open directory project. dmoz.org.
|
| |
10
|
A. Dobra and S. Fienberg. How large is the World Wide Web? Web Dynamics, pages 23--44, 2004.
|
| |
11
|
|
 |
12
|
|
| |
13
|
W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97--109, 1970.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
S. Lawrence and C. Giles. Searching the World Wide Web. Science, 5360(280):98, 1998.
|
| |
18
|
S. Lawrence and C. Giles. Accessibility of information on the Web. Nature, 400:107--109, 1999.
|
| |
19
|
J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001.
|
| |
20
|
T. Mayer. Our blog is growing up and so has our index. www.ysearchblog.com/archives/000172.html.
|
| |
21
|
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equations of state calculations by fast computing machines. J. of Chemical Physics, 21:1087--1091, 1953.
|
| |
22
|
G. Price. More on the total database size battle and Googlewhacking with Yahoo. blog.searchenginewatch.com/blog/050811-231448, 2005.
|
| |
23
|
P. Rusmevichientong, D. Pennock, S. Lawrence, and C. Lee Giles. Methods for sampling pages uniformly from the World Wide Web. In AAAI Fall Symp., 2001.
|
| |
24
|
J.von Neumann. Various techniques used in connection with random digits. In John von Neumann, Collected Works, volume V. Oxford, 1963.
|
CITED BY 24
|
|
Arjun Dasgupta , Nan Zhang , Gautam Das , Surajit Chaudhuri, Privacy preservation of aggregates in hidden databases: why and how?, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
Andrei Broder , Marcus Fontura , Vanja Josifovski , Ravi Kumar , Rajeev Motwani , Shubha Nabar , Rina Panigrahy , Andrew Tomkins , Ying Xu, Estimating corpus size via queries, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gautam Das , Nick Koudas , Manos Papagelis , Sushruth Puttaswamy, Efficient sampling of information in social networks, Proceeding of the 2008 ACM workshop on Search in social media, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edward Bortnikov , Maxim Gurevich , Idit Keidar , Gabriel Kliot , Alexander Shraer, Brahms: Byzantine resilient random membership sampling, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.53 n.13, p.2340-2359, August, 2009
|
|
|
|
|
|
Yutaka Matsuo , Hironori Tomobe , Takuichi Nishimura, Robust estimation of Google counts for social network extraction, Proceedings of the 22nd national conference on Artificial intelligence, p.1395-1401, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|