ACM Home Page
Please provide us with feedback. Feedback
Random sampling from a search engine's index
Full text PdfPdf (848 KB)
Source
Journal of the ACM (JACM) archive
Volume 55 ,  Issue 5  (October 2008) table of contents
Article No. 24  
Year of Publication: 2008
ISSN:0004-5411
Authors
Ziv Bar-Yossef  Technion and Google Haifa, Haifa, Israel
Maxim Gurevich  Technion, Haifa, Israel
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 36,   Downloads (12 Months): 395,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1411509.1411514
What is a DOI?

ABSTRACT

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.

The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method.

The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest.

Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Aldous, D. 1987. On the Markov chain simulation method for uniform combinatorial distributed and simulated annealing. Prob. Eng. Inf. Sci. 1, 33--46.
2
 
3
4
 
5
Battelle, J. 2005. John Battelle's searchblog. http://battellemedia.com/archives/001889.php.
 
6
 
7
Bondar, S. 2006. Search engine indexing limits: Where do the bots stop? (Available at http://www.sitepoint.com/article/indexing-limits-where-bots-stop.)
 
8
 
9
10
 
11
 
12
Cheney, M., and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. Available at http://vburton.ncsa.uiuc.edu/indexsize.html.
 
13
Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology (ASIST). Vol. 41. 393--402.
 
14
 
15
Dobra, A., and Fienberg, S. E. 2004. How large is the World Wide Web? Web Dyn. 23--44.
 
16
 
17
 
18
19
 
20
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.
 
21
 
22
 
23
 
24
Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.
 
25
 
26
Lawrence, S., and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.
 
27
Lawrence, S., and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.
 
28
Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.
 
29
Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York.
 
30
Marshal, A. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed. Vol. 21. Wiley, New York, 123--140.
 
31
Mayer, T. 2005. Our blog is growing up - and so has our index. Available at http://www.ysearchblog.com/archives/000172.html.
32
 
33
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.
 
34
Price, G. 2005. More on the total database size battle and Googlewhacking with Yahoo. Available at http://blog.searchenginewatch.com/blog/050811-231448.
 
35
Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation.
 
36
Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag, New York.
 
37
 
38
Stuart, A. and Ord, J.-K. 1994. Kendall's Advanced Theory of Statistics. Vol. 1, Distribution Theory. Haddon Arnold, London, UK.
 
39
Sullivan, D. 2005. Search engine sizes. http://searchenginewatch.com/showPage.html?page=2156481.
 
40
Véronis, J. 2005. Yahoo: Missing pages? (2). http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-2.html.
 
41
von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.

Collaborative Colleagues:
Ziv Bar-Yossef: colleagues
Maxim Gurevich: colleagues