ACM Home Page
Please provide us with feedback. Feedback
Optimal crawling strategies for web search engines
Full text PdfPdf (345 KB)
Source International World Wide Web Conference archive
Proceedings of the 11th international conference on World Wide Web table of contents
Honolulu, Hawaii, USA
SESSION: Crawling table of contents
Pages: 136 - 147  
Year of Publication: 2002
ISBN:1-58113-449-5
Authors
J. L. Wolf  IBM Watson Research Center, Yorktown Heights, NY
M. S. Squillante  IBM Watson Research Center, Yorktown Heights, NY
P. S. Yu  IBM Watson Research Center, Yorktown Heights, NY
J. Sethuraman  Columbia University, New York, NY
L. Ozsen  Northwestern University, Evanston, IL
Sponsors
ACM: Association for Computing Machinery
: WWW'02
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 91,   Citation Count: 23
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/511446.511465
What is a DOI?

ABSTRACT

Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Ahuja, T. Magnanti and J. Orlin, Network Flows, Prentice Hall, 1993.
 
2
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke and S. Raghavan, "Searching the Web", ACM Transactions on Internet Technology, 1(1), 2001.
 
3
 
4
A. Broder, Personal communication.
 
5
J. Challenger, P. Dantzig, A. Iyengar, M. S. Squillante, and L. Zhang. Efficiently serving dynamic data at highly accessed web sites. Preprint, May 2001.
6
 
7
E. Coffman, Z. Liu and R. Weber, "Optimal Robot Scheduling for Web Search Engines", INRIA Research Report, 1997.
 
8
F. Douglas, A. Feldmann and B. Krishnamurthy, "Rate of Change and other Metrics: A Live Study of the World Wide Web", USENIX Symposium on Internetworking Technologies and Systems, 1999.
 
9
B. Fox, "Discrete Optimization via Marginal Analysis", Management Science, 13:210--216, 1966.
 
10
G. Frederickson and D. Johnson, "The Complexity of Selection and Ranking in X+Y and Matrices with Sorted Columns", Journal of Computer and System Science, 24:197--208, 1982.
11
 
12
 
13
International Business Machines Corporation, Optimization Subroutine Library Guide and Reference, IBM, 1995.
 
14
N. Katoh and T. Ibaraki, "Resource Allocation Problems", in Handbook of Combinatorial Optimization, D-Z. Du and P. Pardalos, editors, Kluwer Academic Press, 2000.
 
15
 
16
 
17
S. Lawrence and C. Giles, "Accessibility of Information on the Web", Nature, 400:107--109, 1999.
 
18
19
 
20
M. Pinedo, Scheduling: Theory, Algorithms and Systems, Prentice-Hall, 1995.
21
 
22
W. Press, B. Flannery, S. Teukolsky and W. Vetterling, Numerical Recipes, Cambridge University Press, 1986.
 
23
S. M. Ross. Stochastic Processes. John Wiley and Sons, Second Edition, 1997.
 
24
K. Sigman. Stationary Marked Point Processes: An Intuitive Approach. Chapman and Hall, 1995.
 
25
M. Squillante, D. Yao and L. Zhang, "Web Traffic Modeling and Web Server Performance Analysis", IEEE Conference on Decision and Control, 1999.
 
26
J. Talim, Z. Liu, P. Nain and E. Coffman, "Optimizing the Number of Robots for Web Search Engines", Telecommunication Systems Journal, 17(1-2):234-243, 2001.
 
27
 
28
R. W. Wolff. Stochastic Modeling and the Theory of Queues. Prentice Hall, 1989.
 
29
G. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, 1949.

CITED BY  23

Collaborative Colleagues:
J. L. Wolf: colleagues
M. S. Squillante: colleagues
P. S. Yu: colleagues
J. Sethuraman: colleagues
L. Ozsen: colleagues