|
ABSTRACT
Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ahuja, T. Magnanti and J. Orlin, Network Flows, Prentice Hall, 1993.
|
| |
2
|
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke and S. Raghavan, "Searching the Web", ACM Transactions on Internet Technology, 1(1), 2001.
|
| |
3
|
|
| |
4
|
A. Broder, Personal communication.
|
| |
5
|
J. Challenger, P. Dantzig, A. Iyengar, M. S. Squillante, and L. Zhang. Efficiently serving dynamic data at highly accessed web sites. Preprint, May 2001.
|
 |
6
|
|
| |
7
|
E. Coffman, Z. Liu and R. Weber, "Optimal Robot Scheduling for Web Search Engines", INRIA Research Report, 1997.
|
| |
8
|
F. Douglas, A. Feldmann and B. Krishnamurthy, "Rate of Change and other Metrics: A Live Study of the World Wide Web", USENIX Symposium on Internetworking Technologies and Systems, 1999.
|
| |
9
|
B. Fox, "Discrete Optimization via Marginal Analysis", Management Science, 13:210--216, 1966.
|
| |
10
|
G. Frederickson and D. Johnson, "The Complexity of Selection and Ranking in X+Y and Matrices with Sorted Columns", Journal of Computer and System Science, 24:197--208, 1982.
|
 |
11
|
|
| |
12
|
|
| |
13
|
International Business Machines Corporation, Optimization Subroutine Library Guide and Reference, IBM, 1995.
|
| |
14
|
N. Katoh and T. Ibaraki, "Resource Allocation Problems", in Handbook of Combinatorial Optimization, D-Z. Du and P. Pardalos, editors, Kluwer Academic Press, 2000.
|
| |
15
|
|
| |
16
|
|
| |
17
|
S. Lawrence and C. Giles, "Accessibility of Information on the Web", Nature, 400:107--109, 1999.
|
| |
18
|
|
 |
19
|
Venkata N. Padmanabhan , Lili Qiu, The content and access dynamics of a busy Web site: findings and implications, Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, p.111-123, August 28-September 01, 2000, Stockholm, Sweden
|
| |
20
|
M. Pinedo, Scheduling: Theory, Algorithms and Systems, Prentice-Hall, 1995.
|
 |
21
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
| |
22
|
W. Press, B. Flannery, S. Teukolsky and W. Vetterling, Numerical Recipes, Cambridge University Press, 1986.
|
| |
23
|
S. M. Ross. Stochastic Processes. John Wiley and Sons, Second Edition, 1997.
|
| |
24
|
K. Sigman. Stationary Marked Point Processes: An Intuitive Approach. Chapman and Hall, 1995.
|
| |
25
|
M. Squillante, D. Yao and L. Zhang, "Web Traffic Modeling and Web Server Performance Analysis", IEEE Conference on Decision and Control, 1999.
|
| |
26
|
J. Talim, Z. Liu, P. Nain and E. Coffman, "Optimizing the Number of Robots for Web Search Engines", Telecommunication Systems Journal, 17(1-2):234-243, 2001.
|
| |
27
|
|
| |
28
|
R. W. Wolff. Stochastic Modeling and the Theory of Queues. Prentice Hall, 1989.
|
| |
29
|
G. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, 1949.
|
CITED BY 23
|
|
|
|
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anirban Dasgupta , Arpita Ghosh , Ravi Kumar , Christopher Olston , Sandeep Pandey , Andrew Tomkins, The discoverability of the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jiang-Ming Yang , Rui Cai , Chunsong Wang , Hua Huang , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|