ACM Home Page
Please provide us with feedback. Feedback
Effective page refresh policies for Web crawlers
Full text PdfPdf (346 KB)
Source ACM Transactions on Database Systems (TODS) archive
Volume 28 ,  Issue 4  (December 2003) table of contents
Pages: 390 - 426  
Year of Publication: 2003
ISSN:0362-5915
Authors
Junghoo Cho  University of California, Los Angeles, California
Hector Garcia-Molina  Stanford University, Stanford, California
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 254,   Citation Count: 19
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/958942.958945
What is a DOI?

ABSTRACT

In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
Bernstein, P., Blaustein, B., and Clarke, E. 1980. Fast maintenance of semantic integrity assertions using redundant aggregate data. In Proceedings of the 6th International Conference on Very Large Databases (Montreal, Ont., Canada). 126--136.
4
 
5
 
6
 
7
 
8
 
9
10
11
 
12
 
13
Coffman, Jr., E. G., Liu, Z., and Weber, R. R. 1998. Optimal robot scheduling for web search engines. J. Sched. 1, 1 (June), 15--29.
14
15
 
16
 
17
Douglis, F., Feldmann, A., and Krishnamurthy, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the 2nd USENIX Symposium on Internetworking Technologies and Systems (Boulder, Colo.).
18
 
19
 
20
Google. Google Inc. http://www.google.com.
 
21
22
23
24
 
25
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 280, 5360 (Apr.), 98--100.
 
26
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the web. Nature 400, 6740 (July), 107--109.
27
 
28
 
29
 
30
Pinkerton, B. 1994. Finding what people want: Experiences with the web crawler. In Proceedings of the 2nd World-Wide Web Conference (Chicago, Ill.).
31
32
 
33
 
34
Taylor, H. M. and Karlin, S. 1998. An Introduction to Stochastic Modeling, 3rd ed. Academic Press, Orlando, Fla.
 
35
Thomas, Jr., G. B. 1969. Calculus and analytic geometry, 4th ed. Addison-Wesley, Reading, Mass.
 
36
37
 
38

CITED BY  19

Collaborative Colleagues:
Junghoo Cho: colleagues
Hector Garcia-Molina: colleagues