|
ABSTRACT
In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
Bernstein, P., Blaustein, B., and Clarke, E. 1980. Fast maintenance of semantic integrity assertions using redundant aggregate data. In Proceedings of the 6th International Conference on Very Large Databases (Montreal, Ont., Canada). 126--136.
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
Coffman, Jr., E. G., Liu, Z., and Weber, R. R. 1998. Optimal robot scheduling for web search engines. J. Sched. 1, 1 (June), 15--29.
|
 |
14
|
Latha S. Colby , Akira Kawaguchi , Daniel F. Lieuwen , Inderpal Singh Mumick , Kenneth A. Ross, Supporting multiple view maintenance policies, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.405-416, May 11-15, 1997, Tucson, Arizona, United States
|
 |
15
|
|
| |
16
|
|
| |
17
|
Douglis, F., Feldmann, A., and Krishnamurthy, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the 2nd USENIX Symposium on Internetworking Technologies and Systems (Boulder, Colo.).
|
 |
18
|
|
| |
19
|
|
| |
20
|
Google. Google Inc. http://www.google.com.
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 280, 5360 (Apr.), 98--100.
|
| |
26
|
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the web. Nature 400, 6740 (July), 107--109.
|
 |
27
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
28
|
|
| |
29
|
|
| |
30
|
Pinkerton, B. 1994. Finding what people want: Experiences with the web crawler. In Proceedings of the 2nd World-Wide Web Conference (Chicago, Ill.).
|
 |
31
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
 |
32
|
|
| |
33
|
|
| |
34
|
Taylor, H. M. and Karlin, S. 1998. An Introduction to Stochastic Modeling, 3rd ed. Academic Press, Orlando, Fla.
|
| |
35
|
Thomas, Jr., G. B. 1969. Calculus and analytic geometry, 4th ed. Addison-Wesley, Reading, Mass.
|
| |
36
|
|
 |
37
|
Alec Wolman , M. Voelker , Nitin Sharma , Neal Cardwell , Anna Karlin , Henry M. Levy, On the scale and performance of cooperative Web proxy caching, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.16-31, December 12-15, 1999, Charleston, South Carolina, United States
|
| |
38
|
|
CITED BY 19
|
|
Luciano Barbosa , Ana Carolina Salgado , Francisco de Carvalho , Jacques Robin , Juliana Freire, Looking at both the present and the past to efficiently update replicas of web content, Proceedings of the 7th annual ACM international workshop on Web information and data management, November 04-04, 2005, Bremen, Germany
|
|
|
|
|
|
Michael L. Nelson , Joan A. Smith , Ignacio Garcia del Campo, Efficient, automatic web resource harvesting, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Qingzhao Tan , Ziming Zhuang , Prasenjit Mitra , C. Lee Giles, Designing efficient sampling techniques to detect webpage updates, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edleno Silva de Moura , Celia Francisca dos Santos , Bruno Dos santos de Araujo , Altigran Soares da Silva , Pavel Calado , Mario A. Nascimento, Locality-Based pruning methods for web search, ACM Transactions on Information Systems (TOIS), v.26 n.2, p.1-28, March 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marc Spaniol , Dimitar Denev , Arturas Mazeika , Gerhard Weikum , Pierre Senellart, Data quality in web archiving, Proceedings of the 3rd workshop on Information credibility on the web, April 20-20, 2009, Madrid, Spain
|
|
|
Jiang-Ming Yang , Rui Cai , Chunsong Wang , Hua Huang , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|