|
ABSTRACT
Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes.Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important to devise efficient update schedules that adapt to the change rate of the pages and avoid visiting pages not modified since the last visit.In this paper, we propose a new approach that learns to predict the change behavior of Web pages based both on the static features and change history of pages, and refreshes the copies accordingly.Experiments using real-world data show that our technique leads to substantial performance improvements compared to previously proposed approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
J. Cho and A. Ntoulas. Effective Change Detection Using Sampling. In Proc. of VLDB, pages 514--525, 2002.
|
| |
8
|
F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of Change and other Metrics: a Live Study of the World Wide Web. In Proc. of the USENIX Symposium on Internetworking Technologies and Systems, pages 147--158, 1999.
|
 |
9
|
|
| |
10
|
|
| |
11
|
Internet archive. http://www.archive.org.
|
| |
12
|
|
| |
13
|
S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280(5360):98--100, 1998.
|
| |
14
|
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.
|
| |
15
|
The MD5 Message-Digest Algorithm. http://www.rfc-editor.org/rfc/rfc1321.txt.
|
 |
16
|
|
| |
17
|
Webarchive project. http://webarchive.cs.ucla.edu.
|
| |
18
|
Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ ml/weka.
|
|