|
ABSTRACT
Continuous queries are queries for which responses given to users must be continuously updated, as the sources of interest get updated. Such queries occur, for instance, during on-line decision making, e.g., traffic flow control, weather monitoring, etc. The problem of keeping the responses current reduces to the problem of deciding how often to visit a source to determine if and how it has been modified, in order to update earlier responses accordingly. On the surface, this seems to be similar to the crawling problem since crawlers attempt to keep indexes up-to-date as pages change and users pose search queries. We show that this is not the case, both due to the inherent differences between the nature of the two problems as well as the performance metric. We propose, develop and evaluate a novel multi-phase (Continuous Adaptive Monitoring) (CAM) solution to the problem of maintaining the currency of query results. Some of the important phases are: The tracking phase, in which changes, to an initially identified set of relevant pages, are tracked. From the observed change characteristics of these pages, a probabilistic model of their change behavior is formulated and weights are assigned to pages to denote their importance for the current queries. During the next phase, the resource allocation phase, based on these statistics, resources, needed to continuously monitor these pages for changes, are allocated. Given these resource allocations, the scheduling phase produces an optimal achievable schedule for the monitoring tasks. An experimental evaluation of our approach compared to prior approaches for crawling dynamic web pages shows the effectiveness of CAM for monitoring dynamic changes. For example, by monitoring just 5% of the page changes, CAM is able to return 90% of the changed information to the users. The experiments also produce some interesting observations pertaining to the differences between the two problems of crawling--to build an index--and the problem of change tracking--to respond to continuous queries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
E. Coffman, J. Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1998.
|
| |
5
|
|
| |
6
|
D. Fetterly, M. Manasse, M. Najork, and J. Wiener. Crawling towards light: A large scale study of the evolution of Web pages. In First Workshop on Algorithms and Models for the Web-Graph, Vancouver, Canada, nov 2002.
|
| |
7
|
B. Fox. Discrete optimization via marginal analysis. Management Science, 13(3):211--216, 1966.
|
| |
8
|
G. N. Frederickson and D. B. Johnson. The complexity of selection and ranking in x + y and matrices with sorted columns. Journal of Computer and System Sciences, 24:197--208, 1982.
|
| |
9
|
|
| |
10
|
|
| |
11
|
J.K.Lenstra, A. Kan, and P.Brucker. Complexity of machine scheduling problems. Annals of Discrete Mathematics, 1:343--362, 1977.
|
| |
12
|
|
 |
13
|
Ling Liu , Calton Pu , Wei Tang, WebCQ-detecting and delivering information changes on the web, Proceedings of the ninth international conference on Information and knowledge management, p.512-519, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354860]
|
| |
14
|
L. Liu, C. Pu, W. Tang, and W. Han. Conquer: A continual query system for update monitoring in the www. International Journal of Computer Systems, Science and Engineering, 1999.
|
| |
15
|
M.R.Garey, D.S.Johnson, and R.Sethi. The complexity of flowshop and jobshop scheduling. Mathematics Operation Research, 1:117--129, 1976.
|
 |
16
|
Chris Olston , Boon Thau Loo , Jennifer Widom, Adaptive precision setting for cached approximate values, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.355-366, May 21-24, 2001, Santa Barbara, California, United States
|
 |
17
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
 |
18
|
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511465]
|
 |
19
|
Alec Wolman , M. Voelker , Nitin Sharma , Neal Cardwell , Anna Karlin , Henry M. Levy, On the scale and performance of cooperative Web proxy caching, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.16-31, December 12-15, 1999, Charleston, South Carolina, United States
|
|