ACM Home Page
Please provide us with feedback. Feedback
Monitoring the dynamic web to respond to continuous queries
Full text PdfPdf (151 KB)
Source International World Wide Web Conference archive
Proceedings of the 12th international conference on World Wide Web table of contents
Budapest, Hungary
SESSION: Web crawling and measurement table of contents
Pages: 659 - 668  
Year of Publication: 2003
ISBN:1-58113-680-3
Authors
Sandeep Pandey  Indian Institute of Technology, Powai, Mumbai, India
Krithi Ramamritham  Indian Institute of Technology, Powai, Mumbai, India
Soumen Chakrabarti  Indian Institute of Technology, Powai, Mumbai, India
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 42,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775152.775245
What is a DOI?

ABSTRACT

Continuous queries are queries for which responses given to users must be continuously updated, as the sources of interest get updated. Such queries occur, for instance, during on-line decision making, e.g., traffic flow control, weather monitoring, etc. The problem of keeping the responses current reduces to the problem of deciding how often to visit a source to determine if and how it has been modified, in order to update earlier responses accordingly. On the surface, this seems to be similar to the crawling problem since crawlers attempt to keep indexes up-to-date as pages change and users pose search queries. We show that this is not the case, both due to the inherent differences between the nature of the two problems as well as the performance metric. We propose, develop and evaluate a novel multi-phase (Continuous Adaptive Monitoring) (CAM) solution to the problem of maintaining the currency of query results. Some of the important phases are: The tracking phase, in which changes, to an initially identified set of relevant pages, are tracked. From the observed change characteristics of these pages, a probabilistic model of their change behavior is formulated and weights are assigned to pages to denote their importance for the current queries. During the next phase, the resource allocation phase, based on these statistics, resources, needed to continuously monitor these pages for changes, are allocated. Given these resource allocations, the scheduling phase produces an optimal achievable schedule for the monitoring tasks. An experimental evaluation of our approach compared to prior approaches for crawling dynamic web pages shows the effectiveness of CAM for monitoring dynamic changes. For example, by monitoring just 5% of the page changes, CAM is able to return 90% of the changed information to the users. The experiments also produce some interesting observations pertaining to the differences between the two problems of crawling--to build an index--and the problem of change tracking--to respond to continuous queries.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
E. Coffman, J. Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1998.
 
5
 
6
D. Fetterly, M. Manasse, M. Najork, and J. Wiener. Crawling towards light: A large scale study of the evolution of Web pages. In First Workshop on Algorithms and Models for the Web-Graph, Vancouver, Canada, nov 2002.
 
7
B. Fox. Discrete optimization via marginal analysis. Management Science, 13(3):211--216, 1966.
 
8
G. N. Frederickson and D. B. Johnson. The complexity of selection and ranking in x + y and matrices with sorted columns. Journal of Computer and System Sciences, 24:197--208, 1982.
 
9
 
10
 
11
J.K.Lenstra, A. Kan, and P.Brucker. Complexity of machine scheduling problems. Annals of Discrete Mathematics, 1:343--362, 1977.
 
12
13
 
14
L. Liu, C. Pu, W. Tang, and W. Han. Conquer: A continual query system for update monitoring in the www. International Journal of Computer Systems, Science and Engineering, 1999.
 
15
M.R.Garey, D.S.Johnson, and R.Sethi. The complexity of flowshop and jobshop scheduling. Mathematics Operation Research, 1:117--129, 1976.
16
17
18
19

CITED BY  8

Collaborative Colleagues:
Sandeep Pandey: colleagues
Krithi Ramamritham: colleagues
Soumen Chakrabarti: colleagues