ACM Home Page
Please provide us with feedback. Feedback
Dynamic maintenance of web indexes using landmarks
Full text PdfPdf (234 KB)
Source International World Wide Web Conference archive
Proceedings of the 12th international conference on World Wide Web table of contents
Budapest, Hungary
SESSION: Information retrieval 2 table of contents
Pages: 102 - 111  
Year of Publication: 2003
ISBN:1-58113-680-3
Authors
Lipyeow Lim  Duke University, Durham, NC
Min Wang  IBM T. J. Watson Research Ctr., Hawthorne, NY
Sriram Padmanabhan  IBM T. J. Watson Research Ctr., Hawthorne, NY
Jeffrey Scott Vitter  Purdue University, West Lafayette, IN
Ramesh Agarwal  IBM Almaden Research Ctr., San Jose, CA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 64,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775152.775167
What is a DOI?

ABSTRACT

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
 
5
 
6
J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.
 
7
 
8
C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.
 
9
C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.
10
 
11
 
12
D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.
 
13
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.
 
14
 
15
 
16
U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.
17
 
18
19
 
20
21
22
 
23
24

CITED BY  7

Collaborative Colleagues:
Lipyeow Lim: colleagues
Min Wang: colleagues
Sriram Padmanabhan: colleagues
Jeffrey Scott Vitter: colleagues
Ramesh Agarwal: colleagues