| Dynamic maintenance of web indexes using landmarks |
| Full text |
Pdf
(234 KB)
|
| Source
|
International World Wide Web Conference
archive
Proceedings of the 12th international conference on World Wide Web
table of contents
Budapest, Hungary
SESSION: Information retrieval 2
table of contents
Pages: 102 - 111
Year of Publication: 2003
ISBN:1-58113-680-3
|
|
Authors
|
|
Lipyeow Lim
|
Duke University, Durham, NC
|
|
Min Wang
|
IBM T. J. Watson Research Ctr., Hawthorne, NY
|
|
Sriram Padmanabhan
|
IBM T. J. Watson Research Ctr., Hawthorne, NY
|
|
Jeffrey Scott Vitter
|
Purdue University, West Lafayette, IN
|
|
Ramesh Agarwal
|
IBM Almaden Research Ctr., San Jose, CA
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 64, Citation Count: 7
|
|
|
ABSTRACT
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Lars Arge , Octavian Procopiuc , Sridhar Ramaswamy , Torsten Suel , Jan Vahrenhold , Jeffrey Scott Vitter, A Unified Approach for Indexed and Non-Indexed Spatial Joins, Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology, p.413-429, March 27-31, 2000
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.
|
| |
7
|
|
| |
8
|
C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.
|
| |
9
|
C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.
|
 |
10
|
|
| |
11
|
|
| |
12
|
D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.
|
| |
13
|
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.
|
| |
14
|
|
| |
15
|
|
| |
16
|
U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.
|
 |
17
|
Sergey Melnik , Sriram Raghavan , Beverly Yang , Hector Garcia-Molina, Building a distributed full-text index for the Web, Proceedings of the 10th international conference on World Wide Web, p.396-406, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372095]
|
| |
18
|
|
 |
19
|
Anthony Tomasic , Héctor García-Molina , Kurt Shoens, Incremental updates of inverted lists for text document retrieval, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.289-300, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
|
 |
24
|
Chun Zhang , Jeffrey Naughton , David DeWitt , Qiong Luo , Guy Lohman, On supporting containment queries in relational database management systems, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.425-436, May 21-24, 2001, Santa Barbara, California, United States
|
CITED BY 7
|
|
|
|
|
Edleno S. de Moura , Célia F. dos Santos , Daniel R. Fernandes , Altigran S. Silva , Pavel Calado , Mario A. Nascimento, Improving Web search efficiency via a locality based static pruning method, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
Ronny Lempel , Yosi Mass , Shila Ofek-Koifman , Dafna Sheinwald , Yael Petruschka , Ron Sivan, Just in time indexing for up to the second search, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
Fei Chen , Byron J. Gao , AnHai Doan , Jun Yang , Raghu Ramakrishnan, Optimizing complex extraction programs over evolving text data, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|