ACM Home Page
Please provide us with feedback. Feedback
Incremental maintenance of length normalized indexes for approximate string matching
Full text PdfPdf (529 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 11: database optimization table of contents
Pages 429-440  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Marios Hadjieleftheriou  AT&T Labs - Research, Florham Park, NJ, USA
Nick Koudas  University of Toronto, Toronto, ON, Canada
Divesh Srivastava  AT&T Labs - Research, Florham Park, NJ, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 44,   Downloads (12 Months): 166,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559891
What is a DOI?

ABSTRACT

Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Apache Lucene. http://lucene.apache.org/.
 
2
 
3
 
4
R. A. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127--158, 1999
5
 
6
7
 
8
9
 
10
FAST - Enterprise Search. http://www.fastsearch.com.
 
11
Google. http://www.google.com.
 
12
 
13
 
14
 
15
N. Koudas, A. Marathe, and D. Srivastava. Propagating updates in SPIDER. In ICDE, pages 1146--1153, 2007.
16
 
17
M. Ley. DBLP database. http://dblp.uni-trier.de/xml.
 
18
 
19
Oracle. Oracle Berkeley DB. http://www.oracle.com/technology/products/ berkeley-db/db/ index.html.
 
20
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Text REtrieval Conference (TREC), pages 21--30, 1992.
21
22

Collaborative Colleagues:
Marios Hadjieleftheriou: colleagues
Nick Koudas: colleagues
Divesh Srivastava: colleagues