ACM Home Page
Please provide us with feedback. Feedback
Approximate substring selectivity estimation
Full text PdfPdf (712 KB)
Source Extending Database Technology; Vol. 360 archive
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology table of contents
Saint Petersburg, Russia
SESSION: Research sessions: Information retrieval table of contents
Pages 827-838  
Year of Publication: 2009
ISBN:978-1-60558-422-5
Authors
Hongrae Lee  University of British Columbia
Raymond T. Ng  University of British Columbia
Kyuseok Shim  Seoul National University
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 72,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1516360.1516455
What is a DOI?

ABSTRACT

We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended N-gram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with ideas from set hashing signatures. The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
 
5
6
7
8
 
9
 
10
W. Cohen, P. Ravikmar and and S. Fienberg. A Comparison of String Metrics for Matching Names and Records. Proc. of KDD Workshop on Data Cleaning, pp. 73--78, 2003.
 
11
Flamingo Project, http://www.ics.uci.edu/flamingo/.
 
12
13
 
14
 
15
 
16
17
 
18
 
19
20
21
22
 
23
Winkler, W. E. The State of Record Linkage and Current Research Problems. Statistics of Income Division, U.S. Bureau of the Census, 1999.
Collaborative Colleagues:
Hongrae Lee: colleagues
Raymond T. Ng: colleagues
Kyuseok Shim: colleagues