|
ABSTRACT
We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended N-gram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with ideas from set hashing signatures. The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
 |
7
|
|
 |
8
|
Zhiyuan Chen , Nick Koudas , Flip Korn , S. Muthukrishnan, Selectively estimation for Boolean queries, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.216-225, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335225]
|
| |
9
|
|
| |
10
|
W. Cohen, P. Ravikmar and and S. Fienberg. A Comparison of String Metrics for Matching Names and Records. Proc. of KDD Workshop on Data Cleaning, pp. 73--78, 2003.
|
| |
11
|
Flamingo Project, http://www.ics.uci.edu/flamingo/.
|
| |
12
|
|
 |
13
|
H. V. Jagadish , Raymond T. Ng , Divesh Srivastava, Substring selectivity estimation, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.249-260, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.304001]
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
P. Krishnan , Jeffrey Scott Vitter , Bala Iyer, Estimating alphanumeric selectivity in the presence of wildcards, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.282-293, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
Winkler, W. E. The State of Record Linkage and Current Research Problems. Statistics of Income Division, U.S. Bureau of the Census, 1999.
|
|