| Estimating the selectivity of approximate string queries |
| Full text |
Pdf
(365 KB)
|
Source
|
ACM Transactions on Database Systems (TODS)
archive
Volume 32 , Issue 2 (June 2007)
table of contents
Article No. 12
Year of Publication: 2007
ISSN:0362-5915
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 179, Citation Count: 4
|
|
|
ABSTRACT
Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures. We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Björn Blohsfeld , Dieter Korus , Bernhard Seeger, A comparison of selectivity estimators for range queries on metric attributes, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.239-250, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
Cohen, E. 1994. Estimating the size of the transitive closure in linear time. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS) (Nov.). 190--200.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
11
|
|
| |
12
|
|
| |
13
|
H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Viswanath Poosala , Kenneth C. Sevcik , Torsten Suel, Optimal Histograms with Quality Guarantees, Proceedings of the 24rd International Conference on Very Large Data Bases, p.275-286, August 24-27, 1998
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
P. Krishnan , Jeffrey Scott Vitter , Bala Iyer, Estimating alphanumeric selectivity in the presence of wildcards, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.282-293, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
20
|
|
 |
21
|
Yossi Matias , Jeffrey Scott Vitter , Min Wang, Wavelet-based histograms for selectivity estimation, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.448-459, June 01-04, 1998, Seattle, Washington, United States
|
| |
22
|
|
| |
23
|
Mazeika, A. and Böhlen, M. H. 2006. Cleansing databases of misspelled proper nouns. In Proceedings of the CleanDB Workshop (in conjunction with) the International Conference on Very Large Databases (VLDB).
|
 |
24
|
|
 |
25
|
|
 |
26
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
27
|
Sahinalp, C., Tasan, M., Macker, J., and Ozsoyoglu, Z. M. 2003. Distance based indexing for string proximity search. In Proceedings of the International Conference on Data Engineering (ICDE). 125--137.
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
| |
31
|
Vernica, R. and Li, C. 2007. Flamingo project. http://www.ics.uci.edu/~flamingo/.
|
CITED BY 4
|
|
|
|
|
|
|
|
|
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|