| Cost-based variable-length-gram selection for string collections to support approximate queries efficiently |
| Full text |
Pdf
(477 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Research Session 9: Strings and Time
table of contents
Pages 353-364
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Xiaochun Yang
|
Northeastern University, Shenyang, China
|
|
Bin Wang
|
Northeastern University, Shenyang, China
|
|
Chen Li
|
University of California, Irvine, Irvine, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 30, Downloads (12 Months): 222, Citation Count: 3
|
|
|
ABSTRACT
Approximate queries on a collection of strings are important in many applications such as record linkage, spell checking, and Web search, where inconsistencies and errors exist in data as well as queries. Several existing algorithms use the concept of "grams," which are substrings of strings used as signatures for the strings to build index structures. A recently proposed technique, called VGRAM, improves the performance of these algorithms by using a carefully chosen dictionary of variable-length grams based on their requencies in the string collection. Since an index structure using fixed-length grams can be viewed as a special case of VGRAM, a fundamental problem arises naturally: what is the relationship between the gram dictionary and the performance of queries? We study this problem in this paper. We propose a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance. We analyze how a gram dictionary affects the index structure of the string collection and ultimately the performance of queries. We also propose an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries. Our experiments on real data sets show the improvement on query performance achieved by these techniques. To our best knowledge, this study is the first cost-based quantitative approach to deciding good grams for approximate string queries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
V. R. Carvalho and W. W. Cohen. Improving "email speech acts" analysis via n-gram selection. In Analyzing Conversations in Text and Speech (ACTS) Workshop at HLT-NAACL, pages 35--41, 2006.
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
9
|
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Set similarity selection queries at interactive speeds. In ICDE, 2008.
|
 |
10
|
Bijit Hore , Hakan Hacigumus , Bala Iyer , Sharad Mehrotra, Indexing text data under space constraints, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
[doi> 10.1145/1031171.1031212]
|
 |
11
|
H. V. Jagadish , Raymond T. Ng , Divesh Srivastava, Substring selectivity estimation, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.249-260, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.304001]
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
 |
15
|
P. Krishnan , Jeffrey Scott Vitter , Bala Iyer, Estimating alphanumeric selectivity in the presence of wildcards, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.282-293, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
16
|
|
| |
17
|
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, 2008.
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
S. C. Sahinalp, M. Tasan, J. Macker, and Z. M. Özsoyoglu. Distance based indexing for string proximity search. In ICDE, 2003.
|
 |
22
|
|
CITED BY 3
|
|
|
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|