|
ABSTRACT
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Andoni , M. Deza , A. Gupta , P. Indyk , S. Raskhodnikova, Lower bounds for embedding edit distance into normed spaces, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, January 12-14, 2003, Baltimore, Maryland
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
 |
5
|
Christian Böhm , Bernhard Braunmüller , Florian Krebs , Hans-Peter Kriegel, Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.379-388, May 21-24, 2001, Santa Barbara, California, United States
|
| |
6
|
|
| |
7
|
|
 |
8
|
Amit Chandel , Oktie Hassanzadeh , Nick Koudas , Mohammad Sadoghi , Divesh Srivastava, Benchmarking declarative approximate selection predicates, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
[doi> 10.1145/1247480.1247521]
|
 |
9
|
|
| |
10
|
|
| |
11
|
S. Chaudhuri, V. Ganti, and R. Kaushik. Data debugger: An operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2):60--66, 2006.
|
| |
12
|
|
| |
13
|
|
| |
14
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
15
|
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free (erratum). Technical Report CUCS-011-03, Columbia University, 2003.
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18--31, 1980.
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
| |
31
|
|
 |
32
|
|
| |
33
|
W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.
|
 |
34
|
|
 |
35
|
|
CITED BY 3
|
|
|
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|