ACM Home Page
Please provide us with feedback. Feedback
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Full text PdfPdf (3.28 MB)
Source
Proceedings of the VLDB Endowment archive
Volume 1 ,  Issue 1  (August 2008) table of contents
SESSION: Text and keyword query processing table of contents
Pages 933-944  
Year of Publication: 2008
ISSN:2150-8097
Authors
Chuan Xiao  The University of New South Wales, Australia
Wei Wang  The University of New South Wales, Australia
Xuemin Lin  The University of New South Wales, Australia
Publisher
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 94,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1453856.1453957
What is a DOI?

ABSTRACT

There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings.

In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
5
 
6
 
7
8
9
 
10
 
11
S. Chaudhuri, V. Ganti, and R. Kaushik. Data debugger: An operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2):60--66, 2006.
 
12
 
13
 
14
 
15
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free (erratum). Technical Report CUCS-011-03, Columbia University, 2003.
 
16
 
17
18
 
19
 
20
 
21
22
 
23
W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18--31, 1980.
24
25
26
 
27
 
28
29
30
 
31
32
 
33
W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.
34
35


Collaborative Colleagues:
Chuan Xiao: colleagues
Wei Wang: colleagues
Xuemin Lin: colleagues