ACM Home Page
Please provide us with feedback. Feedback
Efficient set joins on similarity predicates
Full text PdfPdf (265 KB)
Source International Conference on Management of Data archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data table of contents
Paris, France
SESSION: Research sessions: query uncertainty table of contents
Pages: 743 - 754  
Year of Publication: 2004
ISBN:1-58113-859-8
Authors
Sunita Sarawagi  IIT Bombay
Alok Kirpal  IIT Bombay
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 109,   Citation Count: 26
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007568.1007652
What is a DOI?

ABSTRACT

In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
R. Ananthakrishna, S. chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002.
 
3
Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.
 
4
5
6
 
7
 
8
9
 
10
 
11
 
12
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an RDBMS. In ICDE, pages 729--731, 2003.
 
13
 
14
 
15
 
16
17
 
18
S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. Technical report, Stanford University, 2001.
 
19
 
20
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
 
21
22
 
23

CITED BY  27
Collaborative Colleagues:
Sunita Sarawagi: colleagues
Alok Kirpal: colleagues