ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Leveraging aggregate constraints for deduplication
Full text PdfPdf (305 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data table of contents
Beijing, China
SESSION: Data cleaning and integration table of contents
Pages: 437 - 448  
Year of Publication: 2007
ISBN:978-1-59593-686-8
Authors
Surajit Chaudhuri  Microsoft Research, Redmond, WA
Anish Das Sarma  Stanford University, Stanford, CA
Venkatesh Ganti  Microsoft Research, Redmond, WA
Raghav Kaushik  Microsoft Research, Redmond, WA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 97,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1247480.1247530
What is a DOI?

Warning: The download time has expired please click on the item to try again.


ABSTRACT

We show that aggregate constraints (as opposed to pairwise constraints) that often arise when integrating multiple sources of data, can be leveraged to enhance the quality of deduplication. However, despite its appeal, we show that the problem is challenging, both semantically and computationally. We define a restricted search space for deduplication that is intuitive in our context and we solve the problem optimally for the restricted space. Our experiments on real data show that incorporating aggregate constraints significantly enhances the accuracy of deduplication.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The K-Means Clustering Algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html.
 
2
Association for computing machinery. http://www.acm.org.
 
3
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 27th International Conference on Very Large Databases, 2002.
 
4
 
5
 
6
I. Bhattacharya and L. Getoor. Collective Entity Resolution In Relational Data. In Data Engineering Bulletin, 2006.
7
8
 
9
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
10
 
11
 
12
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. In Information and Computation, 2005.
 
13
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw Hill, 2001.
 
14
I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In PKDD, 2006.
 
15
Dblp. http://www.informatik.uni-trier.de/ ley/db/index.html.
 
16
X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
 
17
A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. In Information Systems Working Papers, 2006.
 
18
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
 
19
M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.
 
20
 
21
M. M. Halldorsson. Approximations of weighted independent set and hereditary subset problems. Journal of Graph Algorithms and Applications, 2000.
22
 
23
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
 
24
J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, 2003.
 
25
I. Knowledge Partners. Business rules applied. http://www.kpiusa.com.
 
26
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
 
27
Lavastorm. Making the case for automated revenue assurance solutions. http://www.lavastormtech.com.
28
 
29
A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.
 
30
Trillium Inc. www.trilliumsoft.com/trilliumsoft.nsf.
 
31
A. K. H. Tung, R. T. Ng, L. V. S. Lakshmanan, and J. Han. Constraint-based clustering in large databases. In ICDT, 2001.


Collaborative Colleagues:
Surajit Chaudhuri: colleagues
Anish Das Sarma: colleagues
Venkatesh Ganti: colleagues
Raghav Kaushik: colleagues