| Leveraging aggregate constraints for deduplication |
| Full text |
Pdf
(305 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
table of contents
Beijing, China
SESSION: Data cleaning and integration
table of contents
Pages: 437 - 448
Year of Publication: 2007
ISBN:978-1-59593-686-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 125, Citation Count: 1
|
|
|
ABSTRACT
We show that aggregate constraints (as opposed to pairwise constraints) that often arise when integrating multiple sources of data, can be leveraged to enhance the quality of deduplication. However, despite its appeal, we show that the problem is challenging, both semantically and computationally. We define a restricted search space for deduplication that is intuitive in our context and we solve the problem optimally for the restricted space. Our experiments on real data show that incorporating aggregate constraints significantly enhances the accuracy of deduplication.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The K-Means Clustering Algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html.
|
| |
2
|
Association for computing machinery. http://www.acm.org.
|
| |
3
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 27th International Conference on Very Large Databases, 2002.
|
| |
4
|
Javed Aslam , Katya Pelekhov , Daniela Rus, A practical clustering algorithm for static and dynamic information organization, Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms, p.51-60, January 17-19, 1999, Baltimore, Maryland, United States
|
| |
5
|
|
| |
6
|
I. Bhattacharya and L. Getoor. Collective Entity Resolution In Relational Data. In Data Engineering Bulletin, 2006.
|
 |
7
|
Mikhail Bilenko , Sugato Basu , Raymond J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, Proceedings of the twenty-first international conference on Machine learning, p.11, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015360]
|
 |
8
|
|
| |
9
|
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
|
 |
10
|
|
| |
11
|
|
| |
12
|
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. In Information and Computation, 2005.
|
| |
13
|
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw Hill, 2001.
|
| |
14
|
I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In PKDD, 2006.
|
| |
15
|
Dblp. http://www.informatik.uni-trier.de/ ley/db/index.html.
|
| |
16
|
X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
|
| |
17
|
A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. In Information Systems Working Papers, 2006.
|
| |
18
|
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
|
| |
19
|
M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.
|
| |
20
|
|
| |
21
|
M. M. Halldorsson. Approximations of weighted independent set and hereditary subset problems. Journal of Graph Algorithms and Applications, 2000.
|
 |
22
|
|
| |
23
|
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
|
| |
24
|
J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, 2003.
|
| |
25
|
I. Knowledge Partners. Business rules applied. http://www.kpiusa.com.
|
| |
26
|
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
|
| |
27
|
Lavastorm. Making the case for automated revenue assurance solutions. http://www.lavastormtech.com.
|
 |
28
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
29
|
A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.
|
| |
30
|
Trillium Inc. www.trilliumsoft.com/trilliumsoft.nsf.
|
| |
31
|
A. K. H. Tung, R. T. Ng, L. V. S. Lakshmanan, and J. Han. Constraint-based clustering in large databases. In ICDT, 2001.
|
CITED BY
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|