| Unsupervised deduplication using cross-field dependencies |
| Full text |
Pdf
(372 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Las Vegas, Nevada, USA
SESSION: Research papers
table of contents
Pages 310-317
Year of Publication: 2008
ISBN:978-1-60558-193-4
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 16, Downloads (12 Months): 170, Citation Count: 1
|
|
|
ABSTRACT
Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent -- because venues tend to focus on a few research areas -- but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance -- a 58% reduction in error over a standard Dirichlet process mixture.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Bagga and B. Baldwin. Algorithms for scoring coreference chains. In Proceedings of MUC7, 1998.
|
| |
2
|
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SIAM Conference on Data Mining (SDM), 2006.
|
 |
3
|
|
| |
4
|
|
| |
5
|
P. Carbonetto, J. Kisynski, N. de Freitas, and D. Poole. Nonparametric Bayesian logic. In Conference on Uncertainty in Artificial Intelligence (UAI), 2005.
|
| |
6
|
C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 241--248. MIT Press, Cambridge, MA, 2007.
|
 |
7
|
|
| |
8
|
D. B. Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate dirichlet process mixture models. Journal of Computational and Graphical Statistics, 2005.
|
| |
9
|
A. Haghighi and D. Klein. Unsupervised coreference resolution in a nonparametric Bayesian model. In ACL, 2007.
|
| |
10
|
A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 905--912. MIT Press, Cambridge, MA, 2005.
|
| |
11
|
|
| |
12
|
R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249--265, 2000.
|
| |
13
|
H. M. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2003.
|
| |
14
|
P. Singla and P. Domingos. Multi-relational record linkage. In KDD-2004 Workshop on Multi-Relational Data Mining, pages 31--48, 2004.
|
| |
15
|
|
| |
16
|
|
| |
17
|
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566--1581, 2006.
|
| |
18
|
Ben Wellner , Andrew McCallum , Fuchun Peng , Michael Hay, An integrated, conditional model of information extraction and coreference with application to citation matching, Proceedings of the 20th conference on Uncertainty in artificial intelligence, p.593-601, July 07-11, 2004, Banff, Canada
|
CITED BY
|
|
Nilesh Dalvi , Ravi Kumar , Bo Pang , Raghu Ramakrishnan , Andrew Tomkins , Philip Bohannon , Sathiya Keerthi , Srujana Merugu, A web of concepts, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|