|
ABSTRACT
We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer --- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first. Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering algorithms which do not easily generalize to finding multiple groupings. We model deduplication as a segmentation of a linear embedding of records and present a polynomial time algorithm for finding the R highest scoring answers. This method closely matches the accuracy of an exact exponential time algorithm on several datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
M. Bilenko. Learnable similarity functions and their applications to clustering and record linkage. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25--29, 2004, San Jose, California, USA, pages 981--982. AAAI Press / The MIT Press, 2004.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
W. Cohen and J. Richman. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.
|
 |
16
|
|
| |
17
|
|
| |
18
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
|
| |
19
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
20
|
|
| |
21
|
|
| |
22
|
J. Ko, T. Mitamura, and E. Nyberg. Language-independent probabilistic answer ranking for question answering. In ACL, 2007.
|
| |
23
|
D. Koller and N. Friedman. Structured probabilistic models. Under preparation, 2007.
|
| |
24
|
|
 |
25
|
|
 |
26
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
27
|
A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 79--86, Acapulco, Mexico, Aug. 2003.
|
| |
28
|
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
|
| |
29
|
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Processing Systems 15, Vancouver, British Columbia, 2002. MIT Press.
|
| |
30
|
S. Sarawagi. The crf project: a java implementation. http://crf.sourceforge.net, 2004.
|
 |
31
|
|
| |
32
|
S. Sarawagi and A. Kirpal. Scaling up the alias duplicate elimination system: A demostration. In Proc. of the 19th IEEE Int'l Conference on Data Engineering (ICDE), Bangalore, March 2003.
|
 |
33
|
|
 |
34
|
|
 |
35
|
|
 |
36
|
Michael L. Wick , Khashayar Rohanimanesh , Karl Schultz , Andrew McCallum, A unified approach for schema matching, coreference and canonicalization, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
[doi> 10.1145/1401890.1401977]
|
 |
37
|
|
|