ACM Home Page
Please provide us with feedback. Feedback
Scaling up duplicate detection in graph data
Full text PdfPdf (297 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
POSTER SESSION: Poster session 1 database table of contents
Pages 1325-1326  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Melanie Herschel  Hasso-Plattner-Institut, Potsdam, Germany
Felix Naumann  Hasso-Plattner-Institut, Potsdam, Germany
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 96,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458259
What is a DOI?

ABSTRACT

Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks.

We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. We first generalize the process of DDG and then present how to scale DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
4
5
 
6
P. Singla and P. Domingos. Object identification with attribute-mediated dependences. In PKDD Conference, Porto, Portugal, 2005.
 
7
M. Weis and F. Naumann. Industry-scale duplicate detection. In VLDB Conference, Auckland, New Zealand, 2008.
 
8
M. Weis and F. Naumann. Space and time scalability of duplicate detection in graph data. Technical Report 25, Hasso-Plattner-Institut, 2008.
 
9

Collaborative Colleagues:
Melanie Herschel: colleagues
Felix Naumann: colleagues