ACM Home Page
Please provide us with feedback. Feedback
Describing differences between databases
Full text PdfPdf (394 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 15th ACM international conference on Information and knowledge management table of contents
Arlington, Virginia, USA
SESSION: Summarization table of contents
Pages: 612 - 621  
Year of Publication: 2006
ISBN:1-59593-433-2
Authors
Heiko Müller  Humboldt-Universität zu Berlin, Berlin, Germany
Johann-Christoph Freytag  Humboldt-Universität zu Berlin, Berlin, Germany
Ulf Leser  Humboldt-Universität zu Berlin, Berlin, Germany
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 65,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183614.1183702
What is a DOI?

ABSTRACT

We study the novel problem of efficiently computing the update distance for a pair of relational databases. In analogy to the edit distance of strings, we define the update distance of two databases as the minimal number of set-oriented insert, delete and modification operations necessary to transform one database into the other. We show how this distance can be computed by traversing a search space of database instances connected by update operations. This insight leads to a family of algorithms that compute the update distance or approximations of it. In our experiments we observed that a simple heuristic performs surprisingly well in most considered cases.Our motivation for studying distance measures for databases stems from the field of scientific databases. There, replicas of a single database are often maintained at different sites, which typically leads to (accidental or planned) divergence of their content. To re-create a consistent view, these differences must be resolved. Such an effort requires an understanding of the process that produced them. We found that minimal update sequences of set-oriented update operations are a proper and concise representation of systematic errors, thus giving valuable clues to domain experts responsible for conflict resolution.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
H. M.Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne. The Protein Data Bank. Nucleic Acids Research, Vol. 28(1), 2000, 235--242.
 
5
T. N. Bhat, et al. The PDB data uniformity project, Nucleic Acid Research, Vol. 29(1), 2001, 214--218.
 
6
H. Boutselakis, et al. E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research, Vol. 31(1), 2003, 458--462.
7
8
9
 
10
 
11
12
 
13
 
14
V. I. Levenshtein. Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl., 10:707--10, 1966.
 
15
A. E. Monge, C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database tuples. Proc. SIGMOD Workshop on Data Mining and Knowledge Discovery, 1997.
16
 
17
H. Müller, J.-C. Freytag, and U. Leser. On the Distance of Databases, Technical Report, HUB-IB-199, March 2006.
 
18
 
19
K. Rother, H. Müller, S. Trissl, I. Koch, T. Steinke, R. Preissner, C. Frömmel, U. Leser. COLUMBA: Multidimensional Data Integration of Protein Annotations, Int. Workshop on Data Integration in Life Sciences (DILS 2004), Leipzig, Germany, 2004.
 
20
21
 
22
W. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 1999.
 
23
M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In Proc. of the Second SIAM Int. Conference on Data Mining, Arlington, VA, 2002.

Collaborative Colleagues:
Heiko Müller: colleagues
Johann-Christoph Freytag: colleagues
Ulf Leser: colleagues