|
ABSTRACT
We study the novel problem of efficiently computing the update distance for a pair of relational databases. In analogy to the edit distance of strings, we define the update distance of two databases as the minimal number of set-oriented insert, delete and modification operations necessary to transform one database into the other. We show how this distance can be computed by traversing a search space of database instances connected by update operations. This insight leads to a family of algorithms that compute the update distance or approximations of it. In our experiments we observed that a simple heuristic performs surprisingly well in most considered cases.Our motivation for studying distance measures for databases stems from the field of scientific databases. There, replicas of a single database are often maintained at different sites, which typically leads to (accidental or planned) divergence of their content. To re-create a consistent view, these differences must be resolved. Such an effort requires an understanding of the process that produced them. We found that minimal update sequences of set-oriented update operations are a proper and concise representation of systematic errors, thus giving valuable clues to domain experts responsible for conflict resolution.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Marcelo Arenas , Leopoldo Bertossi , Jan Chomicki, Consistent query answers in inconsistent databases, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.68-79, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303983]
|
| |
2
|
|
 |
3
|
|
| |
4
|
H. M.Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne. The Protein Data Bank. Nucleic Acids Research, Vol. 28(1), 2000, 235--242.
|
| |
5
|
T. N. Bhat, et al. The PDB data uniformity project, Nucleic Acid Research, Vol. 29(1), 2001, 214--218.
|
| |
6
|
H. Boutselakis, et al. E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research, Vol. 31(1), 2003, 458--462.
|
 |
7
|
|
 |
8
|
|
 |
9
|
Gao Cong , Anthony K. H. Tung , Xin Xu , Feng Pan , Jiong Yang, FARMER: finding interesting rule groups in microarray datasets, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007587]
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
V. I. Levenshtein. Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl., 10:707--10, 1966.
|
| |
15
|
A. E. Monge, C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database tuples. Proc. SIGMOD Workshop on Data Mining and Knowledge Discovery, 1997.
|
 |
16
|
|
| |
17
|
H. Müller, J.-C. Freytag, and U. Leser. On the Distance of Databases, Technical Report, HUB-IB-199, March 2006.
|
| |
18
|
|
| |
19
|
K. Rother, H. Müller, S. Trissl, I. Koch, T. Steinke, R. Preissner, C. Frömmel, U. Leser. COLUMBA: Multidimensional Data Integration of Protein Annotations, Int. Workshop on Data Integration in Life Sciences (DILS 2004), Leipzig, Germany, 2004.
|
| |
20
|
|
 |
21
|
|
| |
22
|
W. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 1999.
|
| |
23
|
M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In Proc. of the Second SIAM Int. Conference on Data Mining, Arlington, VA, 2002.
|
|