|
ABSTRACT
Information integration is often faced with the problem that different data sources represent the same set of the real-world objects, but give conflicting values for specific properties of these objects. Within this paper we present a model of such conflicts and describe an algorithm for efficiently detecting patterns of conflicts in a pair of overlapping data sources. The contradiction patterns we can find are a special kind of association rules, describing regularities in conflicts occurring together with certain attribute values, paris of attribute values, or with other conflicts. Therefore, we adapt existing association rule mining algorithms for mining contradiction patterns. Such patterns are an important tool for human experts that try to find and resolve problems in data quality using domain knowledge. We present the results of applying our method on a real world data set from the life science domain and show how it helps to generate clean data for integrated data warehouses.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
Rother, K., Müller, H., Trissl, S., Koch, I., Steinke, T., Preissner, R., Frömmel, C., Leser, U., COLUMBA: Multidimensional Data Integration of Protein Annotations. International Workshop on Data Integration in Life Sciences (DILS 2004), Leipzig, Germany
|
| |
4
|
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Wheeler, D. L. GenBank. Nucleic Acids Res., Vol. 31(1) (2003) 23--37
|
| |
5
|
Kulikova, T., Aldebert, P., Althorpe, N., Baker, W., Bates, K., Browne, P., van den Broek, A., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Garcia-Pastor, M., Harte, N., Kanz, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Mancuso, R., McHale, M., Nardone, F., Silventoinen, V., Stoehr, P., Stoesser, G., Tuli, M. A., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W., Apweiler. R., The EMBL Nucleotide Sequence Database. Nucl. Acids. Res. 2004 32:D27-D30.
|
| |
6
|
Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K., Saitou, N., Sugawara, H., Gojobori, T., DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucl. Acids. Res. 2002 30:27--30
|
| |
7
|
Bernstein, F. C., Koetzle, T. F., Willliams, G. J. B., Meyer, E. F. Jr., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., Tasumi, M., The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., Vol. 112 (1977) 535--542
|
| |
8
|
Bhat, T. N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Schneider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J., Berman, H. M., The PDB data uniformity project. Nucleic Acid Research, Vol. 29(1) (2001) 214--218
|
| |
9
|
Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P. A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Urunea, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.; E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research, Vol. 31(1) (2003) 458--462
|
| |
10
|
|
| |
11
|
Naumann, F., Haeussler, M., Declarative Data Merging with Conflict Resolution. Proceedings of the International Conference on Information Quality (IQ 2002), Cambridge, MA, 2002
|
| |
12
|
Müller, H., Naumann, F., Freytag, J. C., Data Quality in Genome Databases. Proceedings of the International Conference on Information Quality (IQ 2003), Cambridge, MA, 2003
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
|