ACM Home Page
Please provide us with feedback. Feedback
Data fusion
Full text PdfPdf (627 KB)
Source
ACM Computing Surveys (CSUR) archive
Volume 41 ,  Issue 1  (December 2008) table of contents
Article No. 1  
Year of Publication: 2008
ISSN:0360-0300
Authors
Jens Bleiholder  Hasso-Plattner-Institut, Potsdam, Germany
Felix Naumann  Hasso-Plattner-Institut, Potsdam, Germany
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 186,   Downloads (12 Months): 1733,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1456650.1456651
What is a DOI?

ABSTRACT

The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.

This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
 
5
6
 
7
8
9
 
10
 
11
Benjelloun, O., Sarma, A. D., Hayworth, C., and Widom, J. 2006. An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29, 1, 5--16.
 
12
Berlin, J. and Motro, A. 2006. Tuplerank: Ranking discovered content in virtual databases. In Proceedings of the International Workshop on Next Generation Information on Technology and Systems (NGITS), 13--25.
 
13
Bertossi, L. E., Bravo, L., Franconi, E., and Lopatenko, A. 2005. Complexity and approximation of fixing numerical attributes in databases under integrity constraints. In Proceedings of the International Conference on Database Programming Languages (DBPL), 262--278.
 
14
Bertossi, L. E. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases, 43--83.
 
15
 
16
 
17
Bleiholder, J. and Naumann, F. 2005. Declarative data fusion—Syntax, semantics, and implementation. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 58--73.
 
18
Bleiholder, J. and Naumann, F. 2006. Conflict handling strategies in an integrated information system. In Proceedings of the IJCAI Workshop on Information on the Web (IIWeb).
19
 
20
 
21
 
22
 
23
 
24
Calmet, J. and Kullmann, P. 1999. Meta Web search with KOMET. In Proceedings of the Workshop on Intelligent Information Integration.
 
25
Calvanese, D., Giacomo, G. D., Lembo, D., Lenzerini, M., and Rosati, R. 2005. Inconsistency tolerance in P2P data integration: An epistemic logic approach. In Proceedings of the International Conference on Database Programming Languages (DBPL).
 
26
Caroprese, L., Greco, S., Trubitsyna, I., and Zumpano, E. 2006. Preferred generalized answers for inconsistent databases. In Proceedings of the Internation Symposium on Methodologies for Information Systems (ISMIS), 344--349.
 
27
Caroprese, L. and Zumpano, E. 2006. A framework for merging, repairing and querying inconsistent databases. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 383--398.
28
29
 
30
Chomicki, J., Marcinkowski, J., and Staworko, S. 2004b. Hippo: A system for computing consistent answers to a class of SQL queries. In Proceedings of the International Conference on Extending Database Technology (EDBT), 841--844.
 
31
32
 
33
 
34
 
35
 
36
Dayal, U. and Hwang, H.-Y. 1984. View definition and generalization for database system integration in a multidatabase system. IEEE Trans. Softw. Eng. 10, 6 (Nov.), 628--645.
 
37
 
38
Dittrich, K. R. and Domenig, R. 1999. Towards exploitation of the data universe: Database technology for comprehensive query services. In Proceedings of the International Conference on Business Infromation Systems (BIS).
39
40
 
41
 
42
Dwyer, P. and Larson, J. 1987. Some experiences with a distributed database testbed system. Proc. IEEE 75, 5 (May), 633--648.
 
43
Eiter, T., Fink, M., Greco, G., and Lembo, D. 2003. Efficient evaluation of logic programs for querying data integration systems. In Proceedings of the International Conference on Logic Programming (ICLP), 163--177.
44
 
45
Flesca, S., Furfaro, F., and Parisi, F. 2005. Consistent query answers on numerical databases under aggregate constraints. In Proceedings of the International Conference on Database Programming Languages (DBPL), 279--294.
46
 
47
48
 
49
 
50
51
 
52
53
 
54
 
55
56
 
57
Hammer, J., McHugh, J., and Garcia-Molina, H. 1997. Semistructured data: The TSIMMIS experience. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 1--8.
 
58
59
 
60
Ives, Z. G., Khandelwal, N., Kapur, A., and Cakir, M. 2005. ORCHESTRA: Rapid, collaborative sharing of dynamic data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 107--118.
 
61
Jakobson, G., Piatetsky-Shapiro, G., Lafond, C., Rajinikanth, M., and Hernandez, J. 1988. CALIDA: A knowledge-based system for integrating multiple heterogeneous databases. In Proceedings of the 3rd International Conference on Data and Knowledge Bases: Improving Usability and Responsiveness, 3--18.
62
 
63
 
64
 
65
Knoblock, C. A. 1995. Planning, executing, sensing, and replanning for information gathering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), C. Mellish, ed. Morgan Kaufmann, San Francisco, CA, 1686--1693.
 
66
 
67
Kwok, C. T. and Weld, D. S. 1996. Planning to gather information. In Proceedings of the National Conference on Artificial Intelligence (AAAI). AAAI/MIT Press, Portland, 32--39.
 
68
Landers, T. and Rosenberg, R. L. 1982. An overview of MULTIBASE. In Proceedings of the 2nd International Symposium on Distributed Data Bases, H. J. Schneider, ed. North Holland, Berlin.
 
69
Lembo, D., Lenzerini, M., and Rosati, R. 2002. Source inconsistency and incompleteness in data integration. In Proceedings of the International Workshop on Knowledge Representation Meets Databases (KRDB).
70
71
 
72
Levenshtein, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Problems Inf. Transm. 1, 8--17.
 
73
 
74
Levy, A. Y., Rajaraman, A., and Ordille, J. J. 1996b. The World Wide Web as a collection of views: Query processing in the information manifold. In Proceedings of the SIGMOD Workshop on Materialized Views: Techniques and Applications (VIEW), 43--55.
 
75
Lim, E.-P., Cao, Y., and Chiang, R. H. L. 1997. Source-Aware multidatabase query processing. In Proceedings of the Workshop on Engineering Federated Information Database Systems (EFDBS), 69--80.
 
76
 
77
78
 
79
Litwin, W. and Abdellatif, A. 1987. An overview of the multi-database manipulation language MDSL. Proc. IEEE 75, 5 (May), 621--632.
 
80
Litwin, W., Boudenant, J., Esculier, C., Ferrier, A., Glorieux, A. M., Chimia, J. L., Kabbaj, K., Moulinoux, C., Rolin, P., and Stangret, C. 1982. SIRIUS system for distributed data management. In Distributed Databases. North-Holland, Amsterdam, The Netherlands, 311--343.
81
82
83
 
84
 
85
 
86
 
87
 
88
89
90
 
91
Naumann, F., Bilke, A., Bleiholder, J., and Weis, M. 2006. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29, 2, 21--31.
 
92
 
93
Nodine, M. H., Fowler, J., and Perry, B. 1999. Active information gathering in InfoSleuth. In Proceedings of the International Symposium on Cooperative Database Systems for Advanced Applications (CODAS), 15--26.
 
94
Ordille, J. J. and Miller, B. P. 1993. Distributed active catalogs and meta-data caching in descriptive name services. In Proceedings of the International Conference on Distributed Computing Systems, 120--129.
 
95
 
96
 
97
 
98
Rahm, E. and Bernstein, P. A. 2001. On matching schemas automatically. Tech. Rep. MSR-TR-2001-17, Microsoft Research, Redmond, Washington. February.
 
99
Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.
100
 
101
 
102
Raman, V., Chou, A., and Hellerstein, J. M. 1999. Scalable spreadsheets for interactive data analysis. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
 
103
104
 
105
 
106
Rusinkiewicz, M., Elmasri, R., Czejdo, B., Georgakopoulos, D., Karabatis, G., Jamoussi, A., Loa, K., and Li, Y. 1989. Omnibase: Design and implementation of a multidatabase system. In Proceedings of the 1st Annual Symposium in Parallel and Distributed Processing, 162--169.
 
107
 
108
Sattler, K., Conrad, S., and Saake, G. 2000. Adding conflict resolution features to a query language for database federations. In Proceedings of the Workshop on Engineering Federated Information System (EFIS), M. Roantree et al., eds, 41--52.
 
109
 
110
Schallehn, E. and Sattler, K.-U. 2003. Using similarity-based operations for resolving data-level conflicts. In Proceedings of the British National Conference on Databases (BNCOD), 172--189.
 
111
112
113
 
114
 
115
 
116
Staworko, S., Chomicki, J., and Marcinkowski, J. 2006. Preference-Driven querying of inconsistent relational databases. In Proceedings of the International Workshop on Inconsistency and Incompleteness in Databases (IIDB).
 
117
Subrahmanian, V. S., Adali, S., Brink, A., Emery, R., Lu, J., Rajput, A., Rogers, T., Ross, R., and Ward, C. 1995. Hermes: A heterogeneous reasoning and mediator system. Tech. Rep., University of Maryland.
 
118
Templeton, M., Brill, D., Dao, S., Lund, E., Ward, P., Chen, A., and MacGregor, R. 1987. Mermaid—A front-end to distributed heterogeneous databases. Proc. IEEE 75, 5 (May), 695--708.
119
 
120
 
121
Tsai, P. S. M. and Chen, A. L. P. 2000. Partial natural outerjoin—An operation for interoperability in a multidatabase environment. J. Inf. Sci. Eng. 16, 4 (Jul.), 593--617.
 
122
 
123
 
124
125
126
 
127
Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 262--276.
 
128
 
129
 
130
 
131


Collaborative Colleagues:
Jens Bleiholder: colleagues
Felix Naumann: colleagues