|
ABSTRACT
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation. This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
S. Adali , K. S. Candan , Y. Papakonstantinou , V. S. Subrahmanian, Query caching and optimization in distributed mediator systems, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.137-146, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
2
|
Parag Agrawal , Omar Benjelloun , Anish Das Sarma , Chris Hayworth , Shubha Nabar , Tomoe Sugihara , Jennifer Widom, Trio: a system for data, uncertainty, and lineage, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
| |
3
|
Rafi Ahmed , Philippe De Smedt , Weimin Du , William Kent , Mohammad A. Ketabchi , Witold A. Litwin , Abbas Rafii , Ming-Chien Shan, The Pegasus Heterogeneous Multidatabase System, Computer, v.24 n.12, p.19-27, December 1991
[doi> 10.1109/2.116885]
|
 |
4
|
José Luis Ambite , Naveen Ashish , Greg Barish , Craig A. Knoblock , Steven Minton , Pragnesh J. Modi , Ion Muslea , Andrew Philpot , Sheila Tejada, Ariadne: a system for constructing mediators for Internet sources, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.561-563, June 01-04, 1998, Seattle, Washington, United States
|
| |
5
|
|
 |
6
|
Marcelo Arenas , Leopoldo Bertossi , Jan Chomicki, Consistent query answers in inconsistent databases, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.68-79, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303983]
|
| |
7
|
|
 |
8
|
|
 |
9
|
R. J. Bayardo, Jr. , W. Bohrer , R. Brice , A. Cichocki , J. Fowler , A. Helal , V. Kashyap , T. Ksiezyk , G. Martin , M. Nodine , M. Rashid , M. Rusinkiewicz , R. Shea , C. Unnikrishnan , A. Unruh , D. Woelk, InfoSleuth: agent-based semantic integration of information in open and dynamic environments, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.195-206, May 11-15, 1997, Tucson, Arizona, United States
|
| |
10
|
V. Belcastro , A. Dutkowski , W. Kaminski , M. Kowalewski , C. L. Mallamaci , S. Meszyk , Tommaso Mostardi , F. P. Scrocco , Witold Staniszkis , G. Turco, An Overview of the Distributed Query System DQS, Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology, p.170-189, March 14-18, 1988
|
| |
11
|
Benjelloun, O., Sarma, A. D., Hayworth, C., and Widom, J. 2006. An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29, 1, 5--16.
|
| |
12
|
Berlin, J. and Motro, A. 2006. Tuplerank: Ranking discovered content in virtual databases. In Proceedings of the International Workshop on Next Generation Information on Technology and Systems (NGITS), 13--25.
|
| |
13
|
Bertossi, L. E., Bravo, L., Franconi, E., and Lopatenko, A. 2005. Complexity and approximation of fixing numerical attributes in databases under integrity constraints. In Proceedings of the International Conference on Database Programming Languages (DBPL), 262--278.
|
| |
14
|
Bertossi, L. E. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases, 43--83.
|
| |
15
|
Alexander Bilke , Jens Bleiholder , Felix Naumann , Christoph Böhm , Karsten Draba , Melanie Weis, Automatic data fusion with HumMer, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
| |
16
|
|
| |
17
|
Bleiholder, J. and Naumann, F. 2005. Declarative data fusion—Syntax, semantics, and implementation. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 58--73.
|
| |
18
|
Bleiholder, J. and Naumann, F. 2006. Conflict handling strategies in an integrated information system. In Proceedings of the IJCAI Workshop on Information on the Web (IIWeb).
|
 |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
Doug Burdick , Prasad M. Deshpande , T. S. Jayram , Raghu Ramakrishnan , Shivakumar Vaithyanathan, OLAP over uncertain and imprecise data, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
| |
23
|
|
| |
24
|
Calmet, J. and Kullmann, P. 1999. Meta Web search with KOMET. In Proceedings of the Workshop on Intelligent Information Integration.
|
| |
25
|
Calvanese, D., Giacomo, G. D., Lembo, D., Lenzerini, M., and Rosati, R. 2005. Inconsistency tolerance in P2P data integration: An epistemic logic approach. In Proceedings of the International Conference on Database Programming Languages (DBPL).
|
| |
26
|
Caroprese, L., Greco, S., Trubitsyna, I., and Zumpano, E. 2006. Preferred generalized answers for inconsistent databases. In Proceedings of the Internation Symposium on Methodologies for Information Systems (ISMIS), 344--349.
|
| |
27
|
Caroprese, L. and Zumpano, E. 2006. A framework for merging, repairing and querying inconsistent databases. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 383--398.
|
 |
28
|
Surajit Chaudhuri , Kris Ganjam , Venky Ganti , Rahul Kapoor , Vivek Narasayya , Theo Vassilakis, Data cleaning in microsoft SQL server 2005, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066287]
|
 |
29
|
|
| |
30
|
Chomicki, J., Marcinkowski, J., and Staworko, S. 2004b. Hippo: A system for computing consistent answers to a class of SQL queries. In Proceedings of the International Conference on Extending Database Technology (EDBT), 841--844.
|
| |
31
|
W. F. Cody , L. M. Haas , W. Niblack , M. Arya , M. J. Carey , R. Fagin , M. Flickner , D. Lee , D. Petkovic , P. M. Schwarz , J. Thomas , M. Tork Roth , J. H. Williams , E. L. Wimmers, Querying multimedia data from multiple repositories by content: the Garlic project, Proceedings of the third IFIP WG2.6 working conference on Visual database systems 3 (VDB-3), p.17-35, June 1997
|
 |
32
|
|
| |
33
|
|
| |
34
|
T. Connors , W. Hasan , C. Kolovson , M.-A. Neimat , D. Schneider , K. Wilkinson, The Papyrus integrated data server, Proceedings of the first international conference on Parallel and distributed information systems, p.139-141, December 1991, Miami, Florida, United States
|
| |
35
|
|
| |
36
|
Dayal, U. and Hwang, H.-Y. 1984. View definition and generalization for database system integration in a multidatabase system. IEEE Trans. Softw. Eng. 10, 6 (Nov.), 628--645.
|
| |
37
|
|
| |
38
|
Dittrich, K. R. and Domenig, R. 1999. Towards exploitation of the data universe: Database technology for comprehensive query services. In Proceedings of the International Conference on Business Infromation Systems (BIS).
|
 |
39
|
|
 |
40
|
Denise Draper , Alon Y. Halevy , Daniel S. Weld, The nimble integration engine, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.567-568, May 21-24, 2001, Santa Barbara, California, United States
|
| |
41
|
|
| |
42
|
Dwyer, P. and Larson, J. 1987. Some experiences with a distributed database testbed system. Proc. IEEE 75, 5 (May), 633--648.
|
| |
43
|
Eiter, T., Fink, M., Greco, G., and Lembo, D. 2003. Efficient evaluation of logic programs for querying data integration systems. In Proceedings of the International Conference on Logic Programming (ICLP), 163--177.
|
 |
44
|
|
| |
45
|
Flesca, S., Furfaro, F., and Parisi, F. 2005. Consistent query answers on numerical databases under aggregate constraints. In Proceedings of the International Conference on Database Programming Languages (DBPL), 279--294.
|
 |
46
|
|
| |
47
|
|
 |
48
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon, AJAX: an extensible data cleaning tool, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.590, May 15-18, 2000, Dallas, Texas, United States
|
| |
49
|
|
| |
50
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
 |
51
|
|
| |
52
|
Hector Garcia-Molina , Yannis Papakonstantinou , Dallan Quass , Anand Rajaraman , Yehoshua Sagiv , Jeffrey Ullman , Vasilis Vassalos , Jennifer Widom, The TSIMMIS Approach to Mediation: Data Models and Languages, Journal of Intelligent Information Systems, v.8 n.2, p.117-132, March/April 1997
[doi> 10.1023/A:1008683107812]
|
 |
53
|
Michael R. Genesereth , Arthur M. Keller , Oliver M. Duschka, Infomaster: an information integration system, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.539-542, May 11-15, 1997, Tucson, Arizona, United States
|
| |
54
|
|
| |
55
|
|
 |
56
|
Alon Y. Halevy , Naveen Ashish , Dina Bitton , Michael Carey , Denise Draper , Jeff Pollock , Arnon Rosenthal , Vishal Sikka, Enterprise information integration: successes, challenges and controversies, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066246]
|
| |
57
|
Hammer, J., McHugh, J., and Garcia-Molina, H. 1997. Semistructured data: The TSIMMIS experience. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 1--8.
|
| |
58
|
|
 |
59
|
Zachary G. Ives , Daniela Florescu , Marc Friedman , Alon Levy , Daniel S. Weld, An adaptive query execution system for data integration, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.299-310, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
60
|
Ives, Z. G., Khandelwal, N., Kapur, A., and Cakir, M. 2005. ORCHESTRA: Rapid, collaborative sharing of dynamic data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 107--118.
|
| |
61
|
Jakobson, G., Piatetsky-Shapiro, G., Lafond, C., Rajinikanth, M., and Hernandez, J. 1988. CALIDA: A knowledge-based system for integrating multiple heterogeneous databases. In Proceedings of the 3rd International Conference on Data and Knowledge Bases: Improving Usability and Responsiveness, 3--18.
|
 |
62
|
|
| |
63
|
|
| |
64
|
|
| |
65
|
Knoblock, C. A. 1995. Planning, executing, sensing, and replanning for information gathering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), C. Mellish, ed. Morgan Kaufmann, San Francisco, CA, 1686--1693.
|
| |
66
|
Craig A. Knoblock , Steven Minton , José Luis Ambite , Naveen Ashish , Pragnesh Jay Modi , Ion Muslea , Andrew G. Philpot , Sheila Tejada, Modeling Web sources for information integration, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.211-218, July 1998, Madison, Wisconsin, United States
|
| |
67
|
Kwok, C. T. and Weld, D. S. 1996. Planning to gather information. In Proceedings of the National Conference on Artificial Intelligence (AAAI). AAAI/MIT Press, Portland, 32--39.
|
| |
68
|
Landers, T. and Rosenberg, R. L. 1982. An overview of MULTIBASE. In Proceedings of the 2nd International Symposium on Distributed Data Bases, H. J. Schneider, ed. North Holland, Berlin.
|
| |
69
|
Lembo, D., Lenzerini, M., and Rosati, R. 2002. Source inconsistency and incompleteness in data integration. In Proceedings of the International Workshop on Knowledge Representation Meets Databases (KRDB).
|
 |
70
|
|
 |
71
|
Nicola Leone , Gianluigi Greco , Giovambattista Ianni , Vincenzino Lio , Giorgio Terracina , Thomas Eiter , Wolfgang Faber , Michael Fink , Georg Gottlob , Riccardo Rosati , Domenico Lembo , Maurizio Lenzerini , Marco Ruzzi , Edyta Kalka , Bartosz Nowicki , Witold Staniszkis, The INFOMIX system for advanced integration of incomplete and inconsistent data, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066286]
|
| |
72
|
Levenshtein, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Problems Inf. Transm. 1, 8--17.
|
| |
73
|
|
| |
74
|
Levy, A. Y., Rajaraman, A., and Ordille, J. J. 1996b. The World Wide Web as a collection of views: Query processing in the information manifold. In Proceedings of the SIGMOD Workshop on Materialized Views: Techniques and Applications (VIEW), 43--55.
|
| |
75
|
Lim, E.-P., Cao, Y., and Chiang, R. H. L. 1997. Source-Aware multidatabase query processing. In Proceedings of the Workshop on Engineering Federated Information Database Systems (EFDBS), 69--80.
|
| |
76
|
|
| |
77
|
|
 |
78
|
|
| |
79
|
Litwin, W. and Abdellatif, A. 1987. An overview of the multi-database manipulation language MDSL. Proc. IEEE 75, 5 (May), 621--632.
|
| |
80
|
Litwin, W., Boudenant, J., Esculier, C., Ferrier, A., Glorieux, A. M., Chimia, J. L., Kabbaj, K., Moulinoux, C., Rolin, P., and Stangret, C. 1982. SIRIUS system for distributed data management. In Distributed Databases. North-Holland, Amsterdam, The Netherlands, 311--343.
|
 |
81
|
|
 |
82
|
|
 |
83
|
|
| |
84
|
|
| |
85
|
|
| |
86
|
|
| |
87
|
|
| |
88
|
|
 |
89
|
|
 |
90
|
|
| |
91
|
Naumann, F., Bilke, A., Bleiholder, J., and Weis, M. 2006. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29, 2, 21--31.
|
| |
92
|
|
| |
93
|
Nodine, M. H., Fowler, J., and Perry, B. 1999. Active information gathering in InfoSleuth. In Proceedings of the International Symposium on Cooperative Database Systems for Advanced Applications (CODAS), 15--26.
|
| |
94
|
Ordille, J. J. and Miller, B. P. 1993. Distributed active catalogs and meta-data caching in descriptive name services. In Proceedings of the International Conference on Distributed Computing Systems, 120--129.
|
| |
95
|
|
| |
96
|
|
| |
97
|
Lucian Popa , Yannis Velegrakis , Mauricio A. Hernández , Renée J. Miller , Ronald Fagin, Translating web data, Proceedings of the 28th international conference on Very Large Data Bases, p.598-609, August 20-23, 2002, Hong Kong, China
|
| |
98
|
Rahm, E. and Bernstein, P. A. 2001. On matching schemas automatically. Tech. Rep. MSR-TR-2001-17, Microsoft Research, Redmond, Washington. February.
|
| |
99
|
Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.
|
 |
100
|
|
| |
101
|
M. Rajinikanth , G. Jakobson , C. Lafond , W. Papp , G. Piatetsky-Shapiro, Multiple database integration in CALIDA: design and implementation, Proceedings of the first international conference on systems integration on Systems integration '90, p.378-384, March 1990, Morristown, New Jersey, United States
|
| |
102
|
Raman, V., Chou, A., and Hellerstein, J. M. 1999. Scalable spreadsheets for interactive data analysis. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
|
| |
103
|
|
 |
104
|
|
| |
105
|
|
| |
106
|
Rusinkiewicz, M., Elmasri, R., Czejdo, B., Georgakopoulos, D., Karabatis, G., Jamoussi, A., Loa, K., and Li, Y. 1989. Omnibase: Design and implementation of a multidatabase system. In Proceedings of the 1st Annual Symposium in Parallel and Distributed Processing, 162--169.
|
| |
107
|
|
| |
108
|
Sattler, K., Conrad, S., and Saake, G. 2000. Adding conflict resolution features to a query language for database federations. In Proceedings of the Workshop on Engineering Federated Information System (EFIS), M. Roantree et al., eds, 41--52.
|
| |
109
|
|
| |
110
|
Schallehn, E. and Sattler, K.-U. 2003. Using similarity-based operations for resolving data-level conflicts. In Proceedings of the British National Conference on Databases (BNCOD), 172--189.
|
| |
111
|
|
 |
112
|
|
 |
113
|
|
| |
114
|
Kurt A. Shoens , Allen Luniewski , Peter M. Schwarz , James W. Stamos , Joachim Thomas, II, The Rufus System: Information Organization for Semi-Structured Data, Proceedings of the 19th International Conference on Very Large Data Bases, p.97-107, August 24-27, 1993
|
| |
115
|
Munindar P. Singh , Philip E. Cannata , Michael N. Huhns , Nigel Jacobs , Tomasz Ksiezyk , Kayliang Ong , Amit P. Sheth , Christine Tomlinson , Darrell Woelk, The Carnot Heterogeneous Database Project: Implemented Applications, Distributed and Parallel Databases, v.5 n.2, p.207-225, April 1997
[doi> 10.1023/A:1008645509474]
|
| |
116
|
Staworko, S., Chomicki, J., and Marcinkowski, J. 2006. Preference-Driven querying of inconsistent relational databases. In Proceedings of the International Workshop on Inconsistency and Incompleteness in Databases (IIDB).
|
| |
117
|
Subrahmanian, V. S., Adali, S., Brink, A., Emery, R., Lu, J., Rajput, A., Rogers, T., Ross, R., and Ward, C. 1995. Hermes: A heterogeneous reasoning and mediator system. Tech. Rep., University of Maryland.
|
| |
118
|
Templeton, M., Brill, D., Dao, S., Lund, E., Ward, P., Chen, A., and MacGregor, R. 1987. Mermaid—A front-end to distributed heterogeneous databases. Proc. IEEE 75, 5 (May), 695--708.
|
 |
119
|
Anthony Tomasic , Rémy Amouroux , Philippe Bonnet , Olga Kapitskaia , Hubert Naacke , Louiqa Raschid, The distributed information search component (Disco) and the World Wide Web, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.546-548, May 11-15, 1997, Tucson, Arizona, United States
|
| |
120
|
|
| |
121
|
Tsai, P. S. M. and Chen, A. L. P. 2000. Partial natural outerjoin—An operation for interoperability in a multidatabase environment. J. Inf. Sci. Eng. 16, 4 (Jul.), 593--617.
|
| |
122
|
|
| |
123
|
|
| |
124
|
|
 |
125
|
|
 |
126
|
|
| |
127
|
Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 262--276.
|
| |
128
|
|
| |
129
|
|
| |
130
|
|
| |
131
|
|
CITED BY
|
|
Nicola Bicocchi , Gabriella Castelli , Marco Mamei , Alberto Rosi , Franco Zambonelli , Matthias Baumgarten , Maurice Mulvenna, Knowledge networks for pervasive services, Proceedings of the 2009 international conference on Pervasive services, July 13-17, 2009, London, United Kingdom
|
|