|
ABSTRACT
Applications that involve data integration among multiple sources often require a preliminary step of data reconciliation in order to ensure that tuples match correctly across the sources. In dynamic settings such as data mashups, however, traditional offline data reconciliation techniques that require prior availability of the data may not be applicable. The alternative, performing similarity joins at query time, is computationally expensive, while ignoring the mismatch problem altogether leads to an incomplete integration. In this paper we make the assumption that, in some dynamic integration scenarios, users may agree to trade the completeness of a join result in return for a faster computation. We explore the consequences of this assumption by proposing a novel, hybrid join algorithm that involves a combination of exact and approximate join operators, managed using adaptive query processing techniques. The algorithm is optimistic: it can switch between physical join operators multiple times throughout query processing, but it only resorts to approximate join operators when there is statistical evidence that result completeness is compromised. Our experiments show that sensible savings in join execution time can be achieved in practice, at the expense of a modest reduction in result completeness.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
J. Barateiro and H. Galhardas. A survey of data quality tools. Datenbank-Spektrum, 14:15--21, 2005.
|
| |
3
|
|
| |
4
|
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Liu et al. {24}, page 5.
|
| |
5
|
|
 |
6
|
|
| |
7
|
U. Dayal, K. Ramamritham, and T. M. Vijayaraman, editors. Proceedings of the 19th International Conference on Data Engineering, March 5--8, 2003, Bangalore, India. IEEE Computer Society, 2003.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In T. Grust, H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, and J. Wijsen, editors, EDBT Workshops, volume 4254 of Lecture Notes in Computer Science, pages 589--600. Springer, 2006.
|
| |
12
|
S. Ewen, H. Kache, V. Markl, and V. Raman. Progressive query optimization for federated queries. In Y. E. Ioannidis, M. H. Scholl, J. W. Schmidt, F. Matthes, M. Hatzopoulos, K. Böhm, A. Kemper, T. Grust, and C. Böhm, editors, EDBT, volume 3896 of Lecture Notes in Computer Science, pages 847--864. Springer, 2006.
|
| |
13
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
14
|
Anastasios Gounaris , Norman W. Paton , Rizos Sakellariou , Alvaro A. A. Fernandes , Jim Smith , Paul Watson, Practical Adaptation to Changing Resources in Grid Query Processing, Proceedings of the 22nd International Conference on Data Engineering, p.165, April 03-07, 2006
[doi> 10.1109/ICDE.2006.113]
|
| |
15
|
A. Gounaris, J. Smith, N. W. Paton, R. Sakellariou, A. A. A. Fernandes, and P. Watson. Adapting to changing resource performance in grid query processing. In J.-M. Pierson, editor, DMG, volume 3836 of Lecture Notes in Computer Science, pages 30--44. Springer, 2005.
|
 |
16
|
|
| |
17
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
18
|
Wook-Shin Han , Jack Ng , Volker Markl , Holger Kache , Mokhtar Kandil, Progressive optimization in a shared-nothing parallel database, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
[doi> 10.1145/1247480.1247569]
|
 |
19
|
|
| |
20
|
J. Kang, J. F. Naughton, and S. Viglas. Evaluating window joins over unbounded streams. In Dayal et al. {7}, pages 341--352.
|
| |
21
|
|
| |
22
|
|
| |
23
|
R. Lengu, et al. Symmetric Set Hash Join: A Pipelined Approximate Join Algorithm. Technical Report DISI-TR-08-03, DISI, Università di Genova, Via Dodecaneso, Genova, IT, 2008. Available at: ftp://ftp.disi.unige.it/person/GuerriniG/reports/sshjoinTR.pdf.
|
| |
24
|
L. Liu, A. Reuter, K.-Y. Whang, and J. Zhang, editors. Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 2006.
|
 |
25
|
Volker Markl , Vijayshankar Raman , David Simmen , Guy Lohman , Hamid Pirahesh , Miso Cilimdzic, Robust query processing through progressive optimization, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007642]
|
| |
26
|
F. Naumann and K.-U. Sattler. Information quality: Fundamentals, techniques, and use. Tutorial at EDBT, 2006.
|
| |
27
|
|
| |
28
|
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Dayal et al. {7}, pages 25--36.
|
| |
29
|
A. Thor, D. Aumueller, and E. Rahm. Data Integration Support for Mashups. In IIWeb, 2007.
|
| |
30
|
G. Weikum, A. C. König, and S. Deßloch, editors. Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004. ACM, 2004.
|
| |
31
|
|
| |
32
|
W. E. Winkler. Overview of record linkage and current research directions. Research Report Statistics #2006--2, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233, 2006. Available at http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.
|
|