ACM Home Page
Please provide us with feedback. Feedback
Time-completeness trade-offs in record linkage using adaptive query processing
Full text PdfPdf (1.30 MB)
Source Extending Database Technology; Vol. 360 archive
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology table of contents
Saint Petersburg, Russia
SESSION: Research sessions: Query processing table of contents
Pages 851-861  
Year of Publication: 2009
ISBN:978-1-60558-422-5
Authors
Roald Lengu  Università di Genova, Italy
Paolo Missier  University of Manchester, UK
Alvaro A. A. Fernandes  University of Manchester, UK
Giovanna Guerrini  Università di Genova, Italy
Marco Mesiti  Università di Milano, Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 47,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1516360.1516458
What is a DOI?

ABSTRACT

Applications that involve data integration among multiple sources often require a preliminary step of data reconciliation in order to ensure that tuples match correctly across the sources. In dynamic settings such as data mashups, however, traditional offline data reconciliation techniques that require prior availability of the data may not be applicable. The alternative, performing similarity joins at query time, is computationally expensive, while ignoring the mismatch problem altogether leads to an incomplete integration. In this paper we make the assumption that, in some dynamic integration scenarios, users may agree to trade the completeness of a join result in return for a faster computation. We explore the consequences of this assumption by proposing a novel, hybrid join algorithm that involves a combination of exact and approximate join operators, managed using adaptive query processing techniques. The algorithm is optimistic: it can switch between physical join operators multiple times throughout query processing, but it only resorts to approximate join operators when there is statistical evidence that result completeness is compromised. Our experiments show that sensible savings in join execution time can be achieved in practice, at the expense of a modest reduction in result completeness.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
J. Barateiro and H. Galhardas. A survey of data quality tools. Datenbank-Spektrum, 14:15--21, 2005.
 
3
 
4
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Liu et al. {24}, page 5.
 
5
6
 
7
U. Dayal, K. Ramamritham, and T. M. Vijayaraman, editors. Proceedings of the 19th International Conference on Data Engineering, March 5--8, 2003, Bangalore, India. IEEE Computer Society, 2003.
 
8
 
9
 
10
 
11
K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In T. Grust, H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, and J. Wijsen, editors, EDBT Workshops, volume 4254 of Lecture Notes in Computer Science, pages 589--600. Springer, 2006.
 
12
S. Ewen, H. Kache, V. Markl, and V. Raman. Progressive query optimization for federated queries. In Y. E. Ioannidis, M. H. Scholl, J. W. Schmidt, F. Matthes, M. Hatzopoulos, K. Böhm, A. Kemper, T. Grust, and C. Böhm, editors, EDBT, volume 3896 of Lecture Notes in Computer Science, pages 847--864. Springer, 2006.
 
13
 
14
 
15
A. Gounaris, J. Smith, N. W. Paton, R. Sakellariou, A. A. A. Fernandes, and P. Watson. Adapting to changing resource performance in grid query processing. In J.-M. Pierson, editor, DMG, volume 3836 of Lecture Notes in Computer Science, pages 30--44. Springer, 2005.
16
 
17
18
19
 
20
J. Kang, J. F. Naughton, and S. Viglas. Evaluating window joins over unbounded streams. In Dayal et al. {7}, pages 341--352.
 
21
 
22
 
23
R. Lengu, et al. Symmetric Set Hash Join: A Pipelined Approximate Join Algorithm. Technical Report DISI-TR-08-03, DISI, Università di Genova, Via Dodecaneso, Genova, IT, 2008. Available at: ftp://ftp.disi.unige.it/person/GuerriniG/reports/sshjoinTR.pdf.
 
24
L. Liu, A. Reuter, K.-Y. Whang, and J. Zhang, editors. Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 2006.
25
 
26
F. Naumann and K.-U. Sattler. Information quality: Fundamentals, techniques, and use. Tutorial at EDBT, 2006.
 
27
 
28
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Dayal et al. {7}, pages 25--36.
 
29
A. Thor, D. Aumueller, and E. Rahm. Data Integration Support for Mashups. In IIWeb, 2007.
 
30
G. Weikum, A. C. König, and S. Deßloch, editors. Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004. ACM, 2004.
 
31
 
32
W. E. Winkler. Overview of record linkage and current research directions. Research Report Statistics #2006--2, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233, 2006. Available at http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.
Collaborative Colleagues:
Roald Lengu: colleagues
Paolo Missier: colleagues
Alvaro A. A. Fernandes: colleagues
Giovanna Guerrini: colleagues
Marco Mesiti: colleagues