|
ABSTRACT
The range of information now available in queryable repositories opens up a host of possibilities for new and valuable forms of data analysis. Database query languages such as SQL and XQuery offer a concise and high-level means by which such analyses can be implemented, facilitating the extraction of relevant data subsets into either generic or bespoke data analysis environments. Unfortunately, the quality of data in these repositories is often highly variable. The data is still useful, but only if the consumer is aware of the data quality problems and can work around them. Standard query languages offer little support for this aspect of data management. In principle, however, it should be possible to embed constraints describing the consumer’s data quality requirements into the query directly, so that the query evaluator can take over responsibility for enforcing them during query processing. Most previous attempts to incorporate information quality constraints into database queries have been based around a small number of highly generic quality measures, which are defined and computed by the information provider. This is a useful approach in some application areas but, in practice, quality criteria are more commonly determined by the user of the information not by the provider. In this article, we explore an approach to incorporating quality constraints into database queries where the definition of quality is set by the user and not the provider of the information. Our approach is based around the concept of a quality view, a configurable quality assessment component into which domain-specific notions of quality can be embedded. We examine how quality views can be incorporated into XQuery, and draw from this the language features that are required in general to embed quality views into any query language. We also propose some syntactic sugar on top of XQuery to simplify the process of querying with quality constraints.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies, and Techniques. Springer, Berlin.
|
| |
2
|
Berti-Equille, L. 2004. Quality-Adaptive query processing over distributed sources. In Proceedings of the 9th International Conference on Information Quality (IQ’04). 285--296.
|
| |
3
|
Berti-Équille, L. 2007. Quality awareness for managing and mining data. Habilitation, L’Université de Rennes.
|
| |
4
|
Blaha, M. 2001. A retrospective on industrial database reverse engineering projects. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE’01). IEEE Computer Society Press, 136--146.
|
| |
5
|
Bouzeghoub, M. and Peralta, V. 2004. A framework for the analysis of data freshness. In Proceedings of the International Workshop on Information Quality in Information Systems (IQIS’04). F. Naumann and M. Scannapieco, Eds. ACM Press.
|
| |
6
|
Burgoon, L., Eckel-Passow, J., Gennings, C., Boverhof, D., Burt, J., Fong, C., and Zacharewski, T. 2005. Protocols for the assurance of microarray data quality and process control. Nucleic Acids Res. 33, 19, e172.
|
| |
7
|
Carlige, B., Bundy, J. L., and Stephenson, J. L. J. 2004. Potential for false positive identifications from large databases through tandem mass spectrometry. J. Proteomics Res. 3, 1082--1085.
|
| |
8
|
Civan, A. and Pratt, W. 2006. Supporting consumers by characterising the quality of online health information: A multidimensional framework. In Proceedings of the 39th Hawaii International Conference on System Sciences. IEEE Computer Society Press.
|
| |
9
|
Dasu, T. and Johnson, T. 2003. Exploratory Data Mining and Data Cleaning. John Wiley, New York.
|
| |
10
|
Elmagarmid, A., Ipeirotis, P., and Verykios, V. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1, 1--16.
|
| |
11
|
Embury, S., Sampaio, S., Missier, P., and Greenwood, R. 2007. The Syntax and Semantics of QXQuery. Tech. rep., School of Computer Science, University of Manchester. www.qurator.org.
|
| |
12
|
Führing, P. and Naumann, F. 2007. Emergent data quality annotation and visualization. In Proceedings of the International Conference on Information Quality (IQ’07).
|
| |
13
|
Heim, S., Hahn, K., Sämann, P., Fahrmeir, L., and Auer, D. 2004. Assessing DTI data quality using bootstrap analysis. Magnetic Resonance Med. 52, 3, 582--589.
|
| |
14
|
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., and Oinn, T. 2006. Taverna: A tool for building and running workflows of services. Nucleic Acids Res. 34, (Web Server issue) W729--W732.
|
| |
15
|
Korn, F., Muthukrishnan, S., and Zhu, Y. 2003. Checks and balances: Monitoring data quality problems in network traffic databases. In Proceedings of the 29th International Conference on Very Large Databases (VLDB’03). ACM Press, 536--547.
|
| |
16
|
Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. AIMQ: A methodology for information quality assessment. Inform. Manag. 40, 2, 133--146.
|
| |
17
|
Martinez, A. and Hammer, J. 2005. Making quality count in biological data sources. In Proceedings of the International Workshop on Information Quality in Information Systems (IQIS’05). ACM Press, 16--27.
|
| |
18
|
McLaughlin, T., Garwood, C., Joens, S., Mornson, N., Taylor, C. F., et al. 2004. PEDRo: A database for storing, searching, and disseminating experimental proteomics data. BMC Genomics 5, 1.
|
| |
19
|
Medjahed, B., Bouguettaya, A., and Elmagarmid, A. 2003. Composing Web services on the semantic Web. VLDB J. 12, 4, 333--351.
|
| |
20
|
Mihaila, G., Raschid, L., and Vidal, M.-E. 2000. Using quality of data metadata for source selection and ranking. In Proceedings of the International Workshop on Web Databases (WebDB’00). 93--98.
|
| |
21
|
Milano, D., Scannapieco, M., and Catarci, T. 2004. Quality-Driven query processing of XQuery queries. In Proceedings of the Workshop on Data and Information Quality (CAiSE’04). J. Grundspenkis and M. Kirikova, Eds. Vol. 2. 78--89.
|
| |
22
|
Missier, P. 2008. Modelling and computing the quality of information in e-science. Ph.D. thesis, School of Computer Science, University of Manchester.
|
| |
23
|
Missier, P., Embury, S., Greenwood, R., Preece, A., and Jin, B. 2006. Quality views: Capturing and exploiting the user perspective on data quality. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06), ACM Press, 977--988.
|
| |
24
|
Missier, P., Embury, S., Hedeler, C., Greenwood, R., Pennock, J., and Brass, A. 2007. Accelerating disease gene identification through integrated SNP data analysis. In Proceedings of the 4th International Workshop on Data Integration in the Life Sciences (DILS’07), S. Cohen Boulakia and V. Tannen, Eds. Springer, 215--230.
|
| |
25
|
Naumann, F. 2002. Quality-Driven Query Answering for Integrated Information Systems. Lecture Notes in Computer Science, vol. 2261. Springer, Berlin.
|
| |
26
|
Naumann, F., Leser, U., and Freytag, J. 1999. Quality-Driven integration of heterogeneous information systems. In Proceedings of the 25th International Conference on Very Large Databases (VLDB’99). Morgan Kaufmann, 447--458.
|
| |
27
|
Peralta, V. 2006. Data freshness and data accuracy: A state of the art. Tech. rep., Universidad de la Republica, Uruguay.
|
| |
28
|
Pipino, L., Lee, Y., and Wang, R. 2002. Data quality assessment. Comm. ACM 45, 4, 211--218.
|
| |
29
|
Preece, A., Jin, B., Missier, P., Embury, S., Stead, D., and Brown, A. 2006a. Towards the management of information quality in proteomics. In Proceedings of the 19th IEEE International Symposium on Computer-Based Medical Systems (CBMS’06). IEEE Computer Society Press, 936--940.
|
| |
30
|
Preece, A., Jin, B., Pignotti, E., Missier, P., and Embury, S. 2006b. Managing information quality in e-science using semantic Web technology. In Proceedings of the 3rd European Semantic Web Conference (ESWC’06). Lecture Notes in Computer Science, vol. 4011. Springer, 472--486.
|
| |
31
|
Raman, V. and Hellerstein, J. 2001. Potter’s wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Databases (VLDB’01), P. Apers, et al., Eds. Morgan Kaufmann, 381--390.
|
| |
32
|
Redman, T. 1996. Data Quality for the Information Age. Artech House, Boston.
|
| |
33
|
Redman, T. 1998. The impact of poor data quality on the typical enterprise. Comm. ACM 41, 2, 79--82.
|
| |
34
|
Sampaio, S., Dong, C., and Sampaio, P. 2005. Incorporating the timeliness quality dimension in internet query systems. In Proceedings of the WISE Workshops, M. Dean et al., Eds. Lecture Notes in Computer Science, vol. 3807. Springer Verlag, 53--62.
|
| |
35
|
Scannapieco, M., Vigillito, A., Marchetti, C., Mecella, M., and Baldoni, R. 2004. The DaQuinCIS architecture: A platform for exchanging and improving data quality in cooperative information systems. Inform. Syst. 29, 7, 551--582.
|
| |
36
|
Simmhan, Y., Plale, B., and Gannon, D. 2006. Towards a quality model for effective data selection in collaboratories. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE Computer Society Press.
|
| |
37
|
Stead, D., Preece, A., and Brown, A. 2006. Universal metrics for quality assessment of protein identifications by mass spectrometry. Molecular Cell Proteomics 5, 7, 1205--1211.
|
| |
38
|
Taylor, C., Paton, N. W., Garwood, K. L., Kirby, P. D., Stead, D. A. et al. 2003. A systematic approach to modelling, capturing and disseminating proteomics experimental data. Nature Biotechnol. 21, 247--254.
|
| |
39
|
Topaloglou, T. 2006. Informatics solutions for high-throughput proteomics. Drug Discov. Today 11, 11/12, 509--516.
|
| |
40
|
Wang, R. and Strong, D. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inform. Syst. 12, 4, 5--34.
|
| |
41
|
Wang, R. Y. 1998. A product perspective on total data quality management. Comm. ACM 41, 2, 58--65.
|
| |
42
|
Weis, M. and Manolescu, I. 2007. Declarative XML data cleaning with XClean. In Proceedings of the 19th International Conference on Advanced Information Systems Engineering (CAiSE’07), J. Krogstie et al., Eds. Lecture Notes in Computer Science, vol. 4495. Springer, 96--110.
|
| |
43
|
Winkler, W. 2004. Methods for evaluating and creating data quality. Inform. Syst. 29, 531--550.
|
| |
44
|
Winkler, W. 2006. Overview of record linkage and current research directions. Statistical Res. rep. series rr2006/02, US Bureau of the Census, Washington D.C.
|
|