|
ABSTRACT
Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Brase, "Using Digital Library Techniques - Registration of Scientific Primary Data," in ECDL, 2004.
|
| |
2
|
D. G. Clarke and D. M. Clark, "Lincage," in Elements of Spatial Data Quality, 1995.
|
| |
3
|
J. L. Romeu, "Data Quality and Pedigree," in Material Ease, 1999.
|
 |
4
|
|
| |
5
|
"Access to genetic resources and Benefit-Sharing (ABS) Program," United Nations University, 2003.
|
| |
6
|
|
| |
7
|
D. P. Lanter, "Design of a Lineage-Based Meta-Data Base for GIS," in Cartography and Geographic Information Systems, vol. 18, 1991.
|
| |
8
|
M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science Experiments - experience from Bioinformatics," in Proceedings of the UK OST e-Science 2nd AHM, 2003.
|
| |
9
|
Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of Data Provenance Techniques," in Technical Report TR-618: Computer Science Department, Indiana University, 2005.
|
 |
10
|
|
| |
11
|
S. Miles, P. Groth, M. Branco, and L. Moreau, "The requirements of recording and using provenance in e-Science experiments," in Technical Report, Electronics and Computer Science, University of Southampton, 2005.
|
| |
12
|
D. Pearson, "Presentation on Grid Data Requirements Scoping Metadata & Provenance," in Workshop on Data Derivation and Provenance, Chicago, 2002.
|
| |
13
|
G. Cameron, "Provenance and Pragmatics," in Workshop on Data Provenance and Annotation, Edinburgh, 2003.
|
| |
14
|
C. Goble, "Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics," in Workshop on Data Derivation and Provenance, Chicago, 2002.
|
| |
15
|
P. P. da Silva, D. L. McGuinness, and R. McCool, "Knowledge Provenance Infrastructure," in IEEE Data Engineering Bulletin, vol. 26, 2003.
|
| |
16
|
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Improving Data Cleaning Quality Using a Data Lineage Facility," in DMDW, 2001.
|
| |
17
|
I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao. "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration," in CIDR, 2003.
|
| |
18
|
J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer, "Semantically Linking and Browsing Provenance Logs for E-science," in ICSNW, 2004.
|
| |
19
|
|
| |
20
|
B. Plale, D. Gannon, D. Reed, S. Graves, K. Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD," in ICCS workshop on Dynamic Data Driven Applications, 2005.
|
| |
21
|
D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya, "An Annotation Management System for Relational Databases," in VLDB, 2004.
|
| |
22
|
|
| |
23
|
J. Widom, "Trio: A System for Integrated Management of Data, Accuracy, and Lineage," in CIDR, 2005.
|
| |
24
|
C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee, L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K. Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C. Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon, B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D. Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M. Frenklach, "Metadata in the collaboratory for multi-scale chemical science," in Dublin Core Conference, 2003.
|
| |
25
|
J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier, "Multi-Scale Science, Supporting Emerging Practice with Semantically Derived Provenance," in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.
|
| |
26
|
R. Bose and J. Frew, "Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products," in SSDBM, 2004.
|
| |
27
|
Ian T. Foster , Jens-S. Vöckler , Michael Wilde , Yong Zhao, Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation, Proceedings of the 14th International Conference on Scientific and Statistical Database Management, p.37-46, July 24-26, 2002
[doi> 10.1109/SSDM.2002.1029704]
|
| |
28
|
James Frew , Rajendra Bose, Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products, Proceedings of the 13th International Conference on Scientific and Statistical Database Management, p.180-189, July 18-20, 2001
|
| |
29
|
|
CITED BY 33
|
|
|
|
|
Simon Miles , Sylvia C. Wong , Weijian Fang , Paul Groth , Klaus-Peter Zauner , Luc Moreau, Provenance-based validation of e-science experiments, Web Semantics: Science, Services and Agents on the World Wide Web, v.5 n.1, p.28-38, March, 2007
|
|
|
|
|
|
|
|
|
|
|
|
H. V. Jagadish , Adriane Chapman , Aaron Elkiss , Magesh Jayapandian , Yunyao Li , Arnab Nandi , Cong Yu, Making database systems usable, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Elizeu Santos-Neto , Samer Al-Kiswany , Nazareno Andrade , Sathish Gopalakrishnan , Matei Ripeanu, enabling cross-layer optimizations in storage systems with custom metadata, Proceedings of the 17th international symposium on High performance distributed computing, June 23-27, 2008, Boston, MA, USA
|
|
|
Sergio Manuel Serra da Cruz , Vanessa Batista , Alberto M. R. Dávila , Edno Silva , Frederico Tosta , Clarissa Vilela , Maria Luiza M. Campos , Rafael Cuadrat , Diogo Tschoeke , Marta Mattoso, OrthoSearch: a scientific workflow approach to detect distant homologies on protozoans, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
Shaowen Wang , Anand Padmanabhan , James D. Myers , Wenwu Tang , Yong Liu, Towards provenance-aware geographic information systems, Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems, November 05-07, 2008, Irvine, California
|
|
|
Leon J. Osterweil , Lori A. Clarke , Aaron M. Ellison , Rodion Podorozhny , Alexander Wise , Emery Boose , Julian Hadley, Experience in using a process language to define scientific workflow and generate dataset provenance, Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, November 09-14, 2008, Atlanta, Georgia
|
|
|
|
|
|
R. Spillane , R. Sears , C. Yalamanchili , S. Gaikwad , M. Chinni , E. Zadok, Story book: an efficient extensible provenance framework, First workshop on on Theory and practice of provenance, p.1-10, February 23, 2009, San Francisco, CA
|
|
|
|
|
|
|
|
|
|
|
|
Liqiang Wang , Shiyong Lu , Xubo Fei , Artem Chebotko , H. Victoria Bryant , Jeffrey L. Ram, Atomicity and provenance support for pipelined scientific workflows, Future Generation Computer Systems, v.25 n.5, p.568-576, May, 2009
|
|
|
|
|
|
|
|
|
Mohamed Y. Eltabakh , Walid G. Aref , Ahmed K. Elmagarmid , Mourad Ouzzani , Yasin N. Silva, Supporting annotations on relations, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 24-26, 2009, Saint Petersburg, Russia
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|