ACM Home Page
Please provide us with feedback. Feedback
Lineage retrieval for scientific data processing: a survey
Full text PdfPdf (729 KB)
Source ACM Computing Surveys (CSUR) archive
Volume 37 ,  Issue 1  (March 2005) table of contents
Pages: 1 - 28  
Year of Publication: 2005
ISSN:0360-0300
Authors
Rajendra Bose  Bren School of Environmental Science and Management University of California, Santa Barbara, CA
James Frew  Bren School of Environmental Science and Management University of California, Santa Barbara, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 47,   Downloads (12 Months): 381,   Citation Count: 20
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1057977.1057978
What is a DOI?

ABSTRACT

Scientific research relies as much on the dissemination and exchange of data sets as on the publication of conclusions. Accurately tracking the lineage (origin and subsequent processing history) of scientific data sets is thus imperative for the complete documentation of scientific work. Researchers are effectively prevented from determining, preserving, or providing the lineage of the computational data products they use and create, however, because of the lack of a definitive model for lineage retrieval and a poor fit between current data management tools and scientific software. Based on a comprehensive survey of lineage research and previous prototypes, we present a metamodel to help identify and assess the basic components of systems that provide lineage retrieval for scientific data products.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Alonso, G., Agrawal, D., El Abbadi, A., and Mohan, C. 1997a. Functionality and limitations of current workflow management systems. Computer Science Department, University of California at Santa Barbara, Santa Barbara, CA. Available at: http://www.inf.ethz.ch/personal/alonso/PAPERS/IEEE-Expert.ps.Z.
 
3
Alonso, G., and El Abbadi, A. 1993. GOOSE: Geographic object oriented support environment. In Proceedings of the ACM Workshop on Advances in Geographic Information Systems. Arlington, VA. 38--49.
 
4
 
5
Alonso, G., Hagen, C., Schek, H.-J., and Tresch, M. 1998. Towards a platform for distributed application development. In Workflow Management Systems and Interoperability. A. Dogac, L. Kalinichenko, M. T. Ozsu and A. Sheth, Eds. NATO ASI Series, Vol. 164. Springer, Berlin. 195--221.
6
 
7
AT&T. 2001. Graphviz graph visualization software. AT&T Labs---Research. Available at: http://www.research.att.com/sw/tools/graphviz/.
 
8
 
9
Barkstrom, B. R. 1998. Digital archive issues from the perspective of an Earth Science data producer. Position Paper: ISO Archiving Workshop Series: Digital Archive Directions (DADs) Workshop (June). College Park, MD. Available at: http://ssdoo.gsfc.nasa.gov/nost/isoas/dads/.
 
10
Barkstrom, B. R. 2002. Data product configuration management and versioning in large-scale production of satellite scientific data production. Position paper: Workshop on Data Derivation and Provenance (Oct.). Chicago, IL.
 
11
Barry, A., Baker, N., Le Goff, J.-M., McClatchey, R., and Vialle, J.-P. 1998. Meta-data based design of workflow systems. Workshop paper: Metadata and Dynamic Object-Model Pattern Mining Workshop (at OOPSLA '98) (Oct.). Vancouver, Canada. Available at: http://www-poleia.lip6.fr/~razavi/aom/papers/oopsla98/mcclatchey.pdf.
 
12
Becker, R. A., and Chambers, J. M. 1988. Auditing of data analyses. SIAM J. Sci. Stat. Comput. 9, 4, 747--760.
 
13
Chad Berkley , Matthew Jones , Jivka Bojilova , Daniel Higgins Metacat, a Schema-Independent XML Database System, Proceedings of the 13th International Conference on Scientific and Statistical Database Management, p.171-179, July 18-20, 2001
 
14
Bernstein, A., Dellarocas, C., and Klein, M. 1999. Towards adaptive workflow systems. SIGMOD Record 28, 3, 7--8.
 
15
 
16
 
17
Buneman, P., and Foster, I. 2002a. Workshop on Data Derivation and Provenance. (Oct). Chicago, IL. Available at: http://www-fp.mcs.anl.gov/~foster/provenance/.
 
18
Buneman, P., and Foster, I. 2003. Workshop on Data Provenance and Annotation (Dec.). Edinburgh, Scotland. Available at: http://www.nesc.ac.uk/esi/events/304/.
 
19
 
20
 
21
Buneman, P., Khanna, S., and Tan, W. C. 2002b. Computing provenance and annotations for views. Workshop Paper: Workshop on Data Derivation and Provenance (Oct.). Chicago IL. Available at: http://people.cs.uchicago.edu/~yongzh/position_papers.html.
 
22
Buneman, P., Maier, D., and Widom, J. 2000b. Where was your data yesterday, and where will it go tomorrow? Data Annotation and Provenance for Scientific Applications. Position paper for NSF Workshop on Information and Data Management (IDM '00): Research Agenda into the Future (March), Chicago IL.
 
23
Cederqvist, P. 1993. Version management with CVS, Signum Support AB (Dec.). Available at: https://www.cvshome.org/docs/manual/.
 
24
Chakravarthy, S., Krishnaprasad, V., Tamizuddin, Z., and Lambay, F. 1993. A federated multi-media DBMS for medical research: Architecture and functionality. Technical Report UF-CIS-TR-93-006, Department of Computer and Information Sciences, University of Florida, Gainesville, FL.
 
25
 
26
 
27
Chen, L., Shadbolt, N. R., Goble, C., Tao, F., Cox, S. J., Puleston, C., and Smart, P. 2003. Towards a knowledge-based approach to semantic service composition. Lecture Notes in Computer Science. 2870, 319--334.
 
28
 
29
Clarke, D. G., and Clark, D. M. 1995. Lineage. In Elements of Spatial Data Quality, S. C. Guptill and J. L. Morrison, Eds., Elsevier Science, Oxford. 13--30.
30
 
31
 
32
Cui, Y., Widom, J., and Wiener, J. L. 1997. Tracing the lineage of view data in a warehousing environment. Technical Report, Stanford University Database Group (Nov.). Stanford, CA. Available at: http://www-db.stanford.edu/pub/papers/lineage-full.ps.
33
 
34
 
35
 
36
Draskic, J., Le Goff, J.-M., Willers, I., Estrella, F., Kovacs, Z., McClatchey, R., and Zsenei, M. 1999. Using a meta-model as the basis for enterprise-wide data navigation. In Proceedings of the 3rd IEEE Metadata Conference (MD'99) (April). Bethesda, MO.
 
37
Eagan, P. D., and Ventura, S. J. 1993. Enhancing value of environmental data: data lineage reporting. J. Environ. Eng. 119, 1, 5--16.
 
38
Elmagarmid, A., and Du, W. 1997. Workflow management: State of the art versus state of the products. In Workflow Management Systems and Interoperability, A. Dogac, L. Kalinichenko, M. T. Ozsu and A. Sheth, Eds. NATO ASI Series, Vol. 164, Springer, Berlin. 1--17.
 
39
ESRI. 1982. ARC/INFO geographic information system (GIS), ESRI, Redlands, CA. Available at: www.esri.com.
 
40
Federal Geographic Data Committee. 1998. Content standard for digital geospatial metadata FGDC-STD-001-1998 (revised June), Federal Geographic Data Committee, Washington, DC. Available at: http://www.fgdc.gov/metadata/csdgm/.
 
41
Feldman, S. I. 1978. Make---A program for maintaining computer programs. In UNIX Programmer's Manual, Vol. 2 (Bell Laboratories). Holt, Rinehart and Winston, New York. 291--300.
 
42
 
43
 
44
Foster, I., Vockler, J., Wilde, M., and Zhao, Y. 2003. The virtual data grid: A new model and architecture for data-intensive collaboration. In Proceedings of the 1st Biennial Conference on Innovative Data System Research (CIDR '03) {Online proceedings} (Jan.). Pacific Grove, CA.
 
45
French, J. C. 1995. What is metadata? In Proceedings of the SDM--92 Workshop: The Role of Metadata in Managing Large Environmental Science Datasets, Richland, WA, R. B. Melton, D. M. DeVaney and J. C. French, Eds. Pacific Northwest Laboratory. 3--8.
 
46
James Frew , Rajendra Bose, Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products, Proceedings of the 13th International Conference on Scientific and Statistical Database Management, p.180-189, July 18-20, 2001
47
 
48
Geist, A., and Nachtigal, N. 2003. ORNL Electronic Notebook Project. Oak Ridge National Laboratory. Available at: http://www.csm.ornl.gov/~geist/java/applets/enote/.
 
49
Geographic Designs. 1993. Geolineus Version 3.0 User Manual. Santa Barbara, CA.
 
50
 
51
Goland, Y., Whitehead, E., Faizi, A., Carter, S., and Jensen, D. 1999. HTTP Extensions for distributed authoring--WEBDAV: RFC 2518. Network Working Group. Available at: http://asg.web.cmu.edu/rfc/rfc2518.html.
 
52
Greenwood, M., Goble, C., Stevens, R., Zhao, J., Addis, M., Marvin, D., Moreau, L., and Oinn, T. 2003. Provenance of e-science experiments---experience from bioinformatics. In Proceedings of the UK e-Science All Hands Meeting. Nottingham, UK. 223--226.
 
53
Grid Physics Network (GriPhyN) project. 2003. Chimera Virtual Data System Version 1.2 User Guide, Grid Physics Network (GriPhyN) project (Dec.). Available at: http://www.griphyn.org/chimera/release.html.
 
54
 
55
Insightful Corporation. 2003. S-PLUS statistical analysis, graphics and programming application, Insightful Corporation, Seattle, WA. Available at: http://www.insightful.com/.
 
56
 
57
Ioannidis, Y., Livny, M., Haber, E., Miller, R., Tsatalos, O., and Wiener, J. 1993. Desktop experiment management. IEEE Data Eng. Bull. 16, 1, 19--23.
 
58
IT Innovation. 2002. IT innovation workflow enactment engine. IT Innovation Centre. Available at: http://www.it-innovation.soton.ac.uk/mygrid/workflow/.
 
59
 
60
Kavantzas, N., Burdett, D., and Ritzinger, G. 2004. Web Services Choreography Description Language Version 1.0. W3C Working Draft, IBM developerWorks (April). Available at: http://www.w3.org/TR/ws-cdl-10/.
 
61
Lanter, D. P. 1988. A neural network for GIS command language translation. Unpublished research paper. University of South Carolina, Columbia, SC.
 
62
 
63
Lanter, D. P. 1989b. Trimming Large spatial databases with lineage analysis. In Proceedings of the 10th Annual ESRI Users Conference. Palm Springs, CA.
 
64
Lanter, D. P. 1990. Lineage in GIS: The problem and a solution. Technical Report 90-6, National Center for Geographic Information and Analysis (NCGIA), University of California at Santa Barbara, Santa Barbara, CA.
 
65
Lanter, D. P. 1991. Design of a lineage-based meta-data base for GIS. Cart. Geograph. Info. Syst. 18, 4, 255--261.
 
66
Lanter, D. P. 1993. A Lineage meta-database approach toward spatial analytic database optimization. Cart. Geograph. Info. Syst. 20, 2, 112--121.
 
67
Lanter, D. P. 1994. Comparison of spatial analytic applications of GIS. In Environmental Information Management and Analysis: Ecosystem to Global Scales, W. K. Michener, J. W. Brunt and S. G. Stafford, Eds. Taylor & Francis, Bristol, PA. 413--425.
 
68
Lanter, D. P., and Veregin, H. 1990. A lineage meta-database program for propagating error in geographic information systems. In Proceedings of the GIS/LIS Conference (Nov.). 144--153.
 
69
Le Goff, J.-M., Vialle, J.-P., Bazan, A., Le Flour, T., Lieunard, S., Rousset, D., McClatchey, R., Baker, N., Kovacs, Z., Heath, H., Leonardi, E., Barone, G., and Organtini, G. 1996. C. R. I. S. T. A. L./ Concurrent repository & information system for tracking assembly and production lifecycles---A data capture and production management tool for the assembly and construction of the CMS ECAL detector. CERN CMS Note 1996/003, CERN, 1996, Geneva, Switzerland. Available at: http://cmsdoc.cern.ch/documents/96/note96_003.pdf.
 
70
Lee, J., Gruninger, M., Jin, Y., Malone, T., Tate, A., and Yost, G. 1998. PIF The process interchange format. In Handbook on Architectures of Information Systems. P. Bernus, G. Schmidt and K. Mertins, Eds. Springer, Berlin. 167--189.
 
71
Manola, F., and Miller, E. 2004. RDF Primer W3C Recommendation. World Wide Web Consortium (W3C). Available at: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/.
 
72
Arunprasad P. Marathe, Tracing Lineage of Array Data, Proceedings of the 13th International Conference on Scientific and Statistical Database Management, p.69-78, July 18-20, 2001
 
73
Mathworks. 2003. MATLAB programming and visualization application. The Mathworks, Inc., Natick, MA. Available at: http://www.mathworks.com/.
 
74
 
75
 
76
 
77
 
78
Merriam-Webster Inc. 2001. Merriam-Webster Collegiate Dictionary, Springfield, MA.
 
79
Mohan, C. 1997. Recent Trends in workflow management products, standards and research. In Workflow Management Systems and Interoperability. A. Dogac, L. Kalinichenko, M. T. Ozsu and A. Sheth, Eds. NATO ASI Series Vol. 164, Springer. 396--409.
 
80
Myers, J., Pancerella, C., Lansing, C., Schuchardt, K., and Didier, B. 2003a. Multi-scale science: Supporting emerging practice with semantically derived provenance. In Proceedings of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data {Online proceedings} (Oct.). Sanibel Island, FL. 2003.
 
81
 
82
National Aeronautics and Space Administration (NASA). 1986. Report of the EOS Data Panel, Vol. IIa: Earth Observing System Data and Information System. Technical Memorandum 87777, National Aeronautics and Space Administration (NASA), Washington, DC.
 
83
National Research Council. 1999. Global Environmental Change: Research Pathways for the Next Decade. National Academy Press, Washington, DC.
 
84
Object Management Group. 2002. Meta-Object Facility (MOF) Specification, Version 1.4. Object Management Group (OMG). Available at: http://www.omg.org/cgi-bin/doc?formal/2002-04-03.
 
85
Object Management Group. 2004. dtc/04-05-01 (Life Sciences Identifiers final adopted specification). Object Management Group, Inc. Available at: http://www.omg.org/docs/dtc/04-05-01.pdf.
 
86
 
87
Pancerella, C., Myers, J., Allison, T. C., and Amin, K. 2003. Metadata in the collaboratory for multi-scale chemical science. In Proceedings of the Dublin Core Conference (DC-'03) {Online proceedings} (Sept.-Oct.). Seattle, WA.
88
 
89
Research Systems Inc. 2003. Interactive Data Language (IDL) computing environment for interactive analysis and visualization of data. Research Systems, Inc. Available at: http://www.rsinc.com/.
 
90
Roush, G. E. 1989. Documenting one's work. IEEE Potentials 8, 2, 24--26.
 
91
92
 
93
 
94
Singh, M., and Vouk, M. A. 1996. Scientific workflows: Scientific computing meets transactional workflow. In Proceedings of the NSF Workshop on Workflow and Process Automation in Information Systems: State-of-the-Art and Future Directions {Online Proceedings} (May). Athens, GA.
 
95
 
96
Smith, T. R., Su, J., Agrawal, D., and El Abbadi, A. 1993. Database and modeling systems for the earth sciences. IEEE Bull. Tech. Comm. Data Eng. 16, 1, 33--37.
 
97
 
98
 
99
Stein, L., Rozen, S., and Goodman, N. 1994. Managing laboratory flow with LabBase. In Proceedings of the Conference on Computers in Medicine (CompMed'94).
 
100
 
101
 
102
 
103
Thatte, S. 2003. Business Process Execution Language for Web Services Version 1.1. Specification, IBM developerWorks (May). Available at: http://www-106.ibm.com/developerworks/library/ws-bpel/.
 
104
U.S. Geological Survey. 1992. Spatial Data Transfer Standard (SDTS) NCITS 320-1998, American National Standards Institute (ANSI) (June). Reston, VA. Available at: http://mcmcweb.er.usgs.gov/sdts/SDTS_standard_nov97/part1b12.html.
 
105
U.S. Geological Survey. 1995. Modern Average Global Sea-Surface Temperature: Metadata. U.S. Geological Survey. Available at: http://geo-nsdi.er.usgs.gov/metadata/digital-data/10/metadata.html#2.
 
106
UC Berkeley. 1994. POSTGRES database management system (DBMS), Universtity of California Berkeley, Berkeley, CA. Available at: http://db.cs.berkeley.edu/postgres.html.
 
107
Vahdat, A., and Anderson, T. 1998. Transparent result caching. In Proceedings of the USENIX Annual Technical Conference {Online proceedings} (June). New Orleans, LA. 1998.
 
108
Vossen, G., and Weske, M. 1997. The WASA Approach to workflow management for scientific applications. In Workflow Management Systems and Interoperability, A. Dogac, L. Kalinichenko, M. T. Ozsu and A. Sheth, Eds. NATO ASI Series Vol. 164, Springer, Berlin. 145--164.
109
 
110
Wainer, J., Weske, M., Vossen, G., and Medeiros, C. M. B. 1996. Scientific workflow systems. In Proceedings of the NSF Workshop on Workflow and Process Automation in Information Systems: State-of-the-Art and Future Directions {Online Proceedings} (May). Athens, GA.
 
111
Winfield, A. J. 1998. A Virtual Laboratory Notebook for simulation models. In Proceedings of the Pacific Symposium on Biocomputing '98 (Jan.). Maui, HI. 177--88.
 
112
 
113
Workflow Management Coalition. 1999a. Interface 1: Process Definition Interchange---Process Model. WfMC Standard WfMC-TC-1016-P v1.1, Workflow Management Coalition. Available at: http://www.wfmc.org/standards/docs.htm.
 
114
Workflow Management Coalition. 1999b. Interface 1: Process Definition Interchange---Q&A and Examples. WfMC Standard WfMC-TC-1016-X v1.1, Workflow Management Coalition. Available at: http://www.wfmc.org/standards/docs.htm.
 
115
Workflow Management Coalition. 2001. Workflow Process Definition Interface---XML Process Definition Language (XPDL). WfMC Standard WFMC-TC-1025, Workflow Management Coalition. Available at: http://www.wfmc.org/standards/docs.htm.
 
116
Zhao, J., Goble, C., Greenwood, M., Wroe, C., and Stevens, R. 2003. Annotating, linking and browsing provenance logs for e-Science. In Proceedings of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data {Online proceedings} (Oct.). Sanibel Island, FL.

CITED BY  20

Collaborative Colleagues:
Rajendra Bose: colleagues
James Frew: colleagues