ACM Home Page
Please provide us with feedback. Feedback
Efficient provenance storage over nested data collections
Full text PdfPdf (1.07 MB)
Source Extending Database Technology; Vol. 360 archive
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology table of contents
Saint Petersburg, Russia
SESSION: Research sessions: Provenance table of contents
Pages 958-969  
Year of Publication: 2009
ISBN:978-1-60558-422-5
Authors
Manish Kumar Anand  University of California, Davis
Shawn Bowers  University of California, Davis
Timothy McPhillips  University of California, Davis
Bertram Ludäscher  University of California, Davis
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 33,   Downloads (12 Months): 140,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1516360.1516470
What is a DOI?

ABSTRACT

Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
I. Altintas, O. Barney, E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow system. IPAW, 2006.
 
3
 
4
L. Bavoil, S. P. Callahan, C. E. Scheidegger, H. T. Vo, P. Crossno, C. T. Silva, J. Freire. Vistrails: Enabling interactive multiple-view visualizations, IEEE Visualization, 2005.
 
5
O. Biton, S. C. Boulakia, S. B. Davidson, and C. S. Hara. Querying and managing provenance through user views in scientific workflows. ICDE, 2008.
 
6
 
7
S. Bowers, T. M. McPhillips, B. Ludäscher, S. Cohen, S. B. Davidson. A model for user-oriented data provenance in pipelined scientific workflows. IPAW, 2006.
 
8
S. Bowers, T. M. McPhillips, M. Wu, B. Ludäscher. Project histories: Managing data provenance across collection-oriented scientific workflow runs. DILS, LNCS, 2007.
9
10
 
11
 
12
 
13
S. Cohen, S. C. Boulakia, S. B. Davidson. Towards a model of provenance and user views in scientific workflows. In DILS, pages 264--279, 2006.
 
14
15
 
16
 
17
18
 
19
 
20
Y. Gil, V. Ratnakar, E. Deelman, G. Mehta, and J. Kim. Wings for pegasus: Creating large-scale scientific applications using semantic representations of computational workflows. In AAAI, pp. 1767--1774, 2007.
 
21
S. Gurmeet, C. Kesselman, and E. Deelman. Optimizing grid-based workflow execution. J. Grid Comput., 3(3--4):201--219, 2005.
22
 
23
J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. V. den Bussche. Petri net + nested relational calculus = dataflow. In OTM Conferences, LNCS 3760, pp. 220--237, 2005.
24
25
 
26
 
27
 
28
 
29
30
 
31
 
32
 
33
 
34
L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson. The open provenance model. Technical Report 14979, ECS, University of Southampton, 2007.
35
36
 
37
38
 
39
N. Walsh, A. Milowski, and H. S. T. (editors). XProc: An xml pipeline language. W3C Working Draft, May 2008.
 
40
 
41
Y. Zhao, M. Wilde, and I. Foster. Applying the virtual data provenance model. In IPAW, LNCS 4145. Springer, 2006.
Collaborative Colleagues:
Manish Kumar Anand: colleagues
Shawn Bowers: colleagues
Timothy McPhillips: colleagues
Bertram Ludäscher: colleagues