|
ABSTRACT
Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
I. Altintas, O. Barney, E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow system. IPAW, 2006.
|
| |
3
|
|
| |
4
|
L. Bavoil, S. P. Callahan, C. E. Scheidegger, H. T. Vo, P. Crossno, C. T. Silva, J. Freire. Vistrails: Enabling interactive multiple-view visualizations, IEEE Visualization, 2005.
|
| |
5
|
O. Biton, S. C. Boulakia, S. B. Davidson, and C. S. Hara. Querying and managing provenance through user views in scientific workflows. ICDE, 2008.
|
| |
6
|
Shawn Bowers , Timothy Mcphillips , Sean Riddle , Manish Kumar Anand , Bertram Ludäscher, Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life, Provenance and Annotation of Data and Processes: Second International Provenance and Annotation Workshop, IPAW 2008, Salt Lake City, UT, USA, June 17-18, 2008. Revised Selected Papers, Springer-Verlag, Berlin, Heidelberg, 2008
[doi> 10.1007/978-3-540-89965-5_9]
|
| |
7
|
S. Bowers, T. M. McPhillips, B. Ludäscher, S. Cohen, S. B. Davidson. A model for user-oriented data provenance in pipelined scientific workflows. IPAW, 2006.
|
| |
8
|
S. Bowers, T. M. McPhillips, M. Wu, B. Ludäscher. Project histories: Managing data provenance across collection-oriented scientific workflow runs. DILS, LNCS, 2007.
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
S. Cohen, S. C. Boulakia, S. B. Davidson. Towards a model of provenance and user views in scientific workflows. In DILS, pages 264--279, 2006.
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
Yolanda Gil , Ewa Deelman , Mark Ellisman , Thomas Fahringer , Geoffrey Fox , Dennis Gannon , Carole Goble , Miron Livny , Luc Moreau , Jim Myers, Examining the Challenges of Scientific Workflows, Computer, v.40 n.12, p.24-32, December 2007
[doi> 10.1109/MC.2007.421]
|
| |
20
|
Y. Gil, V. Ratnakar, E. Deelman, G. Mehta, and J. Kim. Wings for pegasus: Creating large-scale scientific applications using semantic representations of computational workflows. In AAAI, pp. 1767--1774, 2007.
|
| |
21
|
S. Gurmeet, C. Kesselman, and E. Deelman. Optimizing grid-based workflow execution. J. Grid Comput., 3(3--4):201--219, 2005.
|
 |
22
|
|
| |
23
|
J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. V. den Bussche. Petri net + nested relational calculus = dataflow. In OTM Conferences, LNCS 3760, pp. 220--237, 2005.
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
Bertram Ludäscher , Ilkay Altintas , Chad Berkley , Dan Higgins , Efrat Jaeger , Matthew Jones , Edward A. Lee , Jing Tao , Yang Zhao, Scientific workflow management and the Kepler system: Research Articles, Concurrency and Computation: Practice & Experience, v.18 n.10, p.1039-1065, August 2006
[doi> 10.1002/cpe.v18:10]
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
Archan Misra , Marion Blount , Anastasios Kementsietsidis , Daby Sow , Min Wang, Advances and Challenges for Scalable Provenance in Stream Processing Systems, Provenance and Annotation of Data and Processes: Second International Provenance and Annotation Workshop, IPAW 2008, Salt Lake City, UT, USA, June 17-18, 2008. Revised Selected Papers, Springer-Verlag, Berlin, Heidelberg, 2008
[doi> 10.1007/978-3-540-89965-5_26]
|
| |
32
|
Paolo Missier , Khalid Belhajjame , Jun Zhao , Marco Roos , Carole Goble, Data Lineage Model for Taverna Workflows with Lightweight Annotation Requirements, Provenance and Annotation of Data and Processes: Second International Provenance and Annotation Workshop, IPAW 2008, Salt Lake City, UT, USA, June 17-18, 2008. Revised Selected Papers, Springer-Verlag, Berlin, Heidelberg, 2008
[doi> 10.1007/978-3-540-89965-5_4]
|
| |
33
|
Luc Moreau , Bertram Ludäscher , Ilkay Altintas , Roger S. Barga , Shawn Bowers , Steven Callahan , George Chin, Jr. , Ben Clifford , Shirley Cohen , Sarah Cohen-Boulakia , Susan Davidson , Ewa Deelman , Luciano Digiampietri , Ian Foster , Juliana Freire , James Frew , Joe Futrelle , Tara Gibson , Yolanda Gil , Carole Goble , Jennifer Golbeck , Paul Groth , David A. Holland , Sheng Jiang , Jihie Kim , David Koop , Ales Krenek , Timothy McPhillips , Gaurang Mehta , Simon Miles , Dominic Metzger , Steve Munroe , Jim Myers , Beth Plale , Norbert Podhorszki , Varun Ratnakar , Emanuele Santos , Carlos Scheidegger , Karen Schuchardt , Margo Seltzer , Yogesh L. Simmhan , Claudio Silva , Peter Slaughter , Eric Stephan , Robert Stevens , Daniele Turi , Huy Vo , Mike Wilde , Jun Zhao , Yong Zhao, Special Issue: The First Provenance Challenge, Concurrency and Computation: Practice & Experience, v.20 n.5, p.409-418, April 2008
[doi> 10.1002/cpe.v20:5]
|
| |
34
|
L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson. The open provenance model. Technical Report 14979, ECS, University of Southampton, 2007.
|
 |
35
|
Leon J. Osterweil , Lori A. Clarke , Aaron M. Ellison , Rodion Podorozhny , Alexander Wise , Emery Boose , Julian Hadley, Experience in using a process language to define scientific workflow and generate dataset provenance, Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, November 09-14, 2008, Atlanta, Georgia
[doi> 10.1145/1453101.1453147]
|
 |
36
|
|
| |
37
|
|
 |
38
|
|
| |
39
|
N. Walsh, A. Milowski, and H. S. T. (editors). XProc: An xml pipeline language. W3C Working Draft, May 2008.
|
| |
40
|
|
| |
41
|
Y. Zhao, M. Wilde, and I. Foster. Applying the virtual data provenance model. In IPAW, LNCS 4145. Springer, 2006.
|
|