| E = MC3: managing uncertain enterprise data in a cluster-computing environment |
| Full text |
Pdf
(516 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 12: probabilistic databases II
table of contents
Pages 441-454
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Fei Xu
|
University of Florida, Gainesville, FL, USA
|
|
Kevin Beyer
|
IBM Almaden Research Center, San Jos, CA, USA
|
|
Vuk Ercegovac
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
Peter J. Haas
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
Eugene J. Shekita
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 56, Downloads (12 Months): 239, Citation Count: 0
|
|
|
ABSTRACT
Modern enterprises must manage uncertain data for purposes of risk assessment and decisionmaking under uncertainty. The Monte Carlo approach embodied in the MCDB system of Jampani et al. is well suited for such a task. MCDB can support industrial strength business-intelligence queries over uncertain warehouse data. Moreover, MCDB's extensible approach to specifying uncertainty can also capture complex stochastic prediction models, allowing sophisticated ``what-if'' analyses within the DBMS. The MCDB computations can be highly CPU intensive, but offer the potential for massive parallelization. To realize this potential, we provide a new system, called MC3 (Monte Carlo Computation on a Cluster), that extends the MCDB approach to the map-reduce processing framework. MC3 can exploit the robustness and scalability of map-reduce, and can handle data stored in non-relational formats. We show how MCDB query plans over ``tuple bundles'' can be translated to sequences of map-reduce operations over nested data, and describe different parallelization schemes. We also provide and analyze several novel distributed algorithms for adding pseudorandom number seeds to tuple bundles. These algorithms ensure statistical correctness of the Monte-Carlo computations while minimizing the seed length. Our experiments show that MC3 can scale well for a variety of workloads.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Parag Agrawal , Omar Benjelloun , Anish Das Sarma , Chris Hayworth , Shubha Nabar , Tomoe Sugihara , Jennifer Widom, Trio: a system for data, uncertainty, and lineage, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
| |
2
|
L. Antova, C. Koch, andD. Olteanu. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In ICDE, pages 1479--1480, 2007.
|
 |
3
|
Jihad Boulos , Nilesh Dalvi , Bhushan Mandhani , Shobhit Mathur , Chris Re , Dan Suciu, MYSTIQ: a system for finding more answers by using probabilities, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066277]
|
| |
4
|
Fay Chang , Jeffrey Dean , Sanjay Ghemawat , Wilson C. Hsieh , Deborah A. Wallach , Mike Burrows , Tushar Chandra , Andrew Fikes , Robert E. Gruber, Bigtable: a distributed storage system for structured data, Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p.15-15, November 06-08, 2006, Seattle, WA
|
 |
5
|
|
| |
6
|
P. D. Coddington. Random number generators for parallel computers. The NHSE Review, 2, 1996.
|
| |
7
|
Brian F. Cooper , Raghu Ramakrishnan , Utkarsh Srivastava , Adam Silberstein , Philip Bohannon , Hans-Arno Jacobsen , Nick Puz , Daniel Weaver , Ramana Yerneni, PNUTS: Yahoo!'s hosted data serving platform, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
[doi> 10.1145/1454159.1454167]
|
| |
8
|
|
| |
9
|
L. Devroye. Non-Uniform Random Variate Generation. Springer, 1986.
|
 |
10
|
|
| |
11
|
P. W. Glynn and S. Asmussen. Stochastic Simulation: Algorithms and Analysis. Springer, 2007.
|
| |
12
|
Hadoop. http://hadoop.apache.org/core/.
|
| |
13
|
|
| |
14
|
|
| |
15
|
S. G. Henderson and B. L. Nelson, editors. Simulation. North-Holland, 2006.
|
 |
16
|
Ravi Jampani , Fei Xu , Mingxi Wu , Luis Leopoldo Perez , Christopher Jermaine , Peter J. Haas, MCDB: a monte carlo approach to managing uncertain data, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376686]
|
| |
17
|
JAQL. http://code.google.com/p/jaql/.
|
| |
18
|
JSON. http://www.json.org.
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
M. Mascagni. Some methods of parallel pseudorandom number generation. In R. Schreiber, M. Heath, and A. Ranade, editors, Algorithms for Parallel Processing, pages 277--288. Springer, 1997.
|
 |
24
|
|
 |
25
|
|
| |
26
|
Christopher Re , Dan Suciu, Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do, Proceedings of the 2nd international conference on Scalable Uncertainty Management, p.5-18, October 01-03, 2008, Naples, Italy
[doi> 10.1007/978-3-540-87993-0_3]
|
| |
27
|
SimpleDB. http://aws.amazon.com.
|
 |
28
|
Sarvjeet Singh , Chris Mayfield , Sagar Mittal , Sunil Prabhakar , Susanne Hambrusch , Rahul Shah, Orion 2.0: native support for uncertain data, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376744]
|
| |
29
|
SQLServer Data Services. http://www.microsoft.com/sql/dataservices/default.mspx.
|
| |
30
|
A. Srinivasan, D. M. Ceperley, and M. Mascagni. Random number generators for parallel applications. In Monte Carlo Methods in Chemical Physics, pages 13--36. Wiley, 1997.
|
| |
31
|
C. J. K. Tan. The PLFG parallel pseudo-random number generator. Future Generation Computer Systems, 18:693--698, 2002.
|
| |
32
|
D. Z. Wang, E. Michelakis, M. N. Garofalakis, and J. M. Hellerstein. BayesStore:managing large, uncertain data repositories with probabilistic graphical models. Proc. VLDB, pages 340--351, 2008.
|
|