| MCDB: a monte carlo approach to managing uncertain data |
| Full text |
Pdf
(333 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Research Session 15: Probabilistic I
table of contents
Pages 687-700
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Ravi Jampani
|
University of Florida, Gainesville, FL, USA
|
|
Fei Xu
|
University of Florida, Gainesville, FL, USA
|
|
Mingxi Wu
|
University of Florida, Gainesville, FL, USA
|
|
Luis Leopoldo Perez
|
University of Florida, Gainesville, FL, USA
|
|
Christopher Jermaine
|
University of Florida, Gainesville, FL, USA
|
|
Peter J. Haas
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 121, Downloads (12 Months): 507, Citation Count: 8
|
|
|
ABSTRACT
To deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ability to gracefully handle complex or unforeseen types of uncertainty, and does not permit the uncertainty model to be dynamically parameterized according to the current state of the database. We introduce MCDB, a system for managing uncertain data that is based on a Monte Carlo approach. MCDB represents uncertainty via "VG functions," which are used to pseudorandomly generate realized values for uncertain attributes. VG functions can be parameterized on the results of SQL queries over "parameter tables" that are stored in the database, facilitating what-if analyses. By storing parameters, and not probabilities, and by estimating, rather than exactly computing, the probability distribution over possible query answers, MCDB avoids many of the limitations of prior systems. For example, MCDB can easily handle arbitrary joint probability distributions over discrete or continuous attributes, arbitrarily complex SQL queries, and arbitrary functionals of the query-result distribution such as means, variances, and quantiles. To achieve good performance, MCDB uses novel query processing techniques, executing a query plan exactly once, but over "tuple bundles" instead of ordinary tuples. Experiments indicate that our enhanced functionality can be obtained with acceptable overheads relative to traditional systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Parag Agrawal , Omar Benjelloun , Anish Das Sarma , Chris Hayworth , Shubha Nabar , Tomoe Sugihara , Jennifer Widom, Trio: a system for data, uncertainty, and lineage, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
| |
2
|
|
| |
3
|
L. Antova, C. Koch, and D. Olteanu. 10106 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, pages 606--615, 2007.
|
| |
4
|
L. Antova, C. Koch, and D. Olteanu. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In ICDE, pages 1479--1480, 2007.
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
Reynold Cheng , Sarvjeet Singh , Sunil Prabhakar , Rahul Shah , Jeffrey Scott Vitter , Yuni Xia, Efficient join processing over uncertain data, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183719]
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
L. Devroye. Non-Uniform Random Variate Generation. Springer, 1986.
|
| |
14
|
L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, 2001.
|
| |
15
|
G. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer, 1996.
|
 |
16
|
|
| |
17
|
J. E. Gentle. Random Number Generation and Monte Carlo Methods. Springer, second edition, 2003.
|
| |
18
|
|
| |
19
|
|
 |
20
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
| |
21
|
S. G. Henderson and B. L. Nelson, editors. Simulation. North-Holland, 2006.
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
R. Murthy and J. Widom. Making aggregation work in uncertain and probabilistic databases. In Proc. 1st Int. VLDB Work. Mgmt. Uncertain Data (MUD), pages 76--90, 2007.
|
| |
26
|
A. Nadas. An extension of a theorem by Chow and Robbins on sequential confidence intervals for the mean. Ann. Math. Statist., 40(2):667--671, 1969.
|
| |
27
|
|
| |
28
|
A. O'Hagan and J. J. Forster. Bayesian Inference. Volume 2B of Kendall's Advanced Theory of Statistics. Arnold, second edition, 2004.
|
| |
29
|
|
| |
30
|
C. Re, N. N. Dalvi, and D. Suciu. Query evaluation on probabilistic databases. IEEE Data Eng. Bull., 29(1):25--31, 2006.
|
| |
31
|
C. Re, N. N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, pages 886--895, 2007.
|
| |
32
|
|
| |
33
|
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, pages 596--605, 2007.
|
| |
34
|
R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, 1980.
|
| |
35
|
S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. E. Hambrusch. Indexing uncertain categorical data. In ICDE, pages 616--625, 2007.
|
| |
36
|
J. Xie, J. Yang, Y. Chen, H. Wang, and P. Yu. A sampling-based approach to information recovery. In ICDE, 2008. To appear.
|
CITED BY 8
|
|
|
|
|
|
|
|
Eirinaios Michelakis , Rajasekar Krishnamurthy , Peter J. Haas , Shivakumar Vaithyanathan, Uncertainty management in rule-based information extraction systems, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
Fei Xu , Kevin Beyer , Vuk Ercegovac , Peter J. Haas , Eugene J. Shekita, E = MC3: managing uncertain enterprise data in a cluster-computing environment, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|