|
ABSTRACT
Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Abiteboul and P. Senellart. Querying and updating probabilistic information in XML. In EDBT, pages 1059--1068, 2006.
|
| |
2
|
Ernest Adams. A Primer of Probability Logic. CSLI Publications, Stanford, California, 1998.
|
| |
3
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002.
|
| |
4
|
|
| |
5
|
L. Antova, C. Koch, and D. Olteanu. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.
|
| |
6
|
L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194--208, 2007.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
G. Borriello and F. Zhao. World-Wide Sensor Web: 2006 UW-MSR Summer Institute Semiahmoo Resort, Blaine, WA, 2006. www.cs.washington.edu/mssi/2006/schedule.html.
|
| |
12
|
Doug Burdick , Prasad M. Deshpande , T. S. Jayram , Raghu Ramakrishnan , Shivakumar Vaithyanathan, Efficient allocation algorithms for OLAP over imprecise data, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
 |
13
|
|
| |
14
|
T. Choudhury, M. Philipose, D. Wyatt, and J. Lester. Towards activity databases: Using sensors and statistical models to summarize people's lives. IEEE Data Eng. Bull, 29(1):49--58, March 2006.
|
| |
15
|
W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.
|
| |
16
|
|
| |
17
|
Robert G. Cowell , Steffen L. Lauritzen , A. Philip David , David J. Spiegelhalter , V. Nair , J. Lawless , M. Jordan , David J. Spiegelhater, Probabilistic Networks and Expert Systems, Springer-Verlag New York, Inc., Secaucus, NJ, 1999
|
| |
18
|
|
| |
19
|
N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, 2005.
|
| |
20
|
N. Dalvi, Chris Re, and D. Suciu. Query evaluation on probabilistic databases. IEEE Data Engineering Bulletin, 29(1):25--31, 2006.
|
| |
21
|
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, Toronto, Canada, 2004.
|
| |
22
|
|
 |
23
|
|
| |
24
|
Nilesh Dalvi. Query evaluation on a database given by a random graph. In ICDT, pages 149--163, 2007.
|
 |
25
|
|
| |
26
|
|
 |
27
|
|
 |
28
|
Amol Deshpande , Minos Garofalakis , Rajeev Rastogi, Independence is good: dependency-based histogram synopses for high-dimensional data, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.199-210, May 21-24, 2001, Santa Barbara, California, United States
|
| |
29
|
A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004.
|
| |
30
|
A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Using probabilistic models for data management in acquisitional environments. In CIDR, pages 317--328, 2005.
|
| |
31
|
A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):64--72, March 2006.
|
| |
32
|
Magdalena Balazinska , Amol Deshpande , Michael J. Franklin , Phillip B. Gibbons , Jim Gray , Mark Hansen , Michael Liebhold , Suman Nath , Alexander Szalay , Vincent Tao, Data Management in the Worldwide Sensor Web, IEEE Pervasive Computing, v.6 n.2, p.30-40, April 2007
[doi> 10.1109/MPRV.2007.27]
|
 |
33
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
| |
34
|
Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
|
 |
35
|
|
 |
36
|
|
| |
37
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
| |
38
|
Minos Garofalakis and Dan Suciu. Special issue on probabilistic data management. IEEE Data Engineering Bulletin, pages 1--72, 2006.
|
| |
39
|
Lise Getoor. An introduction to probabilistic graphical models for relational data. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):32--40, March 2006.
|
 |
40
|
Erich Grädel , Yuri Gurevich , Colin Hirsch, The complexity of query reliability, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.227-234, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.295124]
|
| |
41
|
T. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Engineering Bulletin, 29(1):17--24, March 2006.
|
| |
42
|
|
| |
43
|
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. In CMIS Technical Report No. 03/83, 2003.
|
| |
44
|
|
 |
45
|
Alon Halevy , Michael Franklin , David Maier, Principles of dataspace systems, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.1-9, June 26-28, 2006, Chicago, IL, USA
[doi> 10.1145/1142351.1142352]
|
| |
46
|
|
 |
47
|
|
| |
48
|
D. Heckerman. Tutorial on graphical models, June 2002.
|
 |
49
|
|
| |
50
|
E. Hung, L. Getoor, and V. S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. In ICDE, 2003.
|
 |
51
|
Ihab F. Ilyas , Volker Markl , Peter Haas , Paul Brown , Ashraf Aboulnaga, CORDS: automatic discovery of correlations and soft functional dependencies, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007641]
|
| |
52
|
T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007.
|
| |
53
|
T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.
|
| |
54
|
|
| |
55
|
R. Karp and M. Luby. Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of the annual ACM symposium on Theory of computing, 1983.
|
 |
56
|
|
 |
57
|
|
| |
58
|
D. Koller. Representation, reasoning, learning. Computers and Thought 2001 Award talk.
|
 |
59
|
|
| |
60
|
J. Lester, T. Choudhury, N. Kern, G. Borriello, and B. Hannaford. A hybrid discriminative/generative approach for modeling human activities. In IJCAI, pages 766--772, 2005.
|
| |
61
|
J. Madhavan, S. Cohen, X. Dong, A. Halevy, S. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.
|
 |
62
|
|
| |
63
|
Radford Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Univ. of Toronto, 1993.
|
| |
64
|
Christos Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.
|
| |
65
|
|
| |
66
|
S. Philippi and J. Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nature Reviews Genetics, 7:481--488, June 2006.
|
| |
67
|
J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777--788, 1983.
|
| |
68
|
C. Re, N. Dalvi, and D. Suciu. Efficient Top-k query evaluation on probabilistic data. In ICDE, 2007.
|
| |
69
|
Christopher Ré. Applications of probabilistic constraints. Technical Reprot TR2007-03-03, University of Washington, Seattle, Washington, March 2007.
|
 |
70
|
|
| |
71
|
Sunita Sarawagi. Automation in information extraction and data integration. Tutorial presented at VLDB'2002.
|
| |
72
|
Prithviraj Sen and Amol Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
|
| |
73
|
W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, pages 862--867, 2005.
|
 |
74
|
|
| |
75
|
L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.
|
| |
76
|
|
 |
77
|
|
| |
78
|
|
 |
79
|
|
| |
80
|
William Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, 1999.
|
| |
81
|
Y. Zabiyaka and A. Darwiche. Functional treewidth: Bounding complexity in the presence of functional dependencies. In SAT, pages 116--129, 2006.
|
| |
82
|
alonhalevy.blogspot.com/2007_01_01_archive.html.
|
| |
83
|
www.flickr.com.
|
| |
84
|
base.google.com.
|
| |
85
|
|
|