| Estimating the confidence of conditional functional dependencies |
| Full text |
Pdf
(608 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 12: probabilistic databases II
table of contents
Pages 469-482
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Graham Cormode
|
AT&T Labs, Florham Park, NJ, USA
|
|
Lukasz Golab
|
AT&T Labs, Florham Park, NJ, USA
|
|
Korn Flip
|
AT&T Labs, Florham Park, NJ, USA
|
|
Andrew McGregor
|
University of Massachusetts Amherst, Amherst, MA, USA
|
|
Divesh Srivastava
|
AT&T Labs, Florham Park, NJ, USA
|
|
Xi Zhang
|
SUNY Buffalo, Buffalo, NY, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 40, Downloads (12 Months): 160, Citation Count: 1
|
|
|
ABSTRACT
Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M. Arlitt and T. Jin. 1998 world cup web site access logs. http://www.acm.org/sigcomm/ITA/, 1998.
|
 |
2
|
Lakshminath Bhuvanagiri , Sumit Ganguly , Deepanjan Kesh , Chandan Saha, Simpler algorithm for estimating frequency moments of data streams, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.708-713, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109634]
|
| |
3
|
|
| |
4
|
|
 |
5
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
| |
6
|
Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. CoRR, abs/0902.2399, 2009.
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
Gao Cong , Wenfei Fan , Floris Geerts , Xibei Jia , Shuai Ma, Improving data quality: consistency and accuracy, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
 |
12
|
|
| |
13
|
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SIAM Conference on Data Mining, 2005.
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. On generating near-optimal tableaux for conditional functional dependencies. In International Conference on Very Large Data Bases, 2008.
|
| |
19
|
Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
|
 |
20
|
Ihab F. Ilyas , Volker Markl , Peter Haas , Paul Brown , Ashraf Aboulnaga, CORDS: automatic discovery of correlations and soft functional dependencies, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007641]
|
| |
21
|
T. S. Jayram, Ravi Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129--135, 2008.
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.
|
| |
26
|
|
 |
27
|
|
CITED BY
|
|
Lukasz Golab , Theodore Johnson , J. Spencer Seidel , Vladislav Shkapenyuk, Stream warehousing with DataDepot, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|