ACM Home Page
Please provide us with feedback. Feedback
Estimating the confidence of conditional functional dependencies
Full text PdfPdf (608 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 12: probabilistic databases II table of contents
Pages 469-482  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Graham Cormode  AT&T Labs, Florham Park, NJ, USA
Lukasz Golab  AT&T Labs, Florham Park, NJ, USA
Korn Flip  AT&T Labs, Florham Park, NJ, USA
Andrew McGregor  University of Massachusetts Amherst, Amherst, MA, USA
Divesh Srivastava  AT&T Labs, Florham Park, NJ, USA
Xi Zhang  SUNY Buffalo, Buffalo, NY, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 41,   Downloads (12 Months): 154,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559895
What is a DOI?

ABSTRACT

Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large.

We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
]]M. Arlitt and T. Jin. 1998 world cup web site access logs. http://www.acm.org/sigcomm/ITA/, 1998.
2
 
3
 
4
5
 
6
]]Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. CoRR, abs/0902.2399, 2009.
 
7
 
8
9
 
10
]]F. Chiang and R. Miller. Discovering data quality rules. In International Conference on Very Large Data Bases, 2008.
 
11
12
 
13
]]G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SIAM Conference on Data Mining, 2005.
14
 
15
 
16
 
17
 
18
]]Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. On generating near-optimal tableaux for conditional functional dependencies. In International Conference on Very Large Data Bases, 2008.
 
19
]]Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
20
 
21
]]T. S. Jayram, Ravi Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129--135, 2008.
22
 
23
 
24
 
25
]]A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.
 
26
27


Collaborative Colleagues:
Graham Cormode: colleagues
Lukasz Golab: colleagues
Korn Flip: colleagues
Andrew McGregor: colleagues
Divesh Srivastava: colleagues
Xi Zhang: colleagues