ACM Home Page
Please provide us with feedback. Feedback
An efficient rigorous approach for identifying statistically significant frequent itemsets
Full text PdfPdf (471 KB)
Source
Symposium on Principles of Database Systems archive
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Providence, Rhode Island, USA
SESSION: Data analysis and optimization table of contents
Pages 117-126  
Year of Publication: 2009
ISBN:978-1-60558-553-6
Authors
Adam Kirsch  Harvard University, Cambridge, MA, USA
Michael Mitzenmacher  Harvard University, Cambridge, MA, USA
Andrea Pietracaprina  University of Padova, Padova, Italy
Geppino Pucci  University of Padova, Padova, Italy
Eli Upfal  Brown University, Providence, RI, USA
Fabio Vandin  University of Padova, Padova, Italy
Sponsors
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGMOD: ACM Special Interest Group on Management of Data
SIGART: ACM Special Interest Group on Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 95,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559795.1559814
What is a DOI?

ABSTRACT

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate.

Our methodology hinges on a Poisson approximation to the distribution of the number of itemsets in a random dataset with support at least s, for any s greater than or equal to a minimum threshold smin. We obtain this result through a novel application of the Chen-Stein approximation method, which is of independent interest. Based on this approximation, we develop an efficient parametric multi-hypothesis test for identifying the desired threshold s*. A crucial feature of our approach is that, unlike most previous work, it takes into account the entire dataset rather than individual discoveries. It is therefore better able to distinguish between significant observations and random fluctuations. We present extensive experimental results to substantiate the effectiveness of our methodology.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein method. Statistical Science, 5(4):403--434, 1990.
 
4
Y. Benjamini, and Y. Hochberg. Controlling the false discovery rate. J. Royal Statistical Society, Series B, 57:289--300, 1995.
 
5
Y. Benjamini, and D. Yekutieli The control of the false discovery rate in multiple testing under dependency Annals of Statistics, 29 (4): 1165--1188, 2001.
 
6
 
7
W. J. Conover. Practical Nonparametric Statistics. Wiley Series in Probability, 3rd Ed., 1999.
 
8
S. Dudoit, J.P. Shaffer, and J.C. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, Vol. 18, No. 1, 2003, p. 71--103.
 
9
W. DuMouchel. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician, 53:177--202, 1999.
10
11
 
12
B. Goethals, R. Bayardo, and M.J. Zaki, editors. Proc. of the 2nd Workshop on Frequent Itemset Mining Implementations (FIMI04), volume 126. CEUR-WS Workshop On-line Proceedings, November 2004.
 
13
B. Goethals and M. J. Zaki, editors. Proc. of the 1st Workshop on Frequent Itemset Mining Implementations (FIMI03), volume 90. CEUR-WS Workshop On-line Proceedings, November 2003.
 
14
 
15
 
16
17
 
18
H.O. Lancaster. The Chi-squared Distribution. John Wiley&Sons, New York NY, 1969.
 
19
N. Megiddo, and R. Srikant Discovering predictive association rules. In Proc. of the 4th Intl. Conference on Knowledge Discovery and Data Mining, pages 274--278, 1998.
 
20
 
21
 
22
P.W. Purdom, D. Van Gucht, and D.P. Groth. Average case performance of the apriori algorithm. SIAM J. Computing, 33 (5): 1223--1260, 2004.
23
 
24
 
25
26
 
27
 
28
29

Collaborative Colleagues:
Adam Kirsch: colleagues
Michael Mitzenmacher: colleagues
Andrea Pietracaprina: colleagues
Geppino Pucci: colleagues
Eli Upfal: colleagues
Fabio Vandin: colleagues