| An efficient rigorous approach for identifying statistically significant frequent itemsets |
| Full text |
Pdf
(471 KB)
|
Source
|
Symposium on Principles of Database Systems
archive
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
table of contents
Providence, Rhode Island, USA
SESSION: Data analysis and optimization
table of contents
Pages 117-126
Year of Publication: 2009
ISBN:978-1-60558-553-6
|
|
Authors
|
|
Adam Kirsch
|
Harvard University, Cambridge, MA, USA
|
|
Michael Mitzenmacher
|
Harvard University, Cambridge, MA, USA
|
|
Andrea Pietracaprina
|
University of Padova, Padova, Italy
|
|
Geppino Pucci
|
University of Padova, Padova, Italy
|
|
Eli Upfal
|
Brown University, Providence, RI, USA
|
|
Fabio Vandin
|
University of Padova, Padova, Italy
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 20, Downloads (12 Months): 95, Citation Count: 0
|
|
|
ABSTRACT
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. Our methodology hinges on a Poisson approximation to the distribution of the number of itemsets in a random dataset with support at least s, for any s greater than or equal to a minimum threshold smin. We obtain this result through a novel application of the Chen-Stein approximation method, which is of independent interest. Based on this approximation, we develop an efficient parametric multi-hypothesis test for identifying the desired threshold s*. A crucial feature of our approach is that, unlike most previous work, it takes into account the entire dataset rather than individual discoveries. It is therefore better able to distinguish between significant observations and random fluctuations. We present extensive experimental results to substantiate the effectiveness of our methodology.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
3
|
R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein method. Statistical Science, 5(4):403--434, 1990.
|
| |
4
|
Y. Benjamini, and Y. Hochberg. Controlling the false discovery rate. J. Royal Statistical Society, Series B, 57:289--300, 1995.
|
| |
5
|
Y. Benjamini, and D. Yekutieli The control of the false discovery rate in multiple testing under dependency Annals of Statistics, 29 (4): 1165--1188, 2001.
|
| |
6
|
|
| |
7
|
W. J. Conover. Practical Nonparametric Statistics. Wiley Series in Probability, 3rd Ed., 1999.
|
| |
8
|
S. Dudoit, J.P. Shaffer, and J.C. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, Vol. 18, No. 1, 2003, p. 71--103.
|
| |
9
|
W. DuMouchel. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician, 53:177--202, 1999.
|
 |
10
|
|
 |
11
|
Aristides Gionis , Heikki Mannila , Taneli Mielikäinen , Panayiotis Tsaparas, Assessing data mining results via swap randomization, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150424]
|
| |
12
|
B. Goethals, R. Bayardo, and M.J. Zaki, editors. Proc. of the 2nd Workshop on Frequent Itemset Mining Implementations (FIMI04), volume 126. CEUR-WS Workshop On-line Proceedings, November 2004.
|
| |
13
|
B. Goethals and M. J. Zaki, editors. Proc. of the 1st Workshop on Frequent Itemset Mining Implementations (FIMI03), volume 90. CEUR-WS Workshop On-line Proceedings, November 2003.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
H.O. Lancaster. The Chi-squared Distribution. John Wiley&Sons, New York NY, 1969.
|
| |
19
|
N. Megiddo, and R. Srikant Discovering predictive association rules. In Proc. of the 4th Intl. Conference on Knowledge Discovery and Data Mining, pages 274--278, 1998.
|
| |
20
|
|
| |
21
|
|
| |
22
|
P.W. Purdom, D. Van Gucht, and D.P. Groth. Average case performance of the apriori algorithm. SIAM J. Computing, 33 (5): 1223--1260, 2004.
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
|