|
ABSTRACT
This paper considers the framework of the so-called "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseline frequency computed as if items occurred independently. The focus is on obtaining reliable estimates of this measure of interestingness for all item sets, even item sets with relatively low frequencies. For example, in a medical database of patient histories, unusual item sets including the item "patient death" (or other serious adverse event) might hopefully be flagged with as few as 5 or 10 occurrences of the item set, it being unacceptable to require that item sets occur in as many as 0.1% of millions of patient reports before the data mining algorithm detects a signal. Similar considerations apply in fraud detection applications. Thus we abandon the requirement that interesting item sets must contain a relatively large fixed minimal support, and adopt a criterion based on the results of fitting an empirical Bayes model to the item set counts. The model allows us to define a 95% Bayesian lower confidence limit for the "interestingness" measure of every item set, whereupon the item sets can be ranked according to their empirical Bayes confidence limits. For item sets of size J > 2, we also distinguish between multi-item associations that can be explained by the observed J(J-1)/2 pairwise associations, and item sets that are significantly more frequent than their pairwise associations would suggest. Such item sets can uncover complex or synergistic mechanisms generating multi-item associations. This methodology has been applied within the U.S. Food and Drug Administration (FDA) to databases of adverse drug reaction reports and within AT&T to customer international calling histories. We also present graphical techniques for exploring and understanding the modeling results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
3
|
|
| |
4
|
Agresti A (1990) Categorical Data Analysis. New York: John Wiley.
|
| |
5
|
Bishop YMM, Fienberg SE, Holland PW (1975) Discrete Multivariate Analysis Cambridge, MA: MIT Press.
|
 |
6
|
Sergey Brin , Rajeev Motwani , Jeffrey D. Ullman , Shalom Tsur, Dynamic itemset counting and implication rules for market basket data, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.255-264, May 11-15, 1997, Tucson, Arizona, United States
|
| |
7
|
Bryck A, Randenbush S (1992) Hierarchical Linear Models. Newbury Park, CA: Sage Publications.
|
| |
8
|
DuMouchel W (1999) Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System (with discussion), The American Statistician, 53:177-202.
|
| |
9
|
DuMouchel W, Friedman C, Hripcsak G, Johnson S, Clayton P (1996) Two applications of statistical modeling to natm'al language processing. AI and Statistics V, ch. 39, edited by D. Fisher and H. Lenz, Springer-Verlag.
|
 |
10
|
William DuMouchel , Chris Volinsky , Theodore Johnson , Corinna Cortes , Daryl Pregibon, Squashing flat files flatter, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.6-15, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312184]
|
| |
11
|
Johnson N, Kotz S (1969) Discrete Distributions. Houghton Mifflin, now distributed by New York: John Wiley.
|
| |
12
|
Mantel N, Haenszel W (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22" 719-748.
|
| |
13
|
O'Hagan A (1994) Kendall's Advanced Theory of Statistics, vol. 2, Bayesian Inference. New York: Halstead Press (John Wiley).
|
| |
14
|
|
| |
15
|
Simpson EH (1951) The interpretation of interaction in contingency tables. J. Royal ,Statistical Soc., B 13:238-241.
|
CITED BY 24
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hui Xiong , Shashi Shekhar , Pang-Ning Tan , Vipin Kumar, Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Adam Kirsch , Michael Mitzenmacher , Andrea Pietracaprina , Geppino Pucci , Eli Upfal , Fabio Vandin, An efficient rigorous approach for identifying statistically significant frequent itemsets, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|