ACM Home Page
Please provide us with feedback. Feedback
Empirical bayes screening for multi-item associations
Full text PdfPdf (932 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Francisco, California
Pages: 67 - 76  
Year of Publication: 2001
ISBN:1-58113-391-X
Authors
William DuMouchel  AT&T Labs---Research, Florham Park, NJ
Daryl Pregibon  AT&T Labs---Research, Florham Park, NJ
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
AAAI : American Association for Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 113,   Citation Count: 24
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/502512.502526
What is a DOI?

ABSTRACT

This paper considers the framework of the so-called "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseline frequency computed as if items occurred independently. The focus is on obtaining reliable estimates of this measure of interestingness for all item sets, even item sets with relatively low frequencies. For example, in a medical database of patient histories, unusual item sets including the item "patient death" (or other serious adverse event) might hopefully be flagged with as few as 5 or 10 occurrences of the item set, it being unacceptable to require that item sets occur in as many as 0.1% of millions of patient reports before the data mining algorithm detects a signal. Similar considerations apply in fraud detection applications. Thus we abandon the requirement that interesting item sets must contain a relatively large fixed minimal support, and adopt a criterion based on the results of fitting an empirical Bayes model to the item set counts. The model allows us to define a 95% Bayesian lower confidence limit for the "interestingness" measure of every item set, whereupon the item sets can be ranked according to their empirical Bayes confidence limits. For item sets of size J > 2, we also distinguish between multi-item associations that can be explained by the observed J(J-1)/2 pairwise associations, and item sets that are significantly more frequent than their pairwise associations would suggest. Such item sets can uncover complex or synergistic mechanisms generating multi-item associations. This methodology has been applied within the U.S. Food and Drug Administration (FDA) to databases of adverse drug reaction reports and within AT&T to customer international calling histories. We also present graphical techniques for exploring and understanding the modeling results.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
Agresti A (1990) Categorical Data Analysis. New York: John Wiley.
 
5
Bishop YMM, Fienberg SE, Holland PW (1975) Discrete Multivariate Analysis Cambridge, MA: MIT Press.
6
 
7
Bryck A, Randenbush S (1992) Hierarchical Linear Models. Newbury Park, CA: Sage Publications.
 
8
DuMouchel W (1999) Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System (with discussion), The American Statistician, 53:177-202.
 
9
DuMouchel W, Friedman C, Hripcsak G, Johnson S, Clayton P (1996) Two applications of statistical modeling to natm'al language processing. AI and Statistics V, ch. 39, edited by D. Fisher and H. Lenz, Springer-Verlag.
10
 
11
Johnson N, Kotz S (1969) Discrete Distributions. Houghton Mifflin, now distributed by New York: John Wiley.
 
12
Mantel N, Haenszel W (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22" 719-748.
 
13
O'Hagan A (1994) Kendall's Advanced Theory of Statistics, vol. 2, Bayesian Inference. New York: Halstead Press (John Wiley).
 
14
 
15
Simpson EH (1951) The interpretation of interaction in contingency tables. J. Royal ,Statistical Soc., B 13:238-241.

CITED BY  24

Collaborative Colleagues:
William DuMouchel: colleagues
Daryl Pregibon: colleagues