|
ABSTRACT
Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
2
|
|
 |
3
|
Sergey Brin , Rajeev Motwani , Craig Silverstein, Beyond market baskets: generalizing association rules to correlations, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.265-276, May 11-15, 1997, Tucson, Arizona, United States
|
 |
4
|
Cristian Bucila , Johannes Gehrke , Daniel Kifer , Walker White, DualMiner: a dual-pruning algorithm for itemsets with constraints, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775054]
|
| |
5
|
|
| |
6
|
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In ICDE, 2000.
|
 |
7
|
|
| |
8
|
G. Grahne, L. V. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. In ICDE, 2000.
|
 |
9
|
Jiawei Han , Jian Pei , Yiwen Yin, Mining frequent patterns without candidate generation, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.1-12, May 15-18, 2000, Dallas, Texas, United States
|
| |
10
|
|
 |
11
|
|
| |
12
|
S. K. Kachigan. Multivariate Statistical Analysis: A Conceptual Introduction. Radius Press, 1991.
|
 |
13
|
Raymond Ng , Laks V. S. Lakshmanan , Jiawei Han , Teresa Mah, Exploratory mining via constrained frequent set queries, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.556-558, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
14
|
|
| |
15
|
H. T. Reynolds. The Analysis of Cross-classifications. The Free Press, New York, 1977.
|
| |
16
|
R. Rymon. Search through systematic set enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.
|
| |
17
|
H. Xiong, , S. Shekhar, P. Tan, and V. Kumar. Taper: An efficient two-step approach for all-pairs correlation query in transaction databases. In Technical Report 03-020, computer science and engineering, University of Minnesota - Twin Cities, May 2003.
|
| |
18
|
G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.
|
CITED BY 13
|
|
|
|
|
|
|
|
Aristides Gionis , Heikki Mannila , Taneli Mielikäinen , Panayiotis Tsaparas, Assessing data mining results via swap randomization, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|