ACM Home Page
Please provide us with feedback. Feedback
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs
Full text PdfPdf (343 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Research track papers table of contents
Pages: 334 - 343  
Year of Publication: 2004
ISBN:1-58113-888-1
Authors
Hui Xiong  University of Minnesota
Shashi Shekhar  University of Minnesota
Pang-Ning Tan  Michigan State University
Vipin Kumar  University of Minnesota
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 72,   Citation Count: 13
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014090
What is a DOI?

ABSTRACT

Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
4
 
5
 
6
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In ICDE, 2000.
7
 
8
G. Grahne, L. V. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. In ICDE, 2000.
9
 
10
11
 
12
S. K. Kachigan. Multivariate Statistical Analysis: A Conceptual Introduction. Radius Press, 1991.
13
 
14
 
15
H. T. Reynolds. The Analysis of Cross-classifications. The Free Press, New York, 1977.
 
16
R. Rymon. Search through systematic set enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.
 
17
H. Xiong, , S. Shekhar, P. Tan, and V. Kumar. Taper: An efficient two-step approach for all-pairs correlation query in transaction databases. In Technical Report 03-020, computer science and engineering, University of Minnesota - Twin Cities, May 2003.
 
18
G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.

CITED BY  13

Collaborative Colleagues:
Hui Xiong: colleagues
Shashi Shekhar: colleagues
Pang-Ning Tan: colleagues
Vipin Kumar: colleagues