|
ABSTRACT
The problem of assessing the significance of data mining results on high-dimensional 0-1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, e.g., chi-square tests, or many other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are more difficult to apply to sets of patterns or other complex results of data mining. In this paper, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and rankings. To generate random datasets with given margins, we use variations of a Markov chain approach, which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is a random artifact, while for other datasets the discovered structure conveys meaningful information.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bezáková, I., Bhatnagar, N., and Vigoda, E. Sampling binary contingency tables with a greedy start. In SODA (2006), SIAM.
|
 |
2
|
Tom Brijs , Gilbert Swinnen , Koen Vanhoof , Geert Wets, Using association rules for product assortment decisions: a case study, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.254-260, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312241]
|
 |
3
|
Sergey Brin , Rajeev Motwani , Craig Silverstein, Beyond market baskets: generalizing association rules to correlations, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.265-276, May 11-15, 1997, Tucson, Arizona, United States
|
| |
4
|
Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. Sequential Monte Carlo methods for statistical analysis of tables. Journal of the American Statistical Association 100, 469 (2005), 109--120.
|
| |
5
|
Cobb, G. W., and Chen, Y.-P. An application of Markov chain Monte Carlo to community ecology. American Mathematical Monthly 110 (2003), 264--288.
|
| |
6
|
Diaconis, P., and Gangolli, A. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms (1995), pp. 15--41.
|
 |
7
|
|
| |
8
|
Good, P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer, 2000.
|
| |
9
|
Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1970).
|
| |
10
|
|
| |
11
|
|
 |
12
|
Bing Liu , Wynne Hsu , Yiming Ma, Pruning and summarizing the discovered associations, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.125-134, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312216]
|
| |
13
|
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. Equations of state calculations by fast computing machines. Journal of Chemical Physics 21 (1953).
|
| |
14
|
Milo, R., Shen-Orr, S., Itzkovirz, S., Kashtan, N., Chklovskii, D., and Alon, U. Network motifs: Simple building blocks of complex networks. Science 298, (2002).
|
| |
15
|
Newman, M. The structure and function of complex networks. SIAM Review 45, 2 (2003), 167--256.
|
| |
16
|
Sanderson, J. Testing ecological patterns. American Scientist 88, 332--339 (2000).
|
| |
17
|
Snijders, F. Enumeration and simulation methods for 0-1 matrices with given marginals. Psychometrika 56 (1991), 397--417.
|
 |
18
|
|
| |
19
|
|
 |
20
|
Hui Xiong , Shashi Shekhar , Pang-Ning Tan , Vipin Kumar, Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014090]
|
CITED BY 6
|
|
|
|
|
|
|
|
Hannes Heikinheimo , Eino Hinkkanen , Heikki Mannila , Taneli Mielikäinen , Jouni K. Seppänen, Finding low-entropy sets and trees from binary data, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
Adam Kirsch , Michael Mitzenmacher , Andrea Pietracaprina , Geppino Pucci , Eli Upfal , Fabio Vandin, An efficient rigorous approach for identifying statistically significant frequent itemsets, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|