ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Assessing data mining results via swap randomization
Full text PdfPdf (388 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 1 ,  Issue 3  (December 2007) table of contents
Article No.: 14  
Year of Publication: 2007
ISSN:1556-4681
Authors
Aristides Gionis  Yahoo! Research, Barcelona, Spain
Heikki Mannila  University of Helsinki and Helsinki University of Technology, Helsinki, Finland
Taneli Mielikäinen  Nokia Research Center, Palo Alto, CA
Panayiotis Tsaparas  Search Labs, Microsoft Research, Mountain View, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 212,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1297332.1297338
What is a DOI?

Warning: The download time has expired please click on the item to try again.


ABSTRACT

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Besag, J. 2004. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag_tl.pdf.
 
2
Besag, J. and Clifford, P. 1989. Generalized Monte Carlo significance tests. Biometrika 76, 4, 633--642.
 
3
Besag, J. and Clifford, P. 1991. Sequential Monte Carlo p-values. Biometrika 78, 2, 301--304.
4
 
5
Bezáková, I., Sinclair, A., Stefankovic, D., and Vigoda, E. 2006. Negative examples for sequential importance sampling of binary contingency tables. http://arxiv.org/abs/math.ST/0606650.
6
7
8
 
9
Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. 2005. Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statis. Assoc. 100, 469, 109--120.
 
10
Cobb, G. W. and Chen, Y.-P. 2003. An application of Markov chain Monte Carlo to community ecology. Amer. Math. Month. 110, 264--288.
 
11
Diaconis, P. and Gangolli, A. 1995. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms, 15--41.
 
12
Diaconis, P. and Saloff-Coste, L. 1995. Random walk on contingency tables with fixed row and column sums. Tech. Rep., Department of Mathematics, Harvard University.
13
14
 
15
Fortelius, M. 2006. Neogene of the old world database of fossil mammals (NOW). http://www.helsinki.fi/science/now/.
 
16
Good, P. 2000. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer.
 
17
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.
 
18
 
19
20
21
 
22
Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), New York, 274--278.
 
23
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1092.
 
24
Mielikäinen, T. 2003. On inverse frequent set mining. In Proceedings of the 2nd Workshop on Privacy Preserving Data Mining (PPDM), IEEE Computer Society, 18--23.
 
25
Milo, R., Shen-Orr, S., Itzkovirz, S., Kashtan, N., Chklovskii, D., and Alon, U. 2002. Network motifs: Simple building blocks of complex networks. Sci. 298, 824--827.
 
26
Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.
 
27
Ryser, H. J. 1957. Combinatorial properties of matrices of zeros and ones. Canadian J. Math. 9, 371--377.
 
28
Sanderson, J. 2000. Testing ecological patterns. Amer. Sci. 88, 332--339.
 
29
Snijders, F. 1991. Enumeration and simulation methods for 0--1 matrices with given marginals. Psychometrika 56, 397--417.
30
 
31
Tomkins, A. 2006. Private communication.
 
32
33
 
34
35


Collaborative Colleagues:
Aristides Gionis: colleagues
Heikki Mannila: colleagues
Taneli Mielikäinen: colleagues
Panayiotis Tsaparas: colleagues