ACM Home Page
Please provide us with feedback. Feedback
Assessing data mining results via swap randomization
Full text PdfPdf (833 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 167 - 176  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Aristides Gionis  University of Helsinki & Helsinki University of Technology
Heikki Mannila  University of Helsinki & Helsinki University of Technology
Taneli Mielikäinen  University of Helsinki & Helsinki University of Technology
Panayiotis Tsaparas  University of Helsinki & Helsinki University of Technology
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 61,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150424
What is a DOI?

ABSTRACT

The problem of assessing the significance of data mining results on high-dimensional 0-1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, e.g., chi-square tests, or many other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are more difficult to apply to sets of patterns or other complex results of data mining. In this paper, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and rankings. To generate random datasets with given margins, we use variations of a Markov chain approach, which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is a random artifact, while for other datasets the discovered structure conveys meaningful information.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bezáková, I., Bhatnagar, N., and Vigoda, E. Sampling binary contingency tables with a greedy start. In SODA (2006), SIAM.
2
3
 
4
Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. Sequential Monte Carlo methods for statistical analysis of tables. Journal of the American Statistical Association 100, 469 (2005), 109--120.
 
5
Cobb, G. W., and Chen, Y.-P. An application of Markov chain Monte Carlo to community ecology. American Mathematical Monthly 110 (2003), 264--288.
 
6
Diaconis, P., and Gangolli, A. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms (1995), pp. 15--41.
7
 
8
Good, P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer, 2000.
 
9
Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1970).
 
10
 
11
12
 
13
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. Equations of state calculations by fast computing machines. Journal of Chemical Physics 21 (1953).
 
14
Milo, R., Shen-Orr, S., Itzkovirz, S., Kashtan, N., Chklovskii, D., and Alon, U. Network motifs: Simple building blocks of complex networks. Science 298, (2002).
 
15
Newman, M. The structure and function of complex networks. SIAM Review 45, 2 (2003), 167--256.
 
16
Sanderson, J. Testing ecological patterns. American Scientist 88, 332--339 (2000).
 
17
Snijders, F. Enumeration and simulation methods for 0-1 matrices with given marginals. Psychometrika 56 (1991), 397--417.
18
 
19
20


Collaborative Colleagues:
Aristides Gionis: colleagues
Heikki Mannila: colleagues
Taneli Mielikäinen: colleagues
Panayiotis Tsaparas: colleagues