ACM Home Page
Please provide us with feedback. Feedback
Tell me something I don't know: randomization strategies for iterative data mining
Full text MovMov (27:15),  PdfPdf (476 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Research track papers table of contents
Pages 379-388  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Sami Hanhijärvi  Helsinki University of Technology, Espoo, Finland
Markus Ojala  Helsinki University of Technology, Espoo, Finland
Niko Vuokko  Helsinki University of Technology, Espoo, Finland
Kai Puolamäki  Helsinki University of Technology, Espoo, Finland
Nikolaj Tatti  Helsinki University of Technology, Espoo, Finland
Heikki Mannila  Helsinki University of Technology, Espoo, Finland
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 56,   Downloads (12 Months): 206,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557065
What is a DOI?

ABSTRACT

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.

In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289--300, 1995.
 
2
Julian Besag. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag\_tl.pdf, 2004.
 
3
Julian Besag and Peter Clifford. Generalized Monte Carlo significance tests. Biometrica, 76(4):633--642, 1989.
 
4
 
5
G. W. Cobb and Y.-P. Chen. An application of markov chain monte carlo to community ecology. American Mathematical Monthly, (110):264--288, 2003.
 
6
Mikael Fortelius and Jussi Eronen. Now - neogene of the old world. http://www.helsinki.fi/science/now/, 2005. Database of fossil mammals.
 
7
 
8
Charles J. Geyer. Markov chain monte carlo maximum likelihood. In Computing Science and Statistics: Proc. of 23rd Symposium on the Interface Foundation, 1991.
9
 
10
Phillip Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer-Verlag, 2000.
 
11
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 1979.
 
12
Szymon Jaroszewicz. Interactive hmm construction based on interesting sequences. In Local Patterns to Global Models (LeGo'08) Workshop at the the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2008.
 
13
David Jensen. Knowledge discovery through induction with randomization testing. In Proc. of the 1991 Knowledge Discovery in Databases Workshop, 1991.
 
14
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21(1):1087--91, 1953.
 
15
Taneli Mielikainen and Heikki Mannila. The pattern ordering problem. Lecture Notes in Computer Science, 2838:327--338, 2003.
 
16
Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. of the 2008 SIAM International Conference on Data Mining, 2008.
 
17
Christos H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
 
18
 
19
 
20
Peter H. Westfall and S. Stanley Young. Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, 1993.

Collaborative Colleagues:
Sami Hanhijärvi: colleagues
Markus Ojala: colleagues
Niko Vuokko: colleagues
Kai Puolamäki: colleagues
Nikolaj Tatti: colleagues
Heikki Mannila: colleagues