| Tell me something I don't know: randomization strategies for iterative data mining |
| Full text |
Mov
(27:15),
Pdf
(476 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Research track papers
table of contents
Pages 379-388
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Sami Hanhijärvi
|
Helsinki University of Technology, Espoo, Finland
|
|
Markus Ojala
|
Helsinki University of Technology, Espoo, Finland
|
|
Niko Vuokko
|
Helsinki University of Technology, Espoo, Finland
|
|
Kai Puolamäki
|
Helsinki University of Technology, Espoo, Finland
|
|
Nikolaj Tatti
|
Helsinki University of Technology, Espoo, Finland
|
|
Heikki Mannila
|
Helsinki University of Technology, Espoo, Finland
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 56, Downloads (12 Months): 206, Citation Count: 0
|
|
|
ABSTRACT
There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289--300, 1995.
|
| |
2
|
Julian Besag. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag\_tl.pdf, 2004.
|
| |
3
|
Julian Besag and Peter Clifford. Generalized Monte Carlo significance tests. Biometrica, 76(4):633--642, 1989.
|
| |
4
|
|
| |
5
|
G. W. Cobb and Y.-P. Chen. An application of markov chain monte carlo to community ecology. American Mathematical Monthly, (110):264--288, 2003.
|
| |
6
|
Mikael Fortelius and Jussi Eronen. Now - neogene of the old world. http://www.helsinki.fi/science/now/, 2005. Database of fossil mammals.
|
| |
7
|
|
| |
8
|
Charles J. Geyer. Markov chain monte carlo maximum likelihood. In Computing Science and Statistics: Proc. of 23rd Symposium on the Interface Foundation, 1991.
|
 |
9
|
|
| |
10
|
Phillip Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer-Verlag, 2000.
|
| |
11
|
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 1979.
|
| |
12
|
Szymon Jaroszewicz. Interactive hmm construction based on interesting sequences. In Local Patterns to Global Models (LeGo'08) Workshop at the the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2008.
|
| |
13
|
David Jensen. Knowledge discovery through induction with randomization testing. In Proc. of the 1991 Knowledge Discovery in Databases Workshop, 1991.
|
| |
14
|
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21(1):1087--91, 1953.
|
| |
15
|
Taneli Mielikainen and Heikki Mannila. The pattern ordering problem. Lecture Notes in Computer Science, 2838:327--338, 2003.
|
| |
16
|
Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. of the 2008 SIAM International Conference on Data Mining, 2008.
|
| |
17
|
Christos H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
|
| |
18
|
|
| |
19
|
|
| |
20
|
Peter H. Westfall and S. Stanley Young. Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, 1993.
|
|