|
ABSTRACT
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process.Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Allison, L., Stern, L., Edgoose, T., Dix, T.I. Sequence Complexity for Biological Sequence Analysis. Computers & Chemistry 24(1): 43--55 (2000)
|
| |
2
|
Benedetto, D., Caglioti, E., & Loreto, V. Language trees and zipping. Physical Review Letters 88, 048702, (2002).
|
 |
3
|
|
| |
4
|
Dasgupta, D. & Forrest,S. Novelty Detection in Time Series Data using Ideas from Immunology." In Proc. of the International Conference on Intelligent Systems (1999).
|
| |
5
|
|
| |
6
|
Elkan, C. Using the triangle inequality to accelerate k-Means. In Proc. of ICML 2003. pp 147--153
|
 |
7
|
|
| |
8
|
Ergun, F., Muthukrishnan, S., & Sahinalp, S.C. Comparing Sequences with Segment Rearrangements. FSTTCS 2003:
|
 |
9
|
Christos Faloutsos , King-Ip Lin, FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.163-174, May 22-25, 1995, San Jose, California, United States
|
| |
10
|
Martin Farach , Michiel Noordewier , Serap Savari , Larry Shepp , Abraham Wyner , Jacob Ziv, On the entropy of DNA: algorithms and measurements based on memory and rapid convergence, Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms, p.48-57, January 22-24, 1995, San Francisco, California, United States
|
| |
11
|
Flexer, A. Statistical evaluation of neural networks experiments: Minimum requirements and current practice. In Proc. of the 13th European Meeting on Cybernetics and Systems Research, vol. 2, pp 1005-1008, Austria, 1996
|
| |
12
|
Gatlin, L. Information Theory and the Living Systems. Columbia University Press, 1972.
|
 |
13
|
Martin Gavrilov , Dragomir Anguelov , Piotr Indyk , Rajeev Motwani, Mining the stock market (extended abstract): which measure is best?, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.487-496, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347189]
|
 |
14
|
|
| |
15
|
Goldberger, A.L., Amaral, L., Glass, L, Hausdorff, J.M., Ivanov, P.Ch., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.. PhysioBank, PhysioToolkit, and PhysioNet: Circulation 101(23):e215-e220
|
| |
16
|
Goodman, J. Comment on "Language Trees and Zipping", unpublished manuscript, 2002 (available at {http://research.microsoft.com/~joshuago/}.
|
| |
17
|
|
| |
18
|
Keogh, E. http://www.cs.ucr.edu/~eamonn/SIGKDD2004.
|
| |
19
|
Keogh, E. & Folias, T. The UCR Time Series Data Mining Archive. Riverside CA. 2002. {http://www.cs.ucr.edu/~eamonn/TSDMA/index.html}.
|
 |
20
|
|
| |
21
|
|
| |
22
|
Li, M., Badger, J.H., Chen, X., Kwong, S, Kearney, P., & Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17: 149--154, 2001.
|
| |
23
|
Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitányi, The similarity metric, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, January 12-14, 2003, Baltimore, Maryland
|
| |
24
|
|
 |
25
|
Jessica Lin , Eamonn Keogh , Stefano Lonardi , Bill Chiu, A symbolic representation of time series, with implications for streaming algorithms, Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, June 13-13, 2003, San Diego, California
[doi> 10.1145/882082.882086]
|
| |
26
|
|
| |
27
|
Loewenstern, D., & Yianilos, P.N. Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6(1), 1999.
|
 |
28
|
|
| |
29
|
|
| |
30
|
Ratanamahatana, C.A. & Keogh, E. Making Time-series Classification More Accurate Using Learned Constraints. In proceedings of SIAM International Conference on Data Mining (SDM '04), Lake Buena Vista, Florida, April 22-24, 2004.
|
| |
31
|
Rissanen, J. Modeling by shortest data description. Automatica, vol. 14 (1978), pp. 465--471.
|
| |
32
|
|
| |
33
|
|
 |
34
|
|
| |
35
|
|
| |
36
|
Yairi, T., Kato, Y., & Hori, K. Fault Detection by Mining Association Rules from House-keeping Data, Proc. of Int'l Sym. on AI, Robotics and Automation in Space, 2001.
|
CITED BY 32
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xian Zhang , Yu Hao , Xiaoyan Zhu , Ming Li , David R. Cheriton, Information distance from a question to an answer, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
Jimeng Sun , Christos Faloutsos , Spiros Papadimitriou , Philip S. Yu, GraphScope: parameter-free mining of large time-evolving graphs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
Chong Long , Xiaoyan Zhu , Ming Li , Bin Ma, Information shared by many objects, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Minnen , Thad Starner , Irfan Essa , Charles Isbell, Improving activity discovery with automatic neighborhood estimation, Proceedings of the 20th international joint conference on Artifical intelligence, p.2814-2819, January 06-12, 2007, Hyderabad, India
|
|
|
|
|
|
|
|