ACM Home Page
Please provide us with feedback. Feedback
Towards parameter-free data mining
Full text PdfPdf (771 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Research track papers table of contents
Pages: 206 - 215  
Year of Publication: 2004
ISBN:1-58113-888-1
Authors
Eamonn Keogh  University of California, Riverside, Riverside, CA
Stefano Lonardi  University of California, Riverside, Riverside, CA
Chotirat Ann Ratanamahatana  University of California, Riverside, Riverside, CA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 36,   Downloads (12 Months): 245,   Citation Count: 32
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014077
What is a DOI?

ABSTRACT

Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process.Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Allison, L., Stern, L., Edgoose, T., Dix, T.I. Sequence Complexity for Biological Sequence Analysis. Computers & Chemistry 24(1): 43--55 (2000)
 
2
Benedetto, D., Caglioti, E., & Loreto, V. Language trees and zipping. Physical Review Letters 88, 048702, (2002).
3
 
4
Dasgupta, D. & Forrest,S. Novelty Detection in Time Series Data using Ideas from Immunology." In Proc. of the International Conference on Intelligent Systems (1999).
 
5
 
6
Elkan, C. Using the triangle inequality to accelerate k-Means. In Proc. of ICML 2003. pp 147--153
7
 
8
Ergun, F., Muthukrishnan, S., & Sahinalp, S.C. Comparing Sequences with Segment Rearrangements. FSTTCS 2003:
9
 
10
 
11
Flexer, A. Statistical evaluation of neural networks experiments: Minimum requirements and current practice. In Proc. of the 13th European Meeting on Cybernetics and Systems Research, vol. 2, pp 1005-1008, Austria, 1996
 
12
Gatlin, L. Information Theory and the Living Systems. Columbia University Press, 1972.
13
14
 
15
Goldberger, A.L., Amaral, L., Glass, L, Hausdorff, J.M., Ivanov, P.Ch., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.. PhysioBank, PhysioToolkit, and PhysioNet: Circulation 101(23):e215-e220
 
16
Goodman, J. Comment on "Language Trees and Zipping", unpublished manuscript, 2002 (available at {http://research.microsoft.com/~joshuago/}.
 
17
 
18
Keogh, E. http://www.cs.ucr.edu/~eamonn/SIGKDD2004.
 
19
Keogh, E. & Folias, T. The UCR Time Series Data Mining Archive. Riverside CA. 2002. {http://www.cs.ucr.edu/~eamonn/TSDMA/index.html}.
20
 
21
 
22
Li, M., Badger, J.H., Chen, X., Kwong, S, Kearney, P., & Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17: 149--154, 2001.
 
23
 
24
25
 
26
 
27
Loewenstern, D., & Yianilos, P.N. Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6(1), 1999.
28
 
29
 
30
Ratanamahatana, C.A. & Keogh, E. Making Time-series Classification More Accurate Using Learned Constraints. In proceedings of SIAM International Conference on Data Mining (SDM '04), Lake Buena Vista, Florida, April 22-24, 2004.
 
31
Rissanen, J. Modeling by shortest data description. Automatica, vol. 14 (1978), pp. 465--471.
 
32
 
33
34
 
35
 
36
Yairi, T., Kato, Y., & Hori, K. Fault Detection by Mining Association Rules from House-keeping Data, Proc. of Int'l Sym. on AI, Robotics and Automation in Space, 2001.

CITED BY  32

Collaborative Colleagues:
Eamonn Keogh: colleagues
Stefano Lonardi: colleagues
Chotirat Ann Ratanamahatana: colleagues