|
ABSTRACT
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise.Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or "don't care" symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Rakesh Agrawal , King-Ip Lin , Harpreet S. Sawhney , Kyuseok Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, Proceedings of the 21th International Conference on Very Large Data Bases, p.490-501, September 11-15, 1995
|
 |
2
|
Alberto Apostolico , Mary Ellen Bock , Stefano Lonardi, Monotony of surprise and large-scale quest for unusual words, Proceedings of the sixth annual international conference on Computational biology, p.22-31, April 18-21, 2002, Washington, DC, USA
[doi> 10.1145/565196.565200]
|
| |
3
|
|
| |
4
|
Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing, Bioinformatics 17: pp. 419--428.
|
 |
5
|
|
| |
6
|
|
| |
7
|
Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery from time series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 16--22.
|
| |
8
|
Dasgupta., D. & Forrest, S. (1999). Novelty detection in time series data using ideas from immunology. In Proceedings of the 5th International Conference on Intelligent Systems (1999).
|
| |
9
|
Daw, C. S., Finney, C. E. A. & Tracy, E. R. (2001). Symbolic analysis of experimental data. Review of Scientific Instruments.
|
| |
10
|
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.
|
| |
11
|
Engelhardt, B., Chien, S. & Mutz, D. (2000). Hypothesis generation strategies for adaptive problem solving. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
Hegland, M., Clarke, W. & Kahn, M. (2002). Mining the MACHO dataset, Computer Physics Communications, Vol 142(1--3), December 15. pp. 22--28.
|
| |
16
|
Hertz, G. & Stormo, G. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, Vol. 15, pp. 563--577.
|
| |
17
|
van Helden, J., Andre, B., & Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides. J. Mol. Biol., Vol. 281, pp. 827--842.
|
| |
18
|
|
| |
19
|
|
 |
20
|
Piotr Indyk , Rajeev Motwani , Prabhakar Raghavan , Santosh Vempala, Locality-preserving hashing in multidimensional spaces, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.618-625, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258656]
|
 |
21
|
|
| |
22
|
Keogh, E. and Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification clustering and relevance feedback. In 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--243
|
| |
23
|
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems. pp 263--286.
|
| |
24
|
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science, Oct. Vol. 262, pp 208--214.
|
| |
25
|
Lawrence. C. &. Reilly. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, Vol. 7, pp 41--51.
|
| |
26
|
Lin, J. Keogh, E. Patel, P. & Lonardi, S. (2002). Finding motifs in time series. In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada.
|
| |
27
|
|
| |
28
|
|
| |
29
|
Reinert, G., Schbath, S. & Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Bio., Vol. 7, pp 1--46.
|
| |
30
|
Rigoutsos, I. & Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: The Teiresias algorithm, Bioinformatics, 14(1), pp. 55--67.
|
| |
31
|
John F. Roddick , Kathleen Hornsby , Myra Spiliopoulou, An Updated Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research, Proceedings of the First International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers, p.147-164, September 12, 2000
|
| |
32
|
Scargle, J., (2000). Bayesian Blocks, A new method to analyze structure in photon counting data, Astrophysical Journal, 504, pp 405--418.
|
| |
33
|
Staden, R. (1989). Methods for discovering novel motifs in nucleic acid sequences. Comput. Appl. Biosci., Vol. 5(5). pp 293--298.
|
 |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
CITED BY 33
|
|
|
|
|
Jessica Lin , Eamonn Keogh , Stefano Lonardi , Jeffrey P. Lankford , Donna M. Nystrom, Visually mining and monitoring massive time series, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dragomir Yankov , Eamonn Keogh , Jose Medina , Bill Chiu , Victor Zordan, Detecting time series motifs under uniform scaling, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiuyao Song , Chris Jermaine , Sanjay Ranka , John Gums, A bayesian mixture model with linear regression mixing proportions, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Minnen , Charles L. Isbell , Irfan Essa , Thad Starner, Discovering multivariate motifs using subsequence density estimation and greedy mixture learning, Proceedings of the 22nd national conference on Artificial intelligence, p.615-620, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Raffay Hamid , Siddhartha Maddi , Amos Johnson , Aaron Bobick , Irfan Essa , Charles Isbell, A novel sequence representation for unsupervised analysis of human activities, Artificial Intelligence, v.173 n.14, p.1221-1244, September, 2009
|
|
|
|
|
|
|
|
|
David Minnen , Thad Starner , Irfan Essa , Charles Isbell, Improving activity discovery with automatic neighborhood estimation, Proceedings of the 20th international joint conference on Artifical intelligence, p.2814-2819, January 06-12, 2007, Hyderabad, India
|
|
|
|
|