|
ABSTRACT
The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued.Many researchers have also considered transforming real valued time series into symbolic representations, nothing that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly "batch-only" problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms.In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
Alberto Apostolico , Mary Ellen Bock , Stefano Lonardi, Monotony of surprise and large-scale quest for unusual words, Proceedings of the sixth annual international conference on Computational biology, p.22-31, April 18-21, 2002, Washington, DC, USA
[doi> 10.1145/565196.565200]
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
5
|
Bastogne, T., Noura, H., Richard A. amp; Hittinger, J. M. (2002). Application of Subspace Methods to the Identification of a Winding Process. In proceedings of the 4th European Control Conference, Vol. 5, Brussels.
|
| |
6
|
Berndt, D. amp; Clifford, J. (1994) Using Dynamic Time Warping to Find Patterns in Time Series. In proceedings of the Workshop on Knowledge Discovery in Databases, at the 12th Int'l Conference on Artificial Intelligence. July 31-Aug 4, Seattle, WA. pp 229--248.
|
| |
7
|
|
 |
8
|
Corinna Cortes , Kathleen Fisher , Daryl Pregibon , Anne Rogers, Hancock: a language for extracting signatures from data streams, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.9-17, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347094]
|
| |
9
|
Dasgupta, D. amp; Forrest, S. (1996) Novelty Detection in Time Series Data using Ideas from Immunology. In proceedings of The International Conference on Intelligent Systems. June 19--21.
|
| |
10
|
|
| |
11
|
Daw, C. S., Finney, C. E. A. amp; Tracy, E. R. (2001). Symbolic Analysis of Experimental Data. Review of Scientific Instruments. (2002-07-22).
|
| |
12
|
|
| |
13
|
Durbin, R., Eddy, S., Krogh, A. amp; Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
|
 |
14
|
Christos Faloutsos , M. Ranganathan , Yannis Manolopoulos, Fast subsequence matching in time-series databases, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.419-429, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
15
|
Fayyad, U., Reina, C. &. Bradley, P. (1998). Initialization of Iterative Refinement Clustering Algorithms. In proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 194--198.
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
 |
19
|
Joseph M. Hellerstein , Elias Koutsoupias , Christos H. Papadimitriou, On the analysis of indexing schemes, Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.249-256, May 11-15, 1997, Tucson, Arizona, United States
[doi> 10.1145/263661.263688]
|
 |
20
|
|
| |
21
|
|
 |
22
|
Eamonn Keogh , Kaushik Chakrabarti , Michael Pazzani , Sharad Mehrotra, Locally adaptive dimensionality reduction for indexing large time series databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.151-162, May 21-24, 2001, Santa Barbara, California, United States
|
 |
23
|
|
 |
24
|
|
| |
25
|
Keogh, E. amp; Pazzani, M. (1998). An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--241.
|
| |
26
|
Lin, J., Keogh, E., Lonardi, S. amp; Patel, P. (2002). Finding Motifs in Time Series. In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 23--26. pp. 53--68.
|
| |
27
|
Larsen, R. J. amp; Marx, M. L. (1986). An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, Englewood, Cliffs, N.J. 2nd Edition.
|
| |
28
|
|
| |
29
|
Reinert, G., Schbath, S. amp; Waterman, M. S. (2000). Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational. Biology. Vol. 7, pp 1--46.
|
| |
30
|
John F. Roddick , Kathleen Hornsby , Myra Spiliopoulou, An Updated Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research, Proceedings of the First International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers, p.147-164, September 12, 2000
|
| |
31
|
|
| |
32
|
Staden, R. (1989). Methods for Discovering Novel Motifs in Nucleic Acid Sequences. Computer Applications in Biosciences. Vol. 5(5). pp 293--298.
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
CITED BY 50
|
|
Jessica Lin , Eamonn Keogh , Stefano Lonardi , Jeffrey P. Lankford , Donna M. Nystrom, Visually mining and monitoring massive time series, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jessica Lin , Eamonn Keogh , Stefano Lonardi , Jeffrey P. Lankford , Daonna M. Nystrom, VizTree: a tool for visually mining and monitoring massive time series databases, Proceedings of the Thirtieth international conference on Very large data bases, p.1269-1272, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
Eamonn Keogh , Stefano Lonardi , Chotirat Ann Ratanamahatana , Li Wei , Sang-Hee Lee , John Handley, Compression-based data mining of sequential data, Data Mining and Knowledge Discovery, v.14 n.1, p.99-129, February 2007
|
|
|
|
|
|
|
|
|
Jonathan S. Anstey , Dennis K. Peters , Chris Dawson, An improved feature extraction technique for high volume time series data, Proceedings of the Fourth conference on IASTED International Conference: Signal Processing, Pattern Recognition, and Applications, p.74-81, February 14-16, 2007, Innsbruck, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lu-An Tang , Bin Gui , Hong-Yan Li , Gao-Shan Miao , Dong-Qing Yang , Xin-Biao Zhou, PGG: an online pattern based approach for stream variation management, Journal of Computer Science and Technology, v.23 n.4, p.497-515, July 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sorabh Gandhi , Suman Nath , Subhash Suri , Jie Liu, GAMPS: compressing multi sensor data by grouping and amplitude scaling, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
David Minnen , Thad Starner , Irfan Essa , Charles Isbell, Improving activity discovery with automatic neighborhood estimation, Proceedings of the 20th international joint conference on Artifical intelligence, p.2814-2819, January 06-12, 2007, Hyderabad, India
|
|
|
|
|