|
ABSTRACT
Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. In this paper we focus on clustering data derived from Autoregressive Moving Average (ARMA) models using k-means and k-medoids algorithms with the Euclidean distance between estimated model parameters. We justify our choice of clustering technique and distance metric by reproducing results obtained in related research. Our research aim is to assess the affects of discretising data into binary sequences of above and below the median, a process known as clipping, on the clustering of time series. It is known that the fitted AR parameters of clipped data tend asymptotically to the parameters for unclipped data. We exploit this result to demonstrate that for long series the clustering accuracy when using clipped data from the class of ARMA models is not significantly different to that achieved with unclipped data. Next we show that if the data contains outliers then using clipped data produces significantly better clusterings. We then demonstrate that using clipped series requires much less memory and operations such as distance calculations can be much faster. Finally, we demonstrate these advantages on three real world data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
H. Akaike. Likelihood of a model and information criteria. Journal of Econometrics, 16:3--14, 1981.
|
| |
2
|
A. J. Bagnall and G. Janacek. Clustering time series from ARMA models with clipped data. Technical Report CMP-C04-01, School of Computing Sciences, University of East Anglia, 2004.
|
| |
3
|
A. J. Bagnall, G. Janacek, B. d. Iglesia, and M. Zhang. Clustering time series from mixture polynomial models with discretised data. In Proceedings of the second Australasian Data Mining Workshop, pages 105--120, 2003.
|
 |
4
|
Ziv Bar-Joseph , Georg Gerber , David K. Gifford , Tommi S. Jaakkola , Itamar Simon, A new approach to analyzing gene expression time series data, Proceedings of the sixth annual international conference on Computational biology, p.39-48, April 18-21, 2002, Washington, DC, USA
[doi> 10.1145/565196.565202]
|
| |
5
|
R. Blender, K. Fraedrich, and F. Lunkeit. Identification of cyclone-track regimes in the north atlantic. Quart J. Royal Meteor. Soc., (123):727--741, 1997.
|
| |
6
|
P. Broerson and S. de Waele. Empirical time series and maximum likelihood estimation. In Proc 2nd IEEE Benelux Signal Processing Symposium, 2000.
|
| |
7
|
J. P. Burg. Maximum entropy spectral analysis. presented at 37th meeting of the Society of Exploration Geophysicists, Oklahoma City, 1967.
|
 |
8
|
Igor Cadez , David Heckerman , Christopher Meek , Padhraic Smyth , Steven White, Visualization of navigation patterns on a Web site using model-based clustering, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.280-284, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347151]
|
| |
9
|
|
| |
10
|
E. Dermatas and G. Kokkinakis. Algorithm for clustering continuous density HMM by recognition error. IEEE Tr. On Speech and Audio Processing, 4(3):231--234, 1996.
|
| |
11
|
|
| |
12
|
S. Gaffney and P. Smyth. Curve clustering with random effects regression mixtures. In C. M. Bishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003.
|
| |
13
|
A. B. Geva and D. H. Kerem. Fuzzy and Neuro-Fuzzy Systems in Medicine, chapter 3. Brain state identification and forecasting of acute pathology using unsupervised fuzzy clustering of EEG temporal patterns. CRC Press, 1998.
|
| |
14
|
E. J. Godolphin. A direct representation for the large-sample maximum likelihood estimator of a gaussian autoregressive-moving average process. Biometrika, 71(2):281--289, 1984.
|
| |
15
|
|
| |
16
|
T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: Data mining, inference, and prediction. Springer-Verlag, 2001.
|
| |
17
|
HiGEM. High Resolution Global Environment and Modelling. http://www.higem.nerc.ac.uk/index.php.
|
| |
18
|
G. J. Janacek. Practical Time Series. Ellis Horwood, 2001.
|
| |
19
|
K. Kalpakis. Distance measures for clustering time series. http://www.csee.umbc.edu/~kalpakis.
|
| |
20
|
|
| |
21
|
B. Kedem. Estimation of the parameters in stationary autoregressive processes after hard limiting. Journal of the American Statistical Association, 75:146--153, 1980.
|
| |
22
|
B. Kedem and E. Slud. On goodness of fit of time series models: An application of higher order crossings. Biometrika, 68:551--556, 1991.
|
| |
23
|
E. Keogh and T. Folias. The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/.
|
| |
24
|
|
| |
25
|
K. Kosmelj and V. Batagelj. Cross-sectional approach for clustering time varying data. Journal of Classification, 7:99--109, 1990.
|
 |
26
|
Jessica Lin , Eamonn Keogh , Stefano Lonardi , Bill Chiu, A symbolic representation of time series, with implications for streaming algorithms, Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, June 13-13, 2003, San Diego, California
[doi> 10.1145/882082.882086]
|
| |
27
|
E. A. Maharaj. A significance test for classifying ARMA models. Journal of Statistical Computation and Simulation, 54:305--331,1996.
|
| |
28
|
E. A. Maharaj. Clusters of time series. Journal of Classification, 17:297--314, 2000.
|
| |
29
|
|
| |
30
|
P. Ormerod and C. Mounfield. Localised structures in the temporal evolution of asset prices. In New Approaches to Financial Economics. Santa Fe Conference, 2000.
|
| |
31
|
D. K. Pauler. The Schwarz criterion and related methods for normal linear models. Biometrika,85(1):13--27,1998.
|
| |
32
|
D. Piccolo. A distance measure for classifying ARIMA models. Journal of Time Series Analysis, 11(2):153--164, 1990.
|
| |
33
|
|
| |
34
|
P. Smyth. Clustering sequences with hidden markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 648. The MIT Press, 1997.
|
| |
35
|
P. Tong and H. Dabas. Cluster of time series models: An example. Journal of Applied Statistics, 17:187--198, 1990.
|
| |
36
|
|
| |
37
|
|
CITED BY 4
|
|
|
|
|
|
|
|
|
|
|
Xiuyao Song , Chris Jermaine , Sanjay Ranka , John Gums, A bayesian mixture model with linear regression mixing proportions, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|