|
ABSTRACT
How do we find patterns in author-keyword associations, evolving over time? Or in data cubes (tensors), with product-branchcustomer sales information? And more generally, how to summarize high-order data cubes (tensors)? How to incrementally update these patterns over time? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks, and many more settings. However, they have only two orders (i.e., matrices, like author and keyword in the previous example). We propose to envision such higher-order data as tensors, and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semi-infinite streams. Thus, we introduce a general framework, incremental tensor analysis (ITA), which efficiently computes a compact summary for high-order and high-dimensional data, and also reveals the hidden correlations. Three variants of ITA are presented: (1) dynamic tensor analysis (DTA); (2) streaming tensor analysis (STA); and (3) window-based tensor analysis (WTA). In paricular, we explore several fundamental design trade-offs such as space efficiency, computational cost, approximation accuracy, time dependency, and model complexity. We implement all our methods and apply them in several real settings, such as network anomaly detection, multiway latent semantic indexing on citation networks, and correlation study on sensor measurements. Our empirical studies show that the proposed methods are fast and accurate and that they find interesting patterns and outliers on the real datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large data bases, p.81-92, September 09-12, 2003, Berlin, Germany
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
3
|
|
| |
4
|
Ding, C. and Ye, J. 2005. Two-Dimensional singular value decomposition (2dsvd) for 2d maps and images. In SDM.
|
 |
5
|
|
 |
6
|
|
| |
7
|
Drineas, P. and Mahoney, M. W. 2005. A randomized algorithm for a tensor-based generalization of the SVD. Tech. Rep. YALEU/DCS/TR-1327, Yale University.
|
| |
8
|
Enron Dataset. 2004. U.C. Berkeley Enron email analysis. http://bailando.sims.berkeley.edu/enron_email.html.
|
 |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
He, X., Cai, D., and Niyogi, P. 2005. Tensor subspace analysis. In Proceeding of the Conference on Advance, in Neural Information Processing Systems (NIPS).
|
| |
15
|
|
| |
16
|
Jolliffe, I. 2002. Principal Component Analysis. Springer.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
Kolda, T. G. 2006. Multilinear operators for higher-order decompositions. Tech. Rep. SAND2006-2081, Sandia National Laboratory/April.
|
| |
21
|
Kolda, T. G. and Bader, B. W. 2008 Tensor decompositions and applications. SIAM Rev. (in revision). http://csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/SAND2007-6702.pdf.
|
| |
22
|
|
| |
23
|
|
| |
24
|
Kroonenberg, P. and Leeuw, J. D. 1980. Principal component analysis of three-mode data by means of alternating least square algorithms. Psychometrika 45.
|
| |
25
|
Kroonenberg, P. M. 1983. Three-Mode Principal Component Analysis. Theory and Applications. DWO Press.
|
| |
26
|
Lathauwer, L. D., Moor, B. D., and Vandewalle, J. 2006. On the best rank-1 and rank-(r1,r2,. . .,rn) approximation of higher-order tensors. SIAM. Matrix Anal. Appl. 21.
|
 |
27
|
|
 |
28
|
Christos H. Papadimitriou , Hisao Tamaki , Prabhakar Raghavan , Santosh Vempala, Latent semantic indexing: a probabilistic analysis, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.159-168, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.275505]
|
| |
29
|
Spiros Papadimitriou , Anthony Brockwell , Christos Faloutsos, Adaptive, hands-off stream mining, Proceedings of the 29th international conference on Very large data bases, p.560-571, September 09-12, 2003, Berlin, Germany
|
| |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
Shashua, A. and Levin, A. 2001. Linear image coding for regression and classification using the tensor-rank principle. In Proceedings of the IEEE Conference on Computer vision and Patteroo Recongnistion CVPR (1). IEEE Computer Society, 42--49.
|
| |
34
|
Smilde, A., Bro, R., and Geladi, P. 2004. Multi-Way Analysis: Applications in the Chemical Sciences. Wiley, West Sussex, UK.
|
| |
35
|
|
 |
36
|
|
| |
37
|
|
| |
38
|
Tucker, L. R. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279--311.
|
| |
39
|
Universität Trier. 2008. http://www.informatik.uni-trier.de/~ley/db/.
|
| |
40
|
|
| |
41
|
|
 |
42
|
|
| |
43
|
Dong Xu , Shuicheng Yan , Lei Zhang , Hong-Jiang Zhang , Zhengkai Liu , Heung-Yeung Shum, Concurrent Subspaces Analysis, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, p.203-208, June 20-26, 2005
[doi> 10.1109/CVPR.2005.107]
|
| |
44
|
|
| |
45
|
|
 |
46
|
|
| |
47
|
|
|