|
ABSTRACT
How do we find patterns in author-keyword associations, evolving over time? Or in Data Cubes, with product-branch-customer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks and many more. However, they have only two orders, like author and keyword, in the above example.We propose to envision such higher order data as tensors,and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semi-infinite streams. Thus, we introduce the dynamic tensor analysis (DTA) method, and its variants. DTA provides a compact summary for high-order and high-dimensional data, and it also reveals the hidden correlations. Algorithmically, we designed DTA very carefully so that it is (a) scalable, (b) space efficient (it does not need to store the past) and (c) fully automatic with no need for user defined parameters. Moreover, we propose STA, a streaming tensor analysis method, which provides a fast, streaming approximation to DTA.We implemented all our methods, and applied them in two real settings, namely, anomaly detection and multi-way latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
3
|
|
| |
4
|
J. D. Carroll and J. Chang. Analysis of individual differences in multi dimensional scaling via an n-way generalization of 'eckart-young' decomposition. Psychometrika, 35(3), 1970.
|
| |
5
|
L. de Lathauwer. Signal Processing Based on Multilinear Algebra. PhD thesis, Katholieke University of Leuven, Belgium, 1997.
|
 |
6
|
|
| |
7
|
Chris Ding and Jieping Ye. Two-dimensional singular value decomposition (2dsvd) for 2d maps and images. In SDM, 2005.
|
| |
8
|
Petros Drineas and Michael W. Mahoney. A randomized algorithm for a tensor-based generalization of the svd. technical report.
|
 |
9
|
|
| |
10
|
R. A. Harshman. Foundations of the parafac procedure:model and conditions for an explanatory multi-mode factor analysis. UCLA working papers in phonetics, 16, 1970.
|
| |
11
|
|
| |
12
|
Xiaofei He, Deng Cai, and Partha Niyogi. Tensor subspace analysis. In NIPS, 2005.
|
 |
13
|
|
| |
14
|
|
| |
15
|
I. T. Jolliffe. Principal Component Analysis. Springer, 2002.
|
| |
16
|
|
| |
17
|
A. Kapteyn, H. Neudecker, and T. Wansbeek. An approach to n-mode component analysis. Psychometrika, 51(2), 1986.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
P. Kroonenberg and J. D. Leeuw. Principal component analysis of three-mode data by means of alternating least square algorithms. Psychometrika, 45, 1980.
|
| |
23
|
S. Muthukrishnan. Data streams: algorithms and applications, volume 1. Foundations and Trends. in Theoretical Computer Science, 2005.
|
 |
24
|
Christos H. Papadimitriou , Hisao Tamaki , Prabhakar Raghavan , Santosh Vempala, Latent semantic indexing: a probabilistic analysis, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.159-168, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.275505]
|
| |
25
|
|
| |
26
|
Amnon Shashua and Anat Levin. Linear image coding for regression and classification using the tensor-rank principle. In CVPR, number 1, 2001.
|
| |
27
|
L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 1966.
|
| |
28
|
|
| |
29
|
N. Viereck, M. Dyrby, and S. B. Engelsen. Monitoring Thermal Processes by NMR Technology. Elsevier Academic Press, 2006.
|
| |
30
|
Dong Xu , Shuicheng Yan , Lei Zhang , Hong-Jiang Zhang , Zhengkai Liu , Heung-Yeung Shum, Concurrent Subspaces Analysis, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, p.203-208, June 20-26, 2005
[doi> 10.1109/CVPR.2005.107]
|
| |
31
|
|
 |
32
|
|
CITED BY 13
|
|
|
|
|
Josiane Xavier Parreira , Sebastian Michel , Matthias Bender , Tom Crecelius , Gerhard Weikum, P2P authority analysis for social communities, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
Jimeng Sun , Christos Faloutsos , Spiros Papadimitriou , Philip S. Yu, GraphScope: parameter-free mining of large time-evolving graphs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Xi Li , Weiming Hu , Zhongfei Zhang , Xiaoqin Zhang, Robust foreground segmentation based on two effective background models, Proceeding of the 1st ACM international conference on Multimedia information retrieval, October 30-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Hanghang Tong , Spiros Papadimitriou , Jimeng Sun , Philip S. Yu , Christos Faloutsos, Colibri: fast mining of large static and dynamic graphs, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|