|
ABSTRACT
Motivated by the enormous amounts of data collected in a large IT service provider organization, this paper presents a method for quickly and automatically summarizing and extracting meaningful insights from the data. Termed Clustered Subset Selection (CSS), our method enables program-guided data explorations of high-dimensional data matrices. CSS combines clustering and subset selection into a coherent and intuitive method for data analysis. In addition to a general framework, we introduce a family of CSS algorithms with different clustering components such as k-means and Close-to-Rank-One (CRO) clustering, and Subset Selection components such as best rank-one approximation and Rank-Revealing QR (RRQR) decomposition. From an empirical perspective, we illustrate that CSS is achieving significant improvements over existing Subset Selection methods in terms of approximation errors. Compared to existing Subset Selection techniques, CSS is also able to provide additional insight about clusters and cluster representatives. Finally, we present a case-study of program-guided data explorations using CSS on a large amount of IT service delivery data collection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
M. Berry, S. Pulatova, and G. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-32 CMSC TR-4589, University of Maryland, College Park, MD, 2004.
|
 |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
T. F. Chan. Rank revealing QR factorizations. Linear Algebra and Its Applications, 88/89:67--82, 1987.
|
| |
8
|
|
| |
9
|
T. F. Chan and P. C. Hansen. Low-rank revealing QR factorizations. Numerical Linear Algebra with Applications, 1:33--44, 1994.
|
| |
10
|
|
| |
11
|
S. J. Cho and M. A. Hermsmeier. Genetic algorithm guided selection: Variable selection and subset selection. Journal of Chemical Information and Computer Sciences, 42(4):927--936, 2002.
|
| |
12
|
|
 |
13
|
Amit Deshpande , Luis Rademacher , Santosh Vempala , Grant Wang, Matrix approximation and projective clustering via volume sampling, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.1117-1126, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109681]
|
| |
14
|
A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In RANDOM - APPROX, 2006.
|
| |
15
|
|
| |
16
|
P. Drineas. Randomized Algorithms for Matrix Operations. PhD thesis, Yale University, 2002.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl., 74:47--71, 1986.
|
| |
21
|
L. V. Foster and X. Liu. Comparison of rank revealing algorithms applied to matrices with well defined numerical ranks. submitted, 2006.
|
| |
22
|
|
| |
23
|
E. Gallopoulos and D. Zeimpekis. CLSI: A flexible approximation scheme from clustered term-document matrices. In SDM, 2005.
|
| |
24
|
G. Golub and C. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.
|
| |
25
|
G. H. Golub. Numerical methods for solving linear least squares problems. Numer. Math., 7:206--216, 1965.
|
| |
26
|
|
| |
27
|
|
| |
28
|
Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Mathematics of Computation, 58:213--232, 1992.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
Y.-D. Kim and S. Choi. A method of initialization for nonnegative matrix factorization. In Acoustics, Speech and Signal Processing (ICASSP), 2007.
|
| |
34
|
|
| |
35
|
C. T. Pan. On the existence and computation of rank-revealing LU factorizations. Linear Algebra Appl., 316:199--222, 2000.
|
| |
36
|
C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.
|
 |
37
|
Christos H. Papadimitriou , Hisao Tamaki , Prabhakar Raghavan , Santosh Vempala, Latent semantic indexing: a probabilistic analysis, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.159-168, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.275505]
|
| |
38
|
|
| |
39
|
|
| |
40
|
G. Stewart. Four algorithms for the efficient computation of truncated QR approximations to a sparse matrix. Numerische Mathematik, 83:313--323, 1999.
|
| |
41
|
G. Stewart and J. Sun. Matrix Perturbation Theory. Academic Press, New York, 1990.
|
| |
42
|
|
|