|
ABSTRACT
Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G.Lanckriet. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Review, 49(3), July 2007
|
| |
2
|
A. Ben-Hur and I. Guyon. Detecting stable clusters using principal component analysis. Methods Mol Biol, 224:159--182, 2003.
|
 |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
C. Boutsidis, M.W. Mahoney, and P. Drineas. Manuscript in preparation, 2008
|
| |
7
|
J. Cadima, J. O. Cerdeira, and M. Minhoto. Computational aspects of algorithms for variable selection in thecontext of principal components. Computational Statistics & Data Analysis, 47(2):225--236, 2004.
|
| |
8
|
T. F. Chan. Rank revealing QR factorizations. Linear Algebra Appl, 88/89:67--82, 1987.
|
| |
9
|
|
| |
10
|
T. F. Chan and P.C. Hansen. Low-rank revealing QR factorizations. Linear Algebra Appl, 1:33--44, 1994.
|
| |
11
|
|
 |
12
|
Amit Deshpande , Luis Rademacher , Santosh Vempala , Grant Wang, Matrix approximation and projective clustering via volume sampling, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.1117-1126, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109681]
|
| |
13
|
|
| |
14
|
|
| |
15
|
P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. http://arxiv.org/abs/0708.3696, 2007.
|
| |
16
|
|
| |
17
|
R.D. Fierro, P.C. Hansen, and P. Hansen. UTV tools: Matlab templates for rank-revealing UTV decompositions. Numerical Algorithms, 20(2-3):165--194, 1999.
|
| |
18
|
L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl, 74:47--71, 1986.
|
| |
19
|
L.V. Foster and Xinrong Liu. Comparison of rank revealing algorithms applied to matrices with welldefined numerical ranks. manuscript, 2006.
|
| |
20
|
|
 |
21
|
|
| |
22
|
G. H. Golub. Numerical methods for solving linear least squares problems. Numer Math, 7:206--216, 1965.
|
| |
23
|
G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.
|
| |
24
|
|
| |
25
|
Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Math Comp, 58:213--232, 1992.
|
| |
26
|
W. J. Krzanowski. Selection of variables to preserve multivariate data structure,using principal components. Applied Statistics,36(1):22--33, 1987.
|
| |
27
|
F.G. Kuruvilla and P.J. Park and S.L. Schreiber. Vector algebra in the analysis of genome-wide expression data. Genome Biology, 3, 2002.
|
| |
28
|
K. Z. Mao. Identifying critical variables of principal components forunsupervised feature selection. IEEE Transactions on Systems,Man, and Cybernetics, Part B, 35(2):339--344, 2005.
|
 |
29
|
|
| |
30
|
P. Menozzi, A. Piazza, and L. Cavalli-Sforza. Synthetic maps of human gene frequencies in Europeans . Science, 201(4358):786--792, 1978.
|
| |
31
|
|
| |
32
|
Open Directory Project. http://www.dmoz.org/
|
| |
33
|
C. T. Pan. On the existence and computation of rank-revealing LUfactorizations. Linear Algebra Appl, 316:199--222, 2000.
|
| |
34
|
C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.
|
| |
35
|
P. Paschou, E. Ziv, E.G. Burchard, S. Choudhry, W.R. Cintron, M.WMahoney, and P. Drineas. PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations, PLoS Genetics,9(3), 2007.
|
| |
36
|
P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S.Gu, K.K. Kidd, and P. Drineas. Intra- and Inter-population genotype reconstruction from tagging SNPs, Genome Research, 17: 96--107, 2007.
|
| |
37
|
N.A. Rosenberg, L.M. Li, R. Ward, and J.K. Pritchard. Informativeness of genetic markers for inference of ancestry. Am J Hum Genet, 73(6):1402--1422, 2003.
|
| |
38
|
G.W. Stewart. Four algorithms for the efficientcomputation of truncated QR approximations to a sparse matrix. Num Math, 83:313--323, 1999.
|
| |
39
|
|
| |
40
|
J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for large sparse graphs, SDM, 2007.
|
| |
41
|
The International HapMap Consortium. A haplotype map of the human genome. Nature, 437:1299--1320, 2005.
|
| |
42
|
|
 |
43
|
|
| |
44
|
|
|