ACM Home Page
Please provide us with feedback. Feedback
Unsupervised feature selection for principal components analysis
Full text PdfPdf (429 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 61-69  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Christos Boutsidis  Rensselaer Polytechnic Institute, Troy, NY, USA
Michael W. Mahoney  Yahoo! Research, Sunnyvale, CA, USA
Petros Drineas  Rensselaer Polytechnic Institute, Troy, NY, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 396,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401903
What is a DOI?

ABSTRACT

Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G.Lanckriet. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Review, 49(3), July 2007
 
2
A. Ben-Hur and I. Guyon. Detecting stable clusters using principal component analysis. Methods Mol Biol, 224:159--182, 2003.
3
4
5
 
6
C. Boutsidis, M.W. Mahoney, and P. Drineas. Manuscript in preparation, 2008
 
7
J. Cadima, J. O. Cerdeira, and M. Minhoto. Computational aspects of algorithms for variable selection in thecontext of principal components. Computational Statistics & Data Analysis, 47(2):225--236, 2004.
 
8
T. F. Chan. Rank revealing QR factorizations. Linear Algebra Appl, 88/89:67--82, 1987.
 
9
 
10
T. F. Chan and P.C. Hansen. Low-rank revealing QR factorizations. Linear Algebra Appl, 1:33--44, 1994.
 
11
12
 
13
 
14
 
15
P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. http://arxiv.org/abs/0708.3696, 2007.
 
16
 
17
R.D. Fierro, P.C. Hansen, and P. Hansen. UTV tools: Matlab templates for rank-revealing UTV decompositions. Numerical Algorithms, 20(2-3):165--194, 1999.
 
18
L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl, 74:47--71, 1986.
 
19
L.V. Foster and Xinrong Liu. Comparison of rank revealing algorithms applied to matrices with welldefined numerical ranks. manuscript, 2006.
 
20
21
 
22
G. H. Golub. Numerical methods for solving linear least squares problems. Numer Math, 7:206--216, 1965.
 
23
G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.
 
24
 
25
Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Math Comp, 58:213--232, 1992.
 
26
W. J. Krzanowski. Selection of variables to preserve multivariate data structure,using principal components. Applied Statistics,36(1):22--33, 1987.
 
27
F.G. Kuruvilla and P.J. Park and S.L. Schreiber. Vector algebra in the analysis of genome-wide expression data. Genome Biology, 3, 2002.
 
28
K. Z. Mao. Identifying critical variables of principal components forunsupervised feature selection. IEEE Transactions on Systems,Man, and Cybernetics, Part B, 35(2):339--344, 2005.
29
 
30
P. Menozzi, A. Piazza, and L. Cavalli-Sforza. Synthetic maps of human gene frequencies in Europeans . Science, 201(4358):786--792, 1978.
 
31
 
32
Open Directory Project. http://www.dmoz.org/
 
33
C. T. Pan. On the existence and computation of rank-revealing LUfactorizations. Linear Algebra Appl, 316:199--222, 2000.
 
34
C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.
 
35
P. Paschou, E. Ziv, E.G. Burchard, S. Choudhry, W.R. Cintron, M.WMahoney, and P. Drineas. PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations, PLoS Genetics,9(3), 2007.
 
36
P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S.Gu, K.K. Kidd, and P. Drineas. Intra- and Inter-population genotype reconstruction from tagging SNPs, Genome Research, 17: 96--107, 2007.
 
37
N.A. Rosenberg, L.M. Li, R. Ward, and J.K. Pritchard. Informativeness of genetic markers for inference of ancestry. Am J Hum Genet, 73(6):1402--1422, 2003.
 
38
G.W. Stewart. Four algorithms for the efficientcomputation of truncated QR approximations to a sparse matrix. Num Math, 83:313--323, 1999.
 
39
 
40
J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for large sparse graphs, SDM, 2007.
 
41
The International HapMap Consortium. A haplotype map of the human genome. Nature, 437:1299--1320, 2005.
 
42
43
 
44


Collaborative Colleagues:
Christos Boutsidis: colleagues
Michael W. Mahoney: colleagues
Petros Drineas: colleagues