ACM Home Page
Please provide us with feedback. Feedback
Microarray data analysis with PCA in a DBMS
Full text PdfPdf (948 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 2nd international workshop on Data and text mining in bioinformatics table of contents
Napa Valley, California, USA
SESSION: Bio-data mining table of contents
Pages 13-20  
Year of Publication: 2008
ISBN:978-1-60558-251-1
Authors
Waree Rinsurongkawong  University of Houston, Houston, TX, USA
Carlos Ordonez  University of Houston, Houston, TX, USA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 97,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458449.1458456
What is a DOI?

ABSTRACT

Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
K. A. Do, G. J. McLachlan, R. Bean, and S. Wen. Application of gene shaving and mixture models to cluster microarray gene expression data. Cancer Informatics, 2:25--43, 2006.
 
5
T. Hastie, R. Tibshirani, A. Eisen, R. Levy, L. Staudt, D. Chan, and P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 2000, 1, 2000.
 
6
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001.
 
7
 
8
L. Liu, D. M. Hawkins, S. Ghosh, and S. S. Young. Robust singular value decomposition analysis of microarray data. In Proceedings of the National Academy of Sciences of the United States of America, pages 13167--13172, 2003.
 
9
 
10
Geoffrey J. McLachlan, Kim-Anh Do, and Christophe Ambroise. Analyzing Microarray Gene Expression Data. John Wiley and Sons, New Jersey, 2004.
11
12
 
13
14
15
16
 
17


Collaborative Colleagues:
Waree Rinsurongkawong: colleagues
Carlos Ordonez: colleagues