ACM Home Page
Please provide us with feedback. Feedback
A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data
Full text PdfPdf (1.53 MB)
Source IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive
Volume 4 ,  Issue 2  (April 2007) table of contents
Pages 293-300  
Year of Publication: 2007
ISSN:1545-5963
Author
Publisher
IEEE Computer Society Press  Los Alamitos, CA, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 65,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.1109/TCBB.2007.1014

ABSTRACT

Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences USA, vol. 96, no. 10, pp. 6745-6750, 1999.
 
2
[2] O. Alter, P. O. Brown, and D. Botstein, "Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling," Proc. Nat'l Academy of Sciences USA, vol. 97, pp. 10101-10106, 2000.
 
3
[3] S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. denBoer, M. D. Minden, S. E. Sallan, E. S. Lander, T. R. Golub, and S. J. Korsmeyer, "Mll Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia," Nature Genetics, vol. 30, no. 1, pp. 41-47, 2002.
 
4
[4] R. E. Bechofer, T. J. Santner, and D. Goldsman, Design and Analysis of Experiments for Statistical Selection, Screening and Multiple Comparison. John Wiley & Sons, 1995.
 
5
[5] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, and M. Meyerson, "Classification of Human Lung Carcinomas by Mrna Expression Profiling Reveals Distinct Adenocarcinoma Subclasses," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 24, pp. 13790-13795, 2001.
 
6
[6] G. Bontempi, M. Birattari, and P. E. Meyer, "Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection," Proc. Int'l Conf. Adaptive and Natural Computing Algorithms (ICANNGA'05), pp. 393-396, 2005.
 
7
[7] S. Davies and S. Russell, "NP-Completeness of Searches for Smallest Possible Feature Sets," Proc. AAAI Fall Symp. Relevance, 1994.
 
8
[8] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. Wiley, 1976.
 
9
[9] S. Dudoit, J. Fridlyand, and T. P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.
 
10
[10] R. Gentleman, "Annotate: Annotation for Microarrays," R package version 1.5.16, 2003.
 
11
[11] R. Gentleman, "Using Go for Statistical Analyses," Proc. COMPSTAT'04 Symp., pp. 171-180, 2004.
 
12
[12] R. Gentleman, "GOstats: Tools for Manipulating GO and Microarrays," R package version 1.1.3, 2005.
 
13
[13] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang, and J. Zhang, "Bioconductor: Open Software Development for Computational Biology and Bioinformatics," Genome Biology, vol. 5, no. 10, 2004.
 
14
[14] D. Ghosh and A. M. Chinnaiyan, "Classification and Selection of Biomarkers in Genomic Data Using Lasso," J. Biomedical Biotechnology , vol. 2, pp. 147-154, 2005.
 
15
[15] T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, and M. Gaasenbeek, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
 
16
 
17
[17] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2001.
 
18
[18] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O. P. Kallioniemi, B. Wilfond, A. Borg, and J. Trent, "Gene Expression Profiles in Hereditary Breast Cancer," New England J. Medicine, vol. 344, no. 8, pp. 539-548, 2001.
 
19
[19] J. Khan, J. S. Wei, and M. Ringner, "Clasification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks," Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001.
 
20
[20] S. H. Kim and B. L. Nelson, "Selecting the Best System," Handbooks in Operations Research and Management Science, Elsevier, 2005.
 
21
 
22
 
23
 
24
 
25
[25] D. A. Notterman, U. Alon, A. J. Sierk, and A. J. Levine, "Transcriptional Gene Expression Profiles of Colorectal Adenoma, Adenocarcinoma and Normal Tissue Examined by Oligonucleotide Arrays," Cancer Research, vol. 6, pp. 3124-3130, 2001.
 
26
[26] C. L. Nutt, D. R. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. E. McLaughlin, T. T. Batchelor, P. M. Black, A. von Deimling, S. L. Pomeroy, T. R. Golub, and D. N. Louis, "Gene Expression-Based Classification of Malignant Gliomas Correlates Better with Survival Than Histological Classification," Cancer Research, vol. 63, no. 7, pp. 1602-1607, 2003.
 
27
[27] S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub, "Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression," Nature, vol. 415, no. 6870, pp. 436-442, 2002.
 
28
[28] R Development Core Team, "R: A Language and Environment for Statistical Computing," R Foundation for Statistical Computing, Vienna, Austria, 2004.
 
29
[29] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, and T. R. Golub, "Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 26, pp. 15149-15154, 2001.
 
30
[30] M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub, "Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning," Nature Medicine, vol. 8, no. 1, pp. 68-74, 2002.
 
31
[31] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D'Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers, "Gene Expression Correlates of Clinical Prostate Cancer Behavior," Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002.
 
32
[32] J. E. Staunton, D. K. Slonim, H. A. Coller, P. Tamayo, M. J Angelo, J. Park, U. Scherf, J. K. Lee, W. O. Reinhold, J. N. Weinstein, J. P. Mesirov, E. S. Lander, and T. R. Golub, "Chemosensitivity Prediction by Transcriptional Profiling," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 19, pp. 10787-10792, 2001.
 
33
[33] A. I. Su, J. B. Welsh, L. M. Sapinoso, S. G. Kern, P. Dimitrov, H. Lapp, P. G. Schultz, S. M. Powell, C. A. Moskaluk, H. F. Frierson Jr., and G. M. Hampton, "Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures," Cancer Research, vol. 61, no. 20, pp. 7388-7393, 2001.
 
34
[34] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. R. Marks, and J. R. Nevins, "Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 20, pp. 11462-11467, 2001.
 
35
[35] E. J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, and A. Patel, "Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Lymphoblastic Leukemia by Gene Expression Profiling," Cancer Cell, vol. 1, pp. 133-143, 2002.