ACM Home Page
Please provide us with feedback. Feedback
Peptide programs: applying fragment programs to protein classification
Full text PdfPdf (229 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 2nd international workshop on Data and text mining in bioinformatics table of contents
Napa Valley, California, USA
SESSION: Bio-data mining table of contents
Pages 37-44  
Year of Publication: 2008
ISBN:978-1-60558-251-1
Authors
Andre O. Falcao  University of Lisbon, Lisbon, Portugal
Daniel Faria  University of Lisbon, Lisbon, Portugal
António Ferreira  University of Lisbon, Lisbon, Portugal
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 55,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458449.1458459
What is a DOI?

ABSTRACT

Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res 25(17), (1997),pp. 3389--3402.
 
2
Devos, D. and Valencia, A. 2000. Practical limits of function prediction, Proteins 41, (2000),pp. 98--107.
 
3
Tian, W. and Skolnick, J. 2003. How well is enzyme function conserved as a function of pairwise sequence identity?, J Mol Biol 333, (2003),pp. 863--882.
 
4
Devos, D. and Valencia, A. 2001. Intrinsic errors in genome annotation, Trends Genet 17(8), (2001),pp. 429--431.
 
5
Baker, E. N., Arcus, V. L. and Lott, J. S. 2003. Protein structure prediction and analysis as a tool for functional genomics, Appl Bioinformatics 2(3),(2003), pp. 3--10.
 
6
von Grotthuss, M., Plewczynski, D., Ginalski, K., Rychlewski, L. and Shakhnovich, E. I. 2006. PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics, BMC Bioinformatics 7, (2006),pp. 53.
 
7
Whisstock, J. C. and Lesk, A. M. 2003. Prediction of protein function from protein sequence and structure, Q Rev Biophys 36(3), (2003),pp. 307--340.
 
8
Friedberg, I. 2006. Automated protein function prediction the genomic challenge, Brief Bioinform 7(3), (2006),pp. 225--242.
 
9
Eskin, E., Noble, W. S. and Singer, Y. 2003. Protein family classification using sparse Markov transducers, J Comput Biol 10, (2003),pp. 187--213.
 
10
 
11
 
12
Han, L. Y., Cai, C. Z., Ji, Z. L., Cao, Z. W., Cui, J. and Chen, Y. Z. 2004. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach, Nucleic Acids Res 32(21), (2004), pp. 6437--6444.
 
13
Bhardwaj, N., Langlois, R. E., Zhao, G. J. and Lu, H. 2005. Kernel-based machine learning protocol for predicting DNA-binding proteins, Nucleic Acids Res 33, (2005),pp. 6486--6493.
 
14
Cai, C. Z., Han, L. Y., Ji, Z. L. and Chen, Y. Z. 2004. Enzyme family classification by support vector machines, Proteins 55, (2004),pp. 66--76.
 
15
 
16
Dobson, P. D. and Doig, A. J. 2005. Predicting enzyme class from protein structure without alignments, J Mol Biol 345, (2005),pp. 187--199.
 
17
 
18
Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y. and Leslie, C. 2005. Profile-based string kernels for remote homology detection and motif extraction, J Bioinform Comput Biol 3(3), (2005),pp. 527--550.
 
19
 
20
 
21
Webb-Robertson, B. J., Oehmen, C. and Matzke, M. 2005. SVM-BALSA: Remote homology detection based on Bayesian sequence alignment, Comput Biol Chem 29,(2005),pp. 440--443.
 
22
Zhang, Z. D., Kochhar, S. and Grigorov, M. G. 2005. Descriptor-based protein remote homology identification, Protein Sci 14,(2005),pp. 431--444.
 
23
Langlois, R. E., Carson, M. B., Bhardwaj, N. and Lu, H. 2007. Learning to translate sequence and structure to function: Identifying DNA binding and membrane binding proteins, Ann Biomed Eng 35, (2007), pp. 1043--1052.
 
24
Yang, M. Q., Yang, J. Y. and Ersoy, O. K. 2007. Classification of proteins multiple-labelled and single-labelled with protein functional classes, Int J Gen Syst 36, (2007),pp. 91--109.
 
25
Pasquier, C., Promponas, V. and Hamodrakas, S. J. 2001. PRED-CLASS: Cascading Neural networks for generalized protein classification and genome wide applications, Proteins 44, (2001),pp. 361--369.
 
26
 
27
Yang, Z. R. and Hamer, R. 2007. Bio-basis function neural networks in protein data mining, Curr Pharm Design 13,(2007),pp. 1403--1413.
 
28
Falcao, A. O. 2005. Residue fragment programs for enzyme classification, In Proceedings of the workshop Bioinformatics: Knowledge Discovery in Biology 24--28(2005).
 
29
 
30
 
31
 
32
Baldi, P., Brunak, S., Chauvin, Y., Anderson, C. A. F. and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: an overview, Bioinformatics 16, (2000),pp. 412--419.
 
33
Matthews, B. W. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochim Biophys Acta 405, (1975),pp. 442--451.
 
34
Bairoch, A. and Boeckman, B. 1997. The SWISS-PROT protein sequence data bank, recent developments Nucleic Acids Res, (1997),pp. 21.
 
35
Rappas, M., Niwa, H. and Zhang, X. 2004. Mechanisms of ATPases -- a multi-disciplinary approach, Curr Protein Pept Sci 5(2),(2004),pp. 89--105.
 
36

Collaborative Colleagues:
Andre O. Falcao: colleagues
Daniel Faria: colleagues
António Ferreira: colleagues