|
ABSTRACT
We study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction that can be used to complement and support other techniques based on sequence or structure information. In order to define this new measure of similarity between proteins we collected a set of 453 features and properties that characterize proteins and are believed to be correlated and related to structural and functional aspects of proteins. Among these properties are the composition and fraction of different groups of amino acids, predicted secondary structure content, molecular weight, average hydrophobicity, isoelectric point and others, as well as a set of properties that are extracted from database records of known protein sequences, such as subcellular location, tissue specificity, and others.We introduce the mixture model of probabilistic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam sequence-based classification of proteins and the EC classification of enzyme families. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Wilson, D. B. & Irwin, D. C. (1999). Genetics and properties of cellulases. Adv. Biochem. Eng. 65, 2--21.
|
| |
2
|
Stawiski, E. W. , Baucom, A. E. , Lohr, S. C. & Gregoret, L. M. (2000). Predicting protein function from structure: unique structural features of proteases. Proc. Natl. Acad. Sci. USA 97, 3954--3958.
|
| |
3
|
van Heel, M. (1991). A new family of powerful multivariate statistical sequence analysis techniques. J. Mol. Biol. 220, 877--887.
|
| |
4
|
Ferran, E. A., Pflugfelder, B. & Ferrara P. (1994). Self-Organized Neural Maps of Human Protein Sequences. Protein Sci. 3, 507--521.
|
| |
5
|
Hobohm, U. & Sander, C. (1995). A sequence property approach to searching protein database. J. Mol. Biol. 251, 390--399.
|
| |
6
|
Wu, C., Whitson, G., Mclarty, J., Ermongkonchai A. & Chang, T. (1992). Protein classification artificial neural system. Protein Sci. 1, 667--677.
|
| |
7
|
Han, K. F. & Baker, D. (1995). Recurring local sequence motifs in proteins. J. Mol. Biol. 251, 176--187.
|
| |
8
|
Casari, G., Sander, C. & Valencia, A. (1995). A method to predict functional residues in proteins. Nat. Struct. Biol. 2, 171--178.
|
| |
9
|
|
| |
10
|
|
| |
11
|
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1993). "Classification and Regression Trees". Chapman & Hall, New York.
|
| |
12
|
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., & Sonnhammer E. L. (1999). Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260--262.
|
| |
13
|
|
| |
14
|
Bairoch, A. & Apweiler, R. (1999). The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27 49--54.
|
| |
15
|
Black, S.D. & Mould, D.R. (1991). Development of Hydrophobicity Parameters to Analyze Proteins Which Bear Post or Cotranslational Modifications. Anal. Biochem. 193, 72--82.
|
| |
16
|
McGuffin, L. J. , Bryson, K. & Jones, D. T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404--405.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Fayyad, U. M. & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of the 13th int. conf. on AI, 1022--1027. Morgan Kaufmann, San Mateo, California.
|
| |
20
|
|
| |
21
|
Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Int. Conf. on AI. 1034--1040.
|
| |
22
|
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1993). "Classification and Regression Trees". Wadsworth Int. Group, Belmont, California.
|
| |
23
|
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Trans. Info. Theory 37:1, 145--151.
|
| |
24
|
Kullback, S. (1959). "Information theory and statistics". John Wiley and Sons, New York.
|
| |
25
|
|
| |
26
|
Hjorth, J. S. U. (1994). "Computer intensive statistical methods validation, model selection, and bootstrap". Chapman & Hall, London.
|
| |
27
|
|
| |
28
|
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389--3402.
|
| |
29
|
|
| |
30
|
Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145--1160.
|
|