ACM Home Page
Please provide us with feedback. Feedback
A multi-expert system for the automatic detection of protein domains from sequence information
Full text PdfPdf (318 KB)
Source Annual Conference on Research in Computational Molecular Biology archive
Proceedings of the seventh annual international conference on Research in computational molecular biology table of contents
Berlin, Germany
Pages: 224 - 234  
Year of Publication: 2003
ISBN:1-58113-635-8
Authors
Niranjan Nagarajan  Cornell University, Ithica, NY
Golan Yona  Cornell University, Ithica, NY
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 18,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/640075.640104
What is a DOI?

ABSTRACT

We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence, and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition or boundary positions between domains. The method was assessed using the domain definitions in SCOP for proteins of known structures and was compared to several other existing methods. Our method improves significantly over the best method available, the semi-manual PFam domain database, while being fully automatic. Our method can also be used to verify domain partitions based on structural data. Few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Rose, G. D. (1979). Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134, 447--470.
 
2
Lesk, A. M. & Rose, G. D. (1981). Folding units in globular proteins. Proc. Natl. Acad. Sci. USA 78, 4304--4308.
 
3
Holm, L. & Sander, C. (1994). Parser for protein folding units. Proteins 19, 256--268.
 
4
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536--540.
 
5
 
6
Kuroda, Y., Tani, K., Matsuo, Y. & Yokoyama, S. (2000). Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics. Protein Sci. 9, 2313--2321.
 
7
George, R. A. & Heringa, J. (2002). Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins 48, 672--681.
 
8
Gouzy, J., Corpet, F. & Kahn, D. (1999). Whole genome protein domain analysis using a new method for domain clustering. Comput Chem. 23, 333--340.
 
9
Sonnhammer, E. L. L. & Kahn, D. (1994). Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 3, 482--492.
 
10
Park, J. & Teicmann, S. A. (1998). DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics 14:2, 144--150.
 
11
Gracy, J. & Argos, P. (1998). Automated protein sequence database classification. I. Integration of copositional similarity search, local similarity search and multiple sequence alignment. II. Delineation of domain boundries from sequence similarity. Bioinformatics 14:2, 164--187.
 
12
Sonnhammer, E. L., Eddy, S. R., Durbin, R. (1997). Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405--420.
 
13
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., & Sonnhammer E. L. (1999). Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260--262.
 
14
Haft, D. H., Loftus, B. J., Richardson, D. L., Yang, F., Eisen, J. A., Paulsen, I. T. & White, O. (2001). TIGRFAMs: a protein family resource for the functional identification of proteins. Nucl. Acids Res. 29, 41--43.
 
15
Ponting, C. P., Schultz, J., Milpetz, F. & Bork, P. (1999). SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucl. Acids Res. 27, 229--232.
 
16
George, R. A. & Heringa, J. (2002). SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol. 316, 839--851.
 
17
Rigden, D. J. (2002). Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Eng. 15, 65--77.
 
18
Guan, X. & Du, L. (1998). Domain identification by clustering sequence alignments. Bioinformatics 14, 783--788.
 
19
Wheelan, S. J., Marchler-Bauer, A. & Bryant, S. H. (2000). Domain size distributions can predict domain boundaries. Bioinformatics 16, 613--618.
 
20
George, D. G., Barker, W. C., Mewes, H. W., Pfeiffer, F. & Tsugita, A. (1996). The PIR-International protein sequence database. Nucl. Acids. Res. 24, 17--20.
 
21
Bairoch, A. & Apweiler, R. (1999). The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27 49--54.
 
22
Hubbard, T. J., Ailey, B., Brenner, S. E., Murzin, A. G. & Chothia, C. (1999). SCOP: a Structural Classification of Proteins database. Nucl. Acids Res. 27, 254--256.
 
23
Westbrook, J., Feng, Z., Jain, S. et al. (2002). The Protein Data Bank: unifying the archive. Nucl. Acids. Res. 30, 245--248.
 
24
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389--3402.
 
25
Yona, G., Linial, N. & Linial, M. (1999). ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins, 37, 360--378.
 
26
Henikoff, J. G. & Henikoff, S. (1996). Using substitution probabilities to improve position-specific scoring matrices. Comp. App. Biosci. 12:2, 135--143.
 
27
Hobohm, U. & Sander, C. (1995). A sequence property approach to searching protein database. J. Mol. Biol. 251, 390--399.
 
28
Ferran, E. A., Pflugfelder, B. & Ferrara P. (1994). Self-Organized Neural Maps of Human Protein Sequences. Protein Sci. 3, 507--521.
 
29
Csiszr, I. Information Theoretic Methods in Probability and Statistics. From citeseer.nj.nec.com
 
30
Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915--10919.
 
31
Pazos, F., Helmer-Citterich, M., Ausiello, G. & Valencia, A. (1997). Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271, 511--523.
 
32
Black, S.D. & Mould, D.R. (1991). Development of Hydrophobicity Parameters to Analyze Proteins Which Bear Post or Cotranslational Modifications. Anal. Biochem. 193, 72--82.
 
33
Sowdhamini, R. & Blundell, T. L. (1995). An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins. Protein Sci. 4, 506--520.
 
34
McGuffin, L. J. , Bryson, K. & Jones, D. T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404--405.
 
35
Gilbert, W. & Glynias, M. (1993). On the ancient nature of introns. Gene 135, 137--144.
 
36
Gilbert, W., de Souza, S. J. & Long, M. (1997). Origin of genes. Proc. Natl Acad. Sci. USA 94, 7698--7703.
 
37
Saxonov, S. , Daizadeh, I. , Fedorov, A. & Gilbert, W. (2000). EID: the Exon-Intron Database-an exhaustive database of protein-coding intron-containing genes. Nucl. Acids Res. 28, 185--190.
 
38
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Trans. Info. Theory 37:1, 145--151.
 
39
Kullback, S. (1959). "Information theory and statistics". John Wiley and Sons, New York.
 
40
 
41
Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J. & Zdobnov, E. M. (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucl. Acids Res. 29, 37--40.

Collaborative Colleagues:
Niranjan Nagarajan: colleagues
Golan Yona: colleagues

Peer to Peer - Readers of this Article have also read: