ACM Home Page
Please provide us with feedback. Feedback
A unified sequence-structure classification of protein sequences: combining sequence and structure in a map of the protein space
Full text PdfPdf (952 KB)
Source Annual Conference on Research in Computational Molecular Biology archive
Proceedings of the fourth annual international conference on Computational molecular biology table of contents
Tokyo, Japan
Pages: 308 - 317  
Year of Publication: 2000
ISBN:1-58113-186-0
Authors
Golan Yona  Department of Structural Biology, Fairchild Bld D-109, Stanford University, CA
Michael Levitt  Department of Structural Biology, Fairchild Bld D-109, Stanford University, CA
Sponsor
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 34,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/332306.332569
What is a DOI?

ABSTRACT

We analyze all known protein sequences in search for a global map of protein space that is consistent in terms of both sequence and structure. Our goal is to define clusters of homologous protein domains, beyond those detected by sequence-based methods alone, and then to build a three-dimensional (3D) model for each of the sequences that are homologous to sequences of known 3D structure. This analysis uses both sequence and structure based metrics in the analysis of all protein sequences in a non-redundant (NR) database, comprising all major sequence databases.

The analysis starts from the sequences of the SCOP database domains, which have known three-dimensional structures These sequences are clustered first into families based on sequence similarity alone, without incorporating any information from the SCOP classification. Each sequence-based family is represented by a profile, and this profile is used to search the NR database, using PSI-BLAST. Since PSI-BLAST can lead to false similarities, several different indices of validity are used to control the procedure Each of the detected sequences is marked and a profile is built for the whole cluster of similar sequences. A 3D model is then built for each sequence in the cluster using an alignment made using the profile as well as the known structures of the SCOP representatives in the cluster Clusters based on SCOP domains are called type-I clusters In all we find 1421 type-I clusters with total of 168,431 sequences (44.5% of our NR database)

After all members of type-I clusters have been marked, we analyze the remaining sequences. The PSI-BLAST procedure is applied repeatedly, each time with a different query, to search what is left over from the previous run. This give type-II clusters, which may overlap.

Type-I and type-II clusters are then grouped using higher level measures of similarity. Those pairs of clusters that contain the same common protein (significant overlap in membership), are marked first. The pairs of clusters are then compared using either a structure metric (when 3D structures are known) or a novel sequence profile metric, and clustered into superfamilies and “fold” families.

This analysis avoids the limitation of classifications that are based just on sequence comparison, and allows us to construct a 3D model for a substantial portion of the sequences in the NR database.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
Altschul et al 1994
Altschul, S F , Boguski, M S , Gish, W G & Woootton, J C (1994) Issues in searching molecular sequence databases Nature Genetics 6, 119-129
 
Altschul et al 1997
Altschul, S F. Madden, T L, Schaffer, A A, Zhang, J , Zhang, Z , Miller, W & Lipman, D J (1997) Gapped BLAST and PSI-BLAST a new generation of protein database search programs Nucl Acids Res 215, 3389-3402
 
Attwood et al 1999
Attwood, T K , Flower, D R , Lewis, A P, Mabey, J E , Morgan, ,S R, Scordis. P, Selley, J & Wright W (1999) PRINTS prepares for the new millennium Nucl Acids Res 27, 220-225
 
Barker et al 1996
Barker, W C, Pfeiffer, F & George, D G (1996) Superfamily classification, in PIR-international protein sequence database Methods Enzymol 266, 59-71
 
Bateman et al 1999
Bateman, A , Birney, E , Durbin, R, Eddy, S R, Finn R D , & Sonnhammer E L (1999) Pfam 31 1313 multiple alignments and profiles HMMs match the majority of proteins Nucl Acids Res 27, 260-262
 
Brenner et al 1998
Brenner, S E . Chothia, C &; Hubbard, T J P (1998) Assessing sequence comparison methods with reliable structurally indentifed distant evolutionary relationships Proc Natl Acod Sci USA 95, 6073-6078
 
Corpet et al 1999
Corpet, F, Gouzy, J, & Kahn, D (1999) Recent improvements of the ProDorn database ofd protein domain families Nucl Acids Res 27, 263-267
 
Dayhoff 1976
Dayhoff, M O (1976) The origin and evolution of protein superfamilies Fcd Proc 35, 2132-2138
 
Elofsson & Sonnhammer 1999
Elofsson, A & Sonnhammer, E L (1996) A comparsion of sequence and structure protein domain families as a basis for structural genomics Bioinformatics 15:6, 480-500
 
El-Yaniv et al 1997
 
Gerstein & Levitt 1998
Gerstein, M & Levitt, M (1998) Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins Protein Sci 7, 445-456
 
Gonnet et al 1992
Gonnet, G H, Cohen, M A & Benner, S A (1992) Exhaustive matching of the entire protein sequence database Science 256, 1443-1445
 
Gracy & Argos 1998
Gracy: J & Argos, P (1998) Automated protern sequence database classification I Integration of copositional simlarity search, local simlarity search and multiple sequence alignment II Delineation of domain boundries from sequence similarity Bioinformatics, 14:2, 164-187
 
Gumbel 1958
Gumbel. E J (1958) "Statisttcs of extremes" Columbia University Press, New York
 
Han & Baker 1996
Han, K F & Baker, D (1996) Global properties of the mapping between local amino acid sequence and local structure in proteins Proc Natl Acad Sci USA 93, 5814-5818
 
Harris et al 1992
Harris, N L, Hunter, L & States, D,J (1992) Mega-classification Discovering motifs in massive datastreams In Proc of the 10th national conf on AI, 837-842, AAAI press/The MIT Press, Menlo park/Cambridge
 
Hemkoff & Henikoff 1992
Hemkoff, S & Henikoff, J G (1992) Amino acid substitution matrices from protein blocks Proc Natl Acad Sci USA 89, 10915-10919
 
Hemkoff & Hemkoff 1993
Hemkoff S & Hemkoff, J G (1993) Performance evaluation of amino acid substitution matrices Proteins 17, 49-61
 
Hemkoff et al 1999
Hemkoff, J G , Hemkoff, S & Pietrokovskl, S (1999) New features of the Blocks Database servers Nucl Acids Res 27, 226-228
 
Hofmann et al 1999
Hofmann, K, Bucher, P, Falquet, L & Bairoch, A (1999) The PROSITE database, its status in 1999 Nucl Acids Res 27, 215-219
 
Holm & Sander 1997a
Holm, L & Sander, C (1997) Dah/FSSP classification of three-dimensional protein folds Nucl Acids Res 25, 231-234
 
Hubbard et al 1999
Hubbard, T J, Ailey, B, Brenner, S E, Murzin, A G & Chothia, C (1999) SCOP a Structural Classification of Proteins database Nuel Acids Res 27, 254-256
 
Hughey & Krogh 1998
 
Karhn & Altschul 1990
Karhn, S & Altschul, S F (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Proc Nati Acid Sci USA 87, 2264-2268
 
Koonin et al 1996
Koonin, E. V., Tatusov, R L & Rudd, K E (1996)- Protein sequence comparison at genome scale Methods Enzymol 266, 295-321.
 
Krause & Vingron 1998
Krause. A. & Vingron, M (1998) A settheoretic approach to database searching and clustering Bioinformatics 14:5, 430-438.
 
Kullback 1959
Kullback, S (1959). "Information theory and statistics" John Wiley and Sons, New York
 
Levitt 1992
Levitt, M (1992) Accurate modelling of protein conformation by automatic segment matching J Mol Biol 226, 507-533
 
Levitt & Chothia 1976
Levitt, M & Chothia, G (1976) Structural Patterns in Globular Proteins. Nature 261, 552-558
 
Levitt & Gerstein 1998
Levitt, M & Gerstein, M (1998) A Unified Statistical Framework for Sequence Comparison and Structure Comparison Proc. Natl. Acad. Sci USA 95, 5913-5920
 
Levitt et al 1995
Levitt, M., Hirshberg, M , Sharon, R & Daggett, V (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution Comp Phys. Comm. 91,215-231
 
Lin 1991
Lin, J (1991) Divergence measures based on the Shannon emtropy IEEE Trans Info. Theory 37:1, 145-151
 
Murzin 1993
Murzin, A G (1993) OB(oligonucleotide/oligosacchande binding)-fold common structural and functional solution for non-homologous sequences EMBO J. 12:3, 861-867
 
Orengo et al 1997
Orengo, C. A., Michie, A D, Jones, S , Jones, D T, Swindells, M. B & Thornton, J M (1997) CATH-a hierarchic classification of protein domain structures Structure 5, 1093-1108
 
Park et al 1998
Park, J, Karplus, K, Barrett, C, Hughey, R, Haussler, D , Hubbard, T. & Chothia, C (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284, 1201- 1210
 
Pearson 1995
Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases Protein,Sci 4, 1145-1160
 
Pearson 1997
Pearson, W R (1997) Identifying distantly related protein sequences Comp App Biosci 13:4, 325-332
 
Rigoutsos et al 1999
 
Tatusov et al 1997
Tatusov, R. L., Eugene, V K & David, J L (1997) A genomic perspective on protein families Science 278, 631-637
 
Watanabe & Otsuka 1995
Watanabe, H. & Otsuka, J (1995) A comprehensive representation of extensive similarity linkage between large numbers of proteins Comp App Biosci 11:2, 159- 166
 
Wootton & Federhen 1993
Wootton, J C & Federhen, S (1993) Statmtlcs of local complexity in amino acid sequences and sequence databases. Comp Chem. 17, t49-163
 
Yona et al 1999
Yona, G., Limal, N & Limal, M (1999) ProtoMap Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space Proteins, 37, 360-378.
 
Yona 1999
Yona, G (1999), "Methods for global organization of the protein sequence space" Ph D Thesis, The Hebrew University, Jerusalem, Israel
Collaborative Colleagues:
Golan Yona: colleagues
Michael Levitt: colleagues