|
ABSTRACT
We analyze all known protein sequences in search for a global map of protein space that is consistent in terms of both sequence and structure. Our goal is to define clusters of homologous protein domains, beyond those detected by sequence-based methods alone, and then to build a three-dimensional (3D) model for each of the sequences that are homologous to sequences of known 3D structure. This analysis uses both sequence and structure based metrics in the analysis of all protein sequences in a non-redundant (NR) database, comprising all major sequence databases.
The analysis starts from the sequences of the SCOP database domains, which have known three-dimensional structures These sequences are clustered first into families based on sequence similarity alone, without incorporating any information from the SCOP classification. Each sequence-based family is represented by a profile, and this profile is used to search the NR database, using PSI-BLAST. Since PSI-BLAST can lead to false similarities, several different indices of validity are used to control the procedure Each of the detected sequences is marked and a profile is built for the whole cluster of similar sequences. A 3D model is then built for each sequence in the cluster using an alignment made using the profile as well as the known structures of the SCOP representatives in the cluster Clusters based on SCOP domains are called type-I clusters In all we find 1421 type-I clusters with total of 168,431 sequences (44.5% of our NR database)
After all members of type-I clusters have been marked, we analyze the remaining sequences. The PSI-BLAST procedure is applied repeatedly, each time with a different query, to search what is left over from the previous run. This give type-II clusters, which may overlap.
Type-I and type-II clusters are then grouped using higher level measures of similarity. Those pairs of clusters that contain the same common protein (significant overlap in membership), are marked first. The pairs of clusters are then compared using either a structure metric (when 3D structures are known) or a novel sequence profile metric, and clustered into superfamilies and “fold” families.
This analysis avoids the limitation of classifications that are based just on sequence comparison, and allows us to construct a 3D model for a substantial portion of the sequences in the NR database.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
Altschul et al 1994
|
Altschul, S F , Boguski, M S , Gish, W G & Woootton, J C (1994) Issues in searching molecular sequence databases Nature Genetics 6, 119-129
|
| |
Altschul et al 1997
|
Altschul, S F. Madden, T L, Schaffer, A A, Zhang, J , Zhang, Z , Miller, W & Lipman, D J (1997) Gapped BLAST and PSI-BLAST a new generation of protein database search programs Nucl Acids Res 215, 3389-3402
|
| |
Attwood et al 1999
|
Attwood, T K , Flower, D R , Lewis, A P, Mabey, J E , Morgan, ,S R, Scordis. P, Selley, J & Wright W (1999) PRINTS prepares for the new millennium Nucl Acids Res 27, 220-225
|
| |
Barker et al 1996
|
Barker, W C, Pfeiffer, F & George, D G (1996) Superfamily classification, in PIR-international protein sequence database Methods Enzymol 266, 59-71
|
| |
Bateman et al 1999
|
Bateman, A , Birney, E , Durbin, R, Eddy, S R, Finn R D , & Sonnhammer E L (1999) Pfam 31 1313 multiple alignments and profiles HMMs match the majority of proteins Nucl Acids Res 27, 260-262
|
| |
Brenner et al 1998
|
Brenner, S E . Chothia, C &; Hubbard, T J P (1998) Assessing sequence comparison methods with reliable structurally indentifed distant evolutionary relationships Proc Natl Acod Sci USA 95, 6073-6078
|
| |
Corpet et al 1999
|
Corpet, F, Gouzy, J, & Kahn, D (1999) Recent improvements of the ProDorn database ofd protein domain families Nucl Acids Res 27, 263-267
|
| |
Dayhoff 1976
|
Dayhoff, M O (1976) The origin and evolution of protein superfamilies Fcd Proc 35, 2132-2138
|
| |
Elofsson & Sonnhammer 1999
|
Elofsson, A & Sonnhammer, E L (1996) A comparsion of sequence and structure protein domain families as a basis for structural genomics Bioinformatics 15:6, 480-500
|
| |
El-Yaniv et al 1997
|
|
| |
Gerstein & Levitt 1998
|
Gerstein, M & Levitt, M (1998) Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins Protein Sci 7, 445-456
|
| |
Gonnet et al 1992
|
Gonnet, G H, Cohen, M A & Benner, S A (1992) Exhaustive matching of the entire protein sequence database Science 256, 1443-1445
|
| |
Gracy & Argos 1998
|
Gracy: J & Argos, P (1998) Automated protern sequence database classification I Integration of copositional simlarity search, local simlarity search and multiple sequence alignment II Delineation of domain boundries from sequence similarity Bioinformatics, 14:2, 164-187
|
| |
Gumbel 1958
|
Gumbel. E J (1958) "Statisttcs of extremes" Columbia University Press, New York
|
| |
Han & Baker 1996
|
Han, K F & Baker, D (1996) Global properties of the mapping between local amino acid sequence and local structure in proteins Proc Natl Acad Sci USA 93, 5814-5818
|
| |
Harris et al 1992
|
Harris, N L, Hunter, L & States, D,J (1992) Mega-classification Discovering motifs in massive datastreams In Proc of the 10th national conf on AI, 837-842, AAAI press/The MIT Press, Menlo park/Cambridge
|
| |
Hemkoff & Henikoff 1992
|
Hemkoff, S & Henikoff, J G (1992) Amino acid substitution matrices from protein blocks Proc Natl Acad Sci USA 89, 10915-10919
|
| |
Hemkoff & Hemkoff 1993
|
Hemkoff S & Hemkoff, J G (1993) Performance evaluation of amino acid substitution matrices Proteins 17, 49-61
|
| |
Hemkoff et al 1999
|
Hemkoff, J G , Hemkoff, S & Pietrokovskl, S (1999) New features of the Blocks Database servers Nucl Acids Res 27, 226-228
|
| |
Hofmann et al 1999
|
Hofmann, K, Bucher, P, Falquet, L & Bairoch, A (1999) The PROSITE database, its status in 1999 Nucl Acids Res 27, 215-219
|
| |
Holm & Sander 1997a
|
Holm, L & Sander, C (1997) Dah/FSSP classification of three-dimensional protein folds Nucl Acids Res 25, 231-234
|
| |
Hubbard et al 1999
|
Hubbard, T J, Ailey, B, Brenner, S E, Murzin, A G & Chothia, C (1999) SCOP a Structural Classification of Proteins database Nuel Acids Res 27, 254-256
|
| |
Hughey & Krogh 1998
|
|
| |
Karhn & Altschul 1990
|
Karhn, S & Altschul, S F (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Proc Nati Acid Sci USA 87, 2264-2268
|
| |
Koonin et al 1996
|
Koonin, E. V., Tatusov, R L & Rudd, K E (1996)- Protein sequence comparison at genome scale Methods Enzymol 266, 295-321.
|
| |
Krause & Vingron 1998
|
Krause. A. & Vingron, M (1998) A settheoretic approach to database searching and clustering Bioinformatics 14:5, 430-438.
|
| |
Kullback 1959
|
Kullback, S (1959). "Information theory and statistics" John Wiley and Sons, New York
|
| |
Levitt 1992
|
Levitt, M (1992) Accurate modelling of protein conformation by automatic segment matching J Mol Biol 226, 507-533
|
| |
Levitt & Chothia 1976
|
Levitt, M & Chothia, G (1976) Structural Patterns in Globular Proteins. Nature 261, 552-558
|
| |
Levitt & Gerstein 1998
|
Levitt, M & Gerstein, M (1998) A Unified Statistical Framework for Sequence Comparison and Structure Comparison Proc. Natl. Acad. Sci USA 95, 5913-5920
|
| |
Levitt et al 1995
|
Levitt, M., Hirshberg, M , Sharon, R & Daggett, V (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution Comp Phys. Comm. 91,215-231
|
| |
Lin 1991
|
Lin, J (1991) Divergence measures based on the Shannon emtropy IEEE Trans Info. Theory 37:1, 145-151
|
| |
Murzin 1993
|
Murzin, A G (1993) OB(oligonucleotide/oligosacchande binding)-fold common structural and functional solution for non-homologous sequences EMBO J. 12:3, 861-867
|
| |
Orengo et al 1997
|
Orengo, C. A., Michie, A D, Jones, S , Jones, D T, Swindells, M. B & Thornton, J M (1997) CATH-a hierarchic classification of protein domain structures Structure 5, 1093-1108
|
| |
Park et al 1998
|
Park, J, Karplus, K, Barrett, C, Hughey, R, Haussler, D , Hubbard, T. & Chothia, C (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284, 1201- 1210
|
| |
Pearson 1995
|
Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases Protein,Sci 4, 1145-1160
|
| |
Pearson 1997
|
Pearson, W R (1997) Identifying distantly related protein sequences Comp App Biosci 13:4, 325-332
|
| |
Rigoutsos et al 1999
|
|
| |
Tatusov et al 1997
|
Tatusov, R. L., Eugene, V K & David, J L (1997) A genomic perspective on protein families Science 278, 631-637
|
| |
Watanabe & Otsuka 1995
|
Watanabe, H. & Otsuka, J (1995) A comprehensive representation of extensive similarity linkage between large numbers of proteins Comp App Biosci 11:2, 159- 166
|
| |
Wootton & Federhen 1993
|
Wootton, J C & Federhen, S (1993) Statmtlcs of local complexity in amino acid sequences and sequence databases. Comp Chem. 17, t49-163
|
| |
Yona et al 1999
|
Yona, G., Limal, N & Limal, M (1999) ProtoMap Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space Proteins, 37, 360-378.
|
| |
Yona 1999
|
Yona, G (1999), "Methods for global organization of the protein sequence space" Ph D Thesis, The Hebrew University, Jerusalem, Israel
|
|