|
ABSTRACT
We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard Blastp seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus Blastp.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. Kucherov, L. Noé, and M. Roytberg, "A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds," J. Bioinformatics and Computational Biology, vol. 4, no. 2, pp. 553- 570, Apr. 2006 (preliminary version in WABI '05).
|
| |
2
|
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, "Basic Local Alignment Search Tool," J. Molecular Biology, vol. 215, pp. 403-410, 1990.
|
| |
3
|
S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
|
| |
4
|
|
| |
5
|
B. Ma, J. Tromp, and M. Li, "PatternHunter: Faster and More Sensitive Homology Search," Bioinformatics, vol. 18, no. 3, pp. 440- 445, 2002.
|
| |
6
|
W.J. Kent, "BLAT-The BLAST-Like Alignment Tool," Genome Research, vol. 12, pp. 656-664, 2002.
|
| |
7
|
M. Li, B. Ma, D. Kisman, and J. Tromp, "PatternHunter II: Highly Sensitive and Fast Homology Search," J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004 (earlier version in GIW '03).
|
| |
8
|
L. Noé and G. Kucherov, "YASS: Enhancing the Sensitivity of DNA Similarity Search," Nucleic Acid Research, vol. 33, pp. W540- W543, 2005.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
G. Kucherov, L. Noé, and M. Roytberg, "Multi-Seed Lossless Filtration," Proc. 15th Ann. Combinatorial Pattern Matching Symp. (CPM '04), pp. 297-310, July 2004.
|
| |
14
|
|
| |
15
|
J. Xu, D. Brown, M. Li, and B. Ma, "Optimizing Multiple Spaced Seeds for Homology Search," Proc. 15th Symp. Combinatorial Pattern Matching, pp. 47-58, July 2004.
|
| |
16
|
|
| |
17
|
P. Peterlongo, L. Noé, D. Lavenier, G. Georges, J. Jacques, G. Kucherov, and M. Giraud, "Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware," Proc. Second Workshop Parallel Computational Biology, 2007.
|
| |
18
|
V.H. Nguyen and D. Lavenier, "Speeding Up Subset Seed Algorithm for Intensive Protein Sequence Comparison," Proc. Sixth IEEE Int'l Conf. Research, Innovation & Vision for the Future (RIVF '08), pp. 57-63, 2008.
|
| |
19
|
L. Noé and G. Kucherov, "Improved Hit Criteria for DNA Local Alignment," BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
|
| |
20
|
L. Zhou, J. Stanton, and L. Florea, "Universal Seeds for cDNA-to-Genome Comparison," BMC Bioinformatics, vol. 9, no. 36, 2008.
|
| |
21
|
B. Ma and H. Yao, "Seed Optimization is No Easier Than Optimal Golomb Ruler Design," Proc. Sixth Asia Pacific Bioinformatics Conf. (APBC '08), pp. 133-144, Jan. 2008.
|
| |
22
|
|
| |
23
|
T. Li, K. Fan, J. Wang, and W. Wang, "Reduction of Protein Sequence Complexity by Residue Grouping," J. Protein Eng., vol. 16, pp. 323-330, 2003.
|
| |
24
|
L. Murphy, A. Wallqvist, and R. Levy, "Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding," J. Protein Eng., vol. 13, pp. 149-152, 2000.
|
| |
25
|
|
| |
26
|
S. Henikoff and J. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences USA, vol. 89, pp. 10915-10919, 1992.
|
| |
27
|
S. Henikoff and J. Henikoff, "Automated Assembly of Protein Blocks for Database Searching," Nucleic Acids Research, vol. 19, no. 23, pp. 6565-6572, 1991.
|
 |
28
|
|
| |
29
|
L. Ilie and S. Ilie, "Long Spaced Seeds for Finding Similarities Between Biological Sequences," Proc. Second Int'l Conf. Bioinformatics & Computational Biology (BIOCOMP '07), pp. 3-8, 2007.
|
| |
30
|
A. Bahr, J. Thompson, J. Thierry, and O. Poch, "BAliBASE (Benchmark Alignment dataBASE): Enhancements for Repeats, Transmembrane Sequences and Circular Permutations," Nucleic Acids Research, vol. 29, no. 1, pp. 323-326, 2001.
|
| |
31
|
A. Stebbings and K. Mizuguchi, "HOMSTRAD: Recent Developments of the Homologous Protein Structure Alignment Database," Nucleic Acids Research, vol. 32, pp. D203-D207, 2004.
|
| |
32
|
A.R. Subramanian, J. Weyer-Menkhoff, M. Kaufmann, and B. Morgenstern, "DIALIGN-T: An Improved Algorithm for Segment-Based Multiple Sequence Alignment," BMC Bioinformatics, vol. 6, no. 66, 2005.
|
| |
33
|
G. Raghava, S. Searle, P. Audley, J. Barber, and G. Barton, "OXBench: A Benchmark for Evaluation of Protein Multiple Sequence Alignment Accuracy," BMC Bioinformatics, vol. 4, no. 47, 2003.
|
| |
34
|
R. Finn, J. Mistry, B. Schuster-Bckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. Eddy, E. Sonnhammer, and A. Bateman, "PFAM: Clans, Web Tools and Services," Nucleic Acids Research, vol. 34, pp. D247-D251, 2006.
|
| |
35
|
R.C. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.
|
| |
36
|
I. Letunic, R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork, "SMART 5: Domains in the Context of Genomes and Networks," Nucleic Acids Research, vol. 34, no. 1, pp. D257-D260, 2006.
|
| |
37
|
R. Nunez Miguel, J. Shi, and K. Mizuguchi, "Protein Fold Recognition and Comparative Modeling using HOMSTRAD, JOY and FUGUE," Protein Structure Prediction: Bioinformatic Approach, pp. 143-169, Int'l University Line Publishers, 2001.
|
| |
38
|
|
| |
39
|
P. Peterlongo, L. Noé, D. Lavenier, N.V.H., G. Kucherov, and M. Giraud, "Optimal Neighborhood Indexing for Protein Similarity Search," BMC Bioinformatics, vol. 9, no. 534, 2008.
|
INDEX TERMS
Primary Classification:
J.
Computer Applications
J.3
LIFE AND MEDICAL SCIENCES
Subjects:
Biology and genetics
Additional Classification:
F.
Theory of Computation
F.2
ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY
F.2.2
Nonnumerical Algorithms and Problems
Subjects:
Sorting and searching
I.
Computing Methodologies
I.5
PATTERN RECOGNITION
I.5.3
Clustering
Subjects:
Similarity measures
I.6
SIMULATION AND MODELING
I.6.4
Model Validation and Analysis
I.6.5
Model Development
Subjects:
Modeling methodologies
J.
Computer Applications
J.3
LIFE AND MEDICAL SCIENCES
Subjects:
Medical information systems
General Terms:
Algorithms,
Design,
Measurement,
Performance
Keywords:
Protein sequences,
protein databases,
local alignment,
similarity search,
seeds,
subset seeds,
multiple seeds,
seed alphabet,
sensitivity,
selectivity.
|