ACM Home Page
Please provide us with feedback. Feedback
Designing Patterns and Profiles for Faster HMM Search
Full text PdfPdf (504 KB)
Source IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive
Volume 6 ,  Issue 2  (April 2009) table of contents
Pages 232-243  
Year of Publication: 2009
ISSN:1545-5963
Authors
Yanni Sun  Washington University, St. Louis
Jeremy Buhler  Washington University, St. Louis
Publisher
IEEE Computer Society Press  Los Alamitos, CA, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 67,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.1109/TCBB.2008.14

ABSTRACT

Profile HMMs are powerful tools for modeling conserved motifs in proteins. They are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome. It is highly desirable to speed up HMM search in large databases. We design PROSITE-like patterns and short profiles that are used as filters to rapidly eliminate protein-motif pairs for which a full profile HMM comparison does not yield a significant match. The design of the pattern-based filters is formulated as a multichoice knapsack problem. Profile-based filters with high sensitivity are extracted from a profile HMM based on their theoretical sensitivity and false positive rate. Experiments show that our profile-based filters achieve high sensitivity (near 100 percent) while keeping around 20\times speedup with respect to the unfiltered search program. Pattern-based filters typically retain at least 90 percent of the sensitivity of the source HMM with 30-40\times speedup. The profile-based filters have sensitivity comparable to the multistage filtering strategy HMMERHEAD [15] and are faster in most of our experiments.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
T.K. Attwood, P. Bradley, D.R. Flower, A. Gaulton, N. Maudling, A. Mitchell, G. Moulton, A. Nordle, K. Paine, P. Taylor, A. Uddin, and C. Zygouri, "PRINTS and Its Automatic Supplement," Nucleic Acids Research, vol. 31, pp. 400-402, 2003.
 
2
A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, pp. D138-D141, 2004.
 
3
C. Bru, E. Courcelle, S. Carrere, Y. Beausse, S. Dalmar, and D. Kahn, "The ProDom Database of Protein Domain Families: More Emphasis on 3D," Nucleic Acids Research, database issue, vol. 33, pp. D212-D215, 2005.
 
4
A.K. Chandra, D.S. Hirschberg, and C.K. Wong, "Approximate Algorithms for Some Generalized Knapsack Problems," Theoretical Computer Science, vol. 3, pp. 293-304, 1976.
 
5
S.R. Eddy, "Profile Hidden Markov Models," Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
 
6
P.M.K. Gordon and C.W. Sensen, "Osprey: A Comprehensive Tool Employing Novel Methods for the Design of Oligonucleotides for DNA Sequencing and Microarrays," Nucleic Acids Research, vol. 32, no. 17, p. e133, 2004.
 
7
S. Henikoff, J.G. Henikoff, and S. Pietrokovski, "Blocks+: A Non-Redundant Database of Protein Alignment Blocks Derived from Multiple Compilations," Bioinformatics, vol. 15, no. 6, pp. 471-479, 1999.
 
8
J.G. Henikoff, E.A. Greene, S. Pietrokovski, and S. Henikoff, "Increased Coverage of Protein Families with the Blocks Database Servers," Nucleic Acids Research, vol. 28, pp. 228-230, 2000.
 
9
N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J.A. Sigrist, "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. 227-230, 2006.
10
 
11
A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov Models in Computational Biology: Applications to Protein Modeling," J. Molecular Biology, vol. 235, pp. 1501-1531, 1994.
 
12
I. Letunic, R.R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork, "SMART 5: Domains in the Context of Genomes and Networks," Nucleic Acids Research, vol. 1, no. 34, database issue, pp. D257- D260, 2006.
 
13
M. Li, B. Ma, D. Kisman, and J. Tromp, "PatternHunter II: Highly Sensitive and Fast Homology Search," J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004.
 
14
M. Madera, C. Vogel, S.K. Kummerfeld, C. Chothia, and J. Gough, "The SUPERFAMILY Database in 2004: Additions and Improvements," Nucleic Acids Research, vol. 32, pp. D235-D239, 2004.
 
15
E. Portugaly and M. Ninio, "HMMERHEAD--Accelerating HMM Searches on Large Databases (Poster)," Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB), 2004.
 
16
E. Portugaly, A. Harel, N. Linial, and M. Linial, "EVEREST: Automatic Identification and Classification of Protein Domains in All Protein Sequences," BMC Bioinformatics, vol. 7, p. 277, 2006.
 
17
 
18
J. Schultz, F. Milpetz, P. Bork, and C.P. Ponting, "SMART, a Simple Modular Architecture Research Tool: Identification of Signaling Domains," Proc. Nat'l Academy Sciences USA, vol. 95, pp. 5857-5864, 1998.
19
 
20
 
21
J. Xu, D.G. Brown, M. Li, and B. Ma, "Optimizing Multiple Spaced Seeds for Homology Search," Lecture Notes in Computer Science, vol. 3109, pp. 47-58, Springer 2004.

Collaborative Colleagues:
Yanni Sun: colleagues
Jeremy Buhler: colleagues