ACM Home Page
Please provide us with feedback. Feedback
Pairwise statistical significance of local sequence alignment using multiple parameter sets
Full text PdfPdf (568 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 2nd international workshop on Data and text mining in bioinformatics table of contents
Napa Valley, California, USA
SESSION: Bio-text mining table of contents
Pages 53-60  
Year of Publication: 2008
ISBN:978-1-60558-251-1
Authors
Ankit Agrawal  Iowa State University, Ames, IA, USA
Xiaoqiu Huang  Iowa State University, Ames, IA, USA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 100,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458449.1458462
What is a DOI?

ABSTRACT

Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. Preliminary results for a knowledge discovery application such as homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives significantly better coverage than using a single parameter set, at least at some error levels. Also, the fact that the performance does not degrade when using multiple parameter sets is a strong evidence that the assumption that the score distribution follows an extreme value distribution is valid even when using multiple parameter sets. Results of pairwise statistical significance using multiple parameter sets are further shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Extreme Value Distributions: Theory and Applications chapter 1, pages 3--4.Imperial College Press, London, UK, 2000.
 
2
A. Agrawal, V. Brendel, and X. Huang. Pairwise statistical significance versus database statistical significance for local alignment of protein sequences. In Bioinformatics Research and Applications volume 4983 of LNCS (LNBI) pages 50--61. Springer Berlin/Heidelberg, 2008.
 
3
A. Agrawal, V. Brendel, and X. Huang. Pairwise Statistical Significance and Empirical Determination of Effective Gap Opening Penalties for Protein Local Sequence Alignment. 2008. under review.
 
4
S. F. Altschul, M. S. Boguski, W. Gish, and J. C. Wootton. Issues in searching molecular sequence databases. Nature Genetics 6(2):119--129, 1994.
 
5
S. F. Altschul, R. Bundschuh, R. Olsen, and T. Hwa. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29(2):351--361, 2001.
 
6
S. F. Altschul and W. Gish. Local Alignment Statistics. Methods in Enzymology 266:460--80, 1996.
 
7
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3):403--410, 1990.
 
8
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25(17):3389--3402, 1997.
 
9
S. E. Brenner. Practical database searching. Trends in Biotechnology 16(1):9--12, 1998.
 
10
 
11
S. Grossmann and B. Yakir. Large Deviations for Global Maxima of Independent Superadditive Processes with Negative Drift and an Application to Optimal Sequence Alignments. Bernoulli 10(5):829--845, 2004.
 
12
A. K. Hartmann. Sampling Rare Events: Statistics of Local Sequence Alignments. Physical Review E 65(5):056102, 2002.
 
13
X. Huang and D. L. Brutlag. Dynamic Use of Multiple Parameter Sets in Sequence Alignment. Nucleic Acids Research 35(2):678--686, 2007.
 
14
X. Huang and K.-M. Chao. A Generalized Global Alignment Algorithm. Bioinformatics 19(2):228--233, 2003.
 
15
S. Karlin and S. F. Altschul. Methods for Assessing the Statistical Signi ficance of Molecular Sequence Features by Using General Scoring Schemes. Proceedings of the National Academy of Sciences, USA 87(6):2264--2268,1990.
 
16
M. Kschischo, M. Lässig, and Y.-K. Yu. Toward an Accurate Statistics of Gapped Alignments. Bulletin of Mathematical Biology 67:169--191, 2004.
 
17
A. Y. Mitrophanov and M. Borodovsky. Statistical Significance in Biological Sequence Analysis. Briefings in Bioinformatics 7(1):2--24, 2006.
 
18
R. Mott. Accurate Formula for P-values of Gapped Local Sequence and Profile Alignments. Journal of Molecular Biology 300:649--659, 2000.
 
19
R. Mott. Alignment: Statistical Significance. Encyclopedia of Life Sciences 2005.available at http://mrw.interscience.wiley.com/emrw/9780470015902/els/article/a0005264/current/abstract.
 
20
R. Mott and R. Tribe. Approximate Statistics of Gapped Alignments. Journal of Computational Biology 6(1):91--112, 1999.
 
21
R. F. Mott. Maximum-likelihood Estimation of the Statistical Distribution of SmithÜWaterman Local Sequence Similarity Scores. Bulletin of Mathematical Biology 54:59--75, 1992.
 
22
 
23
C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. CATH -A Hierarchic Classification of Protein Domain Structures. Structure 28(1):1093--1108,1997.
 
24
M. Pagni and C. V. Jongeneel. Making Sense of Score Statistics for Sequence Alignments. Briefings in Bioinformatics 2(1):51--67, 2001.
 
25
W. R. Pearson. Effective Protein Sequence Comparison. Methods in Enzymology 266:227--259, 1996.
 
26
W. R. Pearson. Empirical Statistical Estimates for Sequence Similarity Searches. Journal of Molecular Biology 276:71--84, 1998.
 
27
W. R. Pearson. Flexible Sequence Similarity Searching with the FASTA3 Program Package. Methods in Molecular Biology 132:185--219, 2000.
 
28
W. R. Pearson and D. J. Lipman. Improved Tools for Biological Sequence Comparison. Proceedings of the National Academy of Sciences, USA 85(8):2444--2448, 1988.
 
29
W. R. Pearson and T. C. Wood. Statistical Significance in Biological Sequence Comparison. In D. J. Balding, M. Bishop, and C. Cannings, editors, Handbook of Statistical Genetics pages 39--66. Chichester, UK:Wiley, 2001.
 
30
J. Rocha, F. Rosselló, and J. Segura. Compression Ratios Based on the Universal Similarity Metric Still Yield Protein Distances far from CATH Distances. CoRR abs/q-bio/0603007, 2006.
 
31
A. A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge,Y. I. Wolf, E. V. Koonin,and S. F. Altschul. Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-based Statistics and Other Refinements. Nucleic Acids Research 29(14):2994--3005, 2001.
 
32
P. H. Sellers. Pattern Recognition in Genetic Sequences by Mismatch Density. Bulletin of Mathematical Biology 46(4):501--514, 1984.
 
33
S. Sheetlin, Y. Park, and J. L. Spouge. The Gumbel Pre-factor k for Gapped Local Alignment can be Estimated From Simulations of Global Alignment. Nucleic Acids Research 33(15):4987--4994, 2005.
 
34
M. L. Sierk and W. R. Pearson. Sensitivity and Selectivity in Protein Structure Comparison. Protein Science 13(3):773--785, 2004.
 
35
T. F. Smith and M. S. Waterman. Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1):195--197, 1981.
 
36
M. S. Waterman and M. Vingron. Rapid and Accurate Estimates of Statistical Significance for Sequence Database Searches. Proceedings of the National Academy of Sciences, USA 91(11):4625--4628, 1994.
 
37
S. Wolfsheimer, B. Burghardt, and A. K. Hartmann. Local Sequence Alignments Statistics: Deviations from Gumbel Statistics in the Rare-event Tail. Algorithms for Molecular Biology 2(9),2007.available at http://www.almob.org/content/2/1/9.
 
38
Y.-K. Yu, E. M. Gertz, R. Agarwala, A. A. Schäffer, and S. F. Altschul. Retrieval Accuracy, Statistical Significance and Compositional Similarity in Protein Sequence Database Searches.Nucleic Acids Research 34(20):5966--5973, 2006.

Collaborative Colleagues:
Ankit Agrawal: colleagues
Xiaoqiu Huang: colleagues