|
ABSTRACT
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. Ma, J. Tromp, and M. Li, "PatternHunter: Faster and More Sensitive Homology Search," Bioinformatics, vol. 18, no. 3, pp. 440- 445, Mar. 2002.
|
| |
2
|
|
| |
3
|
L. Noé and G. Kucherov, "Improved Hit Criteria for DNA Local Alignment," BMC Bioinformatics, vol. 5, p. 149, Oct. 2004.
|
| |
4
|
|
| |
5
|
S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, "REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale," Nucleic Acids Research, vol. 29, no. 22, pp. 4633-4642, Nov. 2001.
|
| |
6
|
J. Jurka, V.V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, and J. Walichiewicz, "Repbase Update, A Database of Eukaryotic Repetitive Elements," Cytogenetic and Genome Research, vol. 110, nos. 1-4, pp. 462-467, 2005.
|
| |
7
|
|
 |
8
|
|
| |
9
|
Y. Sun and J. Buhler, "Designing Multiple Simultaneous Seeds for DNA Similarity Search," J. Computational Biology, vol. 12, no. 6, pp. 847-861, 2005.
|
| |
10
|
J. Xu, D.G. Brown, M. Li, and B. Ma, "Optimizing Multiple Spaced Seeds for Homology Search," Proc. Combinatorial Pattern Matching (CPM), pp. 47-58, 2004.
|
| |
11
|
J. Flannick and S. Batzoglou, "Using Multiple Alignments to Improve Seeded Local Alignment Algorithms," Nucleic Acids Research, vol. 33, no. 14, pp. 4563-4577, 2005.
|
| |
12
|
K.R. Rasmussen, J. Stoye, and E.W. Myers, "Efficient q-Gram Filters for Finding All Epsilon-Matches over a Given Length," J. Computational Biology, vol. 13, pp. 189-203, 2005.
|
| |
13
|
L. Li, C.J. Stoeckert, and D.S. Roos, "OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes," Genome Research, vol. 13, no. 9, pp. 2178-2189, Sept. 2003.
|
| |
14
|
D.B. Jaffe, J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh, J.P. Mesirov, M.C. Zody, and E.S. Lander, "Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2," Genome Research, vol. 13, no. 1, pp. 91-96, Jan. 2003.
|
| |
15
|
C. Ane and M. Sanderson, "Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories," Systems Biology, vol. 54, no. 1, pp. I311-I317, 2005.
|
| |
16
|
M. Margulies and 55 Other Authors, "Genome Sequencing in Microfabricated High-Density Picolitre Reactors," Nature, vol. 437, no. 7057, pp. 376-380, 2005.
|
| |
17
|
D. Ryan, M. Rahimi, J. Lund, R. Mehta, and B.A.A. Parviz, "Toward Nanoscale Genome Sequencing," Trends in Biotechnology, vol. 25, pp. 385-389, July 2007.
|
| |
18
|
R.G. Blazej, P. Kumaresan, and R.A. Mathies, "Microfabricated Bioprocessor for Integrated Nanoliter-Scale Sanger DNA Sequencing," Proc. Nat'l Academy Sciences USA, vol. 103, no. 19, pp. 7240- 7245, May 2006.
|
| |
19
|
J. Shendure and H. Ji, "Next-Generation DNA Sequencing," Nature Biotechnology, vol. 26, no. 10, pp. 1135-1145, Oct. 2008.
|
| |
20
|
A.C.E. Darling, B. Mau, F.R. Blattner, and N.T. Perna, "Mauve: Multiple Alignment of Conserved Genomic Sequence with Rearrangements," Genome Research, vol. 14, no. 7, pp. 1394-403, 2004.
|
| |
21
|
M. Höhl, S. Kurtz, and E. Ohlebusch, "Efficient Multiple Genome Alignment," Bioinformatics, vol. 18, no. 1, pp. S312-S320, 2002.
|
| |
22
|
T.J. Treangen and X. Messeguer, "M-GCAT: Interactively and Efficiently Constructing Large-Scale Multiple Genome Comparison Frameworks in Closely Related Species," BMC Bioinformatics, vol. 7, p. 433, Oct. 2006.
|
| |
23
|
C.N. Dewey and L. Pachter, "Evolution at the Nucleotide Level: The Problem of Multiple Whole-Genome Alignment," Human Molecular Genetics, vol. 15, no. 1, pp. R51-56, Apr. 2006.
|
| |
24
|
M. Sammeth and J. Heringa, "Global Multiple-Sequence Alignment with Repeats," Proteins, vol. 64, pp. 263-274, Apr. 2006.
|
| |
25
|
B. Raphael, D. Zhi, H. Tang, and P. Pevzner, "A Novel Method for Multiple Alignment of Sequences with Repeated and Shuffled Elements," Genome Research, vol. 14, no. 11, pp. 2336-2346, 2004.
|
| |
26
|
|
| |
27
|
S. Kumar and A. Filipski, "Multiple Sequence Alignment: In Pursuit of Homologous DNA Positions," Genome Research, vol. 17, no. 2, pp. 127-135, 2007.
|
| |
28
|
S. Schwartz, J.W. Kent, A. Smit, Z. Zhang, R. Baertsch, R.C. Hardison, D. Haussler, and W. Miller, "Human-Mouse Alignments with BLASTZ," Genome Research, vol. 13, no. 1, pp. 103-107, Jan. 2003.
|
| |
29
|
W.R. Pearson, "Rapid and Sensitive Sequence Comparison with FASTP and FASTA," Methods in Enzymology, vol. 183, pp. 63-98, 1990.
|
| |
30
|
M. Blanchette, W. Kent, C. Riemer, L. Elnitski, A.F. Smit, K.M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. Green, D. Haussler, and W. Miller, "Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner," Genome Research, vol. 14, pp. 708-715, 2004.
|
| |
31
|
B. Morgenstern, K. French, A. Dress, and T. Werner, "DIALIGN: Finding Local Similarities by Multiple Sequence Alignment," Bioinformatics, vol. 14, pp. 290-294, 1998.
|
| |
32
|
Y. Zhang and M.S. Waterman, "An Eulerian Path Approach to Local Multiple Alignment for DNA Sequences," Proc. Nat'l Academy Sciences USA, vol. 102, no. 5, pp. 1285-1290, 2005.
|
| |
33
|
M. Brudno, C.B. Do, G.M. Cooper, M.F. Kim, E. Davydov, N.C.S. Program, E.D. Green, A. Sidow, and S. Batzoglou, "LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA," Genome Research., vol. 13, no. 4, pp. 721-731, 2003.
|
| |
34
|
|
| |
35
|
L. Wang and T. Jiang, "On the Complexity of Multiple Sequence Alignment," J. Computational Biology, vol. 1, no. 4, pp. 337-348, 1994.
|
| |
36
|
J.D. Thompson, D.G. Higgins, and T. Gibson, "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, no. 22, pp. 4673-80, 1994.
|
| |
37
|
C. Notredame, D.G. Higgins, and J. Heringa, "T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment," J. Molecular Biology, vol. 302, no. 1, pp. 205-217, Sept. 2000.
|
| |
38
|
R. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research., vol. 32, no. 5, pp. 1792-1797, 2004.
|
| |
39
|
C.B. Do, M.S. Mahabhashyam, M. Brudno, and S. Batzoglou, "ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment," Genome Research, vol. 15, no. 2, pp. 330-340, Feb. 2005.
|
| |
40
|
A.E. Darling, T.J. Treangen, L. Zhang, C. Kuiken, X. Messeguer, and N.T. Perna, "Procrastination Leads to Efficient Filtration for Local Multiple Alignment," Algorithms in Bioinformatics, vol. 4175, pp. 126-137, 2006.
|
| |
41
|
|
| |
42
|
|
| |
43
|
S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, Sept. 1997.
|
| |
44
|
W.J. Kent, "BLAT-The BLAST-Like Alignment Tool," Genome Research., vol. 12, no. 4, pp. 656-664, Apr. 2002.
|
| |
45
|
F. Chiaromonte, V.B. Yap, and W. Miller, "Scoring Pairwise Genomic Sequence Alignments," Proc. Pacific Symp. Biocomputing, pp. 115-126, 2002.
|
| |
46
|
Y.-K. Yu, J.C. Wootton, and S.F. Altschul, "The Compositional Adjustment of Amino Acid Substitution Matrices," Proc. Nat'l Academy Sciences USA, vol. 100, no. 26, pp. 15 688-15 693, Dec. 2003.
|
| |
47
|
|
| |
48
|
|
| |
49
|
|
| |
50
|
Guillaume Achaz , Frédéric Boyer , Eduardo P. C. Rocha , Alain Viari , Eric Coissac, Repseek, a tool to retrieve approximate repeats from large DNA sequences, Bioinformatics, v.23 n.1, p.119-121, January 2007
[doi> 10.1093/bioinformatics/btl519]
|
| |
51
|
E.P. Rocha and A. Blanchard, "Genomic Repeats, Genome Plasticity and the Dynamics of Mycoplasma, Evolution," Nucleic Acids Research, vol. 30, no. 9, pp. 2031-2042, May 2002.
|
| |
52
|
J.D. Thompson, F. Plewniak, and O. Poch, "A Comprehensive Comparison of Multiple Sequence Alignment Programs," Nucleic Acids Research, vol. 27, no. 13, pp. 2682-2690, 1999.
|
| |
53
|
|
INDEX TERMS
Primary Classification:
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.8
Problem Solving, Control Methods, and Search
Subjects:
Heuristic methods
Additional Classification:
F.
Theory of Computation
F.2
ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY
F.2.2
Nonnumerical Algorithms and Problems
Subjects:
Sequencing and scheduling
G.
Mathematics of Computing
G.3
PROBABILITY AND STATISTICS
Subjects:
Markov processes
I.
Computing Methodologies
I.5
PATTERN RECOGNITION
I.5.2
Design Methodology
Subjects:
Pattern analysis
I.6
SIMULATION AND MODELING
I.6.4
Model Validation and Analysis
J.
Computer Applications
J.3
LIFE AND MEDICAL SCIENCES
Subjects:
Biology and genetics
General Terms:
Algorithms,
Design,
Performance,
Theory
Keywords:
Sequence alignment,
genome comparison,
DNA repeats,
local multiple alignment,
hidden Markov model,
gapped extension.
|