|
ABSTRACT
For the last twenty years fragment assembly in DNA sequencing followed the “overlap - layout - consensus” paradigm that is used in all currently available assembly tools. Although this approach proved to be useful in assembling clones, it faces difficulties in genomic shotgun assembly: the existing algorithms make assembly errors and are often unable to resolve repeats even in prokaryotic genomes. Biologists are well-aware of these errors and are forced to carry additional experiments to verify the assembled contigs.
We abandon the classical “overlap - layout - consensus” approach in favor of a new Eulerian Superpath approach that, for the first time, resolves the problem of repeats in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem. This reduction opens new possibilities for repeat resolution and allows one to generate error-free solutions of the large-scale fragment assembly problems. The major improvement of EULER over other algorithms is that it resolves all repeats except long perfect repeats that are theoretically impossible to resolve without additional experiments.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Bolotin, S. Mauger, K. Malarme, S. D. Ehrlich, and A. Sorokin. Low-redundancy sequencing of the entire lactococcus lactis IL1403 genome. Antonie van Leeuwenhoek, 76:27-76, 1999.
|
| |
2
|
J. K. Bonfield, K. F. Smith, and R. Staden. A new DNA sequence assembly program. Nucleic Acids Research, 23:4992-4999, 1995.
|
| |
3
|
R. Drmanac, I. Labat, I. Brukner, and R. Crkvenjakov. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics, 4:114-128, 1989.
|
| |
4
|
D. Gordon, C. Abajian, and P. Green. Consed: a graphical tool for sequence finishing. Genome Research, 8(3):195-202, 1998.
|
| |
5
|
P. Green. Documentation for phrap. http://bozeman.mbt.washington.edu/phrap.docs/phrap, html, 1994.
|
 |
6
|
|
| |
7
|
X. Huang and A. Madan. CAP3: A DNA sequence assembly program. Genome Research, 9:868-877, 1999.
|
| |
8
|
R. Idury and M. Waterman. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2:291-306, 1995.
|
| |
9
|
Y. Lysov, V. Florentev, A. Khorlin, K. Khrapko, V. Shik, and A. Mirzabekov. DNA sequencing by hybridization with oligonucleotides. Doklady Academy Nauk USSR, 303:1508-1511, 1988.
|
| |
10
|
E. M. Myers. Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 2:275-290, 1995.
|
| |
11
|
E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H. H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter. A whole-genome assembly of Drosophila. Science, 287:2196-2204, 2000.
|
| |
12
|
J. Parkhill, M. Achtman, K. D. James, S. D. Bentley, C. Churcher, S. R. Klee, G. Morelli, D. Basham, D. Brown, T. Chillingworth, R. M. Davies, P. Davis, K. Devlin, T. Feltwell, N. Hamlin, S. Holroyd, K. Jagels, S. Leather, S. Moule, K. Mungall, M. A. Quail, M. A. Rajandream, K. M. Rutherford, M. Simmonds, J. Skelton, S. Whitehead, B. G. Spratt, and B. G. Barrell. Complete dna sequence of a serogroup a strain of neisseria meningitidis Z2491. Nature, 404:502-506, 2000.
|
| |
13
|
J. Parkhill, B. Wren, K. Mungall, J. M. Ketley, C. Churcher, D. Basham, T. Chillingworth, R. M. Davies, T. Feltwell, S. Holroyd, K. Jagels, A. Karlyshev, S. Moule, M. J. Pallen, C. W. Penn, Q. A. Quail, M. A. Rajandream, K. M. Rutherford, A. H. van Vliet, S. Whitehead, and B. G. Barrell. The genome sequence of the food-borne pathogen campylobacter jejuni reveals hypervariable sequences. Nature, 403:665-668, 2000.
|
| |
14
|
|
| |
15
|
P. Pevzner. Huple DNA sequencing: computer analysis. Journal of Biomolecular Structure and Dynamics, 7:63-73, 1989.
|
| |
16
|
P. A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. The MIT Pres, 2000.
|
| |
17
|
J. Roach, C. Boysen, K. Wang, and L. Hood. Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics, 26:345-353, 1995.
|
| |
18
|
G. Sutton, O. White, M. Adams, and A. Kerlavage. TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Science Technology, 1:9-19, 1995.
|
| |
19
|
H. Tettelin, D. Radune, S. Kasif, H. Khouri, and S. L. Salzberg. Optimized multiplex PCR: emciently closing a whole-genome shotgun sequencing project. Genomics, 62:500-507, 1999.
|
| |
20
|
M. Waterman. Introduction to Computational Biology. Chapman Hall, 1995.
|
| |
21
|
J. Weber and G. Myers. Whole genome shotgun sequencing. Genome Research, 7:401-409, 1997.
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
|