ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Suffix trees for very large genomic sequences
Full text PdfPdf (330 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 18th ACM conference on Information and knowledge management table of contents
Hong Kong, China
POSTER SESSION: Poster session 1: DB track table of contents
Pages: 1417-1420  
Year of Publication: 2009
ISBN:978-1-60558-512-3
Authors
Marina Barsky  University of Victoria, Victoria, BC, Canada
Ulrike Stege  University of Victoria, Victoria, BC, Canada
Alex Thomo  University of Victoria, Victoria, BC, Canada
Chris Upton  University of Victoria, Victoria, BC, Canada
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 34,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1645953.1646134
What is a DOI?

ABSTRACT

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. All the existing practical algorithms perform random access to the input string, thus requiring that the input be small enough to be kept in main memory.

We are the first to present an algorithm which is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. As a proof of concept, we show that our method allows to build the suffix tree for 12GB of real DNA sequences in 26 hours on a single machine with 2GB of RAM. This input is four times the size of the Human Genome, and the construction of suffix trees for inputs of such magnitude was never reported before.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
 
5
6
7
 
8
B. Phoophakdee and M. J. Zaki. Trellis+: An Effective Approach for Indexing Massive Sequence. Pacific Symposium on Biocomputing, 2008.
 
9
 
10
 
11
USCS Genome Browser: hgdownload.cse.ucsc.edu/downloads.html
 
12
Source code for TDD: www.eecs.umich.edutdddownload.html
 
13
Source code for Trellis+SB: www.cs.rpi.edu/~zaki/software/trellis

Collaborative Colleagues:
Marina Barsky: colleagues
Ulrike Stege: colleagues
Alex Thomo: colleagues
Chris Upton: colleagues