ACM Home Page
Please provide us with feedback. Feedback
Serial and parallel methods for i/o efficient suffix tree construction
Full text PdfPdf (562 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 21: indexing table of contents
Pages 827-840  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Amol Ghoting  IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Konstantin Makarychev  IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 56,   Downloads (12 Months): 199,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559931
What is a DOI?

ABSTRACT

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm.

To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
S. Bedathur and J. Haritsa. Search optimized suffix tree storage for biological applications. In Proceedings of the IEEE International Conference on High Performance Computing 2005.
 
3
N. Bray, I. Dubchak, and L. Pachter. AVID:A global alignment program. Genome Research 13(1), 2003.
 
4
 
5
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.
 
6
A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot. Efficient extraction of structured motifs using box links. In Proceedings of the 11th Conference on String Processing and Information Retrieval 2004.
 
7
W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 1994.
 
8
 
9
R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching 2003.
 
10
A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res. 27(11), 1999.
 
11
A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(1), 2002.
 
12
 
13
 
14
 
15
 
16
R. Japp. The top-compressed suffix tree:A disk resident index for large sequences. In Proceedings of the Bioinformatics Workshop at the 21st Annual British National Conference on Databases 2004.
 
17
S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29, 2001.
 
18
S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg. Versatile and open software for comparing large genomes. Genome Bio. 5(R12), 2004.
19
 
20
 
21
NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html.
22
 
23
B. Phoophakdee and M. Zaki. Trellis+: An effective approach for indexing massive sequences. In Proceedings of the Pacific Symposium on Biocomputing 2008.
 
24
K. Schurmann and J. Stoye. suffix tree construction and storage with limited main memory. Technical report, Universitat Bielefeld, 2003.
 
25
 
26
 
27
 
28
29

Collaborative Colleagues:
Amol Ghoting: colleagues
Konstantin Makarychev: colleagues