ACM Home Page
Please provide us with feedback. Feedback
When indexing equals compression: Experiments with compressing suffix arrays and applications
Full text PdfPdf (351 KB)
Source ACM Transactions on Algorithms (TALG) archive
Volume 2 ,  Issue 4  (October 2006) table of contents
Pages: 611 - 639  
Year of Publication: 2006
ISSN:1549-6325
Authors
Luca Foschini  Scuola Superiore Sant'Anna, Pisa, Italy
Roberto Grossi  Università di Pisa, Pisa, Italy
Ankur Gupta  Duke University, Durham, North Carolina, NC
Jeffrey Scott Vitter  Purdue University, West Lafayette, Indiana, IN
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 92,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1198513.1198521
What is a DOI?

ABSTRACT

We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size---without requiring a separate instance of the text. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows--Wheeler transform), achieving a compression ratio comparable to that of bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
 
5
 
6
The Canterbury Corpus. 2001. http://corpus.canterbury.ac.nz.
 
7
Chazelle, B., and Guibas, L. J. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2, 133--162.
 
8
 
9
Fenwick, P. 1996. Punctured elias codes for variable-length coding of the integers. The University of Auckland, NZ. TR 137. ISSN 1173--3500.
 
10
11
 
12
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
Hon, W., Lam, T., Tse, W., Wong, C., and Yiu, S. 2004. Practical aspects of compressed suffix arrays and fm-index in searching dna sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments (ALENEX).
 
21
 
22
 
23
Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, pp. 549--554.
 
24
 
25
 
26
 
27
28
29
 
30
 
31
 
32
Navarro, G., and Mäkinen, V. 2006. Compressed full-text indexes. Tech. Rep. TR/DCC-2006-6, University of Chile.
 
33
Nelson, M. 2003. Run length encoding/RLE. http://www.datacompression.info/RLE.shtml.
 
34
Oki, M. 2003. http://www.infor.kanazawa-it.ac.jp/~ishii/lhaunix/.
 
35
 
36
 
37
 
38
Rissanen, J., and Langdon, G. G. 1979. Arithmetic coding. IBM J. Res. Devel. 23, 2 (Mar.), 149--162.
 
39
 
40
 
41
Schindler, M. 1999. http://www.compressconsult.com/rangecoder.
 
42
Smith, J. O., III. 2003. http://ccrma-www.stanford.edu/~jos/mdft/Autocorrelation.html.
 
43
TREC. Tipster 3. 2000. http://trec.nist.gov/data/docs_eng.html.
 
44
 
45


Collaborative Colleagues:
Luca Foschini: colleagues
Roberto Grossi: colleagues
Ankur Gupta: colleagues
Jeffrey Scott Vitter: colleagues