| When indexing equals compression: Experiments with compressing suffix arrays and applications |
| Full text |
Pdf
(351 KB)
|
| Source
|
ACM Transactions on Algorithms (TALG)
archive
Volume 2 , Issue 4 (October 2006)
table of contents
Pages: 611 - 639
Year of Publication: 2006
ISSN:1549-6325
|
|
Authors
|
|
Luca Foschini
|
Scuola Superiore Sant'Anna, Pisa, Italy
|
|
Roberto Grossi
|
Università di Pisa, Pisa, Italy
|
|
Ankur Gupta
|
Duke University, Durham, North Carolina, NC
|
|
Jeffrey Scott Vitter
|
Purdue University, West Lafayette, Indiana, IN
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 92, Citation Count: 4
|
|
|
ABSTRACT
We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size---without requiring a separate instance of the text. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows--Wheeler transform), achieving a compression ratio comparable to that of bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
The Canterbury Corpus. 2001. http://corpus.canterbury.ac.nz.
|
| |
7
|
Chazelle, B., and Guibas, L. J. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2, 133--162.
|
| |
8
|
|
| |
9
|
Fenwick, P. 1996. Punctured elias codes for variable-length coding of the integers. The University of Auckland, NZ. TR 137. ISSN 1173--3500.
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
Hon, W., Lam, T., Tse, W., Wong, C., and Yiu, S. 2004. Practical aspects of compressed suffix arrays and fm-index in searching dna sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments (ALENEX).
|
| |
21
|
|
| |
22
|
|
| |
23
|
Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, pp. 549--554.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
 |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
Navarro, G., and Mäkinen, V. 2006. Compressed full-text indexes. Tech. Rep. TR/DCC-2006-6, University of Chile.
|
| |
33
|
Nelson, M. 2003. Run length encoding/RLE. http://www.datacompression.info/RLE.shtml.
|
| |
34
|
Oki, M. 2003. http://www.infor.kanazawa-it.ac.jp/~ishii/lhaunix/.
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
Rissanen, J., and Langdon, G. G. 1979. Arithmetic coding. IBM J. Res. Devel. 23, 2 (Mar.), 149--162.
|
| |
39
|
|
| |
40
|
|
| |
41
|
Schindler, M. 1999. http://www.compressconsult.com/rangecoder.
|
| |
42
|
Smith, J. O., III. 2003. http://ccrma-www.stanford.edu/~jos/mdft/Autocorrelation.html.
|
| |
43
|
TREC. Tipster 3. 2000. http://trec.nist.gov/data/docs_eng.html.
|
| |
44
|
|
| |
45
|
|
|