ACM Home Page
Please provide us with feedback. Feedback
Indexing compressed text
Full text PdfPdf (473 KB)
Source Journal of the ACM (JACM) archive
Volume 52 ,  Issue 4  (July 2005) table of contents
Pages: 552 - 581  
Year of Publication: 2005
ISSN:0004-5411
Authors
Paolo Ferragina  Università di Pisa, Pisa, Italy
Giovanni Manzini  Università del Piemonte Orientale, Alessandria, Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 197,   Citation Count: 25
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1082036.1082039
What is a DOI?

ABSTRACT

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log1+ε n) time for any chosen ε, 0<ε<1. This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p+occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
 
5
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.
 
6
Chan, H., Hon, W., and Lam, T. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 445--456.
 
7
 
8
Colussi, L., and De Col, A. 1996. A time and space efficient data structure for string searching on large texts. Info. Proc. Lett. 58, 5, 217--222.
 
9
Farach, M., and Thorup, M. 1998. String matching in Lempel--Ziv compressed strings. Algorithmica 20, 4, 388--404.
 
10
Fenwick, P. 1996. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer J. 39, 9, 731--740.
11
12
 
13
 
14
 
15
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 150--160.
 
16
 
17
Grabowski, S., Mäkinen, V., and Navarro, G. 2004. First Huffman, then Burrows-Wheeler: An alphabet-independent FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 210--211.
 
18
 
19
20
 
21
Healy, J., Thomas, E. E., Schwartz, J. T., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Research 13, 2306--2315.
 
22
 
23
Hon, W., Lam, T., Sung, W., Tse, W., Wong, C., and Yiu, S. 2004b. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments. SIAM Press, Philadelphia, Pa., 31--38.
 
24
Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 434--444.
 
25
Kärkkäinen, J., and Sutinen, E. 1998. Lempel-Ziv index for q-grams. Algorithmica 21, 137--154.
 
26
Kärkkäinen, J., and Ukkonen, E. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing, N. Ziviani, R. Baeza-Yates, and K. Guimarães, Eds. Carleton University Press, 141--155.
 
27
 
28
 
29
Mäkinen, V., and Navarro, G. 2004. Compressed compact suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 420--433.
 
30
Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the 15th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, Springer-Verlag, New York.
 
31
32
33
 
34
 
35
 
36
 
37
 
38
 
39
 
40
 
41
 
42
Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Informatics 12, 175--183.
 
43
 
44
Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Info. Theory 24, 530--536.

CITED BY  25

Collaborative Colleagues:
Paolo Ferragina: colleagues
Giovanni Manzini: colleagues