ACM Home Page
Please provide us with feedback. Feedback
Compressed representations of sequences and full-text indexes
Full text PdfPdf (328 KB)
Source
ACM Transactions on Algorithms (TALG) archive
Volume 3 ,  Issue 2  (May 2007) table of contents
Article No. 20  
Year of Publication: 2007
ISSN:1549-6325
Authors
Paolo Ferragina  Università di Pisa, Pisa, Italy
Giovanni Manzini  Università del Piemonte Orientale, Alessandria, Italy
Veli Mäkinen  University of Helsinki, Finland
Gonzalo Navarro  University of Chile, santiago, chile
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 127,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1240233.1240243
What is a DOI?

ABSTRACT

Given a sequence S = s1s2sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/log log n) time.

Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Specifically, we design a variant of the FM-index that indexes a string T[1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth-order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log|Σ| n, constant 0 < α < 1, and |Σ| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+ϵ n) time for any constant 0 < ϵ < 1; and reports a text substring of length ℓ in O(ℓ + log1+ϵ n) time.

Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| = O(nβ), for any 0 < β < 1, by paying o(n log|Σ|) extra space and multiplying all query times by O(log |Σ|/log log n).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.
 
2
Chan, W.-L., Hon, W.-K., and Lam, T.-W. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 445--456.
 
3
 
4
 
5
6
 
7
 
8
9
 
10
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 3246. Springer Verlag, Berlin. 150--160.
11
 
12
Geary, R., Rahman, N., Raman, R., and V.Raman. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 159--172.
13
 
14
 
15
González, R., and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer Verlag, Berlin. 295--306.
 
16
Grabowski, S., Navarro, G., Przywarski, R., Salinger, A., and Mäkinen, V. 2006. A simple alphabet-independent FM-index. Int. J. Found. Comput. Sci. 17, 6, 1365--1384.
 
17
 
18
 
19
Healy, J., Thomas, E., Schwartz, J., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Res. 13, 2306--2315.
 
20
 
21
Hon, W.-K., Lam, T.-W., Sung, W. K., Tse, W. L., Wong, C.-K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 31--38.
 
22
Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 434--444.
 
23
Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS). 549--554.
 
24
Mäkinen, V., and Navarro, G. 2004a. Compressed compact suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 420--433.
 
25
Mäkinen, V., and Navarro, G. 2004b. New search algorithms and time/space tradeoffs for succinct suffix arrays. Tech. Rep. C-2004-20, Department of Computer Science, University of Helsinki.
 
26
 
27
Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---Efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the International Symposium on Algorithms and Computation (ISAAC). Lecture Notes in Computer Science, vol. 3341. Springer Verlag, Berlin. 681--692.
 
28
29
 
30
 
31
 
32
Munro, J., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer Verlag, Berlin. 345--356.
 
33
 
34
 
35
 
36
 
37
38
 
39
Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Inf. 12, 175--183.


Collaborative Colleagues:
Paolo Ferragina: colleagues
Giovanni Manzini: colleagues
Veli Mäkinen: colleagues
Gonzalo Navarro: colleagues