ACM Home Page
Please provide us with feedback. Feedback
Dynamic entropy-compressed sequences and full-text indexes
Full text PdfPdf (321 KB)
Source
ACM Transactions on Algorithms (TALG) archive
Volume 4 ,  Issue 3  (June 2008) table of contents
Article No. 32  
Year of Publication: 2008
ISSN:1549-6325
Authors
Veli Mäkinen  University of Helsinki, Helsinki, Finland
Gonzalo Navarro  University of Chile, Santiago, Chile
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 148,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1367064.1367072
What is a DOI?

ABSTRACT

We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-case time, for any k = O(log n). This extends previous results by Hon et al. [2003c] achieving the same space and O(log n/log log n) time complexities for the queries, yet offering complexities for insert and delete that are amortized and worse than ours, and supported only for k = O(1). Our result matches an existing lower bound for large values of k.

We also give new solutions to the Dynamic Sequence problem. Given a sequence of n symbols in the range [1,σ] with binary zero-order entropy H0, we present a dynamic data structure that requires nH0 + o(n log σ) bits of space, which is able of performing rank and select, as well as inserting and deleting symbols at arbitrary positions, in O(log n log σ) time. Our result is the first entropy-bound dynamic data structure for rank and select over general sequences.

In the case σ = 2, where both previous problems coincide, we improve the dynamic solution of Hon et al. [2003c] in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity.

Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size σ, of overall length n and hth order empirical entropy Hh. The index requires nHh + o(n log σ) bits of space, for any h ≤ α logsigma n and constant 0 < α < 1. It can count the number of occurrences of a pattern of length m in time O(m log n log σ). Each such occurrence can be reported in O(log2nlog log n) time, and displaying a context of length ℓ from a text takes time O(log n(ℓ log σ + log n log log n)). Insertion/deletion of a text to/from the collection takes O(log n log σ) time per symbol. This significantly improves the space of a previous result by Chan et al. [2004] in exchange for a slight time complexity penalty. We achieve at the same time the first dynamic index requiring essentially nHh bits of space, and the first construction of a compressed full-text self-index within that working space. Previous results achieve at best O(nHh space with constants larger than 1 [Ferragina and Manzini 2000; Arroyuelo and Navarro 2005] and higher time complexities.

An important result we prove in this paper is that the wavelet tree of the Burrows-Wheeler transform of a text, if compressed with a technique that achieves zero-order compression locally (e.g., Raman et al. [2002]), automatically achieves hth order entropy space for any h. This unforeseen relation is essential for the results of the previous paragraph, but it also derives into significant simplifications on many existing static compressed full-text self-indexes that build on wavelet trees.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Apostolico, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, New York, 85--96.
 
2
Arroyuelo, D., and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings of ISAAC'05. Lecture Notes in Computer Science, vol. 3827. Springer-Verlag, New York, 1143--1152.
 
3
 
4
 
5
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.
 
6
Chan, H.-L., Hon, W.-K., and Lam, T.-W. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 445--456.
7
 
8
Crauser, A., and Ferragina, P. 2002. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1, 1--35.
 
9
 
10
Elias, P. 1975. Universal codeword sets and representation of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.
 
11
Ferragina, P., Giancarlo, R., and Manzini, G. 2006. The myriad virtues of wavelet trees. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP). 560--571.
 
12
 
13
14
15
 
16
17
 
18
 
19
González, R., and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM 2006). Lecture Notes in Computer Science, vol. 4009. Springer-Verlag, New York, 295--306.
 
20
González, R., and Navarro, G. 2008. Improved dynamic rank-select entropy-bound structures. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN). Lecture Notes in Computer Science, vol. 4957, Springer-Verlag, 374--386.
 
21
 
22
 
23
 
24
Gupta, A., Hon, W.-K., Shah, R., and Vitter, J. 2006a. Compressed data structures: dictionaries and data-aware measures. In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA). 158--169.
 
25
Gupta, A., Hon, W.-K., Shah, R., and Vitter, J. 2006b. Dynamic rank/select dictionaries with applications to XML indexing. Tech. Rep. CSD TR #06-014, Purdue Univ. July.
 
26
Hon, W.-K., Lam, T.-W., Sadakane, K., and Sung, W.-K. 2003a. Constructing compressed suffix arrays with large alphabets. In Proceedings of the 14th Annual International Symposium on Algorithms and Computation (ISAAC). 240--249.
 
27
 
28
Hon, W.-K., Sadakane, K., and Sung, W.-K. 2003c. Succinct data structures for searchable partial sums. In Proceedings of the International Symposium on Algorithms and Computation (ISAAC'03). Lecture Notes in Computer Science, vol. 2906. Springer-Verlag, New York, 505--516.
 
29
Kärkkäinen, J. 2004. Fast BWT in small space by blockwise suffix sorting. In Proceedings of DIMACS Working Group on the Burrows-Wheeler Transform: Ten Years Later.
 
30
Lee, S., and Park, K. 2007. Static and dynamic rank-select dictionaries for run-length encoded texts. In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4580. Springer-Verlag, New York.
 
31
 
32
 
33
34
35
 
36
 
37
 
38
 
39
40


Collaborative Colleagues:
Veli Mäkinen: colleagues
Gonzalo Navarro: colleagues