ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Compressed indexes for dynamic text collections
Full text PdfPdf (236 KB)
Source
ACM Transactions on Algorithms (TALG) archive
Volume 3 ,  Issue 2  (May 2007) table of contents
Article No.: 21  
Year of Publication: 2007
ISSN:1549-6325
Authors
Ho-Leung Chan  University of Hong Kong
Wing-Kai Hon  National Tsing Hua University
Tak-Wah Lam  University of Hong Kong
Kunihiko Sadakane  Kyushu University
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 102,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1240233.1240244
What is a DOI?

Warning: The download time has expired please click on the item to try again.


ABSTRACT

Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically.

This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log2 n) time, where occ is the number of occurrences.

Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log2 d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log2 d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only.

The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log2 n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
 
5
 
6
Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Paolo Alto, California.
 
7
Elias, P. 1975. Universal codeword sets and representation of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.
 
8
9
 
10
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the International Symposium on String Processing and Information Retrieval. 150--160.
11
 
12
 
13
14
 
15
 
16
Hon, W. K., Lam, T. W., Sung, W. K., Tse, W. L., Wong, C. K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments. 31--38.
 
17
Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the Symposium on Foundations of Computer Science. 549--554.
 
18
 
19
 
20
Mäkinen, V., and Navarro, G. 2004. Run-Length FM-index. In Proceedings of the DIMACS Workshop: The Burrows-Wheeler Transform: Ten Years Later. 17--19.
 
21
22
 
23
Mewes, H. W., and Heumann, K. 1995. Genome analysis: Pattern search in biological macromolecules. In Proceedings of the Symposium on Combinatorial Pattern Matching. 261--285.
 
24
 
25
 
26
 
27
 
28
 
29
Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comput. Syst. to appear.
 
30
 
31
Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the Symposium on Switching and Automata Theory. 1--11.
32


Collaborative Colleagues:
Ho-Leung Chan: colleagues
Wing-Kai Hon: colleagues
Tak-Wah Lam: colleagues
Kunihiko Sadakane: colleagues