ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Implementing the LZ-index: Theory versus practice
Full text PdfPdf (623 KB)
Source Journal of Experimental Algorithmics (JEA) archive
Volume 13 ,  (February 2009) table of contents
SECTION: 1 - Regular Papers table of contents
Article No.: 2  
Year of Publication: 2009
ISSN:1084-6654
Author
Gonzalo Navarro  University of Chile, Santiago, Chile
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 161,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1412228.1412230
What is a DOI?

ABSTRACT

The LZ-index is a theoretical proposal of a lightweight data structure for text indexing, based on the Ziv-Lempel trie. If a text of u characters over an alphabet of size σ is compressible to n symbols using the LZ78 algorithm, then the LZ-index takes 4n log2 n (1+o(1)) bits of space (that is, 4 times the entropy of the text) and reports the R occurrences of a pattern of length m in worst case time O(m3 log σ + (m + R)log n). In this paper we face the challenge of obtaining a practical implementation of the LZ-index, which is not at all straightforward from the theoretical proposal. We end up with a prototype that takes the promised space and has average search time Om log u + &sqrt;uR). This prototype is shown to be faster than other competing approaches when we take into account the time to report the positions or text contexts of the occurrences found. We show in detail the process of implementing the index, which involves interesting lessons of theory versus practice.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abouelhoda, M., Ohlebusch, E., and Kurtz, S. 2002. Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 31--43.
 
2
Agarwal, P. and Erickson, J. 1999. Geometric range searching and its relatives. Contemporary Mathematics 23: Advances in Discrete and Computational Geometry, 1--56.
 
3
Apostolico, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, 85--96.
 
4
Arroyuelo, D. and Navarro, G. 2005. Space-efficient construction of LZ-index. In Proceedings of the 16th Annual International Symposium on Algorithms and Computation (ISAAC). LNCS 3827. 1143--1152.
 
5
Arroyuelo, D. and Navarro, G. 2007a. A lempel-ziv text index on secondary storage. In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 83--94.
 
6
Arroyuelo, D. and Navarro, G. 2007b. Smaller and faster lempel-ziv indices. In Proceedings of the 18th International Workshop on Combinatorial Algorithms (IWOCA). College Publications, UK, 11--20.
 
7
Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4009. 319--330.
 
8
Bell, T., Cleary, J., and Witten, I. 1990. Text compression. Prentice Hall.
 
9
Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.
 
10
Chazelle, B. 1988. A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing 17, 3, 427--462.
 
11
Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium Foundations of Computer Science (FOCS). 390--398.
 
12
Ferragina, P. and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings of the 12th ACM Symposium on Discrete Algorithms (SODA). 269--278.
 
13
Ferragina, P. and Manzini, G. 2002. On compressing and indexing data. Tech. Rep. TR-02-01, Dipartamento di Informatica, Univ. of Pisa.
 
14
Geary, R., Rahman, N., Raman, R., and Raman, V. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS v. 3109. 159--172.
 
15
González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of the 4th Workshop on Efficient and Experimental Algorithms (WEA). CTI Press and Ellinika Grammata, Greece, 27--38.
 
16
Grossi, R. and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium Theory of Computing (STOC). 397--406.
 
17
Harman, D. 1995. Overview of the Third Text REtrieval Conference. In Proceedings of the 3rd Text REtrieval Conf. (TREC-3). 1--19. NIST Special Publication 500--507.
 
18
Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium Foundations of Computer Science (FOCS). 549--554.
 
19
Kärkkäinen, J. 1995. Suffix cactus: a cross between suffix tree and suffix array. In Proceedings of the 6th Annual Symposium Combinatorial Pattern Matching (CPM). LNCS 937. 191--204.
 
20
Kärkkäinen, J. 1999. Repetition-based text indexes. Ph.D. thesis, Dept. of Computer Science, University of Helsinki, Finland. Also available as Report A-1999-4, Series A.
 
21
Kärkkäinen, J. and Ukkonen, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP). Carleton University Press, 141--155.
 
22
Kärkkäinen, J. and Ukkonen, E. 1996b. Sparse suffix trees. In Proceedings of the 2nd Ann. International Conf. on Computing and Combinatorics (COCOON). LNCS 1090.
 
23
Kosaraju, R. and Manzini, G. 1999. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing 29, 3, 893--911.
 
24
Kurtz, S. 1998. Reducing the space requirements of suffix trees. Report 98-03, Technische Kakultät, Universität Bielefeld.
 
25
Mäkinen, V. 2003. Compact suffix array—a space-efficient full-text index. Fundamenta Informaticae 56, 1--2, 191--210.
 
26
Manber, U. and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935--948.
 
27
Munro, I. 1996. Tables. In Proceedings of the 16th Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS 1180. 37--42.
 
28
Munro, I. and Raman, V. 1997. Succint representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium Foundations of Computer Science (FOCS). 118--126.
 
29
Munro, I., Raman, V., and Rao, S. 2001. Space efficient suffix trees. Journal of Algorithms, 205--222.
 
30
Navarro, G. 2002. Indexing text using the Ziv-Lempel trie. In Proceedings of the 9th International Symposium String Processing and Information Retrieval (SPIRE). LNCS 2476. 325--336.
 
31
Navarro, G. 2004. Indexing text using the ziv-lempel trie. Journal of Discrete Algorithms (JDA) 2, 1, 87--114.
 
32
Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, article 2.
 
33
Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77.
 
34
Russo, L., Navarro, G., and Oliveira, A. 2007. Approximate string matching with Lempel-Ziv compressed indexes. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4726. 265--275.
 
35
Russo, L. and Oliveira, A. 2006. A compressed self-index using a ziv-lempel dictionary. In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE). LNCS 4209. 163--180.
 
36
Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th International Symposium Algorithms and Computation (ISAAC). LNCS 1969. 410--421.
 
37
Sadakane, K. 2002. Succint representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM Symposium on Discrete Algorithms (SODA). 225--232.
 
38
Welch, T. 1984. A technique for high performance data compression. IEEE Computer Magazine 17, 6 (June), 8--19.
 
39
Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes, second ed. Morgan Kaufmann Publishers, New York.
 
40
Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. on Information Theory 24, 530--536.