ACM Home Page
Please provide us with feedback. Feedback
An experimental study of an opportunistic index
Full text PdfPdf (786 KB)
Source Symposium on Discrete Algorithms archive
Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms table of contents
Washington, D.C., United States
Pages: 269 - 278  
Year of Publication: 2001
ISBN:0-89871-490-7
Authors
Paolo Ferragina  Dipartimento di Informatica, Università di Pisa, Italy
Giovanni Manzini  Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale, Alessandria, Italy and IMC-CNR, Pisa, Italy
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIAM : Society for Industrial and Applied Mathematics
Publisher
Society for Industrial and Applied Mathematics  Philadelphia, PA, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 84,   Citation Count: 19
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that efficiently extract the user-requested information.

Approaches to combine compression and indexing techniques are nowadays receiving more and more attention. A first step towards the design of a compressed full-text index achieving guaranteed performance in the worst case has been recently done in [10]. This index combines the compression algorithm proposed by Burrows and Wheeler [5] with the suffix array data structure [16]. The index is opportunistic in that it takes advantage of the compressibility of the input data by decreasing the space occupancy at no significant asymptotic slowdown in the query performance.

In this paper we present an implementation of this index and perform an extensive set of experiments on various text collections. The experiments show that our index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). In addition, our experiments show that the FM-index is flexible in that it is possible to trade space occupancy for search time by choosing the amount of auxiliary information stored into it.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
R. Arnold and T. Bell. The Canterbury corpus home page. http://corpus.canterbury.ac.nz.
3
 
4
 
5
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
 
6
 
7
 
8
M. Farach and M. Thorup. String matching in Lempel- Ziv compressed strings. Algorithmica, 20(4):388-404, 1998.
 
9
P. Fenwick. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer Journal, 39(9):731-740, 1996.
 
10
 
11
G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures and Algorithms, chapter 5, pages 66-82. Prentice-Hall, 1992.
12
 
13
D. K. Harman, editor. Proc. TREC Text Retrieval Conference. National Institute of Standards, 1992. Special Pubblication 500-207.
 
14
 
15
 
16
 
17
U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 23- 32, 1994.
 
18
E. Moura, G. Navarro, and N. Ziviani. Indexing compressed text. In N. Ziviani, R. Baeza-Yates, and K. Guimaraes, editors, Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, 1997.
19
 
20
J. I. Munro. Succinct data structures. In Proceeding of the 19th Conference on Foundations of Software Technology and Theoretical Computer Science. Springer- Verlag LNCS n. 1738, 1999.
 
21
 
22
J. Seward. The nzlP2 home page, 1997. http ://sourceware. cygnus, com/bzip2/index, html.
 
23
 
24
25

CITED BY  19

Collaborative Colleagues:
Paolo Ferragina: colleagues
Giovanni Manzini: colleagues