ACM Home Page
Please provide us with feedback. Feedback
Fast and flexible word searching on compressed text
Full text PdfPdf (165 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 18 ,  Issue 2  (April 2000) table of contents
Pages: 113 - 139  
Year of Publication: 2000
ISSN:1046-8188
Authors
Edleno Silva de Moura  Univ. Federal de Minas Gerais, Belo Horizonte, Brazil
Gonzalo Navarro  Univ. de Chile, Santiago, Chile
Nivio Ziviani  Univ. Federal de Minas Gerais, Belo Horizonte, Brazil
Ricardo Baeza-Yates  Univ. de Chile, Santiago, Chile
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 123,   Citation Count: 31
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/348751.348754
What is a DOI?

ABSTRACT

We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
AMIR, A. AND BENSON, G. 1992. Efficient two-dimensional compressed matching. In Proceedings of the on 1992 IEEE Data Compression Conference (Mar.), J. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, CA, 279-288.
 
3
 
4
ARA JO, M. D., NAVARRO, G., AND ZIVIANI, N. 1997. Large text searching allowing errors. In Proceedings of the 4th South American Workshop on String Processing, R. Baeza-Yates, Ed. Carleton University Press International Informatics Series, vol. 8. Carleton University Press, Ottawa, Canada, 2-20.
5
6
 
7
BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127-158.
 
8
 
9
 
10
11
12
 
13
 
14
HARMAN, D. 1995. Overview of the third Text Retrieval Conference (TREC-3). In Proceedings of the 3rd Text Retrieval Conference (TREC-3), D. Harman, Ed. National Institute of Standards and Technology, Gaithersburg, MD.
 
15
16
 
17
HORSPOOL, R. N. AND CORMACK, G.V. 1992. Constructing word-based text compression algorithms. In Proceedings of the on 1992 IEEE Data Compression Conference (Mar.), J. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, CA, 62-71.
 
18
HUFFMAN, D.A. 1952. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9 (Sept.), 1098-1101.
 
19
 
20
 
21
22
 
23
MANBER, U. AND WU, S. 1993. Glimpse: A tool to search through entire file systems. Tech. Rep. 93-34. Department of Computer Science, University of Arizona, Tucson, AZ.
 
24
MILIDIU, R., PESSOA, A., AND LABER, E. 1998. In-place, simple, and fast length-restricted prefix coding. In Proceedings of 5th International Symposium on String Processing and Information Retrieval (SPIRE'98, Sept. 1998), IEEE Computer Society, Washington, DC, 50-59.
 
25
 
26
 
27
MOURA, E. S. 1999. Aplica# es de Compress o de Dados a Sistemas de Recupera# o de Informa# o. Ph.D. Dissertation. Department of Computer Science, Universidad Federal de Minas Gerais, Brazil.
 
28
MOURA, E. S., NAVARRO, G., AND ZIVIANI, N. 1997. Indexing compressed text. In Proceedings of the 4th South American Workshop on String Processing, R. Baeza-Yates, Ed. Carleton University Press International Informatics Series, vol. 8. Carleton University Press, Ottawa, Canada, 95-111.
29
 
30
MOURA, E. S., NAVARRO, G., ZIVIANI, N., AND BAEZA-YATES, R. 1998b. Direct pattern matching on compressed text. In Proceedings of 5th International Symposium on String Processing and Information Retrieval (SPIRE'98, Sept. 1998), IEEE Computer Society, Washington, DC, 90-95.
 
31
 
32
33
34
 
35
TURPIN, A. AND MOFFAT, A. 1997. Fast file search using text compression. In Proceedings of the 20th Conference on Australian Computer Science, 1-8.
 
36
WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes. 2nd Morgan Kaufmann Publishers Inc., San Francisco, CA.
37
 
38
ZmF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.
 
39
ZIv, J. AND LEMPEL, A. 1976. On the complexity of finite sequences. IEEE Trans. Inf. Theor. 22, 1, 75-81.
 
40
ZIv, g. AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. IT-23, 3 (May), 337-343.
 
41
ZIv, J. AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. IT-24, 5 (Sept.), 530-536.
 
42

CITED BY  31


REVIEW

"Timothy C. Bell : Reviewer"

Drawing on existing compression and search methods, the authors build new technologies that provide a significant improvement in the speed of searching compressed text in exchange for a small loss in compression performance. The methods develo  more...

Collaborative Colleagues:
Edleno Silva de Moura: colleagues
Gonzalo Navarro: colleagues
Nivio Ziviani: colleagues
Ricardo Baeza-Yates: colleagues