|
ABSTRACT
We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
AMIR, A. AND BENSON, G. 1992. Efficient two-dimensional compressed matching. In Proceedings of the on 1992 IEEE Data Compression Conference (Mar.), J. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, CA, 279-288.
|
| |
3
|
|
| |
4
|
ARA JO, M. D., NAVARRO, G., AND ZIVIANI, N. 1997. Large text searching allowing errors. In Proceedings of the 4th South American Workshop on String Processing, R. Baeza-Yates, Ed. Carleton University Press International Informatics Series, vol. 8. Carleton University Press, Ottawa, Canada, 2-20.
|
 |
5
|
|
 |
6
|
|
| |
7
|
BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127-158.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
HARMAN, D. 1995. Overview of the third Text Retrieval Conference (TREC-3). In Proceedings of the 3rd Text Retrieval Conference (TREC-3), D. Harman, Ed. National Institute of Standards and Technology, Gaithersburg, MD.
|
| |
15
|
|
 |
16
|
|
| |
17
|
HORSPOOL, R. N. AND CORMACK, G.V. 1992. Constructing word-based text compression algorithms. In Proceedings of the on 1992 IEEE Data Compression Conference (Mar.), J. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, CA, 62-71.
|
| |
18
|
HUFFMAN, D.A. 1952. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9 (Sept.), 1098-1101.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
MANBER, U. AND WU, S. 1993. Glimpse: A tool to search through entire file systems. Tech. Rep. 93-34. Department of Computer Science, University of Arizona, Tucson, AZ.
|
| |
24
|
MILIDIU, R., PESSOA, A., AND LABER, E. 1998. In-place, simple, and fast length-restricted prefix coding. In Proceedings of 5th International Symposium on String Processing and Information Retrieval (SPIRE'98, Sept. 1998), IEEE Computer Society, Washington, DC, 50-59.
|
| |
25
|
|
| |
26
|
|
| |
27
|
MOURA, E. S. 1999. Aplica# es de Compress o de Dados a Sistemas de Recupera# o de Informa# o. Ph.D. Dissertation. Department of Computer Science, Universidad Federal de Minas Gerais, Brazil.
|
| |
28
|
MOURA, E. S., NAVARRO, G., AND ZIVIANI, N. 1997. Indexing compressed text. In Proceedings of the 4th South American Workshop on String Processing, R. Baeza-Yates, Ed. Carleton University Press International Informatics Series, vol. 8. Carleton University Press, Ottawa, Canada, 95-111.
|
 |
29
|
Edleno Silva de Moura , Gonzalo Navarro , Nivio Ziviani , Ricardo Baeza-Yates, Fast searching on compressed text allowing errors, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 24-28, 1998, Melbourne, Australia
[doi> 10.1145/290941.291013]
|
| |
30
|
MOURA, E. S., NAVARRO, G., ZIVIANI, N., AND BAEZA-YATES, R. 1998b. Direct pattern matching on compressed text. In Proceedings of 5th International Symposium on String Processing and Information Retrieval (SPIRE'98, Sept. 1998), IEEE Computer Society, Washington, DC, 90-95.
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
 |
34
|
|
| |
35
|
TURPIN, A. AND MOFFAT, A. 1997. Fast file search using text compression. In Proceedings of the 20th Conference on Australian Computer Science, 1-8.
|
| |
36
|
WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes. 2nd Morgan Kaufmann Publishers Inc., San Francisco, CA.
|
 |
37
|
|
| |
38
|
ZmF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.
|
| |
39
|
ZIv, J. AND LEMPEL, A. 1976. On the complexity of finite sequences. IEEE Trans. Inf. Theor. 22, 1, 75-81.
|
| |
40
|
ZIv, g. AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. IT-23, 3 (May), 337-343.
|
| |
41
|
ZIv, J. AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. IT-24, 5 (Sept.), 530-536.
|
| |
42
|
|
CITED BY 31
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcos André Gonçalves , Edward A. Fox , Layne T. Watson , Neill A. Kipp, Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries, ACM Transactions on Information Systems (TOIS), v.22 n.2, p.270-312, April 2004
|
|
|
|
|
|
|
|
|
|
|
|
Edleno S. de Moura , Célia F. dos Santos , Daniel R. Fernandes , Altigran S. Silva , Pavel Calado , Mario A. Nascimento, Improving Web search efficiency via a locality based static pruning method, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edleno Silva de Moura , Celia Francisca dos Santos , Bruno Dos santos de Araujo , Altigran Soares da Silva , Pavel Calado , Mario A. Nascimento, Locality-Based pruning methods for web search, ACM Transactions on Information Systems (TOIS), v.26 n.2, p.1-28, March 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
REVIEW
"Timothy C. Bell : Reviewer"
Drawing on existing compression and search methods, the authors
build new technologies that provide a significant improvement in the
speed of searching compressed text in exchange for a small loss in
compression performance. The methods develo
more...
|