|
ABSTRACT
This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations called n-s coding. Using these compression methods on two different text sources (the King James Version of the Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databases on the CM-5, a massively parallel MIMD supercomputer.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Bookstein, A., S.T. Klein, and T. Raita. "Compression Concordances Subject to Clustering." Proceedings: Data Compression Conference (Snowbird, Utah), April 1993, to appear.
|
| |
3
|
|
 |
4
|
|
| |
5
|
Elias, P. "Universal Code-Word Sets and Representations of the Integers." IEEE Transactions on Information Theory, p. 194-203, March 1975.
|
| |
6
|
Goulomb, Solomon. "Run-length Encodings." IEEE Transactions on Information Theory, 12(3):399-401, July 1966.
|
| |
7
|
Harmon, Donna. "How Effective is Suffixing?" Journal of the American Society for Information Science, 42(1):7-15, January 1991.
|
| |
8
|
Haskin, Roger. "Special Purpose Processors for Text Retrieval." Database Engineering, 4(1): 16-29.
|
| |
9
|
Huffman, David. "A Method for the Construction of Minimum-Redundancy Codes." Proceedings of the IRE, 40(9):1098-1101, September 1932.
|
 |
10
|
|
| |
11
|
Moffat, Alistair and Justin Zobel. "Coding for Compression in Full-Text Retrieval Systems." Proceedings IEEE Data Compression Conference (Snowbird, Utah), March 1992, pp. 23-32.
|
 |
12
|
|
| |
13
|
Van Rijsbergen, C.J. Chapter 2, "Automated Text Analysis" in information Retrieval, 2nd edition. Butterworth & Co., London, U.K., 1979, pp. 14-35.
|
| |
14
|
Witten, Ian, Timothy Bell, and Craig Nevill. "Models for Compression in Full-Text Retrieval Systems." Proceedings: Data Compression Conference(Snowbird, Utah), April 1991, pp. 23-32.
|
 |
15
|
|
|