ACM Home Page
Please provide us with feedback. Feedback
Compression of indexes with full positional information in very large text databases
Full text PdfPdf (690 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Pittsburgh, Pennsylvania, United States
Pages: 88 - 95  
Year of Publication: 1993
ISBN:0-89791-605-0
Authors
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 43,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/160688.160699
What is a DOI?

ABSTRACT

This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations called n-s coding. Using these compression methods on two different text sources (the King James Version of the Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databases on the CM-5, a massively parallel MIMD supercomputer.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Bookstein, A., S.T. Klein, and T. Raita. "Compression Concordances Subject to Clustering." Proceedings: Data Compression Conference (Snowbird, Utah), April 1993, to appear.
 
3
4
 
5
Elias, P. "Universal Code-Word Sets and Representations of the Integers." IEEE Transactions on Information Theory, p. 194-203, March 1975.
 
6
Goulomb, Solomon. "Run-length Encodings." IEEE Transactions on Information Theory, 12(3):399-401, July 1966.
 
7
Harmon, Donna. "How Effective is Suffixing?" Journal of the American Society for Information Science, 42(1):7-15, January 1991.
 
8
Haskin, Roger. "Special Purpose Processors for Text Retrieval." Database Engineering, 4(1): 16-29.
 
9
Huffman, David. "A Method for the Construction of Minimum-Redundancy Codes." Proceedings of the IRE, 40(9):1098-1101, September 1932.
10
 
11
Moffat, Alistair and Justin Zobel. "Coding for Compression in Full-Text Retrieval Systems." Proceedings IEEE Data Compression Conference (Snowbird, Utah), March 1992, pp. 23-32.
12
 
13
Van Rijsbergen, C.J. Chapter 2, "Automated Text Analysis" in information Retrieval, 2nd edition. Butterworth & Co., London, U.K., 1979, pp. 14-35.
 
14
Witten, Ian, Timothy Bell, and Craig Nevill. "Models for Compression in Full-Text Retrieval Systems." Proceedings: Data Compression Conference(Snowbird, Utah), April 1991, pp. 23-32.
15


Collaborative Colleagues:
Gordon Linoff: colleagues
Craig Stanfill: colleagues