ACM Home Page
Please provide us with feedback. Feedback
Self-indexing inverted files for fast text retrieval
Full text PdfPdf (485 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 14 ,  Issue 4  (October 1996) table of contents
Pages: 349 - 379  
Year of Publication: 1996
ISSN:1046-8188
Authors
Alistair Moffat  Univ. of Melbourne, Victoria, Australia
Justin Zobel  RMIT, Victoria, Australia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 32,   Downloads (12 Months): 263,   Citation Count: 82
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/237496.237497
What is a DOI?

ABSTRACT

Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
BOOKSTEIN, A., KLEIN,S.T.,AND RAITA, T. 1992. Model based concordance compression. In Proceedings of the IEEE Data Compression Conference, J. A. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, Calif., 82-91.
3
4
5
 
6
7
 
8
ELIAS, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory IT-21, 2 (Mar.), 194-203.
 
9
10
11
12
 
13
 
14
FRAENKEL,A.S.AND KLEIN, S. T. 1985. Novel compression of sparse bit-strings:Prelimi-nary report. In Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, Eds. NATO ASI Series F, vol. 12. Springer-Verlag, Berlin, 169-183.
 
15
GALLAGER,R.G.AND VAN VOORHIS, D. C. 1975. Optimal source codes for geometrically distributed integer alphabets. IEEE Trans. Inf. Theory IT-21, 2 (Mar.), 228-230.
 
16
GOLOMB, S. W. 1966. Run-length encodings. IEEE Trans. Inf. Theory IT-12, 3 (July), 399-401.
 
17
HARMAN, D. K., Ed. 1992a. Proceedings of TREC Text Retrieval Conference (Washington, D.C., Nov.). National Institute of Standards Special Publication 500-207. NIST, Washing-ton, D.C.
 
18
 
19
HARMAN,D.K.AND CANDELA, G. 1990. Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. J. Am. Soc. Inf. Sci. 41, 8, 581-589.
 
20
HWANG,F.K.AND LIN, S. 1972. A simple algorithm for merging two disjoint linearly ordered sets. SIAM J. Comput. 1, 1, 31-39.
 
21
KENT,A.J.,SACKS-DAVIS, R., AND RAMAMOHANARAO, K. 1990. A signature file scheme based on multiple organizations for indexing very large text databases. J. Am. Soc. Inf. Sci. 41, 7, 508-534.
22
23
 
24
LOVINS, J. B. 1968. Development of a stemming algorithm. Mech. Transl. Comput. 11, 1-2, 22-31.
 
25
 
26
MCDONELL, K. J. 1977. An inverted index implementation. Comput. J. 20, 1, 116-123.
27
 
28
MOFFAT, A., ZOBEL, J., AND KLEIN, S. T. 1995. Improved inverted file processing for large text databases. In Proceedings of the 6th Australasian Database Conference, R. Sacks-Davis and J. Zobel, Eds. 162-171.
 
29
30
 
31
 
32
 
33
 
34
 
35

CITED BY  82

Collaborative Colleagues:
Alistair Moffat: colleagues
Justin Zobel: colleagues