|
ABSTRACT
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
BOOKSTEIN, A., KLEIN,S.T.,AND RAITA, T. 1992. Model based concordance compression. In Proceedings of the IEEE Data Compression Conference, J. A. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, Calif., 82-91.
|
 |
3
|
|
 |
4
|
Y. Choueka , A. Fraenkel , S. Klein , E. Segal, Improved techniques for processing queries in full-text systems, Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, p.306-315, June 03-05, 1987, New Orleans, Louisiana, United States
[doi> 10.1145/42005.42039]
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
ELIAS, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory IT-21, 2 (Mar.), 194-203.
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
Donna Harman , R. Baeza-Yates , Edward Fox , W. Lee, Inverted files, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992
|
| |
14
|
FRAENKEL,A.S.AND KLEIN, S. T. 1985. Novel compression of sparse bit-strings:Prelimi-nary report. In Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, Eds. NATO ASI Series F, vol. 12. Springer-Verlag, Berlin, 169-183.
|
| |
15
|
GALLAGER,R.G.AND VAN VOORHIS, D. C. 1975. Optimal source codes for geometrically distributed integer alphabets. IEEE Trans. Inf. Theory IT-21, 2 (Mar.), 228-230.
|
| |
16
|
GOLOMB, S. W. 1966. Run-length encodings. IEEE Trans. Inf. Theory IT-12, 3 (July), 399-401.
|
| |
17
|
HARMAN, D. K., Ed. 1992a. Proceedings of TREC Text Retrieval Conference (Washington, D.C., Nov.). National Institute of Standards Special Publication 500-207. NIST, Washing-ton, D.C.
|
| |
18
|
|
| |
19
|
HARMAN,D.K.AND CANDELA, G. 1990. Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. J. Am. Soc. Inf. Sci. 41, 8, 581-589.
|
| |
20
|
HWANG,F.K.AND LIN, S. 1972. A simple algorithm for merging two disjoint linearly ordered sets. SIAM J. Comput. 1, 1, 31-39.
|
| |
21
|
KENT,A.J.,SACKS-DAVIS, R., AND RAMAMOHANARAO, K. 1990. A signature file scheme based on multiple organizations for indexing very large text databases. J. Am. Soc. Inf. Sci. 41, 7, 508-534.
|
 |
22
|
|
 |
23
|
|
| |
24
|
LOVINS, J. B. 1968. Development of a stemming algorithm. Mech. Transl. Comput. 11, 1-2, 22-31.
|
| |
25
|
|
| |
26
|
MCDONELL, K. J. 1977. An inverted index implementation. Comput. J. 20, 1, 116-123.
|
 |
27
|
|
| |
28
|
MOFFAT, A., ZOBEL, J., AND KLEIN, S. T. 1995. Improved inverted file processing for large text databases. In Proceedings of the 6th Australasian Database Conference, R. Sacks-Davis and J. Zobel, Eds. 162-171.
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
CITED BY 82
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sergey Melnik , Sriram Raghavan , Beverly Yang , Hector Garcia-Molina, Building a distributed full-text index for the Web, Proceedings of the 10th international conference on World Wide Web, p.396-406, May 01-05, 2001, Hong Kong, Hong Kong
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
|
|
|
|
|
|
Manolis Terrovitis , Spyros Passas , Panos Vassiliadis , Timos Sellis, A combination of trie-trees and inverted files for the indexing of set-valued attributes, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Holger Bast , Debapriyo Majumdar , Ralf Schenkel , Martin Theobald , Gerhard Weikum, IO-Top-k: index-access optimized top-k query processing, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bin Cui , Ling Liu , Calton Pu , Jialie Shen , Kian-Lee Tan, QueST: querying music databases by acoustic and textual features, Proceedings of the 15th international conference on Multimedia, September 25-29, 2007, Augsburg, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Flavio Chierichetti , Silvio Lattanzi , Federico Mari , Alessandro Panconesi, On placing skips optimally in expectation, Proceedings of the international conference on Web search and web data mining, February 11-12, 2008, Palo Alto, California, USA
|
|
|
|
|
|
|
|
|
|
|