|
ABSTRACT
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., and Ziviani, N. 2001. Distributed query processing using partitioned inverted files. In Proceedings of String Processing and Information Retrieval Symposium, Laguna de San Rafael, Chile. G. Navarro, Ed. IEEE Computer Society, 10--20.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Bayer, R. and McCreight, R. 1972. Organization and maintenance of large ordered indexes. Acta Informatica 1, 173--189.
|
| |
13
|
Beaulieu, M., Baeza-Yates, R., Myaeng, S. H., and Järvelin, K., Eds. 2002. Proceedings of the 25th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval. Tampere, Finland, ACM Press.
|
| |
14
|
|
| |
15
|
|
| |
16
|
Elisa Bertino , C , Kian-Lee Tan , Beng Chin Ooi , Ron Sacks-Davis , Justin Zobel , Boris Shidlovsky, Indexing Techniques for Advanced Database Systems, Kluwer Academic Publishers, Norwell, MA, 1997
|
 |
17
|
R. M. Bird , J. B. Newsbaum , J. L. Trefftzs, Text file inversion: An evaluation, Proceedings of the fourth workshop on Computer architecture for non-numeric processing, p.42-50, August 01-04, 1978, Blue Mountain Lake, New York, United States
[doi> 10.1145/800128.804166]
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
Bookstein, A. and Klein, S. T. 1991b. Flexible compression for bitmap sets. In Proceedings of the IEEE Data Compression Conference, Snowbird, UT, J. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, CA. 402--410.
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
Brisaboa, N. R., Fariña, A., Navarro, G., and Esteller, M. F. 2003. (S,C)-dense coding: An optimized compression code for natural language text databases. In Proceedings of String Processing and Information Retrieval Symposium, Manaus, Brazil, M. A. Nascimento, Ed. Springer, 122--136.
|
 |
28
|
|
| |
29
|
|
| |
30
|
|
 |
31
|
|
 |
32
|
|
 |
33
|
|
| |
34
|
Cacheda, F., Plachouras, V., and Ounis, I. 2004. Performance analysis of distributed architectures to index one terabyte of text. In Proceedings of the European Conference on IR Research, Sunderland, UK, S. McDonald and J. Tait, Eds. 395--408. Lecture Notes in Computer Science, Springer, vol. 2997.
|
 |
35
|
|
 |
36
|
|
| |
37
|
|
 |
38
|
|
 |
39
|
David Carmel , Doron Cohen , Ronald Fagin , Eitan Farchi , Michael Herscovici , Yoelle S. Maarek , Aya Soffer, Static index pruning for information retrieval systems, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.43-50, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383958]
|
 |
40
|
Y. Choueka , A. Fraenkel , S. Klein , E. Segal, Improved techniques for processing queries in full-text systems, Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, p.306-315, June 03-05, 1987, New Orleans, Louisiana, United States
[doi> 10.1145/42005.42039]
|
 |
41
|
|
 |
42
|
A. S. Fraenkel , S. T. Klein , Y. Choueka , E. Segal, Improved hierarchical bit-vector compression in document retrieval systems, Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval, p.88-96, September 1986, Palazzo dei Congressi, Pisa, Italy
[doi> 10.1145/253168.253190]
|
 |
43
|
|
 |
44
|
|
| |
45
|
Clarke, C. L. A. and Cormack, G. V. 1995. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. rep. MT-95-01, Department of Computer Science, University of Waterloo, Waterloo, Canada.
|
 |
46
|
|
| |
47
|
Clarke, C. L. A., Cormack, G. V., and Burkowski, F. J. 1994. Fast inverted indexes with on-line update. Tech. rep. CS-94-40, Department of Computer Science, University of Waterloo, Waterloo, Canada.
|
| |
48
|
|
| |
49
|
T. R. Couvreur , R. N. Benzel , S. F. Miller , D. N. Zeitler , D. L. Lee , M. Singhal , N. Shivaratri , W. Y. P. Wong, An analysis of performance and cost factors in searching large text databases using parallel search systems, Journal of the American Society for Information Science, v.45 n.7, p.443-464, Aug. 1994
[doi> 10.1002/(SICI)1097-4571(199408)45:7<443::AID-ASI1>3.0.CO;2-O]
|
 |
50
|
J. K. Cringean , R. England , G. A. Manson , P. Willett, Parallel text searching in serial files using a processor farm, Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, p.429-453, September 05-07, 1990, Brussels, Belgium
[doi> 10.1145/96749.98249]
|
| |
51
|
Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., Eds. 2001. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA. ACM Press.
|
| |
52
|
|
 |
53
|
|
 |
54
|
|
| |
55
|
Culpepper, J. S. and Moffat, A. 2005. Enhanced byte codes with restricted prefix properties. In Proceedings of the 12th International Symposium on String Processing and Information Retrieval. Buenos Aires, Argentina, M. P. Consens and G. Navarro, Eds. Lecture Notes in Computer Science, vol. 3772, Springer, 1--12.
|
 |
56
|
|
 |
57
|
|
| |
58
|
Owen de Kretser , Alistair Moffat, SEFT: a search engine for text, Software—Practice & Experience, v.34 n.10, p.1011-1023, August 2004
|
| |
59
|
|
 |
60
|
|
 |
61
|
|
| |
62
|
Elias, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inform. Theory IT-21, 2 (March), 194--203.
|
 |
63
|
|
 |
64
|
|
| |
65
|
|
| |
66
|
|
| |
67
|
|
| |
68
|
Fraenkel, A. S. and Klein, S. T. 1985. Novel compression of sparse bit-strings---Preliminary report. In Combinatorial Algorithms on Words, Volume 12, A. Apostolico and Z. Galil, Eds. NATO ASI Series F. Springer, Berlin, Germany, 169--183.
|
| |
69
|
|
 |
70
|
|
| |
71
|
Gallager, R. G. and Van Voorhis, D. C. 1975. Optimal source codes for geometrically distributed integer alphabets. IEEE Trans. Inform. Theory IT--21, 2 (March), 228--230.
|
| |
72
|
|
| |
73
|
Golomb, S. W. 1966. Run-length encodings. IEEE Trans. Inform. Theory IT--12, 3 (July), 399--401.
|
 |
74
|
|
| |
75
|
|
| |
76
|
|
| |
77
|
|
| |
78
|
Harman, D. K. and Candela, G. 1990. Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. J. Amer. Soc. Inform. Science 41, 8 (Aug.), 581--589.
|
| |
79
|
Harper, D. J. 1982. An evaluation of four information storage and retrieval packages. Tech. rep. 7, CSIRO Division of Computing Research, Canberra, Australia.
|
 |
80
|
|
| |
81
|
|
 |
82
|
|
| |
83
|
|
 |
84
|
|
| |
85
|
|
| |
86
|
Ivie, E. L. 1966. Search procedure based on measures of relatedness between documents. Ph.D. thesis. MIT, Cambridge, MA.
|
| |
87
|
Jakobsson, M. 1978. Huffman coding in bit-vector compression. Inform. Pro. Let. 7, 6 (Oct.) 304--307.
|
| |
88
|
|
 |
89
|
Björn T. Jónsson , Michael J. Franklin , Divesh Srivastava, Interaction of query evaluation and buffer management for information retrieval, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.118-129, June 01-04, 1998, Seattle, Washington, United States
|
 |
90
|
|
| |
91
|
Kent, A. J., Sacks-Davis, R., and Ramamohanarao, K. 1990. A signature file scheme based on multiple organisations for indexing very large text databases. J. Amer. Soc. Inform. Science 41, 7, 508--534.
|
 |
92
|
|
 |
93
|
|
 |
94
|
|
| |
95
|
|
| |
96
|
Lancaster, F. W. and Fayen, E. G. 1973. Information Retrieval OnLine. Melville, Los Angeles, CA.
|
| |
97
|
|
 |
98
|
|
 |
99
|
Yong Kyu Lee , Seong-Joon Yoo , Kyoungro Yoon , P. Bruce Berra, Index structures for structured documents, Proceedings of the first ACM international conference on Digital libraries, p.91-99, March 20-23, 1996, Bethesda, Maryland, United States
[doi> 10.1145/226931.226950]
|
 |
100
|
|
 |
101
|
|
| |
102
|
|
| |
103
|
Lester, N., Moffat, A., Webber, W., and Zobel, J. 2005. Space-limited ranked query evaluation using adaptive pruning. In Proceedings of the 6th International Conference on Web Informations Systems. Lecture Notes in Computer Science, vol. 3806, Springer, 470--477.
|
 |
104
|
|
| |
105
|
Lester, N., Zobel, J., and Williams, H. E. 2006. Efficient online index maintenance for text retrieval systems. Inform. Proce. Manag. To appear.
|
 |
106
|
Lipyeow Lim , Min Wang , Sriram Padmanabhan , Jeffrey Scott Vitter , Ramesh Agarwal, Dynamic maintenance of web indexes using landmarks, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775167]
|
 |
107
|
|
| |
108
|
|
| |
109
|
|
| |
110
|
Luhn, H. P. 1957. A statistical approach to mechanised encoding and searching of library information. IBM J. Resear. Develop. 1, 309--317.
|
| |
111
|
|
| |
112
|
|
| |
113
|
Manber, U. and Wu, S. 1994. GLIMPSE: a tool to search through entire file systems. In USENIX Winter Technical Conference. San Francisco CA. USENIX Association, Berkeley, CA. 23--32.
|
| |
114
|
|
 |
115
|
|
| |
116
|
|
| |
117
|
|
| |
118
|
Matthew, F. W. and Thomson, L. 1967. Weighted term search: a computer program for an inverted coordinate search on magnetic tape. J. Chem. Document. 7, 1, 49--56.
|
| |
119
|
McDonell, K. J. 1977. An inverted index implementation. Comput. J. 20, 1, 116--123.
|
 |
120
|
|
| |
121
|
Moffat, A. 1992. Economical inversion of large text files. Comput. Syst. 5, 2, 125--139.
|
| |
122
|
|
| |
123
|
|
| |
124
|
|
| |
125
|
Moffat, A., Webber, W., Zobel, J., and Baeza-Yates, R. 2005. A pipelined architecture for distributed text query evaluation. Submitted for publication.
|
| |
126
|
Moffat, A. and Zobel, J. 1992a. Coding for compression in full-text retrieval systems. In Proceedings of the IEEE Data Compression Conference, Snowbird, UT, J. A. Storer and M. Cohn, Eds. IEEE Computer Society Press, Los Alamitos, CA, 72--81.
|
 |
127
|
|
 |
128
|
|
| |
129
|
Moffat, A. and Zobel, J. 2004. What does it mean to “measure performance”? In Proceedings of the 5th International Conference on Web Informations Systems, Brisbane, Australia. X. Zhou, S. Su, M. P. Papazoglou, M. E. Owlowska, and K. Jeffrey, Eds. Lecture Notes in Computer Science, vol. 3306. Springer, 1--12.
|
| |
130
|
|
| |
131
|
|
| |
132
|
|
| |
133
|
Noreault, T., Koll, M., and McGill, M. J. 1977. Automatic ranked output from Boolean searches in SIRE. J. Amer, Soc. Inform. Science 28, 333--339.
|
| |
134
|
Perry, S. A. and Willett, P. 1983. A review of the use of inverted files for best match searching in information retrieval systems. J. Inform. Science 6, 59--66.
|
| |
135
|
|
 |
136
|
|
 |
137
|
|
| |
138
|
|
 |
139
|
|
 |
140
|
Berthier Ribeiro-Neto , Edleno S. Moura , Marden S. Neubert , Nivio Ziviani, Efficient distributed algorithms to build inverted files, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.105-112, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312663]
|
| |
141
|
Rice, R. F. 1979. Some practical universal noiseless coding techniques. Tech. Rep. 79--22, Jet Propulsion Laboratory, Pasadena, CA.
|
| |
142
|
Robertson, S. E. 1977. The probability ranking principle in IR. J. Document. 33, 4 (Dec.) 294--304.
|
| |
143
|
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1994. Okapi at TREC-3. In Overview of the 3rd TREC Text REtrieval Conference, Gaithersburg, MD, D. Harman, Ed. NIST, NIST Special Publication 500-226.
|
| |
144
|
Rogers, W., Candela, G., and Harman, D. 1995. Space and time improvements for indexing in information retrieval. In Proceedings of the Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, L. Spitz and D. D. Lewis, Eds.
|
 |
145
|
|
| |
146
|
|
| |
147
|
Salton, G. 1962. The use of citations as an aid to automatic content analysis. Tech. Rep. ISR-2, Section III, Harvard Computation Laboratory, Cambridge, MA.
|
| |
148
|
|
| |
149
|
|
 |
150
|
|
| |
151
|
|
 |
152
|
|
| |
153
|
|
| |
154
|
|
 |
155
|
|
 |
156
|
Paricia Correia Saraiva , Edleno Silva de Moura , Novio Ziviani , Wagner Meira , Rodrigo Fonseca , Berthier Riberio-Neto, Rank-preserving two-level caching for scalable search engines, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.51-58, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383959]
|
| |
157
|
|
 |
158
|
|
| |
159
|
Schuegraf, E. J. 1976. Compression of large inverted files with hyperbolic term distribution. Inform. Proc. Manag. 12, 377--384.
|
| |
160
|
|
 |
161
|
|
| |
162
|
Shieh, W.-Y., Chen, T.-F., and Chung, C.-P. 2003. A tree-based inverted file for fast ranked-document retrieval. In Proceedings of the International Conference on Information and Knowledge Engineering. Las Vegas, NV. H. R. Arabnia, Ed. CSREA Press, 64--69.
|
| |
163
|
|
| |
164
|
|
| |
165
|
Shieh, W.-Y., Shann, J. J.-J., and Chung, C.-P. 2003. An inverted file cache for fast information retrieval. J. Inform. Science Eng. 19, 4, 681--695.
|
| |
166
|
|
 |
167
|
|
 |
168
|
|
 |
169
|
|
| |
170
|
|
| |
171
|
|
| |
172
|
|
| |
173
|
Spink, A. and Xu, J. L. 2000. Selected results from a large study of web searching: The Excite study. Inform. Resear.---Int. Electron. J. 6, 1.
|
 |
174
|
|
 |
175
|
C. Stanfill , R. Thau , D. Waltz, A parallel indexed algorithm for information retrieval, Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval, p.88-97, June 25-28, 1989, Cambridge, Massachusetts, United States
|
| |
176
|
Stellhorn, W. H. 1977. An inverted file processor for information retrieval. IEEE Trans. Comput. 26, 12, 1258--1267.
|
 |
177
|
|
 |
178
|
|
| |
179
|
Teuhola, J. 1978. A compression method for clustered bit-vectors. Inform. Proc. Lett. 7, 6 (Oct.) 308--311.
|
| |
180
|
|
| |
181
|
|
 |
182
|
Anthony Tomasic , Héctor García-Molina , Kurt Shoens, Incremental updates of inverted lists for text document retrieval, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.289-300, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
183
|
|
| |
184
|
|
| |
185
|
van Rijsbergen, C. J. 1979. Informat. Retriev., 2nd Ed. Butterworths, London, UK.
|
| |
186
|
Vasanthakumar, S. R., Callan, J. P., and Croft, W. B. 1996. Integrating INQUERY with an RDBMS to support text retrieval. Bull. Techn. Comm. Data Eng. 19, 1, 24--33.
|
 |
187
|
|
 |
188
|
|
| |
189
|
|
| |
190
|
Williams, H. E. and Zobel, J. 1999. Compressing integers for fast file access. Comput. J. 42, 3, 193--201.
|
| |
191
|
Williams, H. E., Zobel, J., and Anderson, P. 1999. What's next? Index structures for efficient phrase querying. In Proceedings of the Australasian Database Conference. Auckland, New Zealand. M. Orlowska, Ed. Australian Computer Society, 141--152.
|
 |
192
|
|
| |
193
|
Witten, I. H., Bell, T. C., and Nevill, C. G. 1991. Models for compression in full-text retrieval systems. In Proceedings of the IEEE Data Compression Conference. Snowbird, UT, J. A. Storer and J. H. Reif, Eds. IEEE Computer Society Press, Los Alamitos, CA. 23--32.
|
| |
194
|
|
| |
195
|
|
| |
196
|
Xi, W., Sornil, O., and Fox, E. A. 2002a. Hybrid partition inverted files for large-scale digital libraries. In Proceedings of the Digital Library: IT Opportunities and Challenges in the New Millennium. Beijing Library Press, Beijing, China, 404--418.
|
| |
197
|
|
 |
198
|
|
 |
199
|
|
 |
200
|
|
 |
201
|
|
| |
202
|
|
| |
203
|
|
| |
204
|
Zobel, J., Moffat, A., and Sacks-Davis, R. 1993b. Storage management for files of dynamic records. In Proceedings of the Australasian Database Conference. Brisbane, Australia, 26--38.
|
| |
205
|
|
CITED BY 34
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcos Antonio Vaz Salles , Jens-Peter Dittrich , Shant Kirakos Karakashian , Olivier René Girard , Lukas Blunschi, iTrails: pay-as-you-go information integration in dataspaces, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Vuk Ercegovac , Vanja Josifovski , Ning Li , Mauricio R. Mediano , Eugene J. Shekita, Supporting sub-document updates and queries in an inverted index, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Mingjie Zhu , Shuming Shi , Nenghai Yu , Ji-Rong Wen, Can phrase indexing help to process non-phrase queries?, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|